This is to score the dataset provided, using the best model in H2O dataframe using Gradient Boosting Estimator. In the train file, we have dumped the best model path which we will be loaded here to be used to score a new data set. We have also created necessary functions that will be used in the scoring function and dumped the same in the artifacts folder. We will import the necessary libraries and create a function Project2_scoring, to score the loaded data and display it as per the requirement. The model contains the threshold for max F1 score, which is used to predict the outcome of the function (0 or 1) based on the threshold value. 

In [1]:
import h2o

In [2]:
h2o.init(max_mem_size = "4G")
h2o.remove_all()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "17.0.11" 2024-04-16 LTS; Java(TM) SE Runtime Environment (build 17.0.11+7-LTS-207); Java HotSpot(TM) 64-Bit Server VM (build 17.0.11+7-LTS-207, mixed mode, sharing)
  Starting server from /Users/vishu/Downloads/ml-summer-2024/lib/python3.10/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/b4/37mclspn5hb08s778j8dsmpw0000gn/T/tmpuuwq3r8n
  JVM stdout: /var/folders/b4/37mclspn5hb08s778j8dsmpw0000gn/T/tmpuuwq3r8n/h2o_vishu_started_from_python.out
  JVM stderr: /var/folders/b4/37mclspn5hb08s778j8dsmpw0000gn/T/tmpuuwq3r8n/h2o_vishu_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.3
H2O_cluster_version_age:,7 months and 18 days
H2O_cluster_name:,H2O_from_python_vishu_p8auzi
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,4 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


In [3]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 1500)
import warnings
warnings.filterwarnings('ignore')
#Extend cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [4]:
# Read the data in pandas dataframe as X_score.
X_score = pd.read_csv('SBA_loans_project_2_holdout_students_valid.csv')

In [5]:
# Import the necessary libraries used in the scoring function.

import pickle
import os
from sklearn.preprocessing import PolynomialFeatures
from h2o.estimators import H2OGradientBoostingEstimator

In [6]:
def Project2_scoring(X_data):

    # Load the model path
    artifacts_dir = "../artifacts"
    with open(os.path.join(artifacts_dir, "artifacts_model_path.pkl"), "rb") as artifacts_model_path_file:
        artifacts_dict = pickle.load(artifacts_model_path_file)

    # Load the GBM model
    loaded_gbm_model = h2o.load_model("../artifacts/GBM_model_1722923354827_4554")

    # Load the function definitions
    with open(os.path.join(artifacts_dir, "artifacts_functions.pkl"), "rb") as artifacts_functions_file:
        functions_dict = pickle.load(artifacts_functions_file)

    # Execute the function definitions
    exec(functions_dict["data_cleaning"], globals())
    exec(functions_dict["new_features_1"], globals())
    exec(functions_dict["add_interaction_terms"], globals())
    exec(functions_dict["changing_datatype"], globals())

    X_data = data_cleaning(X_data)                     #Cleaning the scoring data using data_cleaning function

    X_data = new_features_1(X_data)
    # Use the dictionary directly as the mapping
    Bank_SBAappv_mapping = functions_dict["Bank_SBAappv_mapping"]
    X_data['Avg_SBAappv_bank'] = X_data['Bank'].map(Bank_SBAappv_mapping).fillna(0)

    # Do the same for other mappings if they are also dictionaries
    sector_GrAppv_mapping = functions_dict["sector_GrAppv_mapping"]
    X_data['Avg_GrAppv_sector'] = X_data['sector'].map(sector_GrAppv_mapping).fillna(0)

    State_GrAppv_mapping = functions_dict["State_GrAppv_mapping"]
    X_data['Avg_GrAppv_State'] = X_data['State'].map(State_GrAppv_mapping).fillna(0)

    Disb_Bank_mapping = functions_dict["Disb_Bank_mapping"]
    X_data['Avg_Disb_Bank'] = X_data['Bank'].map(Disb_Bank_mapping).fillna(0)

    feature_cols = ['UrbanRural','FranchiseCode']      # Adding new features using new_features function
    X_data = add_interaction_terms(X_data, feature_cols, 2)
    feature_cols = ['disb_greater_app','FranchiseCode']
    X_data = add_interaction_terms(X_data, feature_cols, 2)

     # Convert X_data to H2OFrame
    X_data_h2o = h2o.H2OFrame(X_data)                  # Converting the data to H2o dataframe.

    X_data_h2o = changing_datatype(X_data_h2o)         # Changing the datatype of the columns of scoring data using changing_datatype function

    # Make predictions using the loaded GBM model
    predictions = loaded_gbm_model.predict(X_data_h2o)
    X_data_h2o = X_data_h2o.cbind(predictions)

    predicted_labels = X_data_h2o["predict"]
    probability_0 = X_data_h2o["p0"]   #Probability of Class_0
    probability_1 = X_data_h2o["p1"]   #Probability of Class_1

    # Add an index column to the test set
    data_with_index = X_data_h2o.cbind(h2o.H2OFrame(list(range(X_data_h2o.nrow)), column_names=["index"]))

    # Create a new H2OFrame with the desired output
    output = data_with_index.cbind([predicted_labels, probability_0, probability_1])
    output.columns = X_data_h2o.columns + ["ID", "label", "probability_0", "probability_1"]
    h2o.export_file(output, path="temp.csv", force=True)
    output_df = pd.read_csv("temp.csv")
    os.remove("temp.csv")

    # Select and reorder the required columns
    output_df = output_df[["ID", "label", "probability_0", "probability_1"]]
    return output_df

In [7]:
# We use the function Project2_scoring on X_score and display the first 5 rows.
Project2_scoring(X_score).head()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Export File progress: |██████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,ID,label,probability_0,probability_1
0,0,0,0.9577,0.0423
1,1,0,0.887146,0.112854
2,2,0,0.881359,0.118641
3,3,0,0.982325,0.017675
4,4,0,0.877177,0.122823


In [8]:
h2o.cluster().shutdown()

H2O session _sid_bac0 closed.
