## Stop Here. Create new file/notebook

## ==============================================

## Model Scoring

Write function that will load artifacts from above, transform and score on a new dataset.
Your function should return Python list of labels. For example: [0,1,0,1,1,0,0]


### Example of Scoring function

Don't copy the code as is. It is provided as an example only. 
- Function `train_model` - you need to focus on model and artifacts saving:
    ```
    pickle.dump(obj=artifacts_dict, file=artifacts_dict_file)
    ```
- Function `project_1_scoring` - you should have similar function with name `project_1_scoring`. The function will:
    - Get Pandas dataframe as parameter
    - Will load model and all needed encoders
    - Will perform needed manipulations on the input Pandas DF - in the exact same format as input file for the project, minus MIS_Status feature
    - Return Pandas DataFrame
        - record index
        - predicted class for threshold maximizing F1
        - probability for class 0 (PIF)
        - probability for class 1 (CHGOFF)


In [1]:
import h2o
try:
    h2o.cluster().shutdown()
except:
    pass 

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,3 hours 34 mins
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.38.0.4
H2O_cluster_version_age:,4 months and 2 days !!!
H2O_cluster_name:,H2O_from_python_16826_m9jzd0
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.660 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


In [3]:
def cleaning(df,train=True):
    try:
        df = df.drop(columns=['index'],axis=1)
    except:
        pass
    
    # drop null
    df = df.dropna()
    
    # clean money columns
    money_cols = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
    df[money_cols] = df[money_cols].applymap(lambda x: x.strip().replace('$', '').replace(',', ''))
    df[money_cols] = df[money_cols].astype('float64')
    df[money_cols] = df[money_cols].astype('int64')
    
    # clean RevLineCr 
    values = ['0','T','1','R','`','2','C','3',',','7','A','5','.','4','-','Q']
    df = df[df.RevLineCr.isin(values) == False]
    
    # clean LowDoc
    values = ['0','C','S','A','R','1']
    df = df[df.LowDoc.isin(values) == False]
    
    # clean NAICS
    df['NAICS'] = df['NAICS'].astype('str').apply(lambda x: x[:2])
    df['NAICS'] = df['NAICS'].map({
        '11': 'Ag_For_Fish_Hunt',
        '21': 'Min_Quar_Oil_Gas_ext',
        '22': 'Utilities',
        '23': 'Construction',
        '31': 'Manufacturing',
        '32': 'Manufacturing',
        '33': 'Manufacturing',
        '42': 'Wholesale_trade',
        '44': 'Retail_trade',
        '45': 'Retail_trade',
        '48': 'Trans_Ware',
        '49': 'Trans_Ware',
        '51': 'Information',
        '52': 'Finance_Insurance',
        '53': 'RE_Rental_Lease',
        '54': 'Prof_Science_Tech',
        '55': 'Mgmt_comp',
        '56': 'Admin_sup_Waste_Mgmt_Rem',
        '61': 'Educational',
        '62': 'Healthcare_Social_assist',
        '71': 'Arts_Entertain_Rec',
        '72': 'Accom_Food_serv',
        '81': 'Other_no_pub',
        '92': 'Public_Admin'
    })
    df = df.dropna(subset=['NAICS'])
    
    df['Loan_Approval_Ratio'] = df['SBA_Appv'] / df['GrAppv']
    df['Loan_Disbursement_Ratio'] = df['DisbursementGross'] / df['GrAppv']
    df['Log_DisbursementGross'] = np.log(df['DisbursementGross'])
    
    numerical_columns = [
    # 'Term',
    'NoEmp',
    'CreateJob',
    'RetainedJob',                
    'DisbursementGross',
    'BalanceGross',
    'GrAppv',
    'SBA_Appv',
    'LoanInd',
    'Loan_Approval_Ratio',
    'Loan_Disbursement_Ratio',
    'Log_DisbursementGross']

    categorical_columns = [
    'City',
    'State',
    'Zip',
    'Bank',
    'BankState',
    'FranchiseCode',
    'NAICS',
    'NewExist',
    'UrbanRural',
    'RevLineCr',
    'LowDoc']

    target_column = 'MIS_Status'
    
    X = df[categorical_columns+numerical_columns]
    y = df[target_column]

    data = (X,y)
    features = (categorical_columns,numerical_columns,target_column)

    return (data, features)
        

In [4]:
def project_2_scoring(data):
    """
    Function to score input dataset.
    
    Input: dataset in Pandas DataFrame format
    Output: Python list of labels in the same order as input records
    
    Flow:
        - Load artifacts
        - Transform dataset
        - Score dataset
        - Return labels
    
    """
    from sklearn.preprocessing import OneHotEncoder
    import pickle
    import numpy as np
    import pandas as pd
    import h2o
    
    grid = h2o.load_grid('./h2o_model_grids/gbm_grid1')

    gbm_grid = grid.get_grid(sort_by='auc', decreasing=True)
    best_gbm = gbm_grid.models[0]

    artifacts_dict_file = open("./artifacts/artifacts_dict_file.pkl", "rb")
    artifacts_dict = pickle.load(file=artifacts_dict_file)
    artifacts_dict_file.close()

    ohe_encoder = artifacts_dict["ohe_encoder"]
    woe_encoder = artifacts_dict["woe_encoder"]
    numerical_encoder = artifacts_dict["numerical_encoder"]
    ohe_columns = artifacts_dict["ohe_columns"]
    woe_columns = artifacts_dict["woe_columns"]
    poly = artifacts_dict["poly_features_encoder"]
    poly_numerical_features = artifacts_dict["poly_numerical_features"]
    
    X_test,y_test = cleaning(test_data)[0]
    categorical_columns,numerical_columns,target_column = cleaning(test_data)[1]

    X_test_encoded = ohe_encoder.transform(X_test)
    X_test_encoded = woe_encoder.transform(X_test_encoded)
    X_test_encoded[numerical_columns] = numerical_encoder.transform(X_test_encoded[numerical_columns])

    poly_df= poly.fit_transform(X_test_encoded[poly_numerical_features])[:,len(poly_numerical_features):]
    poly_features = poly.get_feature_names_out(input_features=poly_numerical_features)[len(poly_numerical_features):]
    X_test_encoded[poly_features] = poly_df

    test = h2o.H2OFrame(X_test_encoded)

    results = best_gbm.predict(test)
    
    
    return results

In [5]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
test_data = pd.read_csv('test_data.csv')

In [6]:
project_2_scoring(test_data)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


predict,p0,p1
0,0.873951,0.126049
0,0.914855,0.085145
1,0.314169,0.685831
0,0.846864,0.153136
1,0.497732,0.502268
0,0.98091,0.0190895
0,0.789548,0.210452
0,0.981894,0.0181064
0,0.821437,0.178563
0,0.916856,0.0831438
