## TRAINING

### Feature engineering

You should train/fit categorical features scalers and encoders on Train only. Use `transform` or equivalent function on Validation/Test datasets.

It is important to understand all the steps before model training, so that you can reliably replicate and test them to produce scoring function.


You should generate various new features. Examples of such features can be seen in the Module-3 lecture on GLMs.  
Your final model should have at least **10** new engineered features.   
On-hot-encoding, label encoding, and target encoding **is not included in the** **10** features to create.    
You can attempt target encoding, however the technique is not expected to produce improvement for Linear models.

Ideas for Feature engineering for various types of variables:
1. https://docs.h2o.ai/driverless-ai/1-10-lts/docs/userguide/transformations.html
2. GLM lecture and hands-on (Module-3)


**Note**: 
- You don't have to perform feature engineering using H2O-3 even if you decided to use H2O-3 GLM for model training.
- It is OK to perform feature engineering using any technique, as long as you can replicate it correctly in the Scoring function.

### Threshold calculation

You will need to calculate optimal threshold for class assignment using F1 metric:
- If using sklearn, use F1 `macro`: `f1_score(y_true, y_pred, average='macro')` 
- If using H2O-3, use F1

You will need to find optimal probability threshold for class assignment, the threshold that maximizes above F1.

### Deliverables in a single zip file in the following structure:
- `notebook` (folder)
    - Jupyter notebook with complete code to manipulate data, train and tune final model. `ipynb` format.
    - Jupyter notebook with scoring function. `ipynb` format.
- `artifacts` (folder)
    - Model and any potential encoders in the "pkl" format or native H2O-3 format (for H2O-3 model)
    - Scoring function that will load the final model and encoders. Separate from above notebook or `.py` file



Your notebook should include explanations about your code and be designed to be easily followed and results replicated. Once you are done with the final version, you will need to test it by running all cells from top to bottom after restarting Kernel. It can be done by running `Kernel -> Restart & Run All`


**Important**: To speed up progress, first produce working code using a small subset of the dataset.

In [225]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
from copy import deepcopy

def train_model(df):
    """
    Train sample model and save artifacts
    """
    from sklearn.linear_model import LogisticRegression
    import pickle
    from sklearn.impute import SimpleImputer
    from sklearn.model_selection import GridSearchCV
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.metrics import average_precision_score
    import numpy as np
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    
    
    target_col = "MIS_Status"
    cols_to_drop = ['City', 'State', 'Zip','Bank', 'BankState', 'LowDoc','RevLineCr','MIS_Status']
    # Removing the index column
    if "index" in df.columns:
        df.drop(columns="index", inplace=True)
    y = df[target_col] if target_col in df.columns else None
    X = df.drop(columns=[target_col]) if target_col in df.columns else df.copy()


    # Relacing Missing values
    
    for i in df['RevLineCr']:
      if i not in ['Y','N']:
        df['RevLineCr'].replace(i,'N',inplace=True)
        print("RevLineCr",df['RevLineCr'].unique())

    for i in df['LowDoc']:
      if i not in ['Y','N']:
        df['LowDoc'].replace(i,'N',inplace=True)
        print("LowDoc",df['LowDoc'].unique())

    for i in df['NewExist']:
      if i not in [1,2]:
        df['NewExist'].replace(i,None,inplace=True)
        print("NewExist",df['NewExist'].unique())


    category_cols=['City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc','NewExist']
    for column in category_cols:
        df[column]=df[column].fillna(df[column].mode()[0])

    # Target encoding the categorical columns
    categorical_columns = ['City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc','NewExist']
    encoder = ce.TargetEncoder(cols=categorical_columns)
    encoder.fit(df[categorical_columns], df['MIS_Status'])
    train_encoded = encoder.transform(df[categorical_columns])
    train_encoded = train_encoded.add_suffix('_trg')
    #train_encoded = pd.concat([train_encoded, data], axis=1)
    train_encoded = pd.concat([train_encoded, df], axis=1)
    for column in categorical_columns:
        train_encoded[column + "_trg"].fillna(train_encoded[column + "_trg"].mean(), inplace=True)
    
    # Renaming the columns
    #train_encoded.rename(columns={col: col + "_trg" if col in categorical_columns else col for col in train_encoded.columns}, inplace=False)
    print(train_encoded.columns)
    


    # Adding Features
    import numpy as np
    # Apply the log transformation to the specific feature in your training data
    small_constant = 1e-10  # You can adjust this constant as needed
    # df['LogColumn'] = np.log(df['OriginalColumn'] + small_constant)
    train_encoded['Log_DisbursementGross'] = np.log1p(train_encoded['DisbursementGross'])
    train_encoded['Log_GrAppv'] = np.log1p(train_encoded['GrAppv'])
    train_encoded['Log_SBA_Appv'] = np.log1p(train_encoded['SBA_Appv'])
    train_encoded['Log_BalanceGross'] = np.log1p(train_encoded['BalanceGross'])
    train_encoded['TotalJobs'] = train_encoded['CreateJob'] + train_encoded['RetainedJob']
    #train_encoded['Loan_Efficiency'] = train_encoded['DisbursementGross'] / (train_encoded['CreateJob'] + train_encoded['RetainedJob'] + 1)
    # Calculate 'LoanToIncomeRatio' as a ratio of 'SBA_Appv' to 'DisbursementGross'
    train_encoded['IncomeToLoanRatio'] = train_encoded['DisbursementGross'] / train_encoded['SBA_Appv']
    # Calculate 'LoanToEmployeesRatio' as a ratio of 'SBA_Appv' to 'NoEmp'
    train_encoded['EmployeesToLoanRatio'] = train_encoded['NoEmp'] / train_encoded['SBA_Appv']
    # Create a binary feature to indicate loans with a balance ('BalanceGross' > 0)
    #train_encoded['HasBalance'] = (train_encoded['BalanceGross'] > 0).astype(int)
    # Calculate 'LoanPerJob' as a ratio of 'SBA_Appv' to 'TotalJobs'
    train_encoded['JobPerLoan'] = train_encoded['TotalJobs'] / train_encoded['SBA_Appv'] 
    # Calculate SBA's Gaurenteed Portion of Approved Loan
    train_encoded['Gauren_SBA_Appv'] = train_encoded['GrAppv'] / train_encoded['SBA_Appv']

    
    # Scaling the numerical columns
    numerical_columns = [ 'NoEmp', 'CreateJob', 'RetainedJob', 'GrAppv', 'SBA_Appv', 'DisbursementGross', 'BalanceGross',
                        'Log_DisbursementGross', 'Log_GrAppv', 'Log_SBA_Appv', 'Log_BalanceGross','TotalJobs','IncomeToLoanRatio', 
                        'EmployeesToLoanRatio', 'JobPerLoan', 'Gauren_SBA_Appv']
    
    scaler = StandardScaler()
    #fit and transform separately
    scaler.fit(train_encoded[numerical_columns])
    train_encoded[numerical_columns] = scaler.transform(train_encoded[numerical_columns])
      

    #X = df.copy()
    #X = X.reset_index(drop=True)

    clf = LogisticRegression(random_state=42,max_iter=100,n_jobs=-1, verbose=1)
    columns_to_train = [col for col in train_encoded.columns if col not in cols_to_drop]
    clf_ = clf.fit(train_encoded[columns_to_train],y)

    # Evaluate the model using AUCPR
    aucpr_score = average_precision_score(y_true=y, y_score=clf_.predict_proba(train_encoded[columns_to_train])[:, 1])
    print("AUCPR score:", aucpr_score)
       
    
    param_grid = {'C':[10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
                 'penalty': ['l2', 'l1', 'elasticnet'],
                 'solver': ['lbfgs'], # 'newton-cholesky', 'sag'
                 'l1_ratio': [0.1, 0.3, 0.7]
                 }
    

    
    grid1 = GridSearchCV(clf_, 
                    param_grid, cv =7 , return_train_score= True)
    grid1.fit(train_encoded[columns_to_train], y)
    print("Best parameters found: ", grid1.best_params_)
    print("Best cross-validation score: {:.2f}".format(grid1.best_score_))
    top_clf = grid1.best_estimator_
        
    #clf = LogisticRegression(max_iter=100, random_state=0)
    
    #columns_to_train = [x for x in X.columns if x not in cols_to_drop]
    #print("Training on following columns:", columns_to_train)
    # Create logistic regression classifier
    # Fit classifier to training data
    # grid1 = GridSearchCV(clf.fit(X[columns_to_train], y), 
    #                  param_grid, cv =7 , return_train_score= True)
    # grid1.fit(X[columns_to_train], y)
    # print(grid1.best_params_)
    #clf.fit(X_train[columns_to_train], y_train)

          
   
    # End Todo
    
    # Saving the artifacts
    artifacts_dict = {
        "model": top_clf,
        "target_encoder": encoder,
        "te_columns": categorical_columns,
        "columns_to_train":columns_to_train,
        "numerical_columns":numerical_columns,
        "category_cols":category_cols,
        "scaler":scaler,
        "h2o_model_path":model_path
    }

    #calculating threshold
    if y is not None:
        optimal_threshold = calculate_optimal_threshold(clf_, train_encoded[columns_to_train], y)
        print(f"Optimal Threshold: {optimal_threshold}")
        # Saving the threshold in artifacts
        artifacts_dict["threshold"] = optimal_threshold

    artifacts_dict_file = open("D:/Work/Gre/UTD/Courses/Fall/MIS6341/Softwares/Python/ml-fall-2023/Project1/artifacts/artifacts_dict_file.pkl", "wb")
    pickle.dump(obj=artifacts_dict, file=artifacts_dict_file)
    
    artifacts_dict_file.close()    
    return clf_

In [226]:
from sklearn.model_selection import train_test_split
        
df = pd.read_csv("D:/Work/Gre/UTD/Courses/Fall/MIS6341/Softwares/Python/ml-fall-2023/Project1/SBA_loans_project_1.csv")
target = "MIS_Status"
y = df[target]
x = df.drop(columns=[target])

# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)
df_train = X_train.copy()
df_train[target] = y_train
train_model(df_train)

RevLineCr ['Y' 'N' 'T' nan '`' 'R' '1' '.' '5' '2' '3' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' nan '`' 'R' '1' '.' '5' '2' '3' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' '`' 'R' '1' '.' '5' '2' '3' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' 'R' '1' '.' '5' '2' '3' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' '1' '.' '5' '2' '3' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' '.' '5' '2' '3' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' '5' '2' '3' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' '2' '3' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' '3' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' 'A' '-' '7' 'C']
RevLineCr ['Y' 'N' '-' '7' 'C']
RevLineCr ['Y' 'N' '7' 'C']
RevLineCr ['Y' 'N' 'C']
RevLineCr ['Y' 'N']
LowDoc ['N' 'Y' 'R' 'S' 'C' '0' 'A' '1']
LowDoc ['N' 'Y' 'S' 'C' '0' 'A' '1']
LowDoc ['N' 'Y' 'C' '0' 'A' '1']
LowDoc ['N' 'Y' '0' 'A' '1']
LowDoc ['N' 'Y' 'A' '1']
LowDoc ['N' 'Y' '1']
LowDoc ['N' 'Y']
NewExist [1.0 2.0 None nan]
NewExist [1.0 2.0 None nan]
NewExist [1.0 2.0 None nan]
NewExist [1.0 2.0 None nan]
NewExist [1.0 2.0 None]
NewExist [1.0 2.0

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


AUCPR score: 0.15862387003852957


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.

In [None]:
import pickle

def load_and_print_artifacts_dict(path):
    artifacts_dict = pickle.load(open(path, "rb"))

    print("Target encoder mapping:")
    print([ac for ac in artifacts_dict["target_encoder"].mapping])

    print("Columns to train:")
    print([ac for ac in artifacts_dict["columns_to_train"]])

if __name__ == "__main__":
    load_and_print_artifacts_dict("./Artifacts/artifacts_dict_file.pkl")

Target encoder mapping:
['City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc', 'NewExist']
Columns to train:
['City_trg', 'State_trg', 'Bank_trg', 'BankState_trg', 'RevLineCr_trg', 'LowDoc_trg', 'NewExist_trg', 'NAICS', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural', 'DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv', 'Log_DisbursementGross', 'Log_GrAppv', 'Log_SBA_Appv', 'Log_BalanceGross', 'TotalJobs', 'IncomeToLoanRatio', 'EmployeesToLoanRatio', 'JobPerLoan', 'Gauren_SBA_Appv']
