<a href="https://colab.research.google.com/github/revanthmadasu/machine-learning/blob/master/nallam-project1/Project-1-starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 Starter

**This is draft - version 0 - changes are possible and will be anounced.**

Project 1 is to allow students to practice Data Science concepts learned so far.

The project will include following tasks:
- Load dataset. Don't use "index" column for training.
- Clean up the data:
    - Encode replace missing values
    - Replace features values that appear incorrect
- Encode categorical variables
- Split dataset to Train/Validation/Test
- Add engineered features
- Train and tune ML model
- Provide final metrics using Test dataset

### Types of models to train

Your final submission should include single model. 
The model set you should try to come up with best model:
1. Sklearn Logistic Regression - try all combinations of regularization
2. H2O-3 GLM - try different combinations of regularization



### Feature engineering

You should train/fit categorical features scalers and encoders on Train only. Use `transform` or equivalent function on Validation/Test datasets.

It is important to understand all the steps before model training, so that you can reliably replicate and test them to produce scoring function.


You should generate various new features. Examples of such features can be seen in the Module-3 lecture on GLMs.
Your final model should have at least **10** new engineered features. On-hot-encoding, label encoding, and target encoding is not included in the **10** features.
You can try, but target encoding is not expected to produce improvement for Linear models.

Ideas for Feature engineering for various types of variables:
1. https://docs.h2o.ai/driverless-ai/1-10-lts/docs/userguide/transformations.html
2. GLM lecture and hands-on (Module-3)


**Note**: 
- You don't have to perform feature engineering using H2O-3 even if you decided to use H2O-3 GLM for model training.
- It is OK to perfor feature engineering using any technique, as long as you can replicate it correctly in the Scoring function.


### Threshold calculation

You will need to calculate optimal threshold for class assignment using F1 metric:
- If using sklearn, use F1 `macro`: `f1_score(y_true, y_pred, average='macro')` 
- If using H2O-3, use F1

You will need to find optimal probability threshold for class assignment, the threshold that maximizes above F1.



### Scoring function

The Project-1 will be graded based on the completeness and performance of your final model against the hold-out dataset.
The hold-out dataset will not be known to the students. As part of your deliverables, you will need to submit a scoring function. The scoring function will perform the following:
- Accept dataset in the same format as provided with the project, minus "MIS_Status" column
- Load trained model and any encoders/scalers that are needed to transform data
- Transform dataset into format that can be scored with the trained model
- Score the dataset and return the results, for each record
    - Record ID
    - Record label as determined by final model (0 or 1)
    - If your model returns probabilities, you need to assign the label based on maximum F1 threshold
    
Scoring function header:
```
def project_1_scoring(data):
    """
    Function to score input dataset.
    
    Input: dataset in Pandas DataFrame format
    Output: Python list of labels in the same order as input records
    
    Flow:
        - Load artifacts
        - Transform dataset
        - Score dataset
        - Return labels
    
    """
    l = data.shape[0]
    return l*[0]
```

Look for full example of scoring function at the bottom of the notebook. **Don't copy as is - this is just an example**



### Deliverables in a single zip file in the following structure:
- `notebook` (folder)
    - Jupyter notebook with complete code to manipulate data, train and tune final model. `ipynb` format
    - Jupyter notebook in `html` format
- `artifacts` (folder)
    - Model and any potential encoders in the "pkl" format or native H2O-3 format (for H2O-3 model)
    - Scoring function that will load the final model and encoders. Separate from above notebook or `.py` file



Your notebook should include explanations about your code and be designed to be easily followed and results replicated. Once you are done with the final version, you will need to test it by running all cells from top to bottom after restarting Kernel. It can be done by running `Kernel -> Restart & Run All`


**Important**: To speed up progress, first produce working code using a small subset of the dataset.

In [1]:
pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
pip install category-encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Dataset description

The dataset is from the U.S. Small Business Administration (SBA) The U.S. SBA was founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market (SBA Overview and History, US Small Business Administration (2015)). Small businesses have been a primary source of job creation in the United States; therefore, fostering small business formation and growth has social benefits by creating job opportunities and reducing unemployment. There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans.  
More info on the original dataset: https://www.kaggle.com/mirbektoktogaraev/should-this-loan-be-approved-or-denied

**Don't use original dataset, use only dataset provided with project requirements in eLearning**

## Preparation

Use dataset provided in the eLearning

In [2]:
import pandas as pd
pd.set_option('display.max_columns', 1500)

import warnings
warnings.filterwarnings('ignore')

#Extend cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [None]:
"""
Created on Mon Mar 18 18:25:50 2019

@author: Uri Smashnov

Purpose: Analyze input Pandas DataFrame and return stats per column
Details: The function calculates levels for categorical variables and allows to analyze summarized information

To view wide table set following Pandas options:
pd.set_option('display.width', 1000)
pd.set_option('max_colwidth',200)
"""
import pandas as pd
def describe_more(df,normalize_ind=False, weight_column=None, skip_columns=[], dropna=True):
    var = [] ; l = [] ; t = []; unq =[]; min_l = []; max_l = [];
    assert isinstance(skip_columns, list), "Argument skip_columns should be list"
    if weight_column is not None:
        if weight_column not in list(df.columns):
            raise AssertionError('weight_column is not a valid column name in the input DataFrame')
      
    for x in df:
        if x in skip_columns:
            pass
        else:
            var.append( x )
            uniq_counts = len(pd.value_counts(df[x],dropna=dropna))
            uniq_counts = len(pd.value_counts(df[x], dropna=dropna)[pd.value_counts(df[x],dropna=dropna)>0])
            l.append(uniq_counts)
            t.append( df[ x ].dtypes )
            min_l.append(df[x].apply(str).str.len().min())
            max_l.append(df[x].apply(str).str.len().max())
            if weight_column is not None and x not in skip_columns:
                df2 = df.groupby(x).agg({weight_column: 'sum'}).sort_values(weight_column, ascending=False)
                df2['authtrans_vts_cnt']=((df2[weight_column])/df2[weight_column].sum()).round(2)
                unq.append(df2.head(n=100).to_dict()[weight_column])
            else:
                df_cat_d = df[x].value_counts(normalize=normalize_ind,dropna=dropna).round(decimals=2)
                df_cat_d = df_cat_d[df_cat_d>0]
                #unq.append(df[x].value_counts().iloc[0:100].to_dict())
                unq.append(df_cat_d.iloc[0:100].to_dict())
            
    levels = pd.DataFrame( { 'A_Variable' : var , 'Levels' : l , 'Datatype' : t ,
                             'Min Length' : min_l,
                             'Max Length': max_l,
                             'Level_Values' : unq} )
    #levels.sort_values( by = 'Levels' , inplace = True )
    return levels

### Load data

In [3]:
data = pd.read_csv('./data/SBA_loans_project_1.zip')

In [None]:
print("Data shape:", data.shape)

Data shape: (809247, 21)


**Review dataset**

In [None]:
desc_df = describe_more(data)
desc_df

## Dataset preparation and clean-up

Modify and clean-up the dataset as following:
- Replace encode Na/Null values
- Convert the strings styled as '$XXXX.XX' to float values. Columns = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
- Convert MIS_Status to 0/1. Make value "CHGOFF" as 1

Any additional clean-up as you find fit.

### Removing Duplicates

In [17]:
# removing duplicates
data = data.drop_duplicates()
print("Data shape:", data.shape)

Data shape: (809247, 21)


### Removing index and converting currency to float values

In [18]:
# remove index
processed_data = data.loc[:,data.columns!='index']
def dollar_string_to_float(dollar_str):
  dollar_str = dollar_str.replace(',', '')
  dollar_float = float(dollar_str[1:])
  return dollar_float
currency_cols = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
for colname in currency_cols:
  processed_data[colname] = processed_data[colname].apply(dollar_string_to_float)
processed_data.loc[0]

City                                  GLEN BURNIE
State                                          MD
Zip                                         21060
Bank                 BUSINESS FINANCE GROUP, INC.
BankState                                      VA
NAICS                                      811111
Term                                          240
NoEmp                                           7
NewExist                                      1.0
CreateJob                                       6
RetainedJob                                     7
FranchiseCode                                   1
UrbanRural                                      1
RevLineCr                                       0
LowDoc                                          N
DisbursementGross                        743000.0
BalanceGross                                  0.0
GrAppv                                   743000.0
SBA_Appv                                 743000.0
MIS_Status                                  P I F


### Convert MIS_Status to 0/1. Make value "CHGOFF" as 1

In [19]:
processed_data['MIS_Status'] = processed_data['MIS_Status'].replace('P I F', 0)
processed_data['MIS_Status'] = processed_data['MIS_Status'].replace('CHGOFF', 1)
processed_data = processed_data.dropna()
processed_data['MIS_Status'].unique()

array([0., 1.])

In [None]:
processed_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 784130 entries, 0 to 809246
Data columns (total 20 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   City               784130 non-null  category
 1   State              784130 non-null  category
 2   Zip                784130 non-null  int64   
 3   Bank               784130 non-null  category
 4   BankState          784130 non-null  category
 5   NAICS              784130 non-null  int64   
 6   Term               784130 non-null  int64   
 7   NoEmp              784130 non-null  int64   
 8   NewExist           784130 non-null  float64 
 9   CreateJob          784130 non-null  int64   
 10  RetainedJob        784130 non-null  int64   
 11  FranchiseCode      784130 non-null  int64   
 12  UrbanRural         784130 non-null  int64   
 13  RevLineCr          784130 non-null  category
 14  LowDoc             784130 non-null  category
 15  DisbursementGross  784130 non-null

### Handling null data for individual columns.
As we have already removed rows with null data this does not have any impact. 
Since we are having large amounts of data, removing rows with null data would not result in data loss. Thats why we have choose to remove rows with null data
To handle null data instead of removing, comment `processed_data = processed_data.dropna()` and run this code


In [20]:
# handling null data
row_null_counts = processed_data.isna().any(axis=1).sum()
print("\nNumber of rows with null values:", row_null_counts)
def check_null_counts():
  col_null_counts = dict()
  for column in processed_data.columns:
    col_null_count = processed_data[column].isna().sum()
    if col_null_count:
      col_null_counts[column] = col_null_count
  return col_null_counts
print(f'before handling: {check_null_counts()}')
categoral_null_cols = ['City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc']
for null_cat_col in categoral_null_cols:
  processed_data[null_cat_col] = processed_data[null_cat_col].fillna('unknown').astype('category')
processed_data['NewExist'] = processed_data['NewExist'].fillna(0)
print(f'after handling handling: {check_null_counts()}')


Number of rows with null values: 0
before handling: {}
after handling handling: {}


In [21]:
processed_data['MIS_Status'].value_counts()

0.0    658734
1.0    140776
Name: MIS_Status, dtype: int64

In [22]:
processed_data['RevLineCr'].value_counts()

N    374421
0    231797
Y    179628
T     13606
1        22
R        12
`         9
2         6
C         2
-         1
.         1
3         1
5         1
7         1
A         1
Q         1
Name: RevLineCr, dtype: int64

Feature Engineering: 

Processing RevLineCr feature. Removing rows with unrecognised categories and replacing with binary categories

In [23]:
processed_data['RevLineCr'] = processed_data['RevLineCr'].replace('N', 0)
processed_data['RevLineCr'] = processed_data['RevLineCr'].replace('0', 0)
processed_data['RevLineCr'] = processed_data['RevLineCr'].replace('Y', 1)
for remove_cat in processed_data['RevLineCr'].cat.categories.tolist():
  if remove_cat not in [0,1]:
    processed_data = processed_data[processed_data['RevLineCr'] != remove_cat]
    processed_data['RevLineCr'] = processed_data['RevLineCr'].cat.remove_categories(remove_cat)

processed_data['RevLineCr'].value_counts()

0    606218
1    179628
Name: RevLineCr, dtype: int64

In [None]:
processed_data['MIS_Status'].unique()

array([0., 1.])

Feature Engineering:
Processing NewExist feature. Replacing with binary categories

In [None]:
processed_data['NewExist'] = processed_data['NewExist'].replace(1, 0)
processed_data['NewExist'] = processed_data['NewExist'].replace(2, 1)

Feature Engineering:

Processing LowDoc feature. Removing rows with unrecognised categories and replacing with binary categories

In [25]:
processed_data['LowDoc'] = processed_data['LowDoc'].replace('Y', 1)
processed_data['LowDoc'] = processed_data['LowDoc'].replace('N', 0)
processed_data['LowDoc'] = processed_data['LowDoc'].replace('0', 0)
processed_data['LowDoc'] = processed_data['LowDoc'].replace('1', 1)
for remove_cat in processed_data['LowDoc'].cat.categories.tolist():
  if remove_cat not in [0,1]:
    processed_data = processed_data[processed_data['LowDoc'] != remove_cat]
    processed_data['LowDoc'] = processed_data['LowDoc'].cat.remove_categories(remove_cat)
processed_data['LowDoc'].value_counts()
# processed_data['LowDoc'].cat.categories

0    686998
1     97132
Name: LowDoc, dtype: int64

In [None]:
processed_data['UrbanRural'].value_counts()

1    408049
0    284083
2     91998
Name: UrbanRural, dtype: int64

## Categorical and numerical variables encoding

Encode categorical variables using either one of the techniques below. Don't use LabelEncoder.
- One-hot-encoder for variables with less than 10 valid values. Name your new columns "Original_name"_valid_value
- Target encoder from the following library: https://contrib.scikit-learn.org/category_encoders/index.html . Name your new column "Original_name"_trg
- WOE encoder from the following library: https://contrib.scikit-learn.org/category_encoders/index.html . Name your new column "Original_name"_woe


WOE encoder can be used with numerical variables too. 


Example of use for target encoder:
```
import category_encoders as ce

encoder = ce.TargetEncoder(cols=[...])

encoder.fit(X, y)
X_cleaned = encoder.transform(X_dirty)
```

### Sampling to increase run time

In [27]:
s_processed_data = processed_data.sample(frac=0.0125, random_state=27)
s_processed_data
print(s_processed_data.shape)

(9802, 20)


In [26]:
cat_cols_bin_en = ['City', 'State', 'Bank', 'BankState', 'Zip', 'NAICS', 'UrbanRural']

In [None]:
s_processed_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3921 entries, 724290 to 635320
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   City               3921 non-null   category
 1   State              3921 non-null   category
 2   Zip                3921 non-null   int64   
 3   Bank               3921 non-null   category
 4   BankState          3921 non-null   category
 5   NAICS              3921 non-null   int64   
 6   Term               3921 non-null   int64   
 7   NoEmp              3921 non-null   int64   
 8   NewExist           3921 non-null   float64 
 9   CreateJob          3921 non-null   int64   
 10  RetainedJob        3921 non-null   int64   
 11  FranchiseCode      3921 non-null   int64   
 12  UrbanRural         3921 non-null   int64   
 13  RevLineCr          3921 non-null   category
 14  LowDoc             3921 non-null   category
 15  DisbursementGross  3921 non-null   float64 
 16 

### Choosing Binary Encoder
As there are many categorical variables, one-hot encoding can result in high dimensional sparse feature space.

In [28]:
from category_encoders import BinaryEncoder
import pandas as pd
bin_encoder = BinaryEncoder(cols=cat_cols_bin_en)
bin_encoded_data = bin_encoder.fit_transform(processed_data)

In [29]:
bin_encoded_data.columns

Index(['City_0', 'City_1', 'City_2', 'City_3', 'City_4', 'City_5', 'City_6',
       'City_7', 'City_8', 'City_9', 'City_10', 'City_11', 'City_12',
       'City_13', 'City_14', 'State_0', 'State_1', 'State_2', 'State_3',
       'State_4', 'State_5', 'Zip_0', 'Zip_1', 'Zip_2', 'Zip_3', 'Zip_4',
       'Zip_5', 'Zip_6', 'Zip_7', 'Zip_8', 'Zip_9', 'Zip_10', 'Zip_11',
       'Zip_12', 'Zip_13', 'Zip_14', 'Bank_0', 'Bank_1', 'Bank_2', 'Bank_3',
       'Bank_4', 'Bank_5', 'Bank_6', 'Bank_7', 'Bank_8', 'Bank_9', 'Bank_10',
       'Bank_11', 'Bank_12', 'BankState_0', 'BankState_1', 'BankState_2',
       'BankState_3', 'BankState_4', 'BankState_5', 'NAICS_0', 'NAICS_1',
       'NAICS_2', 'NAICS_3', 'NAICS_4', 'NAICS_5', 'NAICS_6', 'NAICS_7',
       'NAICS_8', 'NAICS_9', 'NAICS_10', 'Term', 'NoEmp', 'NewExist',
       'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural_0',
       'UrbanRural_1', 'RevLineCr', 'LowDoc', 'DisbursementGross',
       'BalanceGross', 'GrAppv', 'SBA_Appv', 'MIS_Sta

In [30]:
encoded_data = bin_encoded_data

In [40]:
encoded_data.iloc[:, 66:-1].columns

Index(['Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob',
       'FranchiseCode', 'UrbanRural_0', 'UrbanRural_1', 'RevLineCr', 'LowDoc',
       'DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv'],
      dtype='object')

### Test Train splitting

In [None]:
from sklearn.model_selection import train_test_split
X = encoded_data.iloc[:, :-1].values
y = encoded_data.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=44)

# Model Training

See Project summary for types of models

### Using Logistic Regression
Selecting Logistic Regression because it is a Binary Classification problem on large dataset. Logistic Regression is specifically designed for binary classification problems.

As it a loan approval problem, the relationship between input and output variables must be mostly linear. Logistic Regression algorithms would be good fit for Linear data.



In [82]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8293182678042331


## Model Tuning

Choose one model from the above list. You should provide reasoning on why you have picked the model over others. Perform tuning for the selected model:
- Hyper-parameter tuning. Your hyper-parameter search space should have at least 50 combinations.
- To avoid overfitting and provide you with reasonable estimate of model performance on hold-out dataset, you will need to split your dataset as following:
    - Train, will be used to train model
    - Validation, will be used to validate model each round of training
    - Testing, will be used to provide final performance metrics, used only once on the final model
- Feature engineering. See project description

**Selelct final model that produces best performance on the Test dataset.**
- For the best model, calculate probability threshold to maximize F1. 

### Grid search for hyper parameter tuning

In [84]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
logreg = LogisticRegression()
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3)
# Define the hyperparameter search space
param_grid = {
    'penalty': ['l2', 'elasticnet', 'none'],
    'C': [0.01, 0.1, 1.0],
    'fit_intercept': [True, False],
    'solver': ['newton-cg', 'lbfgs', 'saga'],
    'max_iter': [100]
}

# Define the GridSearchCV object
grid_search = GridSearchCV(logreg, param_grid, cv=2, n_jobs=-1, verbose=1, 
                           scoring='accuracy')

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)
val_score = grid_search.score(X_val, y_val)

Fitting 2 folds for each of 54 candidates, totalling 108 fits


In [None]:
print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Validation score: {val_score}")

Best hyperparameters: {'C': 1.0, 'fit_intercept': False, 'max_iter': 100, 'penalty': 'l2', 'solver': 'newton-cg'}
Validation score: 0.850360060189166


In [None]:
best_model = grid_search.best_estimator_
score = best_model.score(X_test, y_test)

print("Best model score:", score)

Best model score: 0.8517082626605282


## Save all artifacts

Save all artifacts needed for scoring function:
- Trained model
- Encoders
- Any other arficats you will need for scoring

**You should stop your notebook here. Scoring function should be in a separate file/notebook.**

In [8]:
pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.6.0-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 KB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.0


### Artifact: ModelFormation
ModelFormation class does all the preprocessing, model training, test train splitting.
It has the best Logistic Regression model with hyper tuning.

trainedModel object will be stored in artifact map, and this is used in model scoring function. 

As this is a class, in order to for python to have a definition to create trainedModel instance, it needs class definition. 

The same definition will be provided in model scoring function

Note that in model scoring function we are not actually training data, the class definition is just there for python to create and instance of object from artifact

In [41]:
from category_encoders import BinaryEncoder
import pandas as pd
class ModelFormation:
    def process_encode_data(self, data):
        processed_data = data.loc[:,data.columns!='index']

        def dollar_string_to_float(dollar_str):
          dollar_str = dollar_str.replace(',', '')
          dollar_float = float(dollar_str[1:])
          return dollar_float
        # converting dollar to float
        currency_cols = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
        for colname in currency_cols:
          processed_data[colname] = processed_data[colname].apply(dollar_string_to_float)

        processed_data = processed_data.dropna()
        
        processed_data['MIS_Status'] = processed_data['MIS_Status'].replace('P I F', 0)
        processed_data['MIS_Status'] = processed_data['MIS_Status'].replace('CHGOFF', 1)

        processed_data['RevLineCr'] = processed_data['RevLineCr'].astype('category')

        processed_data['RevLineCr'] = processed_data['RevLineCr'].replace('N', 0)
        processed_data['RevLineCr'] = processed_data['RevLineCr'].replace('0', 0)
        processed_data['RevLineCr'] = processed_data['RevLineCr'].replace('Y', 1)
        for remove_cat in processed_data['RevLineCr'].cat.categories.tolist():
          if remove_cat not in [0,1]:
            processed_data = processed_data[processed_data['RevLineCr'] != remove_cat]
            processed_data['RevLineCr'] = processed_data['RevLineCr'].cat.remove_categories(remove_cat)

        processed_data['NewExist'] = processed_data['NewExist'].replace(1, 0)
        processed_data['NewExist'] = processed_data['NewExist'].replace(2, 1)
        
        processed_data['LowDoc'] = processed_data['LowDoc'].astype('category')
        processed_data['LowDoc'] = processed_data['LowDoc'].replace('Y', 1)
        processed_data['LowDoc'] = processed_data['LowDoc'].replace('N', 0)
        processed_data['LowDoc'] = processed_data['LowDoc'].replace('0', 0)
        processed_data['LowDoc'] = processed_data['LowDoc'].replace('1', 1)
        for remove_cat in processed_data['LowDoc'].cat.categories.tolist():
          if remove_cat not in [0,1]:
            processed_data = processed_data[processed_data['LowDoc'] != remove_cat]
            processed_data['LowDoc'] = processed_data['LowDoc'].cat.remove_categories(remove_cat)
        processed_data['LowDoc'].value_counts()

        s_processed_data = processed_data.sample(frac=0.5, random_state=27)

        cat_cols_bin_en = ['City', 'State', 'Bank', 'BankState', 'Zip', 'NAICS', 'UrbanRural']

        from category_encoders import BinaryEncoder
        import pandas as pd
        bin_encoder = BinaryEncoder(cols=cat_cols_bin_en)
        bin_encoded_data = bin_encoder.fit_transform(processed_data)

        return bin_encoded_data

    def test_train_split(self, encoded_data):
        from sklearn.model_selection import train_test_split

        X = encoded_data.iloc[:, :-1].values
        y = encoded_data.iloc[:, -1].values

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)

        self.x_cols_to_score = encoded_data.iloc[:, :-1].columns

        return X_train, X_test, y_train, y_test
    def get_cols_to_score(self, start=66, end=-1):
      return self.x_cols_to_score[start:end]

    def get_model(self, data):
      encoded_data = self.process_encode_data(data)
      X_train, X_test, y_train, y_test = self.test_train_split(encoded_data)
      from sklearn.linear_model import LogisticRegression

      lr_model = LogisticRegression(C= 1.0, fit_intercept= False, max_iter= 100, penalty= 'l2', solver='newton-cg')
      lr_model.fit(X_train, y_train)

      return lr_model


    def __init__(self, train_model = False, data = None):
      # remove index
      if train_model:
        self.trained_model = self.get_model(data)


trainedModel = ModelFormation(True, data)

In [16]:
trainedModel.trained_model

### Calculating Threshold


In [78]:
from sklearn.metrics import f1_score
import numpy as np
X_train, X_test, y_train, y_test = trainedModel.test_train_split(encoded_data)
y_pred = trainedModel.trained_model.predict(X_test)
thresholds = np.arange(0.1, 1.0, 0.1)

# Calculate the F1 score for each threshold
f1_scores = [f1_score(y_test, y_pred >= t) for t in thresholds]

# Find the threshold that maximizes the F1 score
best_threshold = thresholds[np.argmax(f1_scores)]

print("Best threshold:", best_threshold)
f1_score(y_test, y_pred, average='macro')

Best threshold: 0.1


0.6265405330846991

In [42]:
import pickle
import joblib

In [80]:
artifacts_dict = {
    "model": trainedModel,
    "train_df": train_df,
    "threshold": 0.63
}
artifacts_dict_file = open("artifacts_dict_file.pkl", "wb")
pickle.dump(obj=artifacts_dict, file=artifacts_dict_file)

artifacts_dict_file.close() 

## Stop Here. Create new file/notebook

## ==============================================

## Model Scoring

Write function that will load artifacts from above, transform and score on a new dataset.
Your function should return Python list of labels. For example: [0,1,0,1,1,0,0]


In [None]:
def project_1_scoring(data):
    """
    Function to score input dataset.
    
    Input: dataset in Pandas DataFrame format
    Output: Python list of labels in the same order as input records
    
    Flow:
        - Load artifacts
        - Transform dataset
        - Score dataset
        - Return pandas DF with following columns:
            - index
            - label
            - probability_0
            - probability_1
    """
    pass

### Example of Scoring function

Don't copy the code as is. It is provided as an example only. 
- Function `train_model` - you need to focus on model and artifacts saving:
    ```
    pickle.dump(obj=artifacts_dict, file=artifacts_dict_file)
    ```
- Function `project_1_scoring` - you should have similar function with name `project_1_scoring`. The function will:
    - Get Pandas dataframe as parameter
    - Will load model and all needed encoders
    - Will perform needed manipulations on the input Pandas DF - in the exact same format as input file for the project, minus MIS_Status feature
    - Return Pandas DataFrame
        - record index
        - predicted class for threshold maximizing F1
        - probability for class 0 (PIF)
        - probability for class 1 (CHGOFF)


In [None]:
"""
Don't copy of use the cell code in any way!!!
The code is provided as an example of generating artifacts for scoring function
Your scoring function code should not have model training part!!!!
"""
import pandas as pd
import numpy as np
def train_model(data):
    """
    Train sample model and save artifacts
    """
    from sklearn.preprocessing import OneHotEncoder
    from copy import deepcopy
    from sklearn.linear_model import LogisticRegression
    import pickle
    from sklearn.impute import SimpleImputer
    
    target_col = "Survived"
    cols_to_drop = ['Name', 'Ticket', 'Cabin','SibSp', 'Parch', 'Sex','Embarked','PassengerId','Survived']
    y = data[target_col]
    X = data.drop(columns=[target_col])
    
    # Impute Embarked
    X['Embarked'].replace(np.NaN, 'S',inplace = True)
    
    # Create new feature
    X['FamilySize'] = X['SibSp'] + X['Parch']
    
    # Mean impute Age
    imp_age_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
    imp_age_mean.fit(X[['Age']])
    X['Age'] = imp_age_mean.transform(X[['Age']])


    ohe_orig_columns = ["Embarked","Sex"]
    cat_encoders = {}
    for col in ohe_orig_columns:
        enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
        enc.fit(X[[col]])
        result = enc.transform(X[[col]])
        ohe_columns = [col+"_"+str(x) for x in enc.categories_[0]]
        result_train = pd.DataFrame(result, columns=ohe_columns)
        X= pd.concat([X, result_train], axis=1)
        cat_encoders[col] = [deepcopy(enc),"ohe"]
        
    clf = LogisticRegression(max_iter=1000, random_state=0)
    
    columns_to_train = [x for x in X.columns if x not in cols_to_drop]
    clf.fit(X[columns_to_train], y)
    
    # Todo: Add code to calculate optimal threshold. Replace 0.5 !!!!!
    threshold = 0.5
    # End Todo
    
    artifacts_dict = {
        "model": clf,
        "cat_encoders": cat_encoders,
        "imp_age_mean": imp_age_mean,
        "ohe_columns": ohe_orig_columns,
        "columns_to_train":columns_to_train,
        "threshold": threshold
    }
    artifacts_dict_file = open("./artifacts/artifacts_dict_file.pkl", "wb")
    pickle.dump(obj=artifacts_dict, file=artifacts_dict_file)
    
    artifacts_dict_file.close()    
    return clf

In [None]:
df_train = pd.read_csv('titanic.csv')
train_model(df_train)

### Example scoring function

This is example only. Don't copy code as is!!!   
You must place scoring function in a separate Python file or Jupyter notebook.   

**Don't place function in the same notebook as rest of the code**

In [None]:
def project_1_scoring(data):
    """
    Function to score input dataset.
    
    Input: dataset in Pandas DataFrame format
    Output: Python list of labels in the same order as input records
    
    Flow:
        - Load artifacts
        - Transform dataset
        - Score dataset
        - Return labels
    
    """
    from sklearn.preprocessing import OneHotEncoder
    from copy import deepcopy
    from sklearn.linear_model import LogisticRegression
    import pickle
    
    X = data.copy()
    
    '''Load Artifacts'''
    artifacts_dict_file = open("./artifacts/artifacts_dict_file.pkl", "rb")
    artifacts_dict = pickle.load(file=artifacts_dict_file)
    artifacts_dict_file.close()
    
    clf = artifacts_dict["model"]
    cat_encoders = artifacts_dict["cat_encoders"]
    imp_age_mean = artifacts_dict["imp_age_mean"]
    ohe_columns = artifacts_dict["ohe_columns"]
    columns_to_score = artifacts_dict["columns_to_train"]
    threshold = artifacts_dict["threshold"]
    
    # Impute Embarked
    X['Embarked'].replace(np.NaN, 'S',inplace = True)
    
    # Create new feature
    X['FamilySize'] = X['SibSp'] + X['Parch']
    
    # Mean impute Age
    X['Age'] = imp_age_mean.transform(X[['Age']])
    
    '''Encode categorical columns'''
    for col in ohe_columns:
        enc = cat_encoders[col][0]
        result = enc.transform(X[[col]])
        ohe_columns = [col+"_"+str(x) for x in enc.categories_[0]]
        result_train = pd.DataFrame(result, columns=ohe_columns)
        X = pd.concat([X, result_train], axis=1)
        
    y_pred_proba = clf.predict_proba(X[columns_to_score])
    y_pred = (y_pred_proba[:,0] < threshold).astype(np.int16)
    d = {"index":data["PassengerId"],
         "label":y_pred,
         "probability_0":y_pred_proba[:,0],
         "probability_1":y_pred_proba[:,1]}
    
    return pd.DataFrame(d)

In [None]:
project_1_scoring(df_train).head()

Unnamed: 0,index,label,probability_0,probability_1
0,1,0,0.901298,0.098702
1,2,1,0.071879,0.928121
2,3,1,0.367665,0.632335
3,4,1,0.098564,0.901436
4,5,0,0.92346,0.07654
