# BLU15 - Model CSI


In [1]:
import pandas as pd
import numpy as np
import hashlib
import io
import json
import pickle
import requests
import joblib
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import precision_score, recall_score, precision_recall_curve
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline
import os

Alright, let's go on with the BLU and have fun doing some exercises!

<img src="media/show.jpg" width=300/>

As a reminder:

In the learning unit we received a pretrained model and a new batch of data and analyzed whether the model performs well and what to do with it.

In the end, we realized that there are some unexpected changes in the data distribution and we need to retrain the model.

As the new dataset was pretty small, we have to concat the old data with the new one and train a new model on the combination of 2 datasets.


## Exercise 1:

- Read the .csv file with the original dataframe as **df_old.**

- If you take a look on the **VehicleSearchedIndicator** column, you understand that this subset represents the searched cars only, so we can drop the **VehicleSearchedIndicator** column.

- As the new dataset doesn't contain **InterventionDateTime** column, we also need to drop it from the old dataset.

- Read new observations as **df_new**.

- Combine both the dataframes and add a new column called **is_new** that is going to have all **False** values for the old data and all **True** values for the new observations.

- Call the combined dataframe **df_combined**

- Drop all **NaN** values

- Apply lowercase to department names and intervention location names in the combined dataset

In [2]:
df_old = pd.read_csv(os.path.join('data','train_searched.csv'))
df_old = df_old.drop(['VehicleSearchedIndicator','InterventionDateTime'],axis = 1)
df_old['is_new'] =False
df_new = pd.read_csv(os.path.join('data','new_observations.csv'))
df_new['is_new']=True
df_combined = df_old.append(df_new).dropna()
df_combined['Department Name'] = df_combined['Department Name'].apply(lambda x:str(x).lower())
df_combined['InterventionLocationName'] = df_combined['InterventionLocationName'].apply(lambda x:str(x).lower())
# YOUR CODE HERE
df_combined.head(n=2)
#raise NotImplementedError()

Unnamed: 0,ContrabandIndicator,Department Name,InterventionLocationName,InterventionReasonCode,ReportingOfficerIdentificationID,ResidentIndicator,SearchAuthorizationCode,StatuteReason,SubjectAge,SubjectEthnicityCode,SubjectRaceCode,SubjectSexCode,TownResidentIndicator,is_new
0,False,bridgeport,bridgeport,V,1207,True,I,Speed Related,37.0,H,W,M,True,False
1,True,milford,milford,E,2325,True,I,Defective Lights,30.0,N,W,M,True,False


In [3]:
assert df_combined.shape == (78715, 14), 'combined dataframe shape is wrong'
assert 'VehicleSearchedIndicator' not in df_combined.columns, 'Did you drop the VehicleSearchedIndicator column?'
assert 'is_new' in df_combined.columns, 'Did you add is_new column?'
assert sum(df_combined['is_new']) == 2000, 'is_new column has a wrong number of True values'
assert all([name.islower() for name in df_combined['Department Name']]), 'Department name is not lowercased'
assert all([name.islower() or not name.isalpha() for name in df_combined['InterventionLocationName']]), 'InterventionLocationName is not lowercased'

## Exercise 2:

**Split the created dataset on train and test parts in the following way:**

- Firstly create train and test set. Call them **df_train** and **df_test**.
> We'll need them in the future exercises. 
- Then, split **train** and **test** into **X_train**, **X_test**, **y_train** and **y_test**.
- Test sets shape should be 25% of df_combined shape
- Make sure to have 25% of new values in the test size.
- Use random state 42 while splitting the datasets

In [4]:
# YOUR CODE HERE
target = 'ContrabandIndicator'

df_train,df_test = train_test_split(df_combined,test_size = 0.25,random_state = 42,stratify = df_combined['is_new'])
X_train = df_train.drop('ContrabandIndicator',axis = 1)
X_test = df_test.drop('ContrabandIndicator',axis = 1)
y_train =df_train['ContrabandIndicator']
y_test = df_test['ContrabandIndicator']
#raise NotImplementedError()

In [5]:
df_train.shape

(59036, 14)

In [6]:
assert df_train.shape == (59036, 14), 'df_train shape is wrong. Are you sure test size is 25%?'
assert df_test.shape == (19679, 14), 'df_test shape is wrong. Are you sure test size is 25%?'
assert X_train.shape == (59036, 13), 'X_train shape is wrong. Are you sure test size is 25%?'
assert X_test.shape == (19679, 13), 'X_test shape is wrong. Are you sure test size is 25%?'
assert y_train.shape == (59036,), 'X_train shape is wrong. Are you sure test size is 25%?'
assert y_test.shape == (19679,), 'X_train shape is wrong. Are you sure test size is 25%?'
assert sum(X_train['is_new']) == 1500, 'is_new column in Training set has a wrong number of True values. Make sure to have 25% of new values'
assert sum(X_test['is_new']) == 500, 'is_new column in Test set has a wrong number of True values. Make sure to have 25% of new values'

Now we need to retrain the model.

If we simply load the pipeline and retrain it, it's going to ignore our new feature.

So let's create the same pipeline as in the original notebook:

In [7]:
categorical_features = df_combined.columns.drop(['ContrabandIndicator', 'SubjectAge'])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[('cat', categorical_transformer, categorical_features)])

pipeline = make_pipeline(
    preprocessor,
    LGBMClassifier(n_jobs=-1, random_state=42),
)

In [8]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['Department Name', 'InterventionLocationName', 'InterventionReasonCode',
       'ReportingOfficerIdentificationID', 'ResidentIndicator',
       'SearchAuthorizationCode', 'StatuteReason', 'SubjectEthnicityCode',
       'SubjectRaceCode', 'SubjectSexCode', 'TownResidentIndicator', 'is_new'],
      dtype='object'))])),
                ('lgbmclassifier', LGBMClassifier(random_

## Exercise 3:

Now let's test how the model performs:

- Make model binary predictions and save them to an array **preds**. 
- Make model probability predictions and save them to an array called **preds_proba**. Keep only True class probabilities (by default probability prediction returns you both False and True classes probabilities)
- Create a variable called **precision** with the model precision score.
- Create a variable called **recall** with the model recall score

In [9]:
# YOUR CODE HERE
preds = pipeline.predict(X_test)
preds = preds.astype(int)
preds_proba = pipeline.predict_proba(X_test)
preds_proba = preds_proba[:,1]
precision = precision_score(y_test,preds)
recall = recall_score(y_test,preds)
#raise NotImplementedError()
recall

0.5460269865067466

In [10]:
expected_recall = 'a2cffa866c48b997372a62104161ba89e68fb439c418fc2559e2a32c44987ce8'
hash_recall = hashlib.sha256(bytes(str(round(recall, 2)), encoding='utf8')).hexdigest()
assert hash_recall == expected_recall

In [11]:
assert len(preds) == 19679, 'Are you sure you made predictions for test set only?'
assert len(preds_proba) == 19679, 'Are you sure you made predictions for test set only?'
assert not isinstance(preds_proba[0], np.ndarray), 'Are you sure you kept only the True class predictions?'
assert round(sum(preds_proba)) == 6563
np.testing.assert_almost_equal(precision, 0.64966, decimal=2)
np.testing.assert_almost_equal(recall, 0.546027, decimal=2)


## Exercise 4: 

It's already not bad, but let's now try to calculate the optimal threshold.

By threshold I mean the minimal probability of a prediction that we're going to call "True". 

By default, any prediction with probability > 0.5 is called True, but we might find a better value. 

The metric is the same: our success rate (precision) needs to be at least 50%, and the recall should be as big as possible. 

Save the result to a variable called **threshold**. Round the result to 2 decimal points.

In [12]:
precision, recall, thresholds = precision_recall_curve(y_test, preds_proba)
precision = precision[:-1]
recall = recall[:-1]
min_index = [i for i, prec in enumerate(precision) if prec >= 0.5][0]
precision[min_index]
recall[min_index]
round(thresholds[min_index],2)
threshold = round(thresholds[min_index],2)
recall[min_index]
# YOUR CODE HERE
#raise NotImplementedError()

0.8493253373313343

In [13]:
assert round(threshold, 2) == threshold, 'Did you round the value?'
ans_threshold = hashlib.sha256(bytes(str(threshold), encoding='utf8')).hexdigest()
assert ans_threshold == "6382e07f9de0c85293aee2a45b88c61c28589419682ecc2f8c097f750e861a24"

## Exercise 5:

- Now create a list of predictions. 

> All the values from the **preds_proba** list that have a value > threshold should be True. The rest should be False.

> Save the result to a variable called **best_preds**

- Calculate the precision and recall and save them to variables **precision** and **recall**

In [14]:
best_preds =[True if pred > threshold else False for pred in preds_proba]
precision = precision_score(y_test,best_preds)
recall = recall_score(y_test,best_preds)
recall
# YOUR CODE HERE
#raise NotImplementedError()

0.8442278860569715

In [15]:
np.testing.assert_almost_equal(precision, 0.50290256, decimal=2)
np.testing.assert_almost_equal(recall, 0.84422789, decimal=2)

## Exercise 6:

Now let's find out whether removing rare values is going to help. 

**Filter *df_train* (the one you created in the Exercise 2) the following way:** 

- Remove rows with **Department Name** that appear <= 50 times
- Remove rows with **InterventionLocationName** that appear <= 50 times
- Remove rows with **ReportingOfficerIdentificationID** that appear <= 30 times
- Remove rows with **StatuteReason** that appear <= 10 times
- Note: it's better to keep the original dataframe not touched. Create a copy of the original dataframe and save the results to a variable **train_filtered**

> We have to filter the values after we split the dataset into training and test, because by filtering the test set we also affect the score. If we filtered everything besides the examples that are the easiest to predict, we'd have a super nice score, but in production we're going to expect both the filtered values and unfiltered ones. 

> We shouldn't worry about the fact, that some values in the test set will not be present in the training set, because the pipeline is simply going to ignore them.

> (you might use the logic from the original model's notebook, but I suggest trying to implement it by yourself, it's a good exercise)

In [16]:
min_frequency = {
    "Department Name": 50,
    "InterventionLocationName": 50,
    "ReportingOfficerIdentificationID": 30,
    "StatuteReason": 10
}
def filter_values(df: pd.DataFrame, column_name: str, threshold: int):
    value_counts = df[column_name].value_counts()
    to_keep = value_counts[value_counts > threshold].index
    filtered = df[df[column_name].isin(to_keep)]
    return filtered
train_filtered = df_train.copy()
for feature, threshold in min_frequency.items():
    train_filtered=filter_values(train_filtered, feature, threshold)
# YOUR CODE HERE
#raise NotImplementedError()

In [17]:
assert train_filtered.shape == (30502, 14), 'Make sure to filter rare values. Make sure to filter only train set.'
assert 'middlebury' not in train_filtered['Department Name'], 'Did you filter department names?'
assert 'hampton' not in train_filtered['InterventionLocationName'], 'Did you filter InterventionLocationName ?'
assert 'DACYR048' not in train_filtered['ReportingOfficerIdentificationID'], 'Did you filter officer ids?'
assert 'Stop Sign ' not in train_filtered['StatuteReason'], 'Did you filter statute reasons?'

## Exercise 7:

**Let's split *train_filtered* into *X* and *Y* parts and do the same thing once again:**

- Fit the model on the training set (this time filtered one)

- Predict probabilities for the test set (untouched one).

- Select the best threshold for the specified requirements (precision >= 0.5, max possible recall).

- Round up the threshold up to 2 decimal points.

- Transform probabilities to binary answers: probability above the threshold = True, False otherwise.

- Calculate the precision and recall scores for these predictions. 

I believe you need no exact instructions, as you did exactly same things in Exercises 2, 3 and 4.

Save the score results to variables called **filtered_precision** and **filtered_recall**

In [18]:
X_train_filtered = train_filtered.drop('ContrabandIndicator',axis = 1)
y_train_filtered =train_filtered['ContrabandIndicator']
pipeline.fit(X_train_filtered, y_train_filtered)
preds = pipeline.predict(X_test)
preds = preds.astype(int)
preds_proba = pipeline.predict_proba(X_test)
preds_proba = preds_proba[:,1]
precision, recall, thresholds = precision_recall_curve(y_test, preds_proba)
precision = precision[:-1]
recall = recall[:-1]
min_index = [i for i, prec in enumerate(precision) if prec >= 0.5][0]
filtered_precision = precision[min_index]
filtered_recall = recall[min_index]
round(thresholds[min_index],2)
threshold = round(thresholds[min_index],2)
# YOUR CODE HERE
#raise NotImplementedError()
filtered_recall

0.8344827586206897

In [19]:
np.testing.assert_almost_equal(filtered_precision, 0.501309, decimal=2)
np.testing.assert_almost_equal(filtered_recall, 0.83238, decimal=2)

Okay, so it seems like the original notebook had a mistake of evaluating the model on filtered test set. In fact, filtering features with these frequency limits decreased the recall (0.844 -> 0.832). 

## Exercise 8:
    
Now I'll let you use your fantasy and try to filter the categorical values differently. 

You're free to do whatever you want, but here are a few ideas you can use:

- Instead of dropping rare categories, create a new value for them

- Adjust the frequency values (e.g. keep a part of departments we just filtered or filter even more). You can try to search all the possible combinations of frequency values if you want. 

Your task is to create a list of *True/False* predictions for the **X_test** and call them **best_preds**. These predictions have to have precision >= 0.5 and recall > 0.84422789

In [20]:
df_train['SubjectAge'].value_counts().values

array([3281, 3233, 3191, 3184, 3153, 3126, 2714, 2527, 2471, 2369, 2319,
       1935, 1926, 1564, 1561, 1464, 1409, 1395, 1129, 1054,  995,  920,
        832,  793,  772,  660,  628,  605,  595,  586,  575,  514,  506,
        455,  428,  408,  401,  387,  367,  300,  254,  248,  235,  232,
        193,  160,  118,  113,   95,   92,   67,   53,   47,   47,   47,
         43,   38,   28,   20,   18,   14,   13,   13,   12,   10,    9,
          8,    8,    8,    7,    6,    6,    5,    5,    4,    4,    4,
          3,    3,    3,    2,    2,    2,    1,    1,    1,    1,    1])

In [21]:
df_train

Unnamed: 0,ContrabandIndicator,Department Name,InterventionLocationName,InterventionReasonCode,ReportingOfficerIdentificationID,ResidentIndicator,SearchAuthorizationCode,StatuteReason,SubjectAge,SubjectEthnicityCode,SubjectRaceCode,SubjectSexCode,TownResidentIndicator,is_new
59156,False,glastonbury,glastonbury,V,SAS0410,True,C,Administrative Offense,29.0,H,B,M,True,False
39492,False,csp troop i,shelton,V,1000002761,True,C,Moving Violation,25.0,N,W,M,False,False
43199,True,csp troop c,tolland,V,1000002352,True,I,Other,26.0,N,W,M,False,False
50400,False,new haven,new haven,E,137,True,C,Defective Lights,28.0,N,B,M,True,False
60632,False,stratford,stratford,E,30252,True,C,Display of Plates,26.0,N,B,M,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36837,True,middletown,middletown,I,231,False,C,Other,22.0,H,W,M,True,False
1818,True,westport,westport,I,PSC20044,False,O,Other,24.0,N,B,M,False,True
12268,False,rocky hill,rocky hill,V,EAG0167,True,O,Suspended License,26.0,N,W,M,False,False
36443,True,csp troop g,westport,V,119974171,False,O,Speed Related,20.0,N,B,M,False,False


In [26]:
min_frequency = {
    "Department Name": 50,
    #"InterventionLocationName": 50,
    #"ReportingOfficerIdentificationID": 30,
    #"StatuteReason": 10
    #"SubjectAge" : 10
}
def filter_values(df: pd.DataFrame, column_name: str, threshold: int):
    value_counts = df[column_name].value_counts()
    to_keep = value_counts[value_counts > threshold].index
    filtered = df[df[column_name].isin(to_keep)]
    return filtered
for feature, threshold in min_frequency.items():
    train_filtered=filter_values(train_filtered, feature, threshold)   
X_train_filtered = train_filtered.drop('ContrabandIndicator',axis = 1)
y_train_filtered =train_filtered['ContrabandIndicator']
pipeline.fit(X_train_filtered, y_train_filtered)
preds = pipeline.predict(X_test)
preds = preds.astype(int)
preds_proba = pipeline.predict_proba(X_test)
preds_proba = preds_proba[:,1]
precision, recall, thresholds = precision_recall_curve(y_test, preds_proba)
precision = precision[:-1]
recall = recall[:-1]
min_index = [i for i, prec in enumerate(precision) if prec >= 0.5][0]
filtered_precision = precision[min_index]
filtered_recall = recall[min_index]
round(thresholds[min_index],2)
threshold = round(thresholds[min_index],2)
# YOUR CODE HERE
#raise NotImplementedError()
filtered_recall

# predictions = ...
# YOUR CODE HERE
#raise NotImplementedError()
filtered_precision

0.5000448068823371

In [23]:
precision = precision_score(y_test, best_preds)
recall = recall_score(y_test, best_preds)
assert precision >= 0.5
assert recall > 0.84422789

AssertionError: 

## Exercise 9:

So, we got the model. It's usually a good idea to retrain the model on the whole dataset, so now I want you to:
- Apply the filters that you just created in the Exercise 7 to **df_combined**
- Train the same model on the whole dataset
- Export the model, train columns and data types to **/tmp/<file_name>**, where files are called **new_pipeline.pickle**, **new_dtypes.pickle** and **new_columns.json**.

In [34]:
# YOUR CODE HERE
min_frequency = {
    "Department Name": 50,
    "InterventionLocationName": 50,
    "ReportingOfficerIdentificationID": 30,
    "StatuteReason": 10
}
df_combined_filtered = df_combined.copy()
for feature, threshold in min_frequency.items():
    df_combined_filtered=filter_values(df_combined_filtered, feature, threshold)
df_combined_filtered_X = df_combined_filtered.drop('ContrabandIndicator',axis = 1)
df_combined_filtered_y =df_combined_filtered['ContrabandIndicator']
categorical_features = df_combined.columns.drop(['ContrabandIndicator', 'SubjectAge'])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[('cat', categorical_transformer, categorical_features)])

pipeline = make_pipeline(
    preprocessor,
    LGBMClassifier(n_jobs=-1, random_state=42),
)
pipeline.fit(df_combined_filtered_X,df_combined_filtered_y)
with open('/tmp/new_columns.json','w') as fh:
    json.dump(df_combined_filtered_X.columns.tolist(),fh)

with open('/tmp/new_dtypes.pickle','wb') as fh:
    pickle.dump(df_combined_filtered_X.dtypes,fh)
    
joblib.dump(pipeline, '/tmp/new_pipeline.pickle');
#raise NotImplementedError()

In [35]:
with open('/tmp/new_columns.json') as fh:
    columns = json.load(fh)


with open('/tmp/new_pipeline.pickle', 'rb') as fh:
    pipeline = joblib.load(fh)


with open('/tmp/new_dtypes.pickle', 'rb') as fh:
    dtypes = pickle.load(fh)

assert isinstance(columns, list), 'columns need to be a list of training features'
assert 'ContrabandIndicator' not in columns, 'there should be only training features in columns. You got target there.'
assert 'is_new' in columns, "your columns don't contain is_new feature. Are you you updated the columns file?"
assert isinstance(pipeline, Pipeline), 'new_pipeline.pickle does not seem it be an instance of Pipeline class.'
assert isinstance(dtypes, pd.core.series.Series)
assert all([column in dtypes.index for column in columns]), 'some columns from new_columns file are not in the new_dtypes file'
assert all([dtype in columns for dtype in dtypes.index]), 'some dtypes from new_dtypes file are not in the new_columns file'

## Exercise 10:

And now it's time to change the server! I know you missed this part :) 

Before we do it, I want to remind you, that in this exercise we didn't cover the ethics topic.

Our model is trained on sensible features like race, sex and ethnicity. 

In real situation you'd need to make sure that your model is not discriminating anyone. 

Now, go and create a copy of the **protected_server.py** file. Call it **new_server.py**

In that file:
- Change the **check_valid_column** function to have the new added columns

> You can also automate it by reading the columns file, it's even better!

- Change the **check_categorical_values** function:

> We didn't really affect any of the checked columns there besides **StatuteReason** (of course, if you didn't change it in your best solution). Remove the values that should not be in this column anymore.

> We also add one more categorical feature to the dataframe (**is_new**). Go and add possible values to the check.

- As soon as it's done, go ahead and start the server. 

- Play with the predictions. Make sure that the server checks the **is_new** feature values. Try to send requests without **is_new** or with a different value (not True or False). 

- After you're done, change the value of **done** to **True** to pass the exercise

In [None]:
done = True

In [None]:
# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
assert done == True

Aaaaaand...we're done!

<img src="media/congrats.png" width=300/>