Open a new Jupyter notebook and create a final dataframe containing all features similar to how you did in milestone 1 of project 2.

In [1]:
import pandas as pd


# Preprocessing
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import StandardScaler


# Model definition
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Save the model
import joblib

In [2]:
result_df = pd.read_csv('../model_data/merged_df.csv')
result_df.head()

Unnamed: 0.1,Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,PatientID,Outcome,BirthYear,City,State,Country,Pregnancies,Age
0,0,101,58,17,265,24.2,0.614,1017,0,1998,Winona,Minnesota,United States,2.0,23
1,1,108,70,0,0,30.5,0.955,1031,1,1988,Springfield,Illinois,United States,8.0,33
2,2,148,60,27,318,30.9,0.15,1033,1,1992,Socorro,Texas,United States,4.0,29
3,3,113,76,0,0,33.3,0.278,1035,1,1998,Erie,Pennsylvania,United States,0.0,23
4,4,83,86,19,0,29.3,0.317,1048,0,1987,Sioux Falls,South Dakota,United States,4.0,34


In [3]:
# lowercase all of these column names
result_df.columns = result_df.columns.str.lower()

In [4]:
X = result_df.drop(columns=['outcome'])
y = result_df['outcome']

Split the data set in training and testing with 80-20 ratio.

In [5]:
# Asking us to seed the data would be helpful

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1234)

You will retrain the model using all features.

Define ML pipeline with pre-processing steps and a logistic regression model. Pre-processing can include the following:

Select all features to build your ML model.
Normalize these features using Standard Scaler in sklearn Hint: You can do #4 and #5 using ColumnTransformer which can drop features and apply standard scaler.

In [6]:
features_use = ['glucose', 'bmi', 'pregnancies', 'age', 'patientid', 'birthyear']

In [7]:
column_transformers_v1 = ColumnTransformer(
    transformers=[('scale_features', StandardScaler(), features_use)])


Train the model on training data and test it on testing data.

In [8]:
# Logistic regression


lr_model = LogisticRegression()


# create sklearn ML pipeline 
model_pipeline_v1 = Pipeline(steps=[
                    ('pre_processing', column_transformers_v1),
                    ('linear_model', lr_model)        
                ])

# train the model on training dataset
model_pipeline_v1.fit(X_train, y_train)

train_prediction_v1 = model_pipeline_v1.predict(X_train)

test_prediction_v1 = model_pipeline_v1.predict(X_test)

You can analyze the accuracy of the model. Also, check the confusion matrix.

In [9]:
# It's not as accurate as before
accuracy_score(train_prediction_v1, y_train)

0.7609489051094891

In [10]:
confusion_matrix(train_prediction_v1, y_train)

array([[313,  86],
       [ 45, 104]], dtype=int64)

In [11]:
classification_report(train_prediction_v1, y_train)

'              precision    recall  f1-score   support\n\n           0       0.87      0.78      0.83       399\n           1       0.55      0.70      0.61       149\n\n    accuracy                           0.76       548\n   macro avg       0.71      0.74      0.72       548\nweighted avg       0.79      0.76      0.77       548\n'

Save the model either using pickle or joblib. Make sure to name the file appropriately (postfix with “v2”). The file format should be “.pkl”

In [12]:
joblib.dump(model_pipeline_v1, '../saved_models/ml_pipeline_v2.pkl')

['../saved_models/ml_pipeline_v2.pkl']

You can also retrain the model using a different set of features or tuning the parameters of your model. Chose your way. Make an additional version of the model- v3, v4, or more.

In [13]:
features_use = ['glucose', 'bmi', 'age']

column_transformers_v1 = ColumnTransformer(
    transformers=[('scale_features', StandardScaler(), features_use)])


# Logistic regression


lr_model = LogisticRegression()


# create sklearn ML pipeline 
model_pipeline_v1 = Pipeline(steps=[
                    ('pre_processing', column_transformers_v1),
                    ('linear_model', lr_model)        
                ])

# train the model on training dataset
model_pipeline_v1.fit(X_train, y_train)

train_prediction_v1 = model_pipeline_v1.predict(X_train)

test_prediction_v1 = model_pipeline_v1.predict(X_test)

print(accuracy_score(test_prediction_v1, y_test))
print(confusion_matrix(test_prediction_v1, y_test))
print(classification_report(test_prediction_v1, y_test))

0.7681159420289855
[[79 22]
 [10 27]]
              precision    recall  f1-score   support

           0       0.89      0.78      0.83       101
           1       0.55      0.73      0.63        37

    accuracy                           0.77       138
   macro avg       0.72      0.76      0.73       138
weighted avg       0.80      0.77      0.78       138

