Create a pandas DataFrame from a CSV file: “feature_data_milestone_1.csv” in the feature_store folder.


In [7]:
import pandas as pd
# load data from feature store
features_df = pd.read_csv("../feature_store/feature_data_milestone_1.csv")
features_df.head()

Unnamed: 0,PatientID,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Pregnancies,Age,Outcome
0,1017,101,58,17,265,24.2,0.614,2.0,23,0
1,1031,108,70,0,0,30.5,0.955,8.0,33,1
2,1033,148,60,27,318,30.9,0.15,4.0,29,1
3,1035,113,76,0,0,33.3,0.278,0.0,23,1
4,1048,83,86,19,0,29.3,0.317,4.0,34,0


Remove any records with outliers


Split the dataset into training and testing sets with an 80:20 ratio.

In [8]:
from sklearn.model_selection import train_test_split

# separate target and non-target variables
X=features_df.drop(columns=["Outcome"])
y=features_df["Outcome"]

train_x, test_x, train_y, test_y = train_test_split(
    X,
    y,
    test_size=0.20
)

Define the ML pipeline with preprocessing steps and a logistic regression model. It will be a similar model to the “v1” model you built previously. Preprocessing can include the following:

Select four relevant/significant features to build your ML model or drop irrelevant/nonsignificant features.
Normalize these features using StandardScaler in sklearn. Hint: You can do steps 4 and 5 using ColumnTransformer, which can drop features and apply StandardScaler.

In [9]:
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


columns_to_drop = ["PatientID"]
selected_features_v1 = ["Glucose","BMI","Pregnancies","Age"]

# adding all pre-processing stesp in column transformers
column_transformers_v1 = ColumnTransformer(transformers=[("drop_columns","drop",columns_to_drop),
                                                   ("scale_features", StandardScaler(),selected_features_v1)
])

ml_model = LogisticRegression()
# create sklearn ML pipeline 
model_pipeline_v1 = Pipeline(steps=[
                    ('pre_processing', column_transformers_v1),
                    ('linear_model', ml_model)        
                ])


Train the model on training data and test it on testing data.

In [10]:
# train the model on training dataset
model_pipeline_v1.fit(train_x, train_y)

train_prediction_v1 = model_pipeline_v1.predict(train_x)

You can analyze the accuracy of the model. Also, check the confusion matrix.

In [11]:
# Check accuracy of the model on training data
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
print(accuracy_score(train_prediction_v1, train_y))
print(confusion_matrix(train_prediction_v1, train_y))
print(classification_report(train_prediction_v1, train_y))

0.7773722627737226
[[317  79]
 [ 43 109]]
              precision    recall  f1-score   support

           0       0.88      0.80      0.84       396
           1       0.58      0.72      0.64       152

    accuracy                           0.78       548
   macro avg       0.73      0.76      0.74       548
weighted avg       0.80      0.78      0.78       548



Save this model using either pickle or joblib with a name (*_feature_store_v1.pkl").

In [15]:
# save ml pipeline with model using joblib
import joblib
joblib.dump(model_pipeline_v1, '../saved_models/ml_pipeline_fs_v1.pkl')


['../saved_models/ml_pipeline_fs_v1.pkl']

Retrain the model with either a different set of features or all features.

In [16]:
# load the saved model with joblib
test_pipeline_v1 = joblib.load('../saved_models/ml_pipeline_fs_v1.pkl')
test_prediction_v1 = test_pipeline_v1.predict(test_x)
print(accuracy_score(test_prediction_v1, test_y))


0.717391304347826


In [17]:
# retrain the model
selected_features_v2 = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Pregnancies","Age"]

# adding all pre-processing stesp in column transformers
column_transformers_v2 = ColumnTransformer(transformers=[("drop_columns","drop",columns_to_drop),
                                                   ("scale_features", StandardScaler(),selected_features_v2)
])

# create sklearn ML pipeline 
model_pipeline_v2 = Pipeline(steps=[
                    ('pre_processing', column_transformers_v2),
                    ('linear_model', ml_model)        
                ])

# train the model on training dataset
model_pipeline_v2.fit(train_x, train_y)

train_prediction_v2 = model_pipeline_v2.predict(train_x)
print(accuracy_score(train_prediction_v2, train_y))


0.7791970802919708


Test the retrained model, check the accuracy, and save this model with a name (*_feature_store_v2.pkl").

In [19]:
# save ml pipeline with model using joblib
joblib.dump(model_pipeline_v2, '../saved_models/ml_pipeline_fs_v2.pkl')

['../saved_models/ml_pipeline_fs_v2.pkl']

You can retrain the model as many times as you want and save it with increasing versions.