You will be using new data for this milestone. In your data files, there is a folder named: scoring_data. Check your data files and make sure scoring_data files are accessible. Note that these files don’t have target variable-Outcome as your scoring pipeline will predict it.

Open a new Jupyter notebook, load data files from the scoring_data folder.

In [1]:
import pandas as pd


# Preprocessing
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import StandardScaler


# Model definition
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Save the model
import joblib

In [2]:
diabetes_df = pd.read_csv('../scoring_data/diabetes.csv')
patient_df = pd.read_csv('../scoring_data/patient_data.csv')
pregnancies_df = pd.read_csv('../scoring_data/pregnancies_records.csv')

Derive number of pregnancies by patient from patient_records.csv file.

In [3]:
# aggregate pregnancies records
pregnancies_summary_df=pregnancies_df.groupby('PatientID', as_index=False).agg({"PregnancyRecordID": "count"})
pregnancies_summary_df.shape
pregnancies_summary_df.columns=['PatientID','Pregnancies']
print(pregnancies_summary_df.columns)

Index(['PatientID', 'Pregnancies'], dtype='object')


Create a final dataframe containing all features using a left join.

You will need to create new features that you may have added. Example: Age
Treat missing values for Pregnancies. You may not need to do any further feature processing or engineering.
You need to ensure that you are supplying all features that you have used in model training.

In [4]:
# Merging all three files on common key- PatientID using inner join
merged_df = pd.merge(pd.merge(diabetes_df,patient_df, on="PatientID", how="left"),pregnancies_summary_df, on="PatientID", how="left")
merged_df.head()

Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,PatientID,BirthYear,City,State,Country,Pregnancies
0,129,80,0,0,31.2,0.703,18022,,,,,
1,159,64,0,0,27.4,0.294,18024,,,,,7.0
2,137,61,0,0,24.2,0.151,18051,,,,,6.0
3,113,50,10,85,29.5,0.626,18114,,,,,3.0
4,105,90,0,0,29.6,0.197,18147,,,,,


In [5]:
# BirthYear is not a useful feature. 
# However, we can derive age feature from Birthyear and there may be correlation between age and diabetes.
from datetime import date 
yy = date.today().year
print(yy)
age = yy-merged_df["BirthYear"]
merged_df["Age"]=age
merged_df.head()

2021


Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,PatientID,BirthYear,City,State,Country,Pregnancies,Age
0,129,80,0,0,31.2,0.703,18022,,,,,,
1,159,64,0,0,27.4,0.294,18024,,,,,7.0,
2,137,61,0,0,24.2,0.151,18051,,,,,6.0,
3,113,50,10,85,29.5,0.626,18114,,,,,3.0,
4,105,90,0,0,29.6,0.197,18147,,,,,,


In [6]:
merged_df["Pregnancies"].fillna(0, inplace=True)
merged_df.isnull().sum()

Glucose                      0
BloodPressure                0
SkinThickness                0
Insulin                      0
BMI                          0
DiabetesPedigreeFunction     0
PatientID                    0
BirthYear                   82
City                        82
State                       82
Country                     82
Pregnancies                  0
Age                         82
dtype: int64

Load the ML model from the pickle file. You can pick any version or the one with the best accuracy.

In [7]:
test_pipeline_v2 = joblib.load('../saved_models/ml_pipeline_v2.pkl')

Generate prediction on new data using predict method.

In [8]:
test_prediction_v2 = test_pipeline_v2.predict(merged_df)
test_prediction_2

ValueError: Number of features of the input must be equal to or greater than that of the fitted transformer. Transformer n_features is 14 and input n_features is 13.

Add the prediction in your dataframe with column-“prediction”

Save the results as a CSV file. The CSV file should include all features and a prediction column.