Most of these steps are similar to what we did in Project 1 of this series, “Creating Features.”

You will be using data files from the model_data folder. Launch Jupyter Notebook and load all three data files in the model_data folder using the pandas library.


In [1]:
import pandas as pd
# load data files
# load diabetes.csv
diabetes_df = pd.read_csv("../model_data/diabetes.csv")
# load patient_data.csv
patient_df = pd.read_csv("../model_data/patient_data.csv")
# load pregnancies_history.csv
pregnancies_df = pd.read_csv("../model_data/pregnancies_records.csv")


Aggregate pregnancy records to derive the number of pregnancies by patient.


In [2]:
# aggregate pregnancies records
pregnancies_summary_df=pregnancies_df.groupby('PatientID', as_index=False).agg({"PregnancyRecordID": "count"})
pregnancies_summary_df.shape
pregnancies_summary_df.columns=['PatientID','Pregnancies']
print(pregnancies_summary_df.columns)

Index(['PatientID', 'Pregnancies'], dtype='object')


In [3]:
pregnancies_summary_df.head

<bound method NDFrame.head of      PatientID  Pregnancies
0         1017            2
1         1031            8
2         1033            4
3         1048            4
4         1074            1
..         ...          ...
670      17877            4
671      17903            1
672      17920            1
673      17963            2
674      17995            2

[675 rows x 2 columns]>


Consolidate features from all three DataFrames into a single DataFrame. As diabetes.csv contains a target variable, you should merge diabetes.csv with other two files using left join.


In [4]:
# Merging all three files on common key- PatientID using left join

merged_df = pd.merge(pd.merge(diabetes_df,patient_df, on="PatientID", how="left"),pregnancies_summary_df, on="PatientID", how="left")
merged_df.head()

Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,PatientID,Outcome,BirthYear,City,State,Country,Pregnancies
0,101,58,17,265,24.2,0.614,1017,0,1998,Winona,Minnesota,United States,2.0
1,108,70,0,0,30.5,0.955,1031,1,1988,Springfield,Illinois,United States,8.0
2,148,60,27,318,30.9,0.15,1033,1,1992,Socorro,Texas,United States,4.0
3,113,76,0,0,33.3,0.278,1035,1,1998,Erie,Pennsylvania,United States,
4,83,86,19,0,29.3,0.317,1048,0,1987,Sioux Falls,South Dakota,United States,4.0


In [5]:
merged_df.shape

(686, 13)


Derive age from birth year and add it in a DataFrame.


In [6]:
# BirthYear is not a useful feature. 
# However, we can derive age feature from Birthyear and there may be correlation between age and diabetes.
from datetime import date 
yy = date.today().year
print(yy)
age = yy-merged_df["BirthYear"]
merged_df["Age"]=age
merged_df.head()

2021


Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,PatientID,Outcome,BirthYear,City,State,Country,Pregnancies,Age
0,101,58,17,265,24.2,0.614,1017,0,1998,Winona,Minnesota,United States,2.0,23
1,108,70,0,0,30.5,0.955,1031,1,1988,Springfield,Illinois,United States,8.0,33
2,148,60,27,318,30.9,0.15,1033,1,1992,Socorro,Texas,United States,4.0,29
3,113,76,0,0,33.3,0.278,1035,1,1998,Erie,Pennsylvania,United States,,23
4,83,86,19,0,29.3,0.317,1048,0,1987,Sioux Falls,South Dakota,United States,4.0,34



Impute any missing data for features. This will be the same as what you did in Milestone 2 of Project 1. This can be done at the scoring level as well. For pregnancies, it makes sense to treat missing values now rather than later.

Create a folder named “feature_store” to store all cleaned features.

Save the DataFrame as a CSV file, “feature_data_milestone_1.csv,” in the feature_store folder. For this liveProject, you will use a file-based feature store.

In [7]:
# As per dataset specification--> If patient is not in pregnancies_records.csv, it is safe to assume the patient didn't have any pregnancy
# impute 0 for missing values for Pregnancies
# Alternatively, you can impute missing values as per the distribution of Pregnancies values
merged_df["Pregnancies"].fillna(0, inplace=True)
merged_df.isnull().sum()

Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
PatientID                   0
Outcome                     0
BirthYear                   0
City                        0
State                       0
Country                     0
Pregnancies                 0
Age                         0
dtype: int64

In [8]:
useful_features = ["PatientID","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Pregnancies","Age","Outcome"]
result_df = merged_df[useful_features]
result_df.shape

(686, 10)

In [9]:
# Save useful features in a centralized place
# Ideally, it should be stored in a distributed data store. For this live project, we will save features in CSV files. 
result_df.to_csv("../feature_store/feature_data_milestone_1.csv", index = False, header=True)