# The Problem

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns  
%matplotlib inline 
plt.style.use('fivethirtyeight')

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics

In [None]:
pima_column_names = ['times_pregnant', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness', 'serum_insulin', 'bmi', 'pedigree_function', 'age', 'onset_diabetes']
pima = pd.read_csv(r'../input/pima-indians-diabetes-database/diabetes.csv',names = pima_column_names,skiprows=1)
pima.head()



In [None]:
pima.info()

Exploratory data analysis (EDA)
To identify our missing values we will begin with an EDA of our dataset. We will be using some useful python packages, pandas and numpy, to store our data and make some simple calculations as well as some popular visualization tools to see what the distribution of our data looks like. Let's begin and dive into some code. First, we will do some imports:

In [None]:
pima['onset_diabetes'].value_counts(normalize=True) 

If our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes, let us try to visualize some of the differences between those that developed diabetes and those that did not. Our hope is that the histogram will reveal some sort of pattern, or obvious difference in values between the classes of prediction

In [None]:
col = 'plasma_glucose_concentration'
plt.figure(figsize=(10,5))
plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()

It seems that this histogram is showing us a considerable difference between plasma_glucose_concentration between the two prediction classes. Let's show the same histogram style for multiple columns as follows

In [None]:
for col in ['bmi', 'diastolic_blood_pressure', 'serum_insulin','triceps_thickness', 'plasma_glucose_concentration']:
    plt.figure(figsize=(8,4))
    plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='non-diabetes')
    plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes')
    plt.legend(loc='upper right')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.title('Histogram of {}'.format(col))
    plt.show()

We can definitely see some differences simply by looking at just a few histograms. For example, there seems to be a large jump in plasma_glucose_concentration for those who will eventually develop diabetes. To solidify this, perhaps we can visualize a linear correlation matrix in an attempt to quantify the relationship between these variables. 

In [None]:
# look at the heatmap of the correlation matrix of our dataset
plt.figure(figsize=(12,8))
corr=round(pima.corr(),2)
mask = np.triu(np.ones_like(corr, dtype=np.bool))
sns.heatmap(corr,mask=mask, square=True, annot=True)
plt.xticks(rotation=90)
plt.show()
# plasma_glucose_concentration definitely seems to be an interesting feature here

#Following is the correlation matrix of our dataset. This is showing us the correlation amongst 
#the different columns in our Pima dataset. The output is as follows:

This correlation matrix is showing a strong correlation between plasma_glucose_concentration and onset_diabetes. Let's take a further look at the numerical correlations for the onset_diabetes column

In [None]:
pima.corr()['onset_diabetes'] 

In [None]:
pima.describe()

This shows us quite quickly some basic stats such as mean, standard deviation, and some different percentile measurements of our data. But, notice that the minimum value of the BMI column is 0. That is medically impossible; there must be a reason for this to happen. Perhaps the number zero has been encoded as a missing value instead of the None value or a missing cell. Upon closer inspection, we see that the value 0 appears as a minimum value for the following columns:

1. times_pregnant
2. plasma_glucose_concentration
3. diastolic_blood_pressure
4. triceps_thickness
5. serum_insulin
6. bmi
7. onset_diabetes

Because zero is a class for onset_diabetes and 0 is actually a viable number for times_pregnant, we may conclude that the number 0 is encoding missing values for:

1. plasma_glucose_concentration
2. diastolic_blood_pressure
3. triceps_thickness
4. serum_insulin
5. bmi

In [None]:
pima['serum_insulin'] = pima['serum_insulin'].map(lambda x:x if x != 0 else None)
# manually replace all 0's with a None value

pima['serum_insulin'].isnull().sum()

In [None]:
pima.describe()

In [None]:
columns = ['bmi', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness']

for col in columns:
    pima[col] = pima[col].map(lambda x:x if x != 0 else None)

In [None]:
pima.isnull().sum()

In [None]:
pima.info()

In [None]:
pima.describe()

In [None]:
pima.head(5)

In [None]:
pima['plasma_glucose_concentration'].mean(), pima['plasma_glucose_concentration'].std()


In [None]:
empty_plasma_index = pima[pima['plasma_glucose_concentration'].isnull()].index
pima.loc[empty_plasma_index]['plasma_glucose_concentration']

In [None]:
# Will try to impute the missing values from the existing v
def relation_with_output( column ):
    temp = pima[pima[column].notnull()]
    d= temp[[column,'onset_diabetes']].groupby(['onset_diabetes'])[column].apply(lambda x: x.median()).reset_index()
    return d

In [None]:
#lets look relation of missing columns with onset_diabetes
relation_with_output('plasma_glucose_concentration')

In [None]:
relation_with_output('diastolic_blood_pressure')

In [None]:
relation_with_output('triceps_thickness')

In [None]:
relation_with_output('serum_insulin')

In [None]:
relation_with_output('bmi')

There is huge difference in the median values of the missing columns with respect of diabetes or not.

We will try to impute values according to these statistics

In [None]:
pima.isnull().sum()

In [None]:
pima.loc[(pima['onset_diabetes'] == 0 ) & (pima['serum_insulin'].isnull()), 'serum_insulin'] = 102.5
pima.loc[(pima['onset_diabetes'] == 1 ) & (pima['serum_insulin'].isnull()), 'serum_insulin'] = 169.5

In [None]:
pima.loc[(pima['onset_diabetes'] == 0 ) & (pima['bmi'].isnull()), 'bmi'] = 30.1
pima.loc[(pima['onset_diabetes'] == 1 ) & (pima['bmi'].isnull()), 'bmi'] = 34.3



In [None]:
pima.loc[(pima['onset_diabetes'] == 0 ) & (pima['triceps_thickness'].isnull()), 'triceps_thickness'] = 27.0
pima.loc[(pima['onset_diabetes'] == 1 ) & (pima['triceps_thickness'].isnull()), 'triceps_thickness'] = 32.0



In [None]:
pima.loc[(pima['onset_diabetes'] == 0 ) & (pima['diastolic_blood_pressure'].isnull()), 'diastolic_blood_pressure'] = 70.0
pima.loc[(pima['onset_diabetes'] == 1 ) & (pima['diastolic_blood_pressure'].isnull()), 'diastolic_blood_pressure'] = 75.0

In [None]:
pima.loc[(pima['onset_diabetes'] == 0 ) & (pima['plasma_glucose_concentration'].isnull()), 'plasma_glucose_concentration'] = 107.0
pima.loc[(pima['onset_diabetes'] == 1 ) & (pima['plasma_glucose_concentration'].isnull()), 'plasma_glucose_concentration'] = 140.0


In [None]:
# fill the column's missing values with the mean of the rest of the column
#pima['plasma_glucose_concentration'].fillna(pima['plasma_glucose_concentration'].mean(), inplace=True)
pima.isnull().sum()

In [None]:
X = pima.loc[:,:'age']
y = pima['onset_diabetes']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

# Standardization and normalization
Up until now, we have dealt with identifying the types of data as well as the ways data can be missing and finally, the ways we can fill in missing data. Now, let's talk about how we can manipulate our data (and our features) in order to enhance our machine pipelines further. So far, we have tried four different ways of manipulating our dataset, and the best cross-validated accuracy we have achieved with a KNN model is .745. If we look back at some of the EDA we have previously done, we will notice something about our features

In [None]:
pima.hist(figsize=(15, 15))
plt.show()

In [None]:
pima.info()


The scale of features is very different.

But why does this matter? Well, some machine learning models rely on learning methods that are affected greatly by the scale of the data, meaning that if we have a column such as diastolic_blood_pressure that lives between 24 and 122, and an age column between 21 and 81, then our learning algorithms will not learn optimally. To really see the differences in scales, let's invoke two optional parameters in the histogram method, sharex and sharey, so that we can see each graph on the same scale as every other graph

In [None]:
pima.hist(figsize=(15, 15), sharex=True)
plt.show()

It is quite clear that our data all lives on vastly different scales. Data engineers have options on how to deal with this problem in our machine learning pipelines that are under a family of operations called normalization. Normalization operations are meant to align and transform both columns and rows to a consistent set of rules. For example, a common form of normalization is to transform all quantitative columns to be between a consistent and static range of values (for example all values must be between 0 and 1). We may also impose mathematical rules such as, all columns must have the same mean and standard deviation so that they appear nicely on the same histogram (unlike the pima histograms we computed recently). Normalization techniques are meant to level the playing field of data by ensuring that all rows and columns are treated equally under the eyes of machine learning.

We will focus on three methods of data normalization:
1. Z-score standardization
2. Min-max scaling
3. Row normalization

The first two deal specifically with altering features in place, while the third option actually manipulates the rows of the data, but is still just as pertinent as the first two.

Z-score standardization
The most common of the normalization techniques, z-score standardization, utilizes a very simple statistical idea of a z-score. The output of a z-score normalization are features that are re-scaled to have a mean of zero and a standard deviation of one. By doing this, by re-scaling our features to have a uniform mean and variance (square of standard deviation), then we allow models such as KNN to learn optimally and not skew towards larger scaled features. The formula is simple: for every column, we replace the cells with the following value:

z = (x - μ) / σ

Where:
1. z is our new value (z-score)
2. x is the previous value of the cell
3. μ is the mean of the column
4. σ is the standard deviation of the columns

In [None]:
print (pima['plasma_glucose_concentration'].head())

In [None]:
# get the mean of the column
mu = pima['plasma_glucose_concentration'].mean()

# get the standard deviation of the column
sigma = pima['plasma_glucose_concentration'].std()

# calculate z scores for every value in the column.
print (((pima['plasma_glucose_concentration'] - mu) / sigma).head())

We see that every single value in the column will be replaced, and also notice how now some of them are negative. This is because the resulting values represent a distance from the mean. So, if a value originally was below the mean of the column, the resulting z-score will be negative. Of course, in scikit-learn, we have built-in objects 

In [None]:
# mean and std before z score standardizing
pima['plasma_glucose_concentration'].mean(), pima['plasma_glucose_concentration'].std()

(121.68676277850591, 30.435948867207657)


ax = pima['plasma_glucose_concentration'].hist()
ax.set_title('Distribution of plasma_glucose_concentration')

Here, we can see the distribution of the column before doing anything. Now, let's apply a z-score scaling

In [None]:
scaler = StandardScaler()

glucose_z_score_standardized = scaler.fit_transform(pima[['plasma_glucose_concentration']])
glucose_z_score_standardized.mean(), glucose_z_score_standardized.std()

We can see that after we apply our scaler to the column, mean drops to very small value and our standard deviation is one. Furthermore, if we take a look at the distribution of values across our recently scaled data

In [None]:
ax = pd.Series(glucose_z_score_standardized.reshape(-1,)).hist()
ax.set_title('Distribution of plasma_glucose_concentration after Z Score Scaling')

We will notice that our x axis is now much more constrained, while our y axis is unchanged. Also note that the shape of the data is unchanged entirely. Let's take a look at the histograms of our DataFrame after we apply a z-score transformation on every single column. When we do this, the StandardScaler will compute a mean and standard deviation for every column separately

In [None]:
scale = StandardScaler() # instantiate a z-scaler object

pima_scaled = pd.DataFrame(scale.fit_transform(pima), columns=pima_column_names)
pima_scaled.hist(figsize=(15, 15), sharex=True)
plt.show()

In [None]:
mean_impute_standardize = Pipeline([('imputer', SimpleImputer()), ('standardize', StandardScaler()), ('classify', knn)])
X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

knn_params = {'imputer__strategy':['mean', 'median'], 'classify__n_neighbors':[1, 2, 3, 4, 5, 6, 7]}
#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)
grid = GridSearchCV(mean_impute_standardize, knn_params)
grid.fit(X, y)

print (grid.best_score_, grid.best_params_)

Now we can clearly see our model has already started outperforming the benchmark. That's good progress we have till now

# The min-max scaling method
Min-max scaling is similar to z-score normalization in that it will replace every value in a column with a new value using a formula. In this case, that formula is:

m = (x -xmin) / (xmax -xmin)

Where:

1. m is our new value
2. x is the original cell value
3. xmin is the minimum value of the column
4. xmax is the maximum value of the column

In [None]:
min_max = MinMaxScaler()
pima_min_maxed = pd.DataFrame(min_max.fit_transform(pima), columns=pima_column_names)
pima_min_maxed.describe()

Notice how the min are all zeros and the max values are all ones. Note further that the standard deviations are now all very very small, a side effect of this type of scaling. This can hurt some models as it takes away weight from outliers. Let's plug our new normalization technique into our pipeline

In [None]:
mean_impute_standardize = Pipeline([('imputer', SimpleImputer()), ('standardize', MinMaxScaler()), ('classify', knn)])
X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors':[1, 2, 3, 4, 5, 6, 7]}
grid = GridSearchCV(mean_impute_standardize, knn_params)
grid.fit(X, y)

print (grid.best_score_, grid.best_params_)

# The row normalization method
Our final normalization method works row-wise instead of column-wise. Instead of calculating statistics on each column, mean, min, max, and so on, the row normalization technique will ensure that each row of data has a unit norm, meaning that each row will be the same vector length. Imagine if each row of data belonged to an n-dimensional space; each one would have a vector norm, or length. Another way to put it is if we consider every row to be a vector in space:

1. x = (x1, x2, ..., xn)

Where 1, 2, ..., n in the case of Pima would be 8, 1 for each feature (not including the response), the norm would be calculated as: 

2. ||x|| = √(x12 + x22 + ... + xn2)

This is called the L-2 Norm. Other types of norms exist, but we will not get into that in this text. Instead, we are concerned with making sure that every single row has the same norm. This comes in handy, especially when working with text data or clustering algorithms.

Before doing anything, let's see the average norm of our mean-imputed matrix

In [None]:
np.sqrt((pima**2).sum(axis=1)).mean() 
# average vector length of imputed matrix

In [None]:
normalize = Normalizer()
pima_normalized = pd.DataFrame(normalize.fit_transform(pima), columns=pima_column_names)
np.sqrt((pima_normalized**2).sum(axis=1)).mean()
# average vector length of row normalized imputed matrix

After normalizing, we see that every single row has a norm of one now. Let's see how this method fares in our pipeline

In [None]:
mean_impute_normalize = Pipeline([('imputer', SimpleImputer()), ('normalize', Normalizer()), ('classify', knn)])
X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors':[1, 2, 3, 4, 5, 6, 7]}
grid = GridSearchCV(mean_impute_normalize, knn_params)
grid.fit(X, y)

print (grid.best_score_, grid.best_params_)

Not great, but worth a try. Now that we have seen three different methods of data normalization, let's put it all together and see how we did on this dataset.

There are many learning algorithms that are affected by the scale of data. Here is a list of some popular learning algorithms that are affected by the scale of data:

1. KNN- due to its reliance on the Euclidean Distance
2. K-Means Clustering - same reasoning as KNN
3. Logistic regression, SVM, neural networks — if you are using gradient descent to learn weights
4. Principal component analysis — eigen vectors will be skewed towards larger columns

In [None]:
def run_model(model,hyp,X,y,cv, Scaler):
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=3)
    mean_impute_standardize = Pipeline([('imputer',SimpleImputer()),
                                       ('standardize_values',Scaler),
                                       ('classification',model)])
    
    grid = GridSearchCV(mean_impute_standardize,hyp,cv=cv)
    grid.fit(X_train,y_train)
    pred = grid.best_estimator_.predict(X_test)
    print(grid.best_params_)
    print(grid.best_estimator_)
    return metrics.accuracy_score(pred,y_test)

In [None]:
hyper_parameters = {'classification__penalty':['l1','l2'],'imputer__strategy':['mean','median']}
print('Logistic Regression accuracy: ')
run_model(LogisticRegression(solver='liblinear'),hyper_parameters,
          pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,MinMaxScaler())

In [None]:
hyper_parameters = {'classification__penalty':['l1','l2'],'imputer__strategy':['mean','median']}
print('Logistic Regression accuracy: ')
run_model(LogisticRegression(solver='liblinear'),hyper_parameters,
          pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,StandardScaler())

In [None]:
hyper_parameters = {'classification__criterion':['gini','entropy'],
                   'classification__n_estimators':[40,50,100,150,200],
                   'imputer__strategy':['mean','median']}
print('RandomForest Accuracy: ')
run_model(RandomForestClassifier(n_jobs=-1),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,MinMaxScaler())

In [None]:
hyper_parameters = {'classification__criterion':['gini','entropy'],
                   'classification__n_estimators':[40,50,100,150,200],
                   'imputer__strategy':['mean','median']}
print('RandomForest Accuracy: ')
run_model(RandomForestClassifier(n_jobs=-1),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,StandardScaler())

In [None]:
hyper_parameters = {'classification__kernel':['rbf','sigmoid','poly'],
                   'classification__C':[0.1,0.001,0.3,1],
                   'imputer__strategy':['mean','median']}
print('SupportVectorClassifier Accuracy: ')
run_model(SVC(),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,MinMaxScaler())

In [None]:
hyper_parameters = {'classification__kernel':['rbf','sigmoid','poly'],
                   'classification__C':[0.1,0.001,0.3,1],
                   'imputer__strategy':['mean','median']}
print('SupportVectorClassifier Accuracy: ')
run_model(SVC(),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,StandardScaler())

In [None]:
hyper_parameters = {'classification__p':[1.3,1.5,2],
                   'classification__n_neighbors':[5,7,8,9],
                   'classification__weights':['uniform','distance'],
                    'imputer__strategy':['mean','median']}
print('KNeighborClassifier accuracy: ')
run_model(KNeighborsClassifier(n_jobs=-1),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,MinMaxScaler())

In [None]:
hyper_parameters = {'classification__p':[1.3,1.5,2],
                   'classification__n_neighbors':[5,7,8,9],
                   'classification__weights':['uniform','distance'],
                    'imputer__strategy':['mean','median']}
print('KNeighborClassifier accuracy: ')
run_model(KNeighborsClassifier(n_jobs=-1),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,StandardScaler())

In [None]:
hyper_parameters = {'classification__learning_rate':[0.1,0.3,0.6,1],
                   'classification__n_estimators':[30,50,80,100],
                   'imputer__strategy':['mean','median']}
print('AdaBoostClassifier accuracy: ')
run_model(AdaBoostClassifier(),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,MinMaxScaler())

In [None]:
hyper_parameters = {'classification__learning_rate':[0.1,0.3,0.6,1],
                   'classification__n_estimators':[30,50,80,100],
                   'imputer__strategy':['mean','median']}
print('AdaBoostClassifier accuracy: ')
run_model(AdaBoostClassifier(),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,StandardScaler())

In [None]:
hyper_parameters = {'imputer__strategy':['mean','median'],
                   'classification__learning_rate':[0.1,0.3,0.5,1],
                    'classification__max_depth':[3,6,8],
                    'classification__n_estimators':[30,60,100,150]
                   }
print('GradientBoostingClassifier accuracy: ')
run_model(GradientBoostingClassifier(),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,MinMaxScaler())

In [None]:
hyper_parameters = {'imputer__strategy':['mean','median'],
                   'classification__learning_rate':[0.1,0.3,0.5,1],
                    'classification__max_depth':[3,6,8],
                    'classification__n_estimators':[30,60,100,150]
                   }
print('GradientBoostingClassifier accuracy: ')
run_model(GradientBoostingClassifier(),hyper_parameters,
         pima.drop(labels='onset_diabetes',axis=1),pima['onset_diabetes'],3,StandardScaler())