## Machine Learning

In this notebook, I will do some exercises with prediction.
Reference: https://github.com/ikhlaqsidhu/data-x/blob/master/05a-tools-predicition-titanic/titanic.ipynb

In [22]:
import numpy as np
import pandas as pd

In [23]:
 # machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier



__ Reading __`diabetesdata.csv`__ file into a pandas dataframe. 
About the data: __

1. __TimesPregnant__: Number of times pregnant 
2. __glucoseLevel__: Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. __BP__: Diastolic blood pressure (mm Hg)  
5. __insulin__: 2-Hour serum insulin (mu U/ml) 
6. __BMI__: Body mass index (weight in kg/(height in m)^2) 
7. __pedigree__: Diabetes pedigree function 
8. __Age__: Age (years) 
9. __IsDiabetic__: 0 if not diabetic or 1 if diabetic) 








In [24]:
#Read data & print the head
df = pd.read_csv('diabetesdata.csv')
df.head()

Unnamed: 0,TimesPregnant,glucoseLevel,BP,insulin,BMI,Pedigree,Age,IsDiabetic
0,6,148.0,72,0,33.6,0.627,50.0,1
1,1,,66,0,26.6,0.351,31.0,0
2,8,183.0,64,0,23.3,0.672,,1
3,1,,66,94,28.1,0.167,21.0,0
4,0,137.0,40,168,43.1,2.288,33.0,1


**2. Calculating the percentage of Null values in each column and display it. **

In [25]:
print("Percentage of Null values in each column: ")
display((df.isnull().sum()/len(df))*100)

Percentage of Null values in each column: 


TimesPregnant    0.000000
glucoseLevel     4.427083
BP               0.000000
insulin          0.000000
BMI              0.000000
Pedigree         0.000000
Age              4.296875
IsDiabetic       0.000000
dtype: float64

In [26]:
print("Null values in glucoseLevel: ")
print(df[df['glucoseLevel'].isnull()][['glucoseLevel']])
display(df[df['glucoseLevel'].isnull()]) 

Null values in glucoseLevel: 
     glucoseLevel
1             NaN
3             NaN
9             NaN
13            NaN
16            NaN
28            NaN
33            NaN
38            NaN
79            NaN
86            NaN
93            NaN
97            NaN
100           NaN
182           NaN
185           NaN
187           NaN
298           NaN
303           NaN
686           NaN
690           NaN
698           NaN
700           NaN
708           NaN
711           NaN
713           NaN
714           NaN
715           NaN
717           NaN
719           NaN
723           NaN
725           NaN
728           NaN
733           NaN
737           NaN


Unnamed: 0,TimesPregnant,glucoseLevel,BP,insulin,BMI,Pedigree,Age,IsDiabetic
1,1,,66,0,26.6,0.351,31.0,0
3,1,,66,94,28.1,0.167,21.0,0
9,8,,96,0,0.0,0.232,54.0,1
13,1,,60,846,30.1,0.398,59.0,1
16,0,,84,230,45.8,0.551,31.0,1
28,13,,82,110,22.2,0.245,57.0,0
33,6,,92,0,19.9,0.188,28.0,0
38,2,,68,0,38.2,0.503,27.0,1
79,2,,66,0,25.0,0.307,24.0,0
86,13,,72,0,36.6,0.178,45.0,0


In [27]:
print("Null values in Age: ")
print(df[df['Age'].isnull()][['Age']])
df[df['Age'].isnull()] 

Null values in Age: 
     Age
2    NaN
10   NaN
27   NaN
34   NaN
75   NaN
77   NaN
124  NaN
128  NaN
288  NaN
299  NaN
303  NaN
541  NaN
547  NaN
598  NaN
606  NaN
613  NaN
616  NaN
649  NaN
652  NaN
653  NaN
654  NaN
655  NaN
680  NaN
684  NaN
688  NaN
690  NaN
695  NaN
701  NaN
707  NaN
712  NaN
727  NaN
733  NaN
738  NaN


Unnamed: 0,TimesPregnant,glucoseLevel,BP,insulin,BMI,Pedigree,Age,IsDiabetic
2,8,183.0,64,0,23.3,0.672,,1
10,4,110.0,92,0,37.6,0.191,,0
27,1,97.0,66,140,23.2,0.487,,0
34,10,122.0,78,0,27.6,0.512,,0
75,1,0.0,48,0,24.7,0.14,,0
77,5,95.0,72,0,37.7,0.37,,0
124,0,113.0,76,0,33.3,0.278,,1
128,1,117.0,88,145,34.5,0.403,,1
288,4,96.0,56,49,20.8,0.34,,0
299,8,112.0,72,0,23.6,0.84,,0


**Splitting __`data`__  into  __`train_df`__ and __`test_df`__  with 15% as test.**

In [28]:
train_df, test_df = train_test_split(df, test_size=0.15, random_state = 5) 
train_df.head()

Unnamed: 0,TimesPregnant,glucoseLevel,BP,insulin,BMI,Pedigree,Age,IsDiabetic
745,12,100.0,84,105,30.0,0.488,46.0,0
424,8,151.0,78,210,42.9,0.516,36.0,1
572,3,111.0,58,44,29.5,0.43,22.0,0
203,2,99.0,70,44,20.4,0.235,27.0,0
193,11,135.0,0,0,52.3,0.578,40.0,1


**Displaying the means of the features in train and test sets. Replacing the null values in  __`train_df`__ and __`test_df`__  with the mean of EACH feature column separately for train and test.**

In [29]:
print(test_df.mean())
print(train_df.mean())
test_df = test_df.fillna(test_df.mean())
train_df = train_df.fillna(train_df.mean())

display(test_df.head())
train_df.head()

TimesPregnant      3.948276
glucoseLevel     120.263636
BP                70.603448
insulin           72.922414
BMI               32.133621
Pedigree           0.519457
Age               34.222222
IsDiabetic         0.310345
dtype: float64
TimesPregnant      3.826687
glucoseLevel     121.149038
BP                68.838957
insulin           81.023006
BMI               31.967485
Pedigree           0.463411
Age               33.204147
IsDiabetic         0.355828
dtype: float64


Unnamed: 0,TimesPregnant,glucoseLevel,BP,insulin,BMI,Pedigree,Age,IsDiabetic
567,6,92.0,62,126,32.0,0.085,46.0,0
123,5,132.0,80,0,26.8,0.186,69.0,0
615,3,106.0,72,0,25.8,0.207,27.0,0
492,4,99.0,68,0,32.8,0.145,33.0,0
288,4,96.0,56,49,20.8,0.34,34.222222,0


Unnamed: 0,TimesPregnant,glucoseLevel,BP,insulin,BMI,Pedigree,Age,IsDiabetic
745,12,100.0,84,105,30.0,0.488,46.0,0
424,8,151.0,78,210,42.9,0.516,36.0,1
572,3,111.0,58,44,29.5,0.43,22.0,0
203,2,99.0,70,44,20.4,0.235,27.0,0
193,11,135.0,0,0,52.3,0.578,40.0,1


**Split __`train_df`__ & __`test_df`__   into  __`X_train`__, __`Y_train`__  and __`X_test`__, __`Y_test`__. __`Y_train`__  and __`Y_test`__ should only have the column we are trying to predict,  __`IsDiabetic`__.**

In [30]:
X_train = train_df.iloc[:, :-1 ]#["TimesPregnant", "glucoseLevel", "BP", "insulin", "BMI", "Pedigree", "Age"]]
Y_train = train_df.iloc[:,-1:]  
X_test = test_df.iloc[:, :-1]
Y_test = test_df.iloc[:,-1:]
display(X_train.head())
Y_test.head()

Unnamed: 0,TimesPregnant,glucoseLevel,BP,insulin,BMI,Pedigree,Age
745,12,100.0,84,105,30.0,0.488,46.0
424,8,151.0,78,210,42.9,0.516,36.0
572,3,111.0,58,44,29.5,0.43,22.0
203,2,99.0,70,44,20.4,0.235,27.0
193,11,135.0,0,0,52.3,0.578,40.0


Unnamed: 0,IsDiabetic
567,0
123,0
615,0
492,0
288,0


**Training perceptron, logistic regression and random forest models using 15% test split.**

In [31]:
# Logistic Regression
logreg = LogisticRegression()                               
logreg.fit(X_train, Y_train) 
Y_pred_train = logreg.predict(X_train) 
#acc_logreg = sum(Y_pred_train == Y_train)/len(Y_train)*100  
acc_logreg_train = logreg.score(X_train, Y_train)
acc_logreg_test = logreg.score(X_test, Y_test)
print ('logreg training acuracy= ', str(round(acc_logreg_train*100,2)),'%')
print('logreg test accuracy= ',str(round(acc_logreg_test*100,2)),'%')

#print('Logistic Regression Training labeling accuracy:', str(round(acc_logreg,2)),'%') 
Y_pred_test = logreg.predict(X_test) 
#acc_logreg = sum(Y_pred_test == Y_test)/len(Y_test)*100               
#print('Logistic Regression Training labeling accuracy:', str(round(acc_logreg,2)),'%') 

logreg training acuracy=  75.92 %
logreg test accuracy=  81.03 %


  y = column_or_1d(y, warn=True)


In [32]:
# Perceptron
perceptron = Perceptron(max_iter = 10000)                                  
perceptron.fit(X_train, Y_train)                           
acc_perceptron_train = perceptron.score(X_train, Y_train)       
acc_perceptron_test = perceptron.score(X_test, Y_test) 
print('Perceptron training accuracy:', str(round(acc_perceptron_train*100,2)),'%')
print('Perceptron testing accuracy:', str(round(acc_perceptron_test*100,2)),'%')

  y = column_or_1d(y, warn=True)


Perceptron training accuracy: 73.93 %
Perceptron testing accuracy: 81.9 %


In [56]:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=231)   
random_forest.fit(X_train, Y_train)                         
acc_rf_train = random_forest.score(X_train, Y_train) 
acc_rf_test = random_forest.score(X_test, Y_test)
print('Random Forest Training accuracy:', str(round(acc_rf_train*100,2)),'%') 
print('Random Forest Testing accuracy:', str(round(acc_rf_test*100,2)),'%')  

  This is separate from the ipykernel package so we can avoid doing imports until


Random Forest Training accuracy: 100.0 %
Random Forest Testing accuracy: 80.17 %


**Computing the log probability of classes in  __`IsDiabetic`__ for the first 10 samples of your train set and display it.**


In [57]:
print(logreg.predict_log_proba(X_train[:10]))
print() 
print(Y_pred_train[:10]) 

[[-0.45326433 -1.00936623]
 [-1.34291022 -0.30257204]
 [-0.28051228 -1.40811744]
 [-0.09216015 -2.42995362]
 [-2.951801   -0.0536598 ]
 [-1.02791664 -0.44277999]
 [-0.18237866 -1.79147402]
 [-0.34706216 -1.22676865]
 [-0.76335924 -0.62754309]
 [-0.17226818 -1.84360069]]

[0 1 0 0 1 1 0 0 1 0]


**Computing the log probability of classes in  __`IsDiabetic`__ for the first 10 samples**


In [58]:
print(logreg.predict_log_proba(X_test[:10]))
print()
print(Y_pred_test[:10]) 

[[-0.25129728 -1.50413742]
 [-0.44971237 -1.0155907 ]
 [-0.14801183 -1.98355632]
 [-0.20769657 -1.67372858]
 [-0.15923075 -1.91596004]
 [-0.21076159 -1.66055832]
 [-0.62072957 -0.7712214 ]
 [-0.93639623 -0.49764308]
 [-0.62179312 -0.76998655]
 [-0.63711585 -0.75250536]]

[0 0 0 0 0 0 0 1 0 0]


Since this is a log probablitity, we would select the least negative valules. 

A few notes: 
Mean Imputation is not the best type of imputation. I don't think it will work all the time. For example you may have a dataset that high highly variable values (maybe also a bimodal distribution) , in that case it wouldn't do a very good job. Also if you think about a binary distribution, where you have either say 0 or 1, taking the mean will not make any sense. It might work where you have data compacted between two points. But it does  change the relationships with other variables, and therefore biasing your sample/dataset. Imputations like Multiple Imputation and Maximum Likelihood can help in this situation. In the Multile Imputation model, instead of filling in a single value, the distribution of the observed data is used to estimate multiple values that reflect the uncertainty around the true value. This way we don't contaminate the relationship between variables.