# About

This kernel is the result of a students' team work for the course "Introduction to Data Science" at University of Tartu in autumn 2018.  Based on data set "Clinical, Anthropometric & Bio-Chemical Survey" we tried to draw useful insights from the data, as to say predict lifestyle disease indicators. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
from matplotlib import pyplot as plt # plotting
from math import * # sqrt() etc
# with %matplotlib inline you turn on the immediate display.
# %matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.
import warnings
warnings.filterwarnings("ignore")

# Gather Data

In [None]:
data_dictionary_loc = '../input/CAB_data_dictionary.xlsx'
data_dic = pd.read_excel(data_dictionary_loc, dtype = object)
data_dic['File Content Description'] #well, how to import the correct column width? can be viewed using other programs
data_dic

In [None]:
data_u_pradesh = pd.read_csv('../input/CAB_09_UP.csv', low_memory = False) 
#needed to specify low_memory because columns (14, 43 had mixed types)
data_u_pradesh.head()

# Cleaning data

Subset of adults has 299 570 individuals

In [None]:
data = data_u_pradesh[(data_u_pradesh['age_code']=='Y')&(data_u_pradesh['age']>=18)]
len(data)

Original data had -1 for missing values

In [None]:
data = data.replace([-1, '-1'], np.nan)

Dropping columns only applicable to under 5 year olds

In [None]:
cols_under5 = ['illness_type', 'illness_duration', 'treatment_type']
cols_under3 = ['first_breast_feeding', 'is_cur_breast_feeding', 'day_or_month_for_breast_feeding_', 'day_or_month_for_breast_feeding', 'water_month', 'ani_milk_month', 'semisolid_month_or_day', 'solid_month', 'vegetables_month_or_day']

In [None]:
data = data.drop(cols_under5, axis = 1)
data = data.drop(cols_under3, axis = 1)

Dropping unnecessary features
 - 'state_code'
 - 'PSU_ID' - This is a seven digit number to uniquely identify each record.
 - 'ahs_house_unit' - House Number
 - 'house_hold_no' - Household Number
 - 'record_code_iodine_reason' - Why was iodine testing refused
 - 'sl_no' - Each record of the Household has a serial no. 
 - 'usual_residence' - Whether the member usually lives here
 - 'usual_residence_reason' - Reason for member not being usual resident
 - 'identification_code' - Each member of a PSU is assigned a unique number
 - 'v54' ?

In [None]:
data = data.drop(['state_code', 'psu_id', 'ahs_house_unit', 'house_hold_no', 'record_code_iodine_reason', 'sl_no', 'usual_residance', 'usual_residance_reason', 'identification_code', 'v54'], axis = 1)

From data dictionary:
- 'rural_urban' - Rural-1; Urban-2
- 'stratum' - 1 or 2 when 'rural_urban'=1, 0 when 'rural_urban'=2

dropping feature 'rural_urban', since 'stratum' contains the same information

I guess 'stratum' feature values:
- 0 - urban
- 1 - rural  
- 2 - very rural?

not specified in dictionary

In [None]:
data = data.drop('rural_urban', axis = 1)

## Age related
From data dictionary:
- 'age_code' - unit of recording age
- 'age'
- 'date_of_birth' - DD
- 'month_of_birth' - MM
- 'year_of_birth' - YYYY

Dropping feature age_code(values: Y, M, D for years, months, days), since age always recorded in years for adults

In [None]:
display(np.unique(data['age_code']))
data = data.drop('age_code', axis = 1)

In [None]:
plt.hist(data.age.dropna(), bins = 50)
plt.title('Age')
plt.show

## Iodine
From data dictionary:
- 'test_salt_iodine' - Salt used by the Household has been tested for Iodine content[Recorded as Parts Per Million(PPM)]
- 'record_code_iodine' - No iodine – 1; Less than 15 PPM – 2; More than or equal to 15 PPM – 3; No salt in Household – 4; Salt not tested  – 5

In [None]:
pd.value_counts(data['record_code_iodine'])

## Height/weight
From data dictionary:
- 'weight_measured' - Measured-1;  Member - not present-2, Refused-3, Other-4
- 'weight_in_kg' - outcome
- 'length_height_measured' - Measured-1;  Member not present-2, Refused-3, Other-4
- 'length_height_code' - L- Length, H-Height
- 'length_height_cm' - outcome

Dropping, unnecessary columns, NA in weight/length column if measurement was not conducted

In [None]:
data = data.drop(['weight_measured', 'length_height_measured', 'length_height_code'], axis = 1)

In [None]:
data = data.rename(index=str, columns={"weight_in_kg": "weight", "length_height_cm": "height"})

In [None]:
plt.boxplot(data['weight'].dropna())
plt.title('Weight with outliers')
plt.show

In [None]:
plt.boxplot(data['height'].dropna())
plt.title('Height with outliers')
plt.show

In [None]:
# exclude any measurements where difference from median is larger than 3 standard deviations
def remove_outliers(data, feature):
    stdev = sqrt(np.var(data[feature].dropna()))
    median = np.median(data[feature].dropna())
    print("number of discarded measurements")
    display(len(data[[feature]].where(abs(data[feature] - median)>(3*stdev)).dropna()))
# keep original values if difference from mean is less than 3 standard deviations. NA otherwise
    return data[[feature]].where(abs(data[feature] - median)<(3*stdev), other = np.nan)

In [None]:
data['height'] = remove_outliers(data, 'height')

Removing weight outliers. NA for anything under 20kg

In [None]:
print('number of discarded measurements')
display(len(data[data['weight']<20]))
data['weight'] = data['weight'].where(data['weight']>20, other=np.nan)

In [None]:
plt.boxplot(data['weight'].dropna())
plt.title('Weight without outliers')
plt.show

In [None]:
plt.boxplot(data['height'].dropna())
plt.title('Height without outliers')
plt.show

Body mass index: weight(kg)/(height(m) * height(m))

In [None]:
data['bmi'] = data['weight']/(data['height']/100)**2

In [None]:
plt.hist(data['weight'].dropna(), bins = 50)
plt.title('Weight without outliers')
plt.show()

In [None]:
data_weird = data.where(data.height.where(np.isin(data.height,[130,140,150])).notna()).dropna(how = 'all')
data_without_weird = data.drop(list(data_weird.index.values), axis = 0)

In [None]:
plt.hist(data_without_weird.height.dropna(), bins = 80, label = 'normally distributed heights', color = 'lightblue')
plt.hist(data_weird.height, color = 'red', bins = 50, label = 'outstanding heights')
plt.title('Height without outliers')
plt.xlabel('height[cm]')
plt.ylabel('freq')
plt.legend()
plt.savefig('weird_heights.pdf')
plt.show()

In [None]:
display(data.height.value_counts().head())
data_weird.height.value_counts()

A lot of individuals with 130, 140, 150cm height

In [None]:
plt.hist(data['bmi'].dropna(), bins = 50)
plt.title('BMI')
plt.show()

Data cleaning steps for height/weight related data: 
- Discarded any height measurements where difference from median was further than 3 standard deviations. Looking at distribution of height/weight as normally distributed.
- Discarded any weight measurements under 20kg
- Calculated BMI

Discarded ~800 values for height, ~460 values for weight. Out of ~200 000

## Pulse, blood pressure(heart disease)
From data dictionary:
- 'bp_systolic'
- 'bp_systolic_2_reading'
- 'bp_diastolic'
- 'bp_diastolic_2reading'
- 'pulse_rate',
- 'pulse_rate_2_reading'

In [None]:
# distribution of measurement differences
#plt.hist((data['bp_systolic'] - data['bp_systolic_2_reading']).dropna(), bins = 50)
#plt.hist((data['pulse_rate'] - data['pulse_rate_2_reading']).dropna(), bins = 50)
#plt.hist((data['bp_diastolic'] - data['bp_diastolic_2reading']).dropna(), bins = 50)

In [None]:
# for features where two measurements were taken, exclude any where difference between measurements is larger than 3 standard deviations
def remove_outliers_difference(data, col1, col2):
    stdev = sqrt((data[col1] - data[col2]).var())
# how many measurements were excluded
    print('number of discarded measurements')
    display(len(data[[col1, col2]].where(abs(data[col1] - data[col2])>(3*stdev)).dropna()))
# keep original values if difference of two measurements is less than 3 standard deviations. NA otherwise
    return data[[col1, col2]].where(abs(data[col1] - data[col2])<(3*stdev), other = np.nan)

In [None]:
data[['bp_systolic', 'bp_systolic_2_reading']] = remove_outliers_difference(data, 'bp_systolic', 'bp_systolic_2_reading')
data[['bp_diastolic', 'bp_diastolic_2reading']] = remove_outliers_difference(data, 'bp_diastolic', 'bp_diastolic_2reading')
data[['pulse_rate', 'pulse_rate_2_reading']] = remove_outliers_difference(data, 'pulse_rate', 'pulse_rate_2_reading')

Now that outliers have been removed, aggregate remaining data by finding mean between two readings

In [None]:
# aggregate two reading by finding mean
def aggregate_readings(data, col1, col2):
    data[col1] = data.apply(lambda row: sum([row[col1], row[col2]])/2, axis = 1)
    data = data.drop(col2, axis = 1)
    return data

In [None]:
data = aggregate_readings(data, 'bp_systolic', 'bp_systolic_2_reading')
data = aggregate_readings(data, 'bp_diastolic', 'bp_diastolic_2reading')
data = aggregate_readings(data, 'pulse_rate', 'pulse_rate_2_reading')

Systolic - beating, diastolic - resting blood pressure. Likely input/measurement error where systolic < diastolic

In [None]:
# retain original values where resting blood pressure lower than beating. NA otherwise 
data[['bp_diastolic', 'bp_systolic']] = data[['bp_diastolic', 'bp_systolic']].where(data.bp_diastolic < data.bp_systolic, other = np.nan)

Data cleaning steps for heart disease related data: 
- Discarded any where difference between two measurements was further from mean than 3 standard deviations. Looking at distribution of measurement differences as normally distributed.
- Aggregated two measurements by finding mean
- Discarded any where diastolic pressure was higher than systolic

Lost less than 5% of values for each feature

## Haemoglobin(anemia)
From data dictionary:
- 'haemoglobin_test' - Consent for Haemoglobin test (Yes-1; No-2)
- 'haemoglobin'- Status of Haemoglobin Test (Measured-1; Member not present-2; Refused-3, Other-4)
- 'haemoglobin_level' - Outcome of Haemoglobin Level (Hb) Test (in percentage gms)  

In [None]:
data = data.drop(['haemoglobin_test', 'haemoglobin'], axis = 1)

In [None]:
plt.hist(data.haemoglobin_level[~np.isnan(data.haemoglobin_level)], bins=50)
plt.title('Blood haemoglobin')
plt.show

## Blood sugar(diabetes)
From data dictionary:
- 'diabetes_test' - consent for testing
- 'fasting_blood_glucose' - Measured-1; Member not present-2; Refused-3; Other-4
- 'fasting_blood_glucose_mg_dl' - outcome of test

In [None]:
data = data.drop(['diabetes_test', 'fasting_blood_glucose'], axis = 1)

In [None]:
data = data.rename(index = str, columns = {'fasting_blood_glucose_mg_dl' : 'glucose'})
data['diabetes'] = data['glucose'].apply(lambda x: 1 if x >= 100 else 0)

In [None]:
plt.hist(data.glucose[~np.isnan(data.glucose)], bins=50)
plt.title('Blood sugar')
plt.show

In [None]:
plt.boxplot(data.glucose[~np.isnan(data.glucose)])
plt.title('Blood sugar')
plt.show

In [None]:
data['glucose'] = remove_outliers(data,'glucose')

## Features only applicable to women
From data dictionary:
- 'marital_status' - Never married=1,Married but Gauna not performed=2, Married and Gauna perfomed=3, Remarried=4,Widow=5, Divorced=6, Separated=7, Not stated=8
- 'gauna_perfor_not_perfor' - Pregnant-1; Lactating-2; Non-pregnant or Non-lactating-3
- 'duration_pregnanacy' - Duration of pregnancy/lactation (in months)

In [None]:
cols_women = ['marital_status', 'gauna_perfor_not_perfor', 'duration_pregnanacy']

placing NA where marital status 'not stated' 

In [None]:
data['marital_status'] = data['marital_status'].where(~(data['marital_status']==8.0), other = np.nan)

In [None]:
# input errors have to be dealt with
plt.boxplot(data['duration_pregnanacy'].dropna())
plt.show

In [None]:
corr=data.corr()[['haemoglobin_level', 'pulse_rate', 'bp_diastolic', 'bp_systolic', 'glucose']]
corr.where(abs(corr)>0.1)

Removing features where there's no correlation

In [None]:
data_correlated = data.drop(['district_code', 'stratum', 'test_salt_iodine', 'record_code_iodine', 'date_of_birth', 'month_of_birth', 'duration_pregnanacy'], axis = 1)
corr = data_correlated.corr()[['haemoglobin_level', 'pulse_rate', 'bp_diastolic', 'bp_systolic', 'glucose']]
corr.where(abs(corr)>0.1)


## Summary of first Data Exploration
- From 53 initial features to 21
- data_correlated contains only data that has relationships to other data
- obviously the data does not have a very good quality, there are many missing values and it seems the measurements are taken in different accuracy

TODO:
- A lot of individuals with 130, 140, 150cm height value??

# Prepare Data for training models

## Goals of preparing data
1. Clarify if the data with heights 130cm, 140cm, 150cm is usable (and maybe find a reason for this)
2. apply OneHotEncoding to categorical features
3. encode date_survey as ordinal feature
4. create a scaled and normalized data set
5. Check which features have a high NaN property (for later use)
6. create data sets for anemia, heart, and glucose testing



## 1. Clarify if the data with heights 130cm, 140cm, 150cm is usable (and maybe find a reason for this)

In [None]:
# get an impression of how many values are there again
print("About how many instances are we talking?\n")
print(data.height.value_counts().head())
weird_heights = data.height.value_counts().index[:3].tolist()
data_filter_helper = data.isin(weird_heights)
weird_heights_data = data.loc[data_filter_helper.height]
print("This affects ", weird_heights_data.shape[0], "instances. ")
print("That is ", weird_heights_data.shape[0]*100/data.shape[0], "% of our data in total. ")

So this affects a crucial amount of our data, we should investigate before we consider this data as measurement errors. 

In [None]:
weird_heights_data.describe()

This shows the data has a large variety in other features. Lets visualize: 

In [None]:
fig = plt.figure(figsize = (10, 30))

for counter, column in enumerate(weird_heights_data.columns): 
    axes= fig.add_subplot(7, 4, 1+ counter)
    axes.bar(weird_heights_data[column].value_counts().index, weird_heights_data[column].value_counts().values)
    axes.set_title(column)  
plt.subplots_adjust(wspace = 0.5)
plt.show()

As we can observe, most of the other measurements are fairly distributed. As the district code varies, I can only assume that sometimes the height was very loosely taken. Also it is interesting, that month and date of birth are all the same among the regarded group of people. Hence I can only conclude that these people do not own a birth confirmation and their body size was maybe simply estimated when registrating them. We should keep this data. 

##  2. Apply OneHotEncoding to categorical features



Dummies should be drawn by all features that are encoded numerically and which are actually categorical. From the 21 remaining, these are

- district_code
- stratum
- record_code_iodine. Here, 1, 2 and 3 are ordered, while 4 should be 0 (no salt in household) and 5 should be replaced by NaN (no information).
- sex should be replaced by binary encoding instead of 0, 1, 2
- marital status
- gauna_perfor_not_perfor: 1- pregnant, 2-lactating, 3-nothing of both. Better rename to "pregnant" and "lactating" after OneHoteEncoding



In [None]:
dummieable =['district_code', 'stratum', 'record_code_iodine', 'sex', 'marital_status', 'gauna_perfor_not_perfor']
dummiedata = [data]
for dum in dummieable: 
    dummiedata.append(pd.get_dummies(data[dum], prefix = dum))
dummied_data = pd.concat(dummiedata, axis = 1)
print("Number of features now: ", len(dummied_data.columns))
dummied_data.columns

Finally remove the old columns, rename the new ones and set all not given data NaN.

In [None]:
dummied_data = dummied_data.drop(dummieable, axis =1)
print("Number of features after making categorical numeric: ", len(dummied_data.columns))
rename_dict = {'marital_status_1.0': 'never_married', 'marital_status_2.0': 'married_no_gauna',
               'marital_status_3.0': 'married_and_gauna',
       'marital_status_4.0': 'remarried', 'marital_status_5.0': 'widow', 'marital_status_6.0': 'divorced',
       'marital_status_7.0': 'separated', 'gauna_perfor_not_perfor_1.0': 'pregnant',
       'gauna_perfor_not_perfor_2.0': 'lactating', 'gauna_perfor_not_perfor_3.0': 'non_pregnant_non_lactating',
        'sex_1': 'male', 'sex_2': 'female'}
dummied_data = dummied_data.rename(rename_dict, axis = 'columns')

## 3. Encode date_survey as ordinal feature

In [None]:
def parse(string):
    return int(string[6:])*10000 + int(string[3:5])*100 + int(string[:2])
dummied_data['year_month_day_survey'] = dummied_data.date_survey.apply(parse)
display(dummied_data[['date_survey', 'year_month_day_survey']].head(10)) #show how encoding looks like
dummied_data.drop('date_survey', axis = 1, inplace = True); #remove the original encoding

## 4. Create a scaled and centered data set

In [None]:
ft_numeric = ['year_month_day_survey','test_salt_iodine', 'age', 'date_of_birth', 'month_of_birth', 'year_of_birth', 'weight', 
              'height', 'haemoglobin_level', 'bp_systolic', 'bp_diastolic', 'glucose', 'duration_pregnanacy',
              'bmi', 'pulse_rate']
print("before: ")
display(dummied_data[ft_numeric].head())
#scale data to unit variance
cols = ["std_"+ x for x in  dummied_data[ft_numeric].columns]
dummied_data_numeric_std = pd.DataFrame(StandardScaler(with_mean = True).fit_transform(dummied_data[ft_numeric]), 
                                    columns = cols, index = dummied_data.index)
print("after scaling and centralizing: ")
dummied_data_numeric = dummied_data[ft_numeric]
dummied_data_numeric.head()

In [None]:
dummied_data_numeric_std.hist(figsize = (20, 20));

In [None]:
dummied_data_std = pd.concat([dummied_data.drop(ft_numeric, axis = 1), dummied_data_numeric_std], axis = 1)

In [None]:
dummied_data_std.head()

## 5. Check which features have a high NaN property (for later use)



In [None]:
print("Missing values in both dummied data sets: ")
for c in dummied_data.columns: 
    nan_count = sum(dummied_data[c].isna())
    if(nan_count > 0):
        print(c, nan_count)

## 6. create data sets for anemia, heart, and glucose testing

In [None]:
def drop_null_targets(data, target): 
     return data[data[target].notnull()]

data_anemia = drop_null_targets(dummied_data, 'haemoglobin_level')
data_anemia_std = drop_null_targets(dummied_data_std, 'std_haemoglobin_level')
data_glucose = drop_null_targets(dummied_data, 'glucose')
data_glucose_std = drop_null_targets(dummied_data_std, 'std_glucose')
data_heart = drop_null_targets(dummied_data, 'pulse_rate')
data_heart_std = drop_null_targets(dummied_data_std, 'std_pulse_rate')

In [None]:
#for control
print(dummied_data.shape)
print(dummied_data_std.shape)


# Investigations on Anemia

## Steps

1.  Drop irrelevant data
2. Add feature "anemia"
3. Optimization of models and features


## 1. Drop irrelevant data

Dropping data irrelevant to anemia from medicinal site. 

In [None]:
anemia_relevant = ['stratum_0', 'stratum_1', 'stratum_2',"test_salt_iodine",'record_code_iodine_1', 'record_code_iodine_2', 
                   'record_code_iodine_3',"age","weight","height","haemoglobin_level","bp_systolic",
                   "bp_diastolic","pulse_rate","glucose","duration_pregnanacy","bmi", 'male', 'female']
temp = [x for x in anemia_relevant if not np.isin(x, ft_numeric)]
temp2 = [x for x in anemia_relevant if np.isin(x, ft_numeric)]
anemia_relevant_std = temp + ["std_" + x for x in temp2]
data_anemia_red = data_anemia[anemia_relevant]
data_anemia_std_red = data_anemia_std[anemia_relevant_std]

display(data_anemia_red.head())
display(data_anemia_red.describe(include = 'all'))
display(data_anemia_std_red.head())
display(data_anemia_std_red.describe(include = 'all'))

## 2. Create an anemia feature

Adding a column for anemia. For men, anemia is diagnosed when the haemoglobin level is less than 13 g/dL, and for women less than 12 g/dL.

In [None]:
data_anemia_red['anemia'] = np.where(((data_anemia_red['male'] == 1) & (data_anemia_red['haemoglobin_level'] < 13.0)) |
                                    ((data_anemia_red['female'] == 1) & (data_anemia_red['haemoglobin_level'] < 12.0)), 1, 0)
data_anemia_std_red['anemia'] = data_anemia_red.anemia
display(data_anemia_std_red.head()) #no need to find the borders here.
data_anemia_red.head()

In [None]:
data_anemia_red_corr = data_anemia_red.corr()
data_anemia_std_red_corr = data_anemia_std_red.corr()
display(data_anemia_std_red_corr)
display(data_anemia_red_corr)

In [None]:
data_anemia_women = data_anemia_red.where(data_anemia_red.male == 0)
data_anemia_men = data_anemia_red.where(data_anemia_red.male == 1)

In [None]:
fig = plt.figure(figsize = (15, 5))
#plot anemia status of all women
axes = fig.add_subplot(1, 2, 1)
axes.fill_between(x = [0, 12], y1= [0.28, 0.28], color = 'lightcoral' )
axes.fill_between(x = [12, 18], y1= [0.28, 0.28], color = 'lightgreen' )
axes.hist(data_anemia_women.haemoglobin_level.dropna(), density = True, bins = 20)
axes.axvline(12, color = 'red', label = 'critical haemoglobin level')
axes.set_xlabel('haemoglobin [g/dl]')
axes.set_ylabel('freq')
axes.set_title('haemoglobin of women, mean ' + str(np.nanmean(data_anemia_women.haemoglobin_level)))
#anemia of all men
axes = fig.add_subplot(1, 2, 2)
axes.fill_between(x = [0, 13], y1= [0.28, 0.28], color = 'lightcoral' )
axes.fill_between(x = [13, 18], y1= [0.28, 0.28], color = 'lightgreen' )
axes.hist(data_anemia_men.haemoglobin_level.dropna(), density = True, bins = 20)
axes.axvline(13, color = 'red', label = 'critical haemoglobin level')
axes.set_xlabel('haemoglobin [g/dl]')
axes.set_ylabel('freq')
axes.set_title('haemoglobin of men, mean ' + str(np.nanmean(data_anemia_men.haemoglobin_level)))
axes.legend()
plt.savefig("anemia_men_women.pdf")
plt.show()

In [None]:
data_anemia_red = data_anemia_red.drop('haemoglobin_level', axis = 1)
data_anemia_std_red = data_anemia_std_red.drop('std_haemoglobin_level', axis = 1)

Now to get data ready for training a model. There's many missing values so we'll have to fill them in. I'll use imputation.

In [None]:
def impute_data(data):
    imputer = SimpleImputer()#fill up with mean
    data_i= pd.DataFrame(imputer.fit_transform(data), columns = data.columns, index = data.index)
    return data_i
    
data_anemia_red = impute_data(data_anemia_red)
data_anemia_std_red = impute_data(data_anemia_std_red)

In [None]:
display(data_anemia_red.head())
data_anemia_std_red.head()

In [None]:
def create_sampled_train_test_split(data, label, test_size, under = True):
    X_train,X_test,y_train,y_test = train_test_split(data, data[label],test_size=test_size)
    r = False
    if(under): 
        class_count = np.amin(X_train.groupby(label)[label].count().values)
    else: 
        class_count = np.amax(X_train.groupby(label)[label].count().values)
        r = True
    
    print("Classes before under- or oversampling in train split")
    display(X_train.groupby(label)[label].count())
    
    negative_cases = X_train[X_train[label] == 0].sample(n = class_count, replace = r)
    positive_sample = X_train[X_train[label] == 1].sample(n=class_count, replace = r)
    X_train_balanced = pd.concat([negative_cases, positive_sample])
    X_train_balanced.sort_index
    print("Classes after under- or oversampling")
    display(X_train_balanced.groupby(label)[label].count())
    
    y_train = X_train_balanced.pop(label)
    y_test = X_test.pop(label)
    
    return X_train_balanced, X_test, y_train, y_test

X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_red, label = 'anemia', test_size = 0.2)

In [None]:
#Now train a model
def evaluate_model(model, X_train, y_train, X_test, y_test):
    print("Evaluation of", model)
    rf = model.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    confusion_matrix_result = confusion_matrix(y_test.values, y_pred)
    print("Confusion matrix:\n%s" % confusion_matrix_result)
    print(classification_report(y_test, y_pred))
    print("Accuracy: %.2f" % accuracy_score(y_test, y_pred))
    
evaluate_model(RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0), X_train, y_train, X_test, y_test)

In [None]:
#repeat the same on standardized data set
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_std_red, label = 'anemia', test_size = 0.2)
evaluate_model(RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0), X_train, y_train, X_test, y_test)

In [None]:
#try oversampling
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_std_red, label = 'anemia', test_size = 0.2, under = False)
evaluate_model(RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0), X_train, y_train, X_test, y_test)

## 3. Optimazation of model and feature selection

Currently, the anemia dataset has 19 features. We should find and keep k best features to use. Let's try with k=13. Since both data_anemia_red and data_anemia_std_red currently include the anemia feature, doing that will give us 12 best features for classifying anemia.

In [None]:
select_k_best_classifier = SelectKBest(score_func=f_classif, k=13)

select_k_best_classifier.fit_transform(data_anemia_red, data_anemia_red.anemia)
mask = select_k_best_classifier.get_support()
relevant_columns = data_anemia_red.columns[mask]
display(relevant_columns)

select_k_best_classifier.fit_transform(data_anemia_std_red, data_anemia_std_red.anemia)
mask = select_k_best_classifier.get_support()
relevant_columns_std = data_anemia_std_red.columns[mask]
display(relevant_columns_std)

In [None]:
data_anemia_new = data_anemia_red[relevant_columns]
data_anemia_std_new = data_anemia_std_red[relevant_columns_std]
display(data_anemia_new.head())
display(data_anemia_std_new.head())

Now let's try training some models with the new datasets.

In [None]:
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_new, label = 'anemia', test_size = 0.2)
evaluate_model(RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0), X_train, y_train, X_test, y_test)

In [None]:
#repeat the same on standardized data set
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_std_new, label = 'anemia', test_size = 0.2)
evaluate_model(RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0), X_train, y_train, X_test, y_test)

In [None]:
#try oversampling
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_new, label = 'anemia', test_size = 0.2, under = False)
evaluate_model(RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0), X_train, y_train, X_test, y_test)



In [None]:
#try oversampling on standardiized
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_std_new, label = 'anemia', test_size = 0.2, under = False)
evaluate_model(RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0), X_train, y_train, X_test, y_test)


Let's try learning another model. How will a KNN classifier with n=3 perform relative to the random forest on an oversampled dataset?

In [None]:
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_new, label = 'anemia', test_size = 0.2, under = False)
evaluate_model(KNeighborsClassifier(n_neighbors=3), X_train, y_train, X_test, y_test)


That's way better! Let's try performing 10-fold cross-validation on our oversampled anemia dataset to see which n gives us the lowest validation error for KNN.

In [None]:
# filtering just the odd numbers from 1 to 50
neighbors = list(filter(lambda x: x % 2 != 0, list(range(1,50))))
cv_scores = [] # cross-validation scores

for n in neighbors:
    knn = KNeighborsClassifier(n_neighbors=n)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

In [None]:
errors = [1 - x for x in cv_scores]
optimal_n = neighbors[errors.index(min(errors))]
print("The optimal number of neighbors is %d" % optimal_n)

# plot misclassification errors for each n
plt.plot(neighbors, errors)
plt.xlabel('Number of Neighbors')
plt.ylabel('Misclassification Error')
plt.show()


In [None]:
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_new, label = 'anemia', test_size = 0.2, under = False)
evaluate_model(KNeighborsClassifier(n_neighbors=optimal_n), X_train, y_train, X_test, y_test)

In [None]:
# let's try on the standardized dataset as well
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_std_new, label = 'anemia', test_size = 0.2, under = False)
evaluate_model(KNeighborsClassifier(n_neighbors=optimal_n), X_train, y_train, X_test, y_test)

In [None]:
# and on the undersampled datasets
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_new, label = 'anemia', test_size = 0.2)
evaluate_model(KNeighborsClassifier(n_neighbors=optimal_n), X_train, y_train, X_test, y_test)

In [None]:
# standardized dataset with undersampling
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_anemia_std_new, label = 'anemia', test_size = 0.2)
evaluate_model(KNeighborsClassifier(n_neighbors=optimal_n), X_train, y_train, X_test, y_test)


The model that gave us the highest accuracy (84%) was KNN classifier with n = 1 on the oversampled anemia dataset with 12 best features.

# Investigation on Diabetes

## Questions: 

1. First trying to see how blood sugar is distributed among general population
    among subpopulations? men, women, different counties, strata?
2. Looking to predict gestational diabetes
    pregnant women have a higher risk of developing diabetes
3. Creating new categorical column for blood sugar. Where "1" stands for (pre)diabetes
    - 1 >=100mg/dl
    - 0 <100mg/dl

The data is collected as fasting blood sugar, for which the normal range is <100mg/dl.


## Steps
0. Create a dieabetes feature
1. Explore how features act together with glucose
2. Feature selection (Variance filtering)
3. Applying ensemble models on woman-specific and men-specific data
4. Gestational Diabetes
5. Overview

## 0. Create a diabetes feature

In [None]:
data_glucose.diabetes.value_counts() #see how imbalanced the data is

- 77893 individuals have a high fasting blood sugar.
- Looking at how each feature correlates to blood sugar values. Using categorical values for depicting blood sugar. Category "1" stands for diabetic or prediabetic based on https://www.mayoclinic.org/diseases-conditions/diabetes/diagnosis-treatment/drc-20371451

## 1. Explore how features act together with glucose


In [None]:
def plot_ft_against_pred_label(data, ft, pred_label):     
    print("These", len(ft),"features are going to be compared to diabetes: \n")
    print(ft)
    #plot pairwise in rows of 4
    rest = len(ft)%4
    for l in np.arange(4, len(ft)- rest, 4): 
        sns.pairplot(data.dropna(), x_vars = ft[(l-4):l] , y_vars = [pred_label])
    sns.pairplot(data.dropna(), x_vars = ft[len(ft)-4: len(ft)] , y_vars = [pred_label])
    
#look at undummied data
ft_without_district = [x for x in data.columns if not x.startswith('district')]
ft_without_district.remove('diabetes')
ft_without_district.remove('glucose')

In [None]:
ft_without_district

In [None]:
plot_ft_against_pred_label(data, ft_without_district, 'glucose')

Based on medical knowledge people with higher weight, bmi and age should have higher blood sugar. These plots don't really depict that, hard to make any judgement.

In [None]:
# getting relative frequency of high blood sugar
def diabetes_relative_freq(data, feature):
    subset = data.groupby(feature)
    high = pd.Series()
    for i in np.unique(data[feature]):
        high = high.append(pd.Series((subset['diabetes'].value_counts()[i]/sum(subset['diabetes'].value_counts()[i])).loc[1]))
    high.index = np.arange(1,len(subset)+1)  
    plt.bar(np.arange(1, len(subset)+1), high)
    plt.ylabel("Realtive freq of high blood sugar")
    plt.title(feature)
    plt.show()

### Looking at relative frequency of high blood sugar by different features to find features that might be good indicators of diabetes risk

In [None]:
plt.rcParams['figure.figsize'] = [35, 15]
diabetes_relative_freq(data, "district_code")

Relative frequency of high blood sugar by district. Number of individuals with high blood sugar is normalized by total number of individuals in district. This shows that frequency of high blood sugar might be as low as 10% or as high as 45% depending on the district. So district code might be a good indicator of diabetes risk.

In [None]:
plt.rcParams['figure.figsize'] = [6, 6]
diabetes_relative_freq(data, "stratum")

Urban and rural populations have similar frequency of high blood sugar. Not really informative

In [None]:
plt.rcParams['figure.figsize'] = [6, 6]
diabetes_relative_freq(data, 'sex')

Same for relative frequency in men/women

## Feature selection with variance filtering

In [None]:
ft_with_district = [x for x in dummied_data.columns if x.startswith('district')]

fig = plt.figure(figsize = (20, 50))

for counter, d in enumerate(ft_with_district): 
    df= dummied_data.loc[dummied_data[d] == 1].glucose.dropna()
    axes= fig.add_subplot(14, 5, 1+ counter)
    axes.hist(df, density = True, bins = 20)
    axes.set_title(d + ' ' +str(df.shape[0]))
plt.subplots_adjust(wspace = 0.5, hspace = 0.5)
plt.show()


Blood glucose seems to be quite similarly distributed in all districts. At least there are no districts that remarkably stand out.

In [None]:
var = []
for dist in ft_with_district:
    df= data_glucose.loc[data_glucose[dist] == 1].glucose.dropna()
    var.append(np.var(df.value_counts().sort_index()))
print("The lowest glucose distribution variance: ", np.min(var))
print("The highest glucose distribution variance ", np.max(var))

ft_district_droppable = []
for d in ft_with_district: 
    df= data_glucose.loc[dummied_data[d] == 1].glucose.dropna()
    if(np.var(df.value_counts().sort_index()))< 2000: 
        ft_district_droppable.append(d)
print("The following districts will be discarded: ", ft_district_droppable)

In [None]:
print(data_glucose.shape[1], "features before in glucose not null data")
print(data_glucose_std.shape[1], "features before in glucose not null std data")
data_glucose.drop(ft_district_droppable, axis = 1, inplace = True)#district related features are named the same
data_glucose_std.drop(ft_district_droppable, axis = 1, inplace = True)
print(data_glucose.shape[1], "features left")
print(data_glucose_std.shape[1], "features left in glucose not null std data")

In [None]:
def plot_cat_against_pred_label(dummied_data, pred_label_list):#the binary feature must be first
    ft_cat_no_distr = [x for x in dummied_data.columns if x not in ft_numeric + ft_with_district + pred_label_list]
    fig = plt.figure(figsize = (20, 50))
    std_dict = {}

    for counter, d in enumerate(ft_cat_no_distr): 
        df= dummied_data.loc[dummied_data[d] == 1][pred_label_list[0]].dropna()
        #append the standard deviation among the pred label relative to the mean of the pred label
        std_dict[d] = (np.std(df.value_counts().sort_index()/np.nanmean(dummied_data[pred_label_list[0]])))
        axes= fig.add_subplot(14, 5, 1+ counter)
        axes.hist(df, density = True, bins = 20)
        axes.set_title(d + ' ' +str(df.shape[0]))
    plt.subplots_adjust(wspace = 0.5, hspace = 0.5)
    plt.show()
    return std_dict


In [None]:
cat_std_dict = plot_cat_against_pred_label(data_glucose, ['glucose', 'diabetes'])

Blood sugar is very similarly distributed across all features. There are some differences in features only applycable to women(remarried, separated, widow, divorced).
Now look at the differences in standard deviation and discard those features that are beyond a treshold. 

In [None]:
def plot_bar_dict(d):
    plt.figure(figsize = (14, 14))
    plt.xticks(rotation = 90)
    plt.bar(d.keys(), [d[y] for y in d.keys() ])
    plt.show()

In [None]:
plot_bar_dict(cat_std_dict)

We can see there a big differences in the information density that the categorical features give when compared to glucose.  We can try create a new feature summarizing the information given from stratum1 and stratum 2 and see how much affect this has on glucose.  

In [None]:
data_glucose['stratum_1_2'] = data_glucose['stratum_1'] + data_glucose.stratum_2
data_glucose_std['stratum_1_2'] = data_glucose_std['stratum_1'] + data_glucose_std.stratum_2
print(np.nanstd(data_glucose.stratum_1_2.value_counts().sort_index())/np.nanmean(data_glucose.glucose))


Good, this feature has a large variance among glucose. 

In [None]:
def plot_distr_against_pred_label(dummied_data, pred_label_list):#the binary feature must be first
    ft_with_district = [x for x in dummied_data.columns if x.startswith('district')]
    fig = plt.figure(figsize = (20, 50))
    std_dict = {}

    for counter, d in enumerate([x for x in ft_with_district if x not in pred_label_list]): 
        df= dummied_data.loc[dummied_data[d] == 1][pred_label_list[0]].dropna()
        std_dict[d] = (np.std(df.value_counts().sort_index()/np.nanmean(dummied_data[pred_label_list[0]])))
        axes= fig.add_subplot(14, 5, 1+ counter)
        axes.hist(df, density = True, bins = 20)
        axes.set_title(d + ' ' +str(df.shape[0]))
    plt.subplots_adjust(wspace = 0.5, hspace = 0.5)
    plt.show()
    return std_dict

distr_std_dict = plot_distr_against_pred_label(data_glucose, ['glucose', 'diabetes'])

In [None]:
plot_bar_dict(distr_std_dict)

In [None]:
def filter_by_std_prop(data, std_dict, factor = 3/4):
    stds = sorted(list(std_dict.values()))#sort ascending
    ft = list(std_dict.keys())
    std_treshold = stds[np.round(int(len(stds)*factor))]#1-factor of all keys will be kept
    for f in ft: 
        if std_dict[f] < std_treshold: 
            std_dict.pop(f)
    data.drop([x for x in data[ft].columns if x not in std_dict.keys()], axis = 1, inplace= True)
    print(data.shape[1], "features left")
#district_signi didnt make much sense here anyway, it wouldve been discarded

In [None]:
filter_by_std_prop(data_glucose, distr_std_dict)


In [None]:
diff = [x for x in data_glucose_std.columns if not np.isin(x,data_glucose.columns)]
dropped_districts = [x for x in diff if x.startswith('district')]
data_glucose_std.drop(dropped_districts, axis = 1, inplace = True)
data_glucose_std.shape, data_glucose.shape

In [None]:
print("Standard deviations of normalized heart related data")
display(data_glucose_std[['std_'+ x for x in ft_numeric]].describe().loc['std'])
desc = data_glucose_std[['std_'+ x for x in ft_numeric]].describe()
rel_std = pd.DataFrame({'rel_std': np.divide(desc.loc['std'].values, np.subtract(desc.loc['max'].values,desc.loc['min'].values)),
                        'ft': desc.columns})
print("Standard deviations of normalized heart related data relative to min max of the feature")
display(rel_std)


In [None]:
ft_to_drop = list(rel_std.loc[rel_std.rel_std <0.11].ft)
print("The following features will be dropped because their relative standard variation is too low: ")
ft_to_drop

In [None]:
if np.isin('std_glucose', ft_to_drop):
    ft_to_drop.remove('std_glucose')
ft_nostd_to_drop = [x[4:] for x in ft_to_drop ]
ft_nostd_to_drop

In [None]:
#This is a different result than in the glucose notebook. Lets see the outcome, because this is more precise
data_glucose_std.drop(ft_to_drop, inplace = True, axis = 1)
data_glucose.drop(ft_nostd_to_drop, inplace = True, axis = 1)

In [None]:
print("Now there are", len(data_glucose.columns), "in unscaled data")
print("Now there are", len(data_glucose_std.columns), "in scaled and centered data")


## 3. Applying ensemble models on woman-specific and men-specific data

- try to predict diabetes variable (1 stands for blood glucose > 100md/dl) 

### Over 45 year olds

results so far: 
- linear svm had 68 % accuracy
- random forest with men and women achieved 59% accuracy

First prepare womens and mens data. 

In [None]:
cols_women = ['never_married', 'married_no_gauna', 'married_and_gauna', 'remarried', 'widow', 'divorced', 'separated', 
              'pregnant', 'lactating', 'non_pregnant_non_lactating', 'duration_pregnanacy']
cols_women_std = ['never_married', 'married_no_gauna', 'married_and_gauna', 'remarried', 'widow', 'divorced', 'separated', 
              'pregnant', 'lactating', 'non_pregnant_non_lactating', 'std_duration_pregnanacy']
cols_w = [x for x in cols_women if not np.isin(x,ft_nostd_to_drop)]
cols_w_std = [x for x in cols_women_std if not np.isin(x,ft_to_drop)]
data_glucose_over45 = data_glucose.where(data_glucose['age']>=45)
data_glucose_over45_std = data_glucose_std.loc[list(data_glucose_over45.index), :]
print(data_glucose_over45.shape)
print(data_glucose_over45_std.shape)
men_glucose_over45 = data_glucose_over45.where(data_glucose_over45.male == 1).drop(cols_w + ['male', 'female'], axis = 1).dropna()
women_glucose_over45 = data_glucose_over45.where(data_glucose_over45.male == 0).drop(['male', 'female'], axis = 1).dropna()
men_glucose_over45_std = data_glucose_over45_std.loc[data_glucose_over45_std.male == 1].drop(cols_w_std + ['male', 'female'], axis = 1).dropna()
women_glucose_over45_std = data_glucose_over45_std.loc[data_glucose_over45_std.male == 0].drop(['male', 'female'], axis = 1).dropna()

In [None]:
print(men_glucose_over45.shape,
women_glucose_over45.shape,
men_glucose_over45_std.shape,
women_glucose_over45_std.shape)

In [None]:
fig = plt.figure(figsize = (15, 5))
#plot glucose of postmeno women
axes = fig.add_subplot(1, 2, 1)
axes.fill_between(x = [100, 200], y1= [0.04, 0.04], color = 'lightcoral' )
axes.fill_between(x = [0, 100], y1= [0.04, 0.04], color = 'lightgreen' )
axes.hist(women_glucose_over45.pulse_rate.dropna(), density = True, bins = 20)
axes.set_xlabel('blood glucose [mg/dl]')
axes.set_ylabel('freq')
axes.axvline(100, color = 'red', label = 'critical glucose')
axes.set_title('glucose of elder women [mg/dL], mean ' + str(np.nanmean(women_glucose_over45.glucose)))
#plot glucose of older men
axes = fig.add_subplot(1, 2, 2)
axes.fill_between(x = [100, 200], y1= [0.04, 0.04], color = 'lightcoral' )
axes.fill_between(x = [0, 100], y1= [0.04, 0.04], color = 'lightgreen' )
axes.hist(men_glucose_over45.glucose.dropna(), density = True, bins = 20)
axes.axvline(100, color = 'red', label = 'critical glucose')
axes.set_title('glucose of elder men [mg/dL], mean ' + str(np.nanmean(men_glucose_over45.glucose)))
axes.set_xlabel('blood glucose [mg/dl]')
axes.set_ylabel('freq')
axes.legend()
plt.savefig("elder_women_elder_men_glucose.pdf")
plt.show()
#plt.savefig("elder_women_elder_men_glucose.pdf")

In [None]:
#For random forest standardization is not necessary
X_train, X_test, y_train, y_test = train_test_split(men_glucose_over45.drop(['diabetes', 'glucose'], axis = 1).fillna(0),
                                                    men_glucose_over45.diabetes)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
#predictions for men, initially optimized with gridsearch
#rfc = RandomForestClassifier()
#rfc_param_grid = {'n_estimators': np.arange(160, 170, 1), 
#                  'criterion': ['entropy', 'gini']}
#rfc_grid_search = GridSearchCV(rfc, rfc_param_grid, cv=20, return_train_score=True)
#rfc_grid_search.fit(X_train, y_train)
#print(rfc_grid_search.best_params_)
r_m = RandomForestClassifier(criterion = 'gini', n_estimators = 168).fit(X_train, y_train)
print("Accuracy achieved by Random Forest with criterion gini, 168 estimators: ", 
     accuracy_score(y_test, r_m.predict(X_test)))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(women_glucose_over45.drop(['diabetes', 'glucose'], axis = 1).fillna(0),
                                                    women_glucose_over45.diabetes)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
#predictions for women, parameters were optimized with grid search
#rfc = RandomForestClassifier()
#rfc_grid_search = GridSearchCV(rfc, rfc_param_grid, cv=20, return_train_score=True)
#rfc_grid_search.fit(X_train, y_train)
#print(rfc_grid_search.best_params_)
r_w = RandomForestClassifier(criterion = 'entropy', n_estimators = 164).fit(X_train, y_train)
print("Accuracy achieved by Random Forest with criterion entropy, n_estimators = 164: ", 
     accuracy_score(y_test, r_w.predict(X_test)))

## 4. Gestational diabetes

Trying to predict diabetes in pregnant women. This https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4673797/ paper suggest that a normal upper limit for fasting blood glucose in pregnant women is 92mg/dl. The upper limit is slightly higher for the general population. I have taken this into account when assigning the categorical feature "diabetes".

In [None]:
#first for upper limit 100mg/dl
data_glucose_pregnant = data_glucose.loc[data_glucose.pregnant == 1].drop(['male', 'female'], axis = 1)
data_glucose_pregnant = impute_data(data_glucose_pregnant)
data_glucose_pregnant_std = data_glucose_std.loc[data_glucose_std.pregnant == 1].drop(['male', 'female'], axis = 1)

# I will try imputing instead of dropping all nan rows
data_glucose_pregnant_std = impute_data(data_glucose_pregnant_std)
# centralized age is used for modeling

In [None]:
data_glucose_pregnant_std.diabetes.value_counts()

In [None]:
X_train, X_test, y_train, y_test = create_sampled_train_test_split(data_glucose_pregnant_std.drop(['std_glucose'], axis = 1), label = 'diabetes', test_size = 0.2, under = False)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
#data_glucose_pregnant_std.columns

In [None]:
#X_train, X_test, y_train, y_test = train_test_split(data_glucose_pregnant_std.drop(['diabetes', 'std_glucose'], axis = 1),
                                                    #data_glucose_pregnant_std.diabetes)


In [None]:
rfc = RandomForestClassifier(criterion = 'gini', n_estimators = 162)#was optimized with GridSearch below
#rfc_param_grid = {'n_estimators': np.arange(162, 180, 3), 
#                  'criterion': ['entropy', 'gini']}#
#rfc_grid_search = GridSearchCV(rfc, rfc_param_grid, cv=20, return_train_score=True)
#rfc_grid_search.fit(X_train, y_train)
#print(rfc_grid_search.best_params_)

In [None]:
evaluate_model(rfc, X_train, y_train, X_test, y_test)

These were results when predicting with blood glucose normal upper limit 100mg/dl.

In [None]:
#now for upper limit 92mg/dl
data_glucose_pregnant.drop(['diabetes'], axis = 1, inplace = True)
data_glucose_pregnant_std.drop(['diabetes'], axis = 1, inplace = True)
data_glucose_pregnant['diabetes'] = data_glucose_pregnant.glucose.apply(lambda x : 1 if x > 92 else 0)
data_glucose_pregnant_std['diabetes'] = data_glucose_pregnant.diabetes

In [None]:
data_glucose_pregnant_std.diabetes.value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data_glucose_pregnant_std.drop(['diabetes', 'std_glucose'], axis = 1),
                                                    data_glucose_pregnant_std.diabetes)

#rfc = RandomForestClassifier()
#rfc_grid_search = GridSearchCV(rfc, rfc_param_grid, cv=20, return_train_score=True)
#rfc_grid_search.fit(X_train, y_train)
#print(rfc_grid_search.best_params_)
rfc = RandomForestClassifier(criterion = 'gini', n_estimators = 171)

In [None]:
evaluate_model(rfc, X_train, y_train, X_test, y_test)

In [None]:
imps = rfc.feature_importances_
important_features = [idx for idx in range(len(imps))if imps[idx]>0]
print("There are ", len(important_features), "features used for Random Forest for pregnant women: ")
plt.xticks(rotation='vertical')
plt.bar(data_glucose_pregnant_std.columns[important_features], imps[important_features])
plt.show()

In [None]:

#predictions for men, initially optimized with gridsearch
#dtc = DecisionTreeClassifier()
#dtc_param_grid = {'min_samples_split': np.arange(2, 10, 1), 'criterion': ['entropy', 'gini'], "splitter":["best", "random"]}
#dtc_grid_search = GridSearchCV(dtc, dtc_param_grid, cv=20, return_train_score=True)
#dtc_grid_search.fit(X_train, y_train)
#print(dtc_grid_search.best_params_)
dt = DecisionTreeClassifier(criterion = 'gini', splitter = "random", min_samples_split = 7).fit(X_train, y_train)
print("Accuracy achieved by Decision Tree with criterion gini, 7 min leaves, random splitter: ", 
     accuracy_score(y_test, dt.predict(X_test)))

In [None]:
#predictions for men, initially optimized with gridsearch
#knn = KNeighborsClassifier()
#knn_param_grid = {'n_neighbors': np.arange(3,10,1)}
#knn_grid_search = GridSearchCV(knn, knn_param_grid, cv=20, return_train_score=True)
#knn_grid_search.fit(X_train, y_train)
#print(knn_grid_search.best_params_)
kn = KNeighborsClassifier(n_neighbors = 8).fit(X_train, y_train)
print("Accuracy achieved by KNN with 8 neighbors: ", 
     accuracy_score(y_test, kn.predict(X_test)))

In [None]:
#Using standardized data for SVM
X_train, X_test, y_train, y_test = train_test_split(men_glucose_over45_std.drop(['diabetes', 'std_glucose'], axis = 1).fillna(0),
                                                    men_glucose_over45_std.diabetes)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## 5. Overview

These models are hoping to predict type 2 diabetes or a prediabetic condition. These are characterized by a fasting blood glucose over 100mg/dl. India has the largest population of type 2 diabetics in the world. Most patients are diagnosed at 46-47years of age. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3920109/ The rising number of diabetic patients in India is thought to be caused by urbanization and excessive weight.

Pregnant women also have a higher risk of developing type 2 diabetes. In addition India has a high prevalence of gestational diabetes(gestational==pregnant) compared to world average. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4673797/

These studies are the reason why the models are hoping to predict high blood glucose specifically in people over 45 and pregnant women.

**Possible Todos**: 
    ...

# Investigation on Heart Diseases

## Steps
1.  Find relevant features
2. Design an evaluation measure
3. Find a good modeling technique


## 1. Find relevant features

We expect, that heart pulse rate, age, sex and BMI will be indicators for heart diseases. As we do not have a feature which tells us, that a person is heart sick, we will set the heart pulse rate as target value. So lets investigate how heart pulse rate and the other features interact.

In [None]:
distr_std_dict = plot_distr_against_pred_label(data_heart, ['pulse_rate'])

We can see, that in most of the districts the pulse_rate is normally distributed. But there are some states which are not normally distributed, although the sample num is comparably high to the normally distributed ones. So the belonging to a state might have an impacts on your pulse rate. So lets sort out all the districts that have a normally distributed pulse_rate.

In [None]:
filter_by_std_prop(data_heart, distr_std_dict)
diff = [x for x in data_heart_std.columns if not np.isin(x,data_heart.columns)]
dropped_districts = [x for x in diff if x.startswith('district')]
data_heart_std.drop(dropped_districts, axis = 1, inplace = True)
data_heart_std.shape, data_heart.shape

In [None]:
cat_std_dict = plot_cat_against_pred_label(data_heart, ['pulse_rate'])

In general, most of the pulse_rates here are normally distributed. There is a clear difference between male and female too. Visually stratum1 and stratum2 have a very similiar distribution, as well as all the record_code_iodines. Between the different marital stati there are visible differences. We could summarize some of these similiar features in one feature or omit the ones which are very normally distributed (and have a low variance).

In [None]:
cat_std_dict

This tells the contrary of the visual inspection. Lets try to construct some extra features and compare them to the single features. It will show that combining stratum1 and stratum2 will create a feature that has a very high std deviation (relative to its .

In [None]:
data_heart['stratum_1_2'] = data_heart['stratum_1'] + data_heart.stratum_2
data_heart_std['stratum_1_2'] = data_heart_std['stratum_1'] + data_heart_std.stratum_2
print(np.nanstd(data_heart.stratum_1_2.value_counts().sort_index())/np.nanmean(data_heart.pulse_rate))

Trying the same with significant districts did not deliver a good feature, so we will omit the code here. We will not apply the std filtering for categorical features yet, as we might need some for later use. 

### Standard deviation filtering of numeric features

In [None]:
#first we need a deviation standard dict of numerical features relative to their min-max-value
ft_numeric_std = ['std_'+x for x in ft_numeric]
ft_numeric_std_heart = ft_numeric_std
ft_numeric_std_heart.remove('std_pulse_rate')
desc = data_heart_std[ft_numeric_std_heart].describe()
desc

In [None]:
std_dict = {}
for i, c in enumerate(desc.columns):
    std_dict[c] = desc.iloc[2, i]/(desc.iloc[7, i]- desc.iloc[3, i])#2->std, 3->min, 7->max
plot_bar_dict(std_dict)

In [None]:
filter_by_std_prop(data_heart_std, std_dict, factor = 0.3)

In [None]:
dropped = [x for x in ft_numeric_std_heart if not np.isin(x, data_heart_std.columns)]
print("The following features have been discarded:")
dropped

In [None]:
data_heart.drop([x[4:] for x in dropped], axis = 1, inplace = True)
data_heart.shape, data_heart_std.shape

In [None]:
cols_w = [x for x in cols_w if np.isin(x, data_heart.columns)]
cols_w_std = [x for x in cols_w_std if np.isin(x, data_heart_std.columns)]#in case a numeric feature was kept
#cols_w, cols_w_std

As a last thing, we can create two different subsets for men and women and discard all female features for the men's data set.

In [None]:
men_data_heart = data_heart.loc[data_heart['male'] == 1].drop(cols_w + ['male', 'female'], axis = 1)
fem_data_heart = data_heart.loc[data_heart['female']== 1].drop(['male', 'female'], axis = 1)
men_data_heart_std = data_heart_std.loc[data_heart_std['male'] == 1].drop(cols_w_std + ['male', 'female'], axis = 1)
fem_data_heart_std = data_heart_std.loc[data_heart_std['female']== 1].drop(['male', 'female'], axis = 1)
print(men_data_heart.shape[1], "features for men")
print(fem_data_heart.shape[1], "features for women")
print(men_data_heart_std.shape[1], "features for men std")
print(fem_data_heart_std.shape[1], "features for women std")
men_data_heart = impute_data(men_data_heart)
fem_data_heart = impute_data(fem_data_heart)
men_data_heart_std = impute_data(men_data_heart_std)
fem_data_heart_std= impute_data(men_data_heart_std)

## 2. Design an evaluation measure

As we focus on the pulse rate, we need a good estimation when a heart pulse is dangerous. Therefore, lets make a classification task to predict if a person has a normal (0) or high (1, above 76) pulse rate.

In [None]:
print("mean pulse rate of women", fem_data_heart.pulse_rate.mean())
print("mean pulse rate of men", men_data_heart.pulse_rate.mean())

It is interesting that already the mean of the pulse rates is higher than the pulse rate which is considered to increase the risk of heart attack, which is 76 for postmenopausal women according to https://www.nhs.uk/news/heart-and-lungs/pulse-predicts-heart-attacks/. As this is a clear treshold, lets filter women older than 50 and take a look at their heart beat and perform our classification task first on them. As we will apply Random Forest, only use the nonstandardized data. 

In [None]:
fem_data_heart_postmeno = fem_data_heart.loc[fem_data_heart['age'] >= 50]
men_data_heart_older = men_data_heart.loc[men_data_heart['age'] >= 50]

fig = plt.figure(figsize = (15, 5))
#plot pulse rate of postmeno women
axes = fig.add_subplot(1, 2, 1)
axes.fill_between(x = [76, 180], y1= [0.04, 0.04], color = 'lightcoral' )
axes.fill_between(x = [0, 76], y1= [0.04, 0.04], color = 'lightgreen' )
axes.hist(fem_data_heart_postmeno.pulse_rate.dropna(), density = True, bins = 20)
axes.axvline(76, color = 'red', label = 'critical pulse rate')
axes.set_title('pulse rate of elder women, mean ' + str(np.nanmean(fem_data_heart_postmeno.pulse_rate)))
axes.set_xlabel('pulse_rate [beat/min]')
axes.set_ylabel('freq')
#plot pulse rate of older men
axes = fig.add_subplot(1, 2, 2)
axes.fill_between(x = [76, 180], y1= [0.04, 0.04], color = 'lightcoral' )
axes.fill_between(x = [0, 76], y1= [0.04, 0.04], color = 'lightgreen' )
axes.hist(men_data_heart_older.pulse_rate.dropna(), density = True, bins = 20)
axes.axvline(76, color = 'red', label = 'critical pulse rate')
axes.set_title('pulse rate of elder men, mean ' + str(np.nanmean(men_data_heart_older.pulse_rate)))
axes.set_xlabel('pulse_rate [beat/min]')
axes.set_ylabel('freq')
axes.legend()
plt.savefig("elder_women_elder_men_pulse_rate.pdf")
plt.show()


In [None]:
def pulse_dange(x): 
    return (x >= 76) * 1
fem_data_heart_postmeno['pulse_rate_dangerous'] = fem_data_heart_postmeno.pulse_rate.apply(pulse_dange)
men_data_heart_older['pulse_rate_dangerous'] = men_data_heart_older.pulse_rate.apply(pulse_dange)
men_count= men_data_heart_older.pulse_rate_dangerous.sum()
fem_count = fem_data_heart_postmeno.pulse_rate_dangerous.sum()
print(men_count, "older men are affected by increased heart attack risk, that is",men_count/men_data_heart_older.shape[0], "%" )
print(fem_count, "older women are affected by increased heart attack risk, that is",fem_count/fem_data_heart_postmeno.shape[0], "%" )

1. So there is no need to do oversampling.

In [None]:
men_data_heart_older.shape, fem_data_heart_postmeno.shape

## 3. Finding a good modelling technique

The following parameters have been found with Grid Search. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(fem_data_heart_postmeno.drop(['pulse_rate', 'pulse_rate_dangerous'], axis = 1),
                                                    fem_data_heart_postmeno.pulse_rate_dangerous)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
rfc = RandomForestClassifier()
rfc_param_grid = {'n_estimators': np.arange(150, 180, 5), 
                  'criterion': ['entropy', 'gini']}
rfc_grid_search = GridSearchCV(rfc, rfc_param_grid, cv=20, return_train_score=True)
rfc_grid_search.fit(X_train, y_train)
rfc_grid_search.best_params_

In [None]:
print(accuracy_score(y_test, rfc_grid_search.best_estimator_.predict(X_test)))

In [None]:
#to see how homogenous the estimator performs
splits = [x for x in list(rfc_grid_search.cv_results_.keys()) if x.endswith('test_score') and x.startswith('split')]
best_rfc_scores = {}
for counter, x in enumerate(splits): 
    best_rfc_scores[counter]= (rfc_grid_search.cv_results_[x][1])
plt.scatter(best_rfc_scores.keys(), best_rfc_scores.values())
plt.title('variance in performance dependent on split')

In [None]:
imps = rfc_grid_search.best_estimator_.feature_importances_
important_features_women = [idx for idx in range(len(imps))if imps[idx]>0]
print("There are ", len(important_features_women), "features used for Random Forest for women: ")
plt.xticks(rotation='vertical')
plt.bar(fem_data_heart_postmeno.columns[important_features_women], imps[important_features_women])
plt.show()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(men_data_heart_older.drop(['pulse_rate', 'pulse_rate_dangerous'], axis = 1),
                                                    men_data_heart_older.pulse_rate_dangerous)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
#rfc = RandomForestClassifier()
#rfc_param_grid = {'n_estimators': np.arange(150, 180, 5), 
#                  'criterion': ['entropy', 'gini']}
#rfc_grid_search = GridSearchCV(rfc, rfc_param_grid, cv=20, return_train_score=True)
#rfc_grid_search.fit(X_train, y_train)
#rfc_grid_search.best_params_
rfc = RandomForestClassifier(criterion = 'gini', n_estimators = 165) #optimized with grid search above

In [None]:
evaluate_model(rfc, X_train, y_train, X_test, y_test)

In [None]:
imps = rfc.feature_importances_
important_features_men = [idx for idx in range(len(imps))if imps[idx]>0]
print("There are ", len(important_features_men), "features used for Random Forest for men: ")
plt.xticks(rotation='vertical')
plt.bar(men_data_heart_older.columns[important_features_men], imps[important_features_men])
plt.show()

This shows that it is precisely possible to predict if a person tends to have a dangerous pulse rate which might cause heart diseases. More models could be learned and optimized, but this one already offers a good impression of what is possible. Furthermore it would be interesting, how precise the predictions of these random forests would be if the main features used for prediction (diastolic and systolic) are not given. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(men_data_heart_older.drop(['pulse_rate', 'pulse_rate_dangerous', 'bp_systolic', 'bp_diastolic'], axis = 1),
                                                    men_data_heart_older.pulse_rate_dangerous)
rfc = RandomForestClassifier(criterion = 'gini', n_estimators = 165) #optimized with grid search above
evaluate_model(rfc, X_train, y_train, X_test, y_test)

In [None]:
imps = rfc.feature_importances_
important_features_men = [idx for idx in range(len(imps))if imps[idx]>0]
print("There are ", len(important_features_men), "features used for Random Forest for men when no heart rate is given: ")
plt.xticks(rotation='vertical')
plt.bar(men_data_heart_older.columns[important_features_men], imps[important_features_men])
plt.show()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(fem_data_heart_postmeno.drop(['pulse_rate', 'pulse_rate_dangerous', 'bp_systolic', 'bp_diastolic'], axis = 1),
                                                    fem_data_heart_postmeno.pulse_rate_dangerous)
#according to the grid search above
rfc = RandomForestClassifier(criterion = 'entropy', n_estimators = 175) #optimized with grid search above
evaluate_model(rfc, X_train, y_train, X_test, y_test)

In [None]:
imps = rfc.feature_importances_
important_features_men = [idx for idx in range(len(imps))if imps[idx]>0]
print("There are ", len(important_features_men), "features used for Random Forest for women when no heart rate is given: ")
plt.xticks(rotation='vertical')
plt.bar(fem_data_heart_postmeno.columns[important_features_men], imps[important_features_men])
plt.show()