# Capstone Project: Stroke Risk Prediction - Exploratory Data Analysis #
This is a capstone project for Springboard's data science intensive track. The dataset used in this project is sourced from the data science competition sponsor by McKinsey analytics and held in a platform "Analytics Vidhya". 
The competition link can be found here [contest page] (https://datahack.analyticsvidhya.com/contest/mckinsey-analytics-online-hackathon/).

**Problem Statement:** a chain of hospitals in US collected anonymized pool of patients data for stroke classification. Stroke is one of critical disease which affects nearly 1 in 20 Americans and is a disease that affects arteries leading to and within the brain. A stroke occurs when a blood vessel that carries oxygen and nutrients to the brain is either blocked by a clot or ruptures. When that happens part of the brain cannot get the blood (and oxygen) it needs, so the brain cells in the affected region is dead. The clients captured several lifestyle, health monitoring measurements and demographic factors about their anonymized patients. These include features like age, gender, health monitoring measurements (i.e., glucose level, body mass index) and lifestyle factors (i.e., smoking status, occupation type etc.). 

**Stakeholders:**
Hospital cardiac unit managers, a group of clinicians, insurance companies and employers from patient population.

## Part A. Import Cleaned Dataset ##

In [1]:
# Import all libraries #
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

# Authorization #
__author__ = "Taesun Yoo"
__email__ = "yoots1988@gmail.com"



In [2]:
# --- 2. Write Out List of Functions --- #
def load_file(file):
    '''load the CSV files as a dataframe'''
    df = pd.read_csv(file)
    return df

def drop_column_by_index(df, var):
    '''drop a column by specified variable'''
    df = df.drop(var, axis=1)
    return df

def join_data(df_train, df_label, key, 
              left_index=None, right_index=None):
    '''Merge the feature and label dataframe(s)'''
    df_join = pd.merge(df_train, df_label, how='inner', on=key,
                         left_index=False, right_index=False)
    return df_join

def clean_data(df):
    '''drop any duplicate based on specific column'''
    clean_df = df.drop_duplicates(subset='id')
    return clean_df

def eda_missing_data(df):
    missing_df = pd.DataFrame(df.isnull().sum())
    missing_df.columns = ['count']
    missing_df['pct'] = (missing_df['count']/len(df))*100
    return missing_df

def eda_summary_stat_num(df):
    '''compute summary statistics for numerical variables'''
    df_stat_num = df.describe().T
    df_stat_num = df_stat_num[['count', 'min', 'mean', 'max', '25%', '50%', '75%', 'std']]
    df_stat_num = df_stat_num.sort_values(by='count', ascending=True)
    df_stat_num = pd.DataFrame(df_stat_num)
    return df_stat_num

def eda_summary_stat_cat(df):
    '''compute summary statistics for categorical variables'''
    df_stat_cat = pd.DataFrame(df.describe(include='O').T)
    return df_stat_cat

def compute_outliers(df_stat_num):
    df_stat_num['IQR'] = df_stat_num['75%'] - df_stat_num['25%']
    df_stat_num['UB'] = df_stat_num['75%'] + 1.5*df_stat_num['IQR']
    df_stat_num['LB'] = df_stat_num['25%'] - 1.5*df_stat_num['IQR']
    df_outliers = df_stat_num[['LB', 'min', 'UB', 'max']]
    return df_outliers

def EDA_plot_correlation(df_EDA):
    '''compute and plot correlation matrix'''
    corr = df_EDA.corr()
    # Create a mask to filter matrix: diagonally
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    # Matrix Plot:
    fig, ax = plt.subplots(figsize=(7,7))
    cmap = sns.diverging_palette(220,10,as_cmap=True)
    sns.set(font_scale=1.1)
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
                annot=True, square=True, linewidths=.5, fmt=".2f",
                annot_kws={'size':10}, cbar_kws={'shrink':.6})
    plt.xticks(rotation=90)
    plt.yticks(rotation=0)

def encode_categorical_feature(df, var_name, map_name):
    '''encode categorical features into mapping values'''
    df[var_name] = df[var_name].map(map_name)
    return df[var_name]

def feature_imputer(X, missing_val_format, method, indices):
    '''imputes missing values based on different uni-variate methods'''
    imputer = Imputer(missing_values=missing_val_format, strategy=method, axis=0)
    imputer = imputer.fit(X.iloc[:, indices])
    X.iloc[:, indices] = imputer.transform(X.iloc[:, indices])
    return X.iloc[:, indices]

def convert_data_type(df, var_name, dt_type):
    '''convert data type into specified metadata type'''
    df[var_name] = df[var_name].astype(dt_type)
    return df[var_name]

def split_dataframe(df):
    '''Split dataframe into features and label'''
    X, y = df.iloc[:, :-1], df.iloc[:, -1]
    return X, y

def avg_groupby_data(df, num_var, cat_var, avg_var_name):
    '''perform average group by categorical variable to compute a mean'''
    avg_groupby_val = df.groupby(cat_var)[num_var].mean().sort_values(ascending=False)
    avg_groupby_df = pd.DataFrame({cat_var:list(df[cat_var].unique()),
                                   avg_var_name:avg_groupby_val})
    avg_groupby_df.reset_index(drop=True, inplace=True)
    return avg_groupby_df

def left_join_data(train_df, avg_groupby_df, key=None, left_index=False, right_index=False):
    '''performs left join on train data to average groupby data'''
    joined_df = pd.merge(train_df, avg_groupby_df, how='left', on=key,
                         left_index=left_index, right_index=right_index)
    return joined_df

def one_hot_encode_feature(df, cat_vars=None, num_vars=None):
    '''performs one-hot encoding on all categorical variables and
       combine results with numerical variables '''
    cat_df = pd.get_dummies(df[cat_vars], drop_first=True)
    num_df = df[num_vars].apply(pd.to_numeric)
    return pd.concat([cat_df, num_df], axis=1)

def get_label_data(df, label_var):
    '''separate label from a dataframe'''
    df_label = df[label_var]
    return df_label

def split_data_by_age_group(df, var_name):
    '''split dataframe by age group'''
    df_age_group = pd.DataFrame(df.groupby(var_name)[var_name].count().sort_values(ascending=False))
    df_age_group.columns = ['count']
    df_age_group.index.name = 'age_group'
    return df_age_group

def strata_by_age_group(df, group_name, idx):
    '''stratify dataframe by label group index'''
    df_strata = df[df[group_name] == idx]
    return df_strata

def resample_data_by_group(df, n_samples):
    '''resample data by random replacement'''
    sample_group = resample(df, n_samples=n_samples, random_state=0, replace=True)
    return sample_group

def EDA_feature_importance_plot(model, X, y):
    '''plots the feature importance plot on trained model'''
    model = model
    model.fit(X, y)
    feat_labels = X.columns
    importances = model.feature_importances_
    indices = np.argsort(importances)[::-1]
    
    plt.bar(range(X.shape[1]), importances[indices], align='center')
    plt.xticks(range(X.shape[1]), feat_labels[indices], rotation=90, fontsize=7)
    plt.xlim(-1, X.shape[1])

def feature_scale_data(X):
    '''Feature scaled data based on standardization'''
    sc_X = StandardScaler()
    X_std = sc_X.fit_transform(X)
    return X_std
    
# Plot confusion matrix: accuracy, precision, recall and etc.
def plot_confusion_matrix(cm, classes):
    '''plot the confusion matrix of trained model'''
    fig, ax = plt.subplots(figsize=(7,7))
    cm = cm.astype('float')/cm.sum()
    
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    fmt='.2f'
    thresh = cm.max()/2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i,j], fmt), ha='center', va='center',
                    color='white' if cm[i,j] > thresh else 'black')
    plt.xlabel('predicted label')
    plt.ylabel('true label')

# Write report classification metrics summary report
def report_class_summary(model_name, y_act, y_pred):
    print ('Accuracy of ' + model_name + ' is %0.2f'% skm.accuracy_score(y_act, y_pred))
    print ('Precision of ' + model_name + ' is %0.2f'% skm.precision_score(y_act, y_pred))
    print ('Recall of ' + model_name + ' is %0.2f'% skm.recall_score(y_act, y_pred))
    print ('ROC score of ' + model_name + ' is %0.2f'% skm.roc_auc_score(y_act, y_pred))

# Compute confusion matrix:
def compute_confusion_matrix(y_act, y_pred):
    '''compute sklearn confusion matrix'''
    cm_model = skm.confusion_matrix(y_act, y_pred)
    return cm_model    

def score_model_roc_auc(model, X_train, y_train, X_val, y_val):
    '''computes the roc_auc score for probability of being a stroke case'''
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_val)
    return skm.roc_auc_score(y_val, probs[:,1])

def model_tuning_param(model, feature_df, label_df, param_dist, n_iter):
    '''performs RandomizedSearchCV to tune model hyper-parameters'''
    random_search = RandomizedSearchCV(model, param_dist, n_iter, cv=5)
    random_search.fit(feature_df, label_df)
    return random_search

def print_best_param(random_search, param_1=None, param_2=None, param_3=None, param_4=None):
    '''print the best model parameter(s)'''
    print("Best " + param_1 + ":", random_search.best_estimator_.get_params()[param_1])
    print("Best " + param_2 + ":", random_search.best_estimator_.get_params()[param_2])
    print("Best " + param_3 + ":", random_search.best_estimator_.get_params()[param_3])
    print("Best " + param_4 + ":", random_search.best_estimator_.get_params()[param_4])

def model_train(model, feature_df, label_df, n_proc, mean_roc_auc, cv_std):
    '''train a model and output mean roc_auc and CV std.dev roc_auc'''
    roc_auc = cross_val_score(model, feature_df, label_df, n_jobs=n_proc,
                               cv=5, scoring='roc_auc')
    mean_roc_auc[model] = np.mean(roc_auc)
    cv_std[model] = np.std(roc_auc)    

def model_summary(model, mean_roc_auc, cv_std):
    '''print out the model performances'''
    print('\nModel:\n', model)
    print('Average roc_auc:\n', mean_roc_auc[model])
    print('Std. Dev during CV:\n', cv_std[model])    

def model_results(model, mean_roc_auc, predictions, feature_importances):
    '''saves the model name, mean_roc_auc, predicted rate, and feature importances'''
    with open('model.txt', 'w') as file:
        file.write(str(model))
        feature_importances.to_csv('feat_importances.csv')
        predictions.to_csv('pred_results_best.csv', index=False)

In [3]:
# --- 3. Load the data --- #
if __name__ == '__main__':
# Define input CSVs:
    train_file = 'stroke_train.csv'
    test_file = 'stroke_test.csv'

# Define type of variables list:
#df_train.select_dtypes(include='object').columns
cat_vars = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

#df_train.select_dtypes(include='int64').columns
#df_train.select_dtypes(include='float64').columns
num_vars = ['hypertension', 'heart_disease', 'age', 'avg_glucose_level', 'bmi']
label_var = 'stroke'

# Define variables to drop
list_vars = 'id'

# Load data
df_train = load_file(train_file)
df_test = load_file(test_file)

# Check the metadata of dataframe:
df_train.info()

# Create a label dataframe:
df_label = df_train[['id', 'stroke']]

# Drop a column by index: poverty_rate
df_train = drop_column_by_index(df_train, label_var)

# join train set and label:
train_raw_df = join_data(df_train, df_label, key='id')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43400 entries, 0 to 43399
Data columns (total 12 columns):
id                   43400 non-null int64
gender               43400 non-null object
age                  43400 non-null float64
hypertension         43400 non-null int64
heart_disease        43400 non-null int64
ever_married         43400 non-null object
work_type            43400 non-null object
Residence_type       43400 non-null object
avg_glucose_level    43400 non-null float64
bmi                  41938 non-null float64
smoking_status       30108 non-null object
stroke               43400 non-null int64
dtypes: float64(3), int64(4), object(5)
memory usage: 4.0+ MB


## Part B. Exploratory Data Analysis ##

**Stroke Patients Data: training set** 

Data exploration is conducted on a cleaned training set. The main goal of this phase is to explore any interesting relationships among features and identify which features are good predictors on stroke label predictions.

Following set of questions are asked:

Can I count something interesting?
Can I find some trends (increase or decrease and any anomalies)?
Can I plot a bar chart or a histogram?
Can I make a scatter plot?
These set of guiding questions will help us to explore any insights and tell a compelling story about the US poverty dataset.