# Taqiya Ehsan
# Programming Exercise \#1
---

# Preamble

In [None]:
# optional code cell when using Google Colab with Google Drive
# remove the docstring comment block below in order to mount Google Drive

# mount Google Drive in Google Colab
from google.colab import drive
drive.mount('/content/drive')

# change directory using the magic command %cd
### replace [MY PATH] below with your own path in Google Drive ###
%cd /content/drive/My\ Drive/ML/ProgrammingAssignment1

Mounted at /content/drive
/content/drive/My Drive/ML/ProgrammingAssignment1


In [None]:
# import relevant Python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats as sps
from IPython.display import display, Latex

# **1. Fetal Health Classification Dataset**

## **Clean Dataset**

### Problem 1.1

In [None]:
#@title
# load the clean dataset csv file into a pandas dataframe
fetal_clean_df = pd.read_csv('fetal_health_dataset_clean.csv')
fetal_clean_df

#### (a)

This is a **supervised machine learning task**. The provided dataset consists of different feature data that could potentially help infer the fetal health -- these are the independent variables. The dataframe also consists of the inferred fetal health based on the given independent variables -- i.e. each combination of Cardiotocogram exam features has been labeled with one of the three fetal health classifications. So, during training, the model will be fed with both the Cardiotocogram exam features as well as the fetal health labels making it supervised learning.

#### (b)

In [None]:
#@title
### Your code for 1.1(b) goes here ###
print(f'dtypes: {fetal_clean_df.dtypes}')
print(f'axes: {fetal_clean_df.axes}')

#### (c)



In [None]:
#@title
### Your code for 1.1(c) goes here ###
fetal_clean_df.head(10)

#### (d)

In [None]:
#@title
### Your code for 1.1(d) goes here ###
fetal_clean_df.shape # (row, column)

#### (e)

Each row is a sample

#### (f)

2126 samples

#### (g)

21 independent variables:
'baseline value', 'accelerations', 'fetal_movement', 'uterine_contractions', 'light_decelerations', 'severe_decelerations', 'prolongued_decelerations', 'abnormal_short_term_variability', 'mean_value_of_short_term_variability', 'percentage_of_time_with_abnormal_long_term_variability', 'mean_value_of_long_term_variability', 'histogram_width', 'histogram_min', 'histogram_max', 'histogram_number_of_peaks', 'histogram_number_of_zeroes', 'histogram_mode', 'histogram_mean', 'histogram_median', 'histogram_variance'

#### (h)

1 dependent variable: 'fetal_health'

#### (i)

(2126, 21)

#### (j)

(2126, 1)

#### (k)

They have been pre-processed. The raw data would not have been as organized and concise as the data provided in the csv file. Moreover, there are no inconsistent or NaN data in it which would have been present if this were raw data. Everything has also been reviewed by expert obstertricians and a consensus classification label. Hence, this is a pre-processed data.

#### (l)

In [None]:
#@title
for col in fetal_clean_df.columns:
  print(f'{col}')
  print(fetal_clean_df[f'{col}'].unique())

# fetal_clean_df['fetal_health'].unique()
# fetal_clean_df['fetal_health'].hist()
# plt.show()

**"fetal_health"** is the only categorical data in this dataset as it is the only one that can be grouped. All the other variables in the dataframe are numerical such that they are a measurement of some feature and we can perform further mathematical operations on them. However, although "fetal_health" is expressed as numbers in the dataframe, each of the numbers is actually a category -- we can group different samples within the different categories of "fetal_health" based on rest of the numeric/non-categorical variables.

#### (m)

The categorical variables in this dataset follows **label encoding** -- each category corresponds to an integer value representing that label. So the integer 1 in "fetal_health" represents the category "normal," 2 represents "suspect" and 3 represents "pathological."


#### (n)

In [None]:
#@title
fetal_clean_df.loc[:, "fetal_health"].value_counts()

## **Dirty Dataset**

In [None]:
#@title
# load the dirty dataset csv file into a pandas dataframe
fetal_dirty_df = pd.read_csv('fetal_health_dataset_dirty.csv')
fetal_dirty_df.columns

### Problem 1.2

#### (a)

In [None]:
#@title
print(fetal_dirty_df.isna().any())
print(f'Total NaN entries: {fetal_dirty_df.isna().sum().sum()}')

#### (b)

In [None]:
#@title
fetal_dirty_df.columns[fetal_dirty_df.isna().any()]

Index(['baseline value',
       'percentage_of_time_with_abnormal_long_term_variability',
       'histogram_max', 'histogram_mode'],
      dtype='object')

#### (c)

In [None]:
#@title
fetal_dirty_df.shape[0]-fetal_dirty_df.dropna().shape[0]

86

### Problem 1.3

In [None]:
#@title
# for col in fetal_dirty_df.columns:
processed_fetal_dirty_df = fetal_dirty_df.copy()
processed_fetal_dirty_df.loc[processed_fetal_dirty_df["baseline value"] < 0, "baseline value"] = np.nan
processed_fetal_dirty_df.loc[(processed_fetal_dirty_df["fetal_health"] != 1) & (processed_fetal_dirty_df["fetal_health"] != 2) & (processed_fetal_dirty_df["fetal_health"] != 3), "fetal_health"] = np.nan
processed_fetal_dirty_df.loc[(processed_fetal_dirty_df["fetal_movement"] < 0) | (processed_fetal_dirty_df["fetal_movement"] > 0.5), "fetal_movement"] = np.nan
processed_fetal_dirty_df.loc[(processed_fetal_dirty_df["uterine_contractions"] < 0) | (processed_fetal_dirty_df["uterine_contractions"] > 0.5), "uterine_contractions"] = np.nan
processed_fetal_dirty_df.loc[(processed_fetal_dirty_df["percentage_of_time_with_abnormal_long_term_variability"] < 0) | (processed_fetal_dirty_df["percentage_of_time_with_abnormal_long_term_variability"] > 100), "percentage_of_time_with_abnormal_long_term_variability"] = np.nan
processed_fetal_dirty_df.loc[processed_fetal_dirty_df["mean_value_of_short_term_variability"] < 0, "mean_value_of_short_term_variability"] = np.nan
processed_fetal_dirty_df.loc[processed_fetal_dirty_df["mean_value_of_long_term_variability"] < 0, "mean_value_of_long_term_variability"] = np.nan

processed_fetal_dirty_df.iloc[91:101]

* The baseline value of fetal heart rate cannot be negative 
* The fetal health value cannot be anything other than 1, 2, 3
* The average movement rate is 0.003/sec so any negative or very high value would be inconsistent
* The range of general uterine contraction is between 0.007-0.008 contractions per second; so it cannot be high values like >1 or negative values
* The mean short and long term variability values cannot be negative
* The percentage of time with abnormal long term variability value cannot be greater than 100 or less than 0



### Problem 1.4

In [None]:
#@title
non_cat_cols = list(processed_fetal_dirty_df.columns[0:21])
processed_fetal_dirty_df.iloc[:, 0:21] = sps.zscore(processed_fetal_dirty_df[non_cat_cols], nan_policy='omit')
processed_fetal_dirty_df.head(20)

### Problem 1.5

In [None]:
#@title
for col in processed_fetal_dirty_df.columns:
  med = processed_fetal_dirty_df[col].median()
  processed_fetal_dirty_df[col].fillna(med, inplace=True)

# processed_fetal_dirty_df.iloc[91:101]

### Problem 1.6

In [None]:
#@title
fetal_dirty_df_onehot = processed_fetal_dirty_df.copy()
fetal_dirty_df_onehot["f0"]= pd.get_dummies(fetal_dirty_df["fetal_health"])[1]  # Normal --> 001
fetal_dirty_df_onehot["f1"]= pd.get_dummies(fetal_dirty_df["fetal_health"])[2]  # Suspect --> 010
fetal_dirty_df_onehot["f2"]= pd.get_dummies(fetal_dirty_df["fetal_health"])[3]  # Pathological --> 100
fetal_dirty_df_onehot.to_csv("fetal_health_dataset_processed.csv")
fetal_dirty_df_onehot.head(10)

# **2. Heart Failure Prediction Dataset**

## Problem 2.1

In [None]:
# load the dataset csv file into a pandas dataframe
heart_df = pd.read_csv('heart_failure_dataset.csv')
heart_df.shape

(299, 13)

### (a)

This is a **supervised machine learning task**. The provided dataset consists of different features that could potentially help infer the cardiovascular health of a person -- these are the independent variables. The dataset also consists of the death event based on the given independent variables -- i.e. each combination of cardiac health features has been labeled with either "died" or "didn't die". So, during training, the model will be fed with both the cardiac health features as well as the death event labels making it supervised learning.



### (b)

In [None]:
print(f'dtypes: {heart_df.dtypes}')
print(f'axes: {heart_df.axes}')

### (c)



In [None]:
heart_df.head(10)

### (d)

In [None]:
heart_df.shape

### (e)

299 samples

### (f)

12 independent variables:
'age', 'anaemia', 'creatinine_phosphokinase' 'diabetes', 'ejection_fraction' 'high_blood_pressure', 'platelets' 'serum_creatinine', 'serum_sodium', 'sex' 'smoking', 'time'

### (g)

1 dependent variable: DEATH_EVENT

### (h)

299, 12

### (i)

299, 1

### (j)

In [None]:
### Your code for 2.1(j) goes here ###
print(heart_df.nunique())
for col in heart_df.columns:
  print(f'{col}')
  print(heart_df[f'{col}'].unique())

In this dataframe **anaemia, diabetes, high_blood_pressure, sex, smoking, DEATH_EVENT** are categorical variables. These variables group the samples into categories, like anaemic/non-anaemic, male/female, smoker/non-smoker, diabetic/non-diabetic, etc. All these variables put the samples into different categories so they are categorical variables.

### (k)

anaemia, diabetes, high_blood_pressure, sex, smoking, DEATH_EVENT are **binary encoded**.

### (l)

In [None]:
heart_df.loc[:,"DEATH_EVENT"].value_counts()

0    203
1     96
Name: DEATH_EVENT, dtype: int64

### (m)

In [None]:
heart_df.loc[:,"sex"].value_counts()

1    194
0    105
Name: sex, dtype: int64

### (n)

In [None]:
heart_df.loc[:,"smoking"].value_counts()

0    203
1     96
Name: smoking, dtype: int64

## Problem 2.2

In [None]:
processed_heart_df = heart_df.copy()

# non-categorical variables
processed_heart_df.loc[(processed_heart_df["platelets"] < 50000) & (processed_heart_df["platelets"] > 1000000), "platelets"] = np.nan
processed_heart_df.loc[(processed_heart_df["ejection_fraction"] < 0) & (processed_heart_df["ejection_fraction"] > 100), "ejection_fraction"] = np.nan
processed_heart_df.loc[(processed_heart_df["serum_sodium"] < 50) & (processed_heart_df["serum_sodium"] > 500), "serum_sodium"] = np.nan
processed_heart_df.loc[(processed_heart_df["creatinine_phosphokinase"] < 0) & (processed_heart_df["creatinine_phosphokinase"] > 200), "creatinine_phosphokinase"] = np.nan

# categorical variables
processed_heart_df["anaemia"] = processed_heart_df["anaemia"].apply(lambda x: x if x in [0,1] else np.nan)
processed_heart_df["diabetes"] = processed_heart_df["diabetes"].apply(lambda x: x if x in [0,1] else np.nan)
processed_heart_df["high_blood_pressure"] = processed_heart_df["high_blood_pressure"].apply(lambda x: x if x in [0,1] else np.nan)
processed_heart_df["sex"] = processed_heart_df["sex"].apply(lambda x: x if x in [0,1] else np.nan)
processed_heart_df["smoking"] = processed_heart_df["smoking"].apply(lambda x: x if x in [0,1] else np.nan)
processed_heart_df["time"] = processed_heart_df["time"].apply(lambda x: x if x >= 0 else np.nan)
processed_heart_df["DEATH_EVENT"] = processed_heart_df["DEATH_EVENT"].apply(lambda x: x if x in [0,1] else np.nan)

processed_heart_df.head(20)

* All the categorical variables have to be either 0 or 1 
* ejection_fraction cannot be less than 0 or more than 100 as it is a percentage
* average platelet count is between 150,000 and 450,000 so it cannot be anything too low (50,000) or too high (1,000,000)
* serum_sodium levels are 135 to 145 milliequivalents per liter, so anything too high (>500) or too low (<50) would be inconsistent
* The average level of creatinine phosphokinase is 10 to 120 micrograms per liter of blood; so inconsistency would be negative or too high values (>500)

## Problem 2.3

In [None]:
### Your code for 2.3 goes here ###
non_cat_cols = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']
processed_heart_df.loc[:, non_cat_cols] = sps.zscore(processed_heart_df[non_cat_cols], nan_policy='omit')
processed_heart_df.head(20)

## Problem 2.4

In [None]:
### Your code for 2.4 goes here ###
heart_df_onehot= processed_heart_df.copy()
heart_df_onehot["D0"] = pd.get_dummies(heart_df["DEATH_EVENT"])[0]                # 01 --> did not die
heart_df_onehot["D1"] = pd.get_dummies(heart_df["DEATH_EVENT"])[1]                # 10 --> died
heart_df_onehot.to_csv("heart_failure_dataset_processed.csv")

## Problem 2.5

In [None]:
heart_df.corr()["DEATH_EVENT"]

age                         0.253729
anaemia                     0.066270
creatinine_phosphokinase   -0.039709
diabetes                   -0.001943
ejection_fraction          -0.268603
high_blood_pressure         0.079351
platelets                  -0.020900
serum_creatinine            0.294278
serum_sodium               -0.177677
sex                        -0.004316
smoking                    -0.012623
time                       -0.526964
DEATH_EVENT                 1.000000
Name: DEATH_EVENT, dtype: float64

In [None]:
plt.matshow(heart_df.corr())
plt.show()

### (a)

serum_creatinine (pairwise correlation: 0.294278) and age (pairwise correlation: 0.253729)

### (b)

time (pairwise correlation: -0.526964) and ejection_fraction (pairwise correlation: -0.268603)

### (c)

The second-most positively correlated variable with DEATH_EVENT is **age** (pairwise correlation: 0.253729). The logical reasoning behind this correlation could be that as people age, their heart condition tends to deteriorate, making death from heart-related causes more and more likely. 

### (d)

The two most negatively correlated variables with DEATH_EVENT are **time** (pairwise correlation: -0.526964) and **ejection_fraction** (pairwise correlation: -0.268603). The logical reasoning behind this could be that the follow-up time between doctor's appointments and how much blood is pumped by the heart are precautionary steps for avoiding death from heart failure. If follow-ups are done more frequently and ejection fraction levels are monitored, the doctor can take actions against heart failure faster thus decreasing chances of death. However, if death event does occur then that could imply less frequent follow-ups as well as inconsistent monitoring of ejection fraction. 