# Exam - Introduction to Machine Learning

The goal of this exam is perform an analysis on data related to heart disease.
We would like to explore the relationship between a `target` variable - whether a patient has a heart disease or not - and several other variables such as cholesterol level, age, ...

The dataset is stored in the file `'heartData_simplified.csv'`, which is a cleaned and simplified version of the [UCI heart disease data set](https://archive.ics.uci.edu/ml/datasets/heart+Disease)

You should explore the dataset and answer the questions of the exam in a jupyter notebook, with comments explaining what you did. 
You should send your code to **wandrille.duchemin [at] unibas.ch**.



## description of the columns

* age : Patient age in years
* sex : Patient sex
* chol : Cholesterol level in mg/dl. 
* thalach : Maxium heart rate during the stress test
* oldpeak : Decrease of the ST segment during exercise according to the same one on rest.
* ca : Number of main blood vessels coloured by the radioactive dye. The number varies between 0 to 3.
* thal : Results of the blood flow observed via the radioactive dye.
	* defect -> fixed defect (no blood flow in some part of the heart)
	* normal -> normal blood flow
	* reversible -> reversible defect (a blood flow is observed but it is not normal)
* target : Whether the patient has a heart disease or not

**code to read the data**

In [1]:
import pandas as pd

df = pd.read_table('heartData_simplified.csv',sep=',')
df.head()
print(len(df))

296


You can see that we have several categorical variables in the dataset: `sex` , `thal`, and our response variable : `target`.

If taken as-is, these variables may create problems in ML models. 
So we will first transform them to sets of colums with 0 and 1.

In [2]:
# in this first line, I make sure that "normal" becomes the default value for the thal columns
df['thal'] = df['thal'].astype(pd.CategoricalDtype(categories=["normal", "reversible", "defect"],ordered=True))

# get dummies will transform these categorical columns to sets of 0/1 columns
df = pd.get_dummies( df , drop_first=True )

df.head()

Unnamed: 0,age,chol,thalach,oldpeak,ca,sex_male,thal_reversible,thal_defect,target_yes
0,63,233,150,2.3,0,1,0,1,0
1,37,250,187,3.5,0,1,0,0,0
2,41,204,172,1.4,0,0,0,0,0
3,56,236,178,0.8,0,1,0,0,0
4,57,354,163,0.6,0,0,0,0,0


So now, we can see that `sex` has been replaced by `sex_male`, a column where 1 means male, and 0 means female (so female is the default value, and in subsequent models, the parameter associated with this column will represent the effect associated with being male).

Similarly, `target` has been replaced with `target_yes`: 0 means no heart disease, 1 means heart disease.

Finally, `thal` has been replaced by 2 columns: this is how categorical variables with more than 2 levels are represented. A 1 in one of these to column means that the blood flow is reversible or fixed (depending on which column the 1 is).
When `thal_reversible` and `thal_defect` are at 0, it means that blood flow is normal, the "default" value (note that there is no case when they are both 1).


**Instructions**

Your mission is to implement a machine learning pipeline in order to predict `target_yes`.

Try to choose a relevant approach regarding :
 * the split of your data into train and test sets
 * the metric you would like to optimize


Try a logistic regression and a decision tree approach.
 * Which one yields the best result here ?
 * If you had to remove 3 measurements out of this analysis, which ones would you choose ?
 * Try to describe briefly (in 3 or 4 sentences) which insights your model gives you about the relationship between heart disease and the coraviables.



In [4]:
coVariables = df.drop('target_yes' , axis=1)
response = df.target_yes