## Classification using scikit-learn

## Dataset
**[UCI Heart Disease Dataset](https://www.kaggle.com/ronitf/heart-disease-uci)** <br/>
Goal: presence/absence of heart disease based the following health-related features

- *age*: age in years 
- *sex*: (1 = male; 0 = female) 
- *cp*: chest pain type 
- *trestbps*: resting blood pressure (in mm Hg on admission to the hospital) 
- *chol*: serum cholestoral in mg/dl 
- *fbs*: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 
- *restecg*: resting electrocardiographic results 
- *thalach*: maximum heart rate achieved 
- *exang*: exercise induced angina (1 = yes; 0 = no) 
- *oldpeak*: ST depression induced by exercise relative to rest 
- *slope*: the slope of the peak exercise ST segment 
- *ca*: number of major vessels (0-3) colored by flourosopy 
- *thal*: 3 = normal; 6 = fixed defect; 7 = reversable defect 
- *target*: have disease or not (1=yes, 0=no)

In [1]:
### Importing Files

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [21]:
data = pd.read_csv("heart.csv", index_col = 0)
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,3,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,6,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,6,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,6,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,6,1


## Part 1: Data Preprocessing Techniques (50 pts)

a) 20 pts

b) 15 pts

c) 15 pts

### One-Hot Encoding

This is used for categorical features. For example, the 'thal' feature is coded as (normal: 3; fixed defect: 6, reversable defect: 7). Model might make false associations such as reversable > fixed. Instead, we create three new features: is_normal, is_fixed, is_reversible.

In [22]:
'''
a) Use one-hot encoding to transform the 'thal' feature into two columns called 'is_normal', 'is_fixed', 
and 'is_reversible'. (15 pts). Be sure to drop the 'thal' column afterwards.
Hint: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
'''
# your code here
data['thal'].replace({3: 'normal', 6: 'fixed', 7: 'reversible'}, inplace = True)
data = pd.get_dummies(data, columns=["thal"], prefix=["is"])

In [23]:
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,target,is_fixed,is_normal,is_reversible
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,0,1,0
1,37,1,2,130,250,0,1,187,0,3.5,0,0,1,1,0,0
2,41,0,1,130,204,0,0,172,0,1.4,2,0,1,1,0,0
3,56,1,1,120,236,0,1,178,0,0.8,2,0,1,1,0,0
4,57,0,0,120,354,0,1,163,1,0.6,2,0,1,1,0,0


In [24]:
''' 
End of Section
'''

' \nEnd of Section\n'

### Feature Normalization

The range of values for different features usually varies widely. This may cause problems if models try to compare features. Thus, we rescale all features from 0 to 1 using min-max normalization. For a feature x, the formula is: x' = (x - min(x)) / (max(x) - min(x)).

In [28]:
'''
b) Use min-max normalzaition to resacle all the features between 0 and 1 (15 pts). Make sure that data remains in the same
dataframe format.
Hint: Use https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
'''
# your code here

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)

X = data.drop(columns = 'target')
y = data['target']

In [29]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,is_fixed,is_normal,is_reversible
0,0.708333,1.0,1.0,0.481132,0.244292,1.0,0.0,0.603053,0.0,0.370968,0.0,0.0,0.0,1.0,0.0
1,0.166667,1.0,0.666667,0.339623,0.283105,0.0,0.5,0.885496,0.0,0.564516,0.0,0.0,1.0,0.0,0.0
2,0.25,0.0,0.333333,0.339623,0.178082,0.0,0.0,0.770992,0.0,0.225806,1.0,0.0,1.0,0.0,0.0
3,0.5625,1.0,0.333333,0.245283,0.251142,0.0,0.5,0.816794,0.0,0.129032,1.0,0.0,1.0,0.0,0.0
4,0.583333,0.0,0.0,0.245283,0.520548,0.0,0.5,0.70229,1.0,0.096774,1.0,0.0,1.0,0.0,0.0


In [30]:
''' 
End of Section
'''

' \nEnd of Section\n'

### Train/test Split

- Training Data: examples used to train the model
- Testing Data: examples used to test the model, separate from training data
- We randomly choose 75% of all data for training, and the remaining 25% for testing. 

In [31]:
'''
c) Split the data into a train, test set using a 75/25 split. Use a random state of 42 for grading purposes (20 pts).
Hint: Use https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
'''

# your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [11]:
''' 
End of Section
'''

' \nEnd of Section\n'

## Part 2) Fitting Model and Analyzing Results (50 pts)

a) 15 pts

b) 10 pts

c) 10 pts

d) 15 pts

In [32]:
'''
a) Fit a logisitic regression classifier on the data. Save the model in a varaible called 'clf'. Use a random state of 42.
Use the following paramters: penalty:'l2', solver:'liblinear', C:0.1. 15 pts.
'''

# your code here
clf = LogisticRegression(penalty='l2', solver='liblinear', C=0.1, random_state = 42)
clf.fit(X_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [33]:
''' 
End of Section
'''

' \nEnd of Section\n'

In [36]:
'''
b) Generate 0/1 predictions on the test set and store them in a varaible called 'pred'. 
Generate proabbility prerdictions on the test set and store them in a variable called 'scores'.
10 pts
'''

# your code here
pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)
print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores[:,1]))
print(classification_report(y_test, pred))

Accuracy:  0.8421052631578947
AUROC:  0.9066202090592334
              precision    recall  f1-score   support

         0.0       0.83      0.83      0.83        35
         1.0       0.85      0.85      0.85        41

    accuracy                           0.84        76
   macro avg       0.84      0.84      0.84        76
weighted avg       0.84      0.84      0.84        76



In [37]:
''' 
End of Section
'''

' \nEnd of Section\n'

In [39]:
def rsme(predictions, actuals):
    '''
    c) Fill in this function to find and return the root mean sqaured error between the predicted and actual values.
    Hint: Use his formula for the rsme: https://sciencing.com/calculate-mean-deviation-7152540.html.
    10 pts
    '''
    # your code here
    from sklearn.metrics import mean_squared_error
    return mean_squared_error(actuals, predictions, squared=False)
print('RSME: ', rsme(y_test, pred))

RSME:  0.39735970711951313


In [40]:
''' 
End of Section
'''

' \nEnd of Section\n'

In [42]:
'''
d) Try using a random forest classifier to fit the data instead. Use the deafult paramters and a random state of 42.
Save the fitted model into a varaible called 'rf'. Generate the 'pred' and 'scores' in a similar way to part b.
Hint: Use https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
15 pts
'''
from sklearn.ensemble import RandomForestClassifier
# your code here
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
scores = rf.predict_proba(X_test)
print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores[:,1]))
print(classification_report(y_test, pred))

Accuracy:  0.8552631578947368
AUROC:  0.9059233449477352
              precision    recall  f1-score   support

         0.0       0.88      0.80      0.84        35
         1.0       0.84      0.90      0.87        41

    accuracy                           0.86        76
   macro avg       0.86      0.85      0.85        76
weighted avg       0.86      0.86      0.85        76



In [43]:
''' 
End of Section
'''

' \nEnd of Section\n'

Looking at the metrics defined above, which classifier performs better? (ungraded)