# Day 3:

## Outline of the Datasets we'll use today:
1. Heart disease
    * decision tree
2. Diamonds
    * Make a brief explanatory analysis.
    * Prepare the data in order to perform a linear regression on the variable "price". In particular we will perform label encoding or one-hot encoding on categorical variables, according to their nature.
    * Understand how Scikit-Learn implements linear regression. Perform this algorithm and study its performance.
    * Understand how the k-NN regressor works, in theory and in practice with Scikit-Learn. How is it different from the k-NN classifier? Use its scikit learn implementation to study its performances.
    * Understand how the decision tree regressor works, in theory and in practice with Scikit-Learn. Study its scikit learn implementation performance.
    * Compare the performances of all ofthe previous algorithms.

# Heart Disease

In [None]:
# Data manipulation:
import pandas as pd

# scikit learn ML models
# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Models
from sklearn.tree import DecisionTreeClassifier, plot_tree
# cross validation
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_roc_curve

## Load and prepare data

In [None]:
## Use the same method as Day1 to prepare X and y for the heart dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

## [Decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

### Research:
Research online how this algorithm works and how it is implemented in scikit learn. Do not trust blindly what I tell you, double check !

### Practice
Find and test out at least 4 relevant decision tree properties from the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

In [None]:
pipe_ex = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('classifier', )
])

params_ex = {
    'imputer__strategy': ['mean', 'median'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'pca__n_components': [5, 11, 12, 13],
    'classifier__': [],
    'classifier__': [],
    'classifier__': [],
    'classifier__': [],
    'classifier__': []
}

gridsearch_ex = GridSearchCV(pipe_ex, knn_params_ex, cv=5, verbose=1, n_jobs=-1)
gridsearch_ex_result = gridsearch_ex.fit(X_train, y_train)

display(gridsearch_ex_result.best_estimator_)
display('Best model accuracy over previously unseen data: {}'.format(
    gridsearch_ex_result.score(X_test, y_test)
))

plot_roc_curve(gridsearch_ex, X_test, y_test)

In [None]:
plot = plot_tree(gridsearch_ex_result.best_estimator_['classifier'], filled=True, rotate=True)

# Diamonds

## Load the data

In [None]:
data = pd.read_csv('../data/diamonds.csv')

## Explore the data

In [None]:
data

## Prepare the data

In [None]:
## prepare the report

In [None]:
## explore the widget

## Experiment with the [Linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
### Research:
Research online how this algorithm works and how it is implemented in scikit learn.

### Practice
Identify and test out the two meaningful properties that might impact the accuracy.

In [None]:
pipe_ex = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('classifier', )
])

params_ex = {
    'imputer__strategy': ['mean', 'median'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'pca__n_components': [5, 11, 12, 13],
    'classifier__': [],
    'classifier__': []
}

gridsearch_ex = GridSearchCV(pipe_ex, knn_params_ex, cv=5, verbose=1, n_jobs=-1)
gridsearch_ex_result = gridsearch_ex.fit(X_train, y_train)

display(gridsearch_ex_result.best_estimator_)
display('Best model accuracy over previously unseen data: {}'.format(
    gridsearch_ex_result.score(X_test, y_test)
))

plot_roc_curve(gridsearch_ex, X_test, y_test)

## Experiment with the [k-NN Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

### Research:
Research online how this algorithm works and how it is implemented in scikit learn. How does it differ from the k-NN classifier?

### Practice
Identify and experiment with at least 4 meaningful properties that might impact the model accuracy. Test out a few of the different possible values for each property you selected.

In [None]:
pipe_ex = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('classifier', )
])

params_ex = {
    'imputer__strategy': ['mean', 'median'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'pca__n_components': [5, 11, 12, 13],
    'classifier__': [],
    'classifier__': []
}

gridsearch_ex = GridSearchCV(pipe_ex, knn_params_ex, cv=5, verbose=1, n_jobs=-1)
gridsearch_ex_result = gridsearch_ex.fit(X_train, y_train)

display(gridsearch_ex_result.best_estimator_)
display('Best model accuracy over previously unseen data: {}'.format(
    gridsearch_ex_result.score(X_test, y_test)
))

plot_roc_curve(gridsearch_ex, X_test, y_test)

## [Decision tree regressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

### Research:
Research online how this algorithm works and how it is implemented in scikit learn. How does it differ from the classifier?

### Practice
Find and test out at least 4 relevant decision tree properties from the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Use your previous experimentation to select 2 different parameters that might have a stronger impact.

In [None]:
pipe_ex = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('classifier', )
])

params_ex = {
    'imputer__strategy': ['mean', 'median'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'pca__n_components': [5, 11, 12, 13],
    'classifier__': [],
    'classifier__': [],
    'classifier__': [],
    'classifier__': [],
    'classifier__': []
}

gridsearch_ex = GridSearchCV(pipe_ex, knn_params_ex, cv=5, verbose=1, n_jobs=-1)
gridsearch_ex_result = gridsearch_ex.fit(X_train, y_train)

display(gridsearch_ex_result.best_estimator_)
display('Best model accuracy over previously unseen data: {}'.format(
    gridsearch_ex_result.score(X_test, y_test)
))

plot_roc_curve(gridsearch_ex, X_test, y_test)

# Diamonds cheat sheet:

1. Make a brief explanatory analysis.
    * identify and understand the non numerical values
    * identifiy correlated data => might want to reduce the number of dimensions
    * identify missing values
2. Prepare the data in order to perform a linear regression on the variable "price". In particular we will perform label encoding or one-hot encoding on categorical variables, according to their nature.
    * X:
         * one hot carat
         * one hot color
         * remove 'Unnamed: 0'
         * remove 'price'
     * y:
         * only keep price column
3. Understand how Scikit-Learn implements linear regression. Perform this algorithm and study its performance.
    * fit_intercept and normalise are the only relevant features to test out here
    * fit_intercept requires centered data if False. This is a preprocessing step, did you do it in order to increase your accuracy ?
4. Understand how the k-NN regressor works, in theory and in practice with Scikit-Learn. How is it different from the k-NN classifier? Use its scikit learn implementation to study its performances.
    * metric is probably one of the most important features to experiment with. Try out the different distance functions available
    * algorithm might impact your metric choices. Test out multiple implementations to experiment with all the possibilities
    * n_neighbors
    * weights
5. Understand how the decision tree regressor works, in theory and in practice with Scikit-Learn. Study its scikit learn implementation performance.
    * criterion
    * max_features
    * min_impurity_split
    * max_depth
6. Compare the performances of all ofthe previous algorithms.
    * cross validation power !!!