**Table of contents**<a id='toc0_'></a>    
- [Supervised Learning - Classification](#toc1_)    
  - [K-Nearest Neighbors (distance-based)](#toc1_1_)    
    - [Overfitting check](#toc1_1_1_)    
    - [Hyperparameter tuning](#toc1_1_2_)    
  - [Logistic Regression (equation-based)](#toc1_2_)    
    - [Evaluation metrics](#toc1_2_1_)    
  - [Decision Trees (tree-based)](#toc1_3_)    
    - [Hyperparameter Tuning](#toc1_3_1_)    
    - [Review decision tree](#toc1_3_2_)    
    - [Feature importance](#toc1_3_3_)    
    - [💥 **Bonus**: Lolliipop charts in Python](#toc1_3_4_)    
  - [Support Vector Machines](#toc1_4_)    
    - [Cross Validation](#toc1_4_1_)    
    - [💥 **Bonus**: Stratification in sklearn](#toc1_4_2_)    
- [Resources](#toc2_)    
- [Acknowledgements](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Supervised Learning - Classification](#toc0_)

In [None]:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
cancer = load_breast_cancer()

In [None]:
# description of the dataset
print(cancer['DESCR'])

In [None]:
# 212 people with cancer
# 357 people without cancer

In [None]:
# Extract dataset into pandas
features = pd.DataFrame(cancer['data'], columns = cancer['feature_names'])
labels = pd.Series(cancer['target'], name = 'labels')

In [None]:
# Display features & labels
display(features.head())
display(labels.head())

In [None]:
from sklearn.model_selection import train_test_split

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state=0)

## <a id='toc1_1_'></a>[K-Nearest Neighbors (distance-based)](#toc0_)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Initialize model with N=9
model = KNeighborsClassifier(n_neighbors=9)

In [None]:
# Train model & predict
model = model.fit(X_train, y_train)
model.predict(X_test)

In [None]:
# Compare predictions to reality
np.array(y_test)

In [None]:
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Compute overall accuracy
accuracy_score(model.predict(X_test),np.array(y_test))

In [None]:
# Compute classification report
print(classification_report(model.predict(X_test), np.array(y_test)))

### <a id='toc1_1_1_'></a>[Overfitting check](#toc0_)

Has my model simply memorized the data or did it infer some patterns and relationships from the data?

In [None]:
# once the model is trained you can call the score method, to compare results of test predictions with actual values -> returns the accuracy
print("test data accuracy was ",model.score(X_test,y_test))

# you should always also see the accuracy of the training
print("train data accuracy was ", model.score(X_train, y_train))


In [None]:
# Compute classification report
print(classification_report(model.predict(X_train), np.array(y_train)))

The model peforms better on the test data than on the training data (unusual!). This means our model is slightly **underfit**. 

In this case, we would add either more data points or more features to further increase the accuracy or we would remove/reduce techniques used to prevent overfitting, e.g. regularization (which we'll talk about in the regression class).

### <a id='toc1_1_2_'></a>[Hyperparameter tuning](#toc0_)

One strategy to deal with underfitting/overfitting is to change the parameters of the model. For the KNN algorithm, the main parameter is the **number of neighbors**:

In [None]:
import matplotlib.pyplot as plt

# hyerparameter tuning - extract train-test scores into lists
train_accuracy = []
test_accuracy = []

# try n_neighbors from 1 to 30
neighbors_settings = range(1, 30)

for n_neighbors in neighbors_settings:
  # Build the model
  clf = KNeighborsClassifier(n_neighbors=n_neighbors)
  # Train the model
  clf.fit(X_train, y_train)
  # record training set accuracy
  train_accuracy.append(clf.score(X_train, y_train))
  # record generalization accuracy
  test_accuracy.append(clf.score(X_test, y_test))

# Plot results
plt.plot(neighbors_settings, train_accuracy, label="train accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()

plt.show()

**Which is the optimal number of neighbors?** Let's review using plotly:

In [None]:
# Adding a plotly chart for comparison
import plotly.graph_objects as go
import plotly.express as px

fig = go.Figure()
fig.add_trace(go.Scatter(x=list(neighbors_settings), y=train_accuracy, name='Training Accuracy'))
fig.add_trace(go.Scatter(x=list(neighbors_settings), y=test_accuracy, name='Testing Accuracy'))
fig.update_layout(xaxis_title='Accuracy', yaxis_title='No neighbors', title='')
fig.show()

10 - 15 neighbours seems to be the optimal point as it's the maximum training score we achieve, in spite of underfitting. 

## <a id='toc1_2_'></a>[Logistic Regression (equation-based)](#toc0_)

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize & fit the model
model = LogisticRegression()
model = model.fit(X_train, y_train)

In [None]:
# Basic Accuracy data
print("test data accuracy was ",model.score(X_test,y_test))
print("train data accuracy was ", model.score(X_train, y_train))

In [None]:
from sklearn.metrics import accuracy_score

# Get overall accuracy
pred = model.predict(X_test)
accuracy_score(y_test, pred)

### <a id='toc1_2_1_'></a>[Evaluation metrics](#toc0_)

In [None]:
from sklearn.metrics import confusion_matrix

# Get confusion matrix, convert to dataframe
cm = pd.DataFrame(confusion_matrix(y_test, pred))
cm.rename({0: 'No - Pred', 1: 'Yes - Pred'}, axis=1, inplace=True)
cm.rename({0: 'No - True', 1: 'Yes - True'}, axis=0, inplace=True)
cm

In [None]:
# Plot confusion matrix
px.imshow(cm, text_auto=True, color_continuous_scale='RdBu', color_continuous_midpoint=0)

In [None]:
from sklearn.metrics import precision_score, recall_score

# Extract precision & recall separately
print(precision_score(y_test,pred))
print(recall_score(y_test,pred))

In [None]:
# Review full classification report
print(classification_report(y_test,pred))

Is 99% accuracy good when predicting cancer outcomes? 

[Not really - 23 minutes worth of explanation for this one, feat. Bayes theorem](https://www.youtube.com/watch?v=lG4VkPoG3ko). 

In the video: **PPV ~ precision** and **sensitivity ~ recall**. For more on this matter, check out this article: [Data Science in Medicine — Precision & Recall or Specificity & Sensitivity?](https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1)

## <a id='toc1_3_'></a>[Decision Trees (tree-based)](#toc0_)

![](https://imgs.search.brave.com/-MSTVa2-jo6LRnfBlwZ6P2ogZFJuJ431Os0ha2p2GuU/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9taXJv/Lm1lZGl1bS5jb20v/djIvMCpsV0R1a2dJ/NE9RNkZ5RHpzLnBu/Zw)  
(Source: [An Exhaustive Guide to Decision Tree Classification in Python 3.x, Towards Data Science](https://towardsdatascience.com/an-exhaustive-guide-to-classification-using-decision-trees-8d472e77223f?gi=ec8e06014983))

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Instantiate and fit decision tree
model = DecisionTreeClassifier(max_depth=10)
model.fit(X_train, y_train)

Decision trees (and their family, Random Forests) are very prone to overfitting:

In [None]:
# Review overall accuracy scores
print("test data accuracy was ",model.score(X_test,y_test))
print("train data accuracy was ",model.score(X_train,y_train))

### <a id='toc1_3_1_'></a>[Hyperparameter Tuning](#toc0_)

Repeat the fitting process using different values for the `max_depth` parameter:

In [None]:
max_depth = range(1, 30)
test = []
train = []

for depth in max_depth:
  model = DecisionTreeClassifier(max_depth= depth)
  model.fit(X_train, y_train)
  test.append(model.score(X_test,y_test))
  train.append(model.score(X_train,y_train))

Review the train/test accuracy with different parameter values:

In [None]:
plt.plot(train, label="training accuracy")
plt.plot(test, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_depth")
plt.legend()

**Which is the ideal depth?** Let's review in plotly:

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(max_depth), y=train, name='Training Accuracy'))
fig.add_trace(go.Scatter(x=list(max_depth), y=test, name='Testing Accuracy'))
fig.update_layout(xaxis_title='Accuracy', yaxis_title='Max Tree Depth', title='')
fig.show()

The best performing `max_depth` is 24, although 2-4 perform almost equally well. Here, we'd decide on a depth of 3 which is less likely to overfit.

### <a id='toc1_3_2_'></a>[Review decision tree](#toc0_)

In [None]:
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

from sklearn import tree
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(model, 
                   feature_names=cancer.feature_names,  
                   class_names=["malignant", "benign"],
                   filled=True)

### <a id='toc1_3_3_'></a>[Feature importance](#toc0_)

In [None]:
# Review features
X_train.head().T

> Decision trees automaticaly give you feature importance based on how many times they split on a given feature:

In [None]:
model.feature_names_in_

In [None]:
model.feature_importances_

It's not very clear to see which features are the most important by comparing the dataframe to the numpy array, so we will plot the values:

In [None]:
def plot_feature_importances_cancer(model):
  n_features = cancer.data.shape[1]
  plt.barh(range(n_features), model.feature_importances_, align='center')
  plt.yticks(np.arange(n_features), cancer.feature_names)
  plt.xlabel("Feature importance")
  plt.ylabel("Feature")
plot_feature_importances_cancer(model)

In [None]:
fig = px.histogram(x=model.feature_importances_, y=model.feature_names_in_, category_orders={'category': 'total descending'})
fig.update_layout(xaxis_title='Feature importance', height=600)
fig.show()

### <a id='toc1_3_4_'></a>[💥 **Bonus**: Lolliipop charts in Python](#toc0_)

In [None]:
feat_importances = pd.DataFrame(model.feature_names_in_, columns=['feature'])
feat_importances['importance'] = model.feature_importances_
feat_importances = feat_importances.sort_values(by=['importance'],
                    ascending=False).iloc[0:15]
fig = go.Figure()
# Draw points
fig.add_trace(go.Scatter(x=feat_importances["importance"], 
                          y=feat_importances["feature"],
                          mode='markers',
                          marker_color='darkblue',
                          marker_size=10))
# Draw lines
for i in range(0, len(feat_importances)):
               fig.add_shape(type='line',
                              x0 = 0, y0 = i,
                              x1 = feat_importances["importance"][i],
                              y1 = i,
                              line=dict(color='crimson', width = 3))
# Set title
fig.update_layout(title_text = 
                   "Top 15 feature importances",
                   title_font_size = 30)
# Set x-axes range
fig.update_xaxes(title = 'Feature importance' , range=[0, 1])
fig.show()

## <a id='toc1_4_'></a>[Support Vector Machines](#toc0_)

> SVMs are used when there are two categories and no obvious linear classifier that separates them in a nice way. (OrtusAI + StatQuest)

![](https://editor.analyticsvidhya.com/uploads/1403824.png)  
(Source: [Guide on Support Vector Machine (SVM) Algorithm, Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/10/support-vector-machinessvm-a-complete-guide-for-beginners/))

Types of SVC boundaries: 
 
![](https://scikit-learn.org/stable/_images/sphx_glr_plot_iris_svc_001.png)
(Source: [sklearn documentation](https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html))

In [None]:
# Support Vector Machine
from sklearn.svm import LinearSVC

# Initialize and fit model
model = LinearSVC()
model.fit(X_train, y_train)

In [None]:
# Review overall accuracy score
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

### <a id='toc1_4_1_'></a>[Cross Validation](#toc0_)

What is cross-validation?

![](https://imgs.search.brave.com/tEBDW7f_GRHyGUhYVI0mmwKHv5NYPdYEFKxDqBUF3mk/rs:fit:860:0:0/g:ce/aHR0cHM6Ly93d3cu/c2VjdGlvbi5pby9l/bmdpbmVlcmluZy1l/ZHVjYXRpb24vaG93/LXRvLWltcGxlbWVu/dC1rLWZvbGQtY3Jv/c3MtdmFsaWRhdGlv/bi81LWZvbGQtY3Yu/anBlZw)  
(Source: [How to Implement K fold Cross-Validation in Scikit-Learn, Section.io](https://www.section.io/engineering-education/how-to-implement-k-fold-cross-validation/))

In [None]:
# Applying an example of cross validation
from sklearn.model_selection import cross_validate

model = LinearSVC()
results = cross_validate(model, cancer['data'], cancer['target'], cv=10)

In [None]:
# Review test scores per validation set
results['test_score']

In [None]:
# Review overall test score
results['test_score'].mean()

**Why cross-validation?**

### <a id='toc1_4_2_'></a>[💥 **Bonus**: Stratification in sklearn](#toc0_)

In [None]:
salary = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/salaries.csv')
salary.head()

In [None]:
salary.Experience.value_counts(dropna=False)

When predicting salary we would want to have an equal distribution of experience in both the train and test sets, i.e. we want to stratify our train-test split by experience:

In [None]:
# Train-test split with stratification
X = salary.drop('Salary', axis=1)
y = salary['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.25, stratify=X['Experience'])

In [None]:
# Review train set proportions
X_train.Experience.value_counts(normalize=True) * 100

In [None]:
# Review test set proportions
X_test.Experience.value_counts(normalize=True) * 100

Let's see how our proportions look like without stratification:

In [None]:
# Train-test split without stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.25)

In [None]:
# Review train set proportions
X_train.Experience.value_counts(normalize=True) * 100

In [None]:
# Review test set proportions
X_test.Experience.value_counts(normalize=True) * 100

**Why stratification?** This is for you to do some research on 😉  

*(Hint: It has something to do with sampling... remember that lesson in inferential statistics?)*

# <a id='toc2_'></a>[Resources](#toc0_)

- [Decision Trees (StatQuest) - 18 mins](https://www.youtube.com/watch?v=_L39rN6gz7Y)
- [Cross-Validation (StatQuest) - 6 mins](https://www.youtube.com/watch?v=fSytzGwwBVw&t=0s)
- Support Vector Machine (StatQuest)
    - [Main Ideas - 20 mins](https://www.youtube.com/watch?v=efR1C6CvhmE)
    - [The Polynomial Kernel - 7 mins](https://www.youtube.com/watch?v=Toet3EiSFcM&t=0s)
    - [The Radial (RBF) Kernel - 16 mins](https://www.youtube.com/watch?v=Qc5IyLW_hns&t=0s)
- [Support Vector Machine - multi-class implementation](https://archive.is/20230328072327/https://towardsdatascience.com/multiclass-classification-with-support-vector-machines-svm-kernel-trick-kernel-functions-f9d5377d6f02)

# <a id='toc3_'></a>[Acknowledgements](#toc0_)

Thank you, David Henriques, for your awesome class structure and content!