<a href="https://colab.research.google.com/github/SimeonHristov99/ML_21-22/blob/main/classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification. KNN and Logistic Regression

|             |continuous           | categorical    |
|-----------  | --------------      | ----------     |
|**supervised**   | regression          | **classification** |
|**unsupervised** | dimension reduction | clustering     |

The first thing to say is that logistic regression is not a regression, but a classification learning algorithm. The name comes from statistics and is due to the fact that the mathematical formulation of logistic regression is similar to that of linear regression. In fact the only difference is that the polynomial gets passed through the so-called `sigmoid` function:

$$result = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)}}$$

Note that:

$$\frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}$$

In [None]:
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

from sklearn.preprocessing import StandardScaler

from sklearn import metrics

In [None]:
FIG_SIZE = (12, 10)
plt.rc('figure', figsize=FIG_SIZE)

In [None]:
def sigmoid(x):
  # return np.exp(x) / (1 + np.exp(x))
  return 1 / (1 + np.exp(-x))

interval = np.linspace(-10, 10, num=1000)

plt.plot(interval, sigmoid(interval))
plt.show()

The sigmoid takes the value of 0.5 at 0 and goes towards 1 when moving towards $+\infty$ and to 0 when moving towards $-\infty$.

We can use those two limits as the identifiers of two classes. For example, if we're classifing whether an email is spam, we can use 0 for "no" and 1 for "yes".

## Introducing the Iris dataset.

- 50 samples of 3 different species of iris (150 samples total)
- measurements: sepal length, sepal width, petal length, petal width
- More information in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

## Imports and Constants

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

In [None]:
FIG_SIZE = (12, 10)
plt.rc('figure', figsize=FIG_SIZE)

## Get the data

In [None]:
iris = load_iris()
iris

In [None]:
type(iris)

In [None]:
iris.data

In [None]:
# print the names of the four features
iris.feature_names

In [None]:
# print integers representing the species of each observation
iris.target

In [None]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
iris.target_names.tolist()

## The rules of Machine Learning


1.   Features and labels are **separate objects**.
2.   Features are **numeric**, and if the problem is a regression problem the labels are also **numeric**.
3.   Features and labels are **NumPy arrays**.
4.   Features and label have **specific shapes**. The features must have **two** dimensions: the first one represents the number of samples/observations and the second - the number of features. The label must have one dimension which is the number of samples (equal to the first dimension of the feature object).
5.   The feature object is named `X` by convention. `X` is capitalized because it represents a matrix.
6.   The label object is named `y` by convention. `y` is lowercase because it represents a vector.



## Exploratory Data Analysis

In [None]:
# Let's create a dataframe

# Step 1: Add the features
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df

In [None]:
# Step 2: Add target
df['Target'] = iris.target
df

In [None]:
# Everything is a number
df.info()

In [None]:
# No class imbalance!
df['Target'].value_counts(normalize=True)

In [None]:
# There are no missing values
df.isna().mean()

In [None]:
# The sepal width feature has very low variance. Probably won't help in
# predicting the target.
df.describe()

In [None]:
# Check the correlation
df.corr()
# Everything is very correlated with the target

In [None]:
plt.boxplot(df['sepal width (cm)'])
plt.show()

In [None]:
plt.boxplot(df['petal length (cm)'])
plt.show()

In [None]:
plt.title('Using petal length and petal width to predict target')
plt.xlabel('petal width')
plt.ylabel('petal length')

scatter=plt.scatter(df['petal width (cm)'], df['petal length (cm)'], c=df['Target'])

plt.legend(handles=scatter.legend_elements()[0], labels=iris.target_names.tolist())
plt.show()

## Preprocessing

In [None]:
def preprocess_inputs(df):
  df = df.copy()

  # Split into X and y
  y = df['Target']
  X = df.drop(['Target'], axis=1)

  return X, y

In [None]:
X, y = preprocess_inputs(df)

In [None]:
X

In [None]:
y

## Choosing a model

For classification tasks, the most widely used models include:

- KNN
- LogisticRegression
- Tree Based (Decision Tree, Random Forest, Adaboost, etc)
- Perceptron
- SVM
- Naive Bayes (for text classification)
- Ensemble (combination of the above)

Today, we'll look at KNN and LogisticRegression. Do you remember how KNN worked?

## The K-nearest neighbors (KNN) algorithm

1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

In [None]:
def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

class KNN:
  def __init__(self, k=3):
    self.k = k

  def fit(self, X, y):
    self.X_train = X
    self.y_train = y

  def predict(self, X):
    y_pred = np.array([self._predict(x) for x in X])
    return y_pred

  def _predict(self, x):
    # Compute distances between x and all examples in the training set
    distances = [euclidean_distance(x, x_train) for x_train in self.X_train]

    # Sort by distance and return indices of the first k neighbors
    k_idx = np.argsort(distances)[:self.k]

    # Extract the labels of the k nearest neighbor training samples
    k_neighbor_labels = self.y_train[k_idx]
    
    # return the most common class label
    most_common = Counter(k_neighbor_labels).most_common(1)
    
    return most_common[0][0]

In [None]:
X, y = make_blobs(
    n_samples=150, n_features=2, centers=2, cluster_std=1.05, random_state=2
)

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
plt.show()

In [None]:
knn = KNN(k=2)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)

print("Training classification accuracy", metrics.accuracy_score(y_train, knn.predict(X_train)))
print("Testing classification accuracy", metrics.accuracy_score(y_test, predictions))

plt.scatter(X_test[:, 0], X_test[:, 1], c=predictions)
plt.show()

## The Logistic Regression algorithm

In [None]:
class Logistic_Regression():
  def __init__(self, learning_rate, no_of_iterations):
    self.learning_rate = learning_rate
    self.no_of_iterations = no_of_iterations

  def fit(self, X, Y):
    self.m, self.n = X.shape

    self.w = np.zeros(self.n)
    self.b = 0

    self.X = X
    self.Y = Y

    for i in range(self.no_of_iterations):
      self.update_weights()

  def update_weights(self):
    # Y_hat formula (sigmoid function)
    Y_hat = 1 / (1 + np.exp( - (self.X.dot(self.w) + self.b ) ))    

    dw = (1/self.m)*np.dot(self.X.T, (Y_hat - self.Y))
    db = (1/self.m)*np.sum(Y_hat - self.Y)

    self.w = self.w - self.learning_rate * dw
    self.b = self.b - self.learning_rate * db

  def predict(self, X):
    Y_pred = 1 / (1 + np.exp( - (X.dot(self.w) + self.b ) )) 
    Y_pred = np.where( Y_pred > 0.5, 1, 0)
    return Y_pred

In [None]:
logreg = Logistic_Regression(learning_rate=0.3, no_of_iterations=200)
logreg.fit(X_train, y_train)
predictions = logreg.predict(X_test)

print("Training classification accuracy", metrics.accuracy_score(y_train, logreg.predict(X_train)))
print("Testing classification accuracy", metrics.accuracy_score(y_test, predictions))

plt.scatter(X_test[:, 0], X_test[:, 1], c=predictions)
plt.show()

## scikit-learn 5-step modeling pattern

In [None]:
X, y = preprocess_inputs(df)

**Step 1:** Import the class you plan to use

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model

In [None]:
model = KNeighborsClassifier(n_neighbors=1)

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

In [None]:
# Let's first train the model on the whole dataset.
model.fit(X, y)

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process
- **NOTE**: Only **2D** data arrays can be passed to models when making a prediction. They should have the same feature names as the dataframe that was used in the `.fit` method.

In [None]:
X_test = pd.DataFrame([[5.5, 3, 1.5, 0]], columns=iris.feature_names)
X_test

In [None]:
# Let's create a fake example and see what the model will output.
model.predict(X_test)

- Returns a NumPy array
- Can predict for multiple observations at once

**Step 5:** Evaluate the model using a metric. **EVALUATION MUST BE DONE ON A VALIDATION / TEST SET**. That way, we know whether our model can truly generalize. It's done on training set only for comparison with the results from the validation / test set.

- **accuracy**: if there is no class imbalance could suffice (classification)
- **f1 score**: if class imbalance, this is definitely a must
- **classification report**: Text summary of the precision, recall, F1 score for each class and overall accuracy .
- **confusion matrix**: shows how often the model predict each class

In [None]:
# Note that this doesn't tell us much, since we're
# evaluating on the training set
model.score(X, y)

In [None]:
y_pred = model.predict(X)
print(classification_report(y, y_pred))

## Hyperparameter tuning

In [None]:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X, y)
model.score(X, y)

In [None]:
y_pred = model.predict(X)
print(classification_report(y, y_pred))

In [None]:
model.predict(X_test)

In [None]:
# Let's try a different model
model = LogisticRegression()
model.fit(X, y)
model.score(X, y)

In [None]:
# Note that the model failed to converge.
# Let's try to increase the iterations.
model = LogisticRegression(max_iter=200)
model.fit(X, y)
model.score(X, y)

In [None]:
y_pred = model.predict(X)
print(classification_report(y, y_pred))

In [None]:
model.predict(X_test)

## Evaluation procedure #1: Train and test on the entire dataset

1. Train the model on the **entire dataset**.
2. Test the model on the **same dataset**, and evaluate how well we did by comparing the **predicted** response values with the **true** response values.

We applied used in the previous notebook.

### Logistic regression

In [None]:
model = LogisticRegression(max_iter=200)
model.fit(X, y)
y_pred = model.predict(X)
len(y_pred)

Classification accuracy:

- **Proportion** of correct predictions
- Common **evaluation metric** for classification problems

In [None]:
model.score(X, y)

In [None]:
# Alternative way.
metrics.accuracy_score(y, y_pred)

- Known as **training accuracy** when you train and test the model on the same data

### KNN (K=5)

In [None]:
model = KNeighborsClassifier(5)
model.fit(X, y)
model.score(X, y)

### KNN (K=1)

In [None]:
model = KNeighborsClassifier(1)
model.fit(X, y)
model.score(X, y)

KNN (K=1) will always have 100% training accuracy, since to make a prediction for any observation in the training set, KNN would search for the nearest observation in the training set and would find that exact same observation! In other words, KNN has memorized the training set and because we're using the exact same data, it'll always make correct predictions.

### Problems with training and testing on the same data

- Goal is to estimate likely performance of a model on **out-of-sample data**
- But, maximizing training accuracy rewards **overly complex models** that won't necessarily generalize
- Unnecessarily complex models **overfit** the training data

> **Rule of thumb**: For KNN a lower value of K creates a more complex model.

## Evaluation procedure #2: Train/test split

1. Split the dataset into two pieces: a **training set** and a **testing set**.
2. Train the model on the **training set**.
3. Test the model on the **testing set**, and evaluate how well we did.

In [None]:
df

In [None]:
def preprocess_inputs(df):
  df = df.copy()

  # Split into X and y
  y = df['Target']
  X = df.drop(['Target'], axis=1)

  # Train/test split
  # usually test_size is between 0.10 and 0.40
  # use random state to get same results on every run
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

  return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = preprocess_inputs(df)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

In [None]:
X_train.head()

In [None]:
y_train.head()

What did this accomplish?

- Model can be trained and tested on **different data**
- Response values are known for the testing set, and thus **predictions can be evaluated**
- **Testing accuracy** is a better estimate than training accuracy of out-of-sample performance

In [None]:
# Logistic regression
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
# KNN (K=5)
model = KNeighborsClassifier(5)
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
# KNN (K=1)
model = KNeighborsClassifier(1)
model.fit(X_train, y_train)
model.score(X_test, y_test)

> **Conclusion**: Logistic regression is likely the best model for this split.

### Can we find the best value for k?

In [None]:
scores = []
ks = range(1, 31)

for k in ks:
  model = KNeighborsClassifier(k)
  model.fit(X_train, y_train)
  scores.append(model.score(X_test, y_test))

In [None]:
plt.title('Searching for the best value of KNN')
plt.xlabel('Value for K')
plt.ylabel('Accuracy')

plt.plot(ks, scores)
plt.show()

- **Training accuracy** rises as model complexity increases
- **Testing accuracy** penalizes models that are too complex or not complex enough
- For KNN models, complexity is determined by the **value of K** (lower value = more complex)

In [None]:
# KNN (K=15)
model = KNeighborsClassifier(15)
model.fit(X_train, y_train)
model.score(X_test, y_test)

### Problems with train/test split

- Provides a **high-variance estimate** of out-of-sample accuracy
- **K-fold cross-validation** overcomes this limitation. We'll see it next time.
- But, train/test split is still useful because of its **flexibility and speed**

# For home

Do classification on the Titanic dataset.