<a href="https://colab.research.google.com/github/raulbenitez/PRML_Probabilistic_classifiers/blob/main/PRML_P2_1_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probabilistic classification models - General workflow

## 1. Load data (n observations, d features)

- data matrix: $X_{n\times d}$ 
- class labels $y = \{\omega_1,\dots \omega_{n}\}$

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

In the iris dataset, we have $n=150$ observations, $d=4$ features and the class labels take 3 possible values:

In [23]:
import numpy as np 

print('Number of observations n = {}'.format(X.shape[0]))
print('Number of features d = {}'.format(X.shape[1]))
print('Possible class labels = {}'.format(np.unique(y)))

Number of observations n = 150
Number of features d = 4
Possible class labels = [0 1 2]


## 2. Split data in training and test subsets (70% train, 30% test, for instance)

In [3]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.7)


We have 105 observations in the training set and 45 observations in the test set:

In [24]:
print('Training observations = {}'.format(Xtrain.shape[0]))
print('Test observations = {}'.format(Xtest.shape[0]))

Training observations = 105
Test observations = 45


## 3. Define the classification model

In this case, a classifier implementing the k-nearest neighbors rule 
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html




In [4]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=1)

4. Fit/train/learn/estimate the model using training data

In [5]:
model.fit(Xtrain, ytrain)

KNeighborsClassifier(n_neighbors=1)

## 4. Evaluate classification performance using test subset

In [20]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# use the model to predict class labels of the test set 
ytest_pred = model.predict(Xtest)


cm = confusion_matrix(ytest, ytest_pred)

acc = accuracy_score(ytest, ytest_pred)

print('Confusion matrix:')
print(cm)

print('Accuracy = {}'.format(acc))


Confusion matrix:
[[14  0  0]
 [ 0 14  0]
 [ 0  3 14]]
Accuracy = 0.9333333333333333


# Exercise 1: 

a) Load the mpg cars database included in the seaborn libraries https://seaborn.pydata.org/generated/seaborn.load_dataset.html

b) Apply the following probabilistic classification methods in order to predict the origin of the car from the numerical features

- k-nearest neigbours (KNN)

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

- Linear Discriminant Analysis (LDA) 

https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html

- Quadratic Discriminant Analysis (DQA)

https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html

- Gaussian Naive Bayes 

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB

c) compare the different methods using the accuracy score

# Exercise 2: 

Download a classification dataset from either Kaggle or UCI library repository and apply a probabilistic classification method. Evaluate the performance