# Gaussian Models for Classification
### From Naive Bayes to Gaussian Mixture Model

Contents:
- Maximum Likelihood and Maximum a posterior classifiers
- Gaussian Naive Bayes
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- Gaussian Mixture Model
- How to improve the model?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image

## Dataset
In this notebook I wanna use MAGIC Gamma Telescope Dataset. It's a Binary classification problem that has 10 real valued features. we want to classify every item as Gamma(signal) or Hadron.

### Reading the Dataset

In [None]:
features = [ 'fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist' ]
raw_data = pd.read_csv('../input/magic-gamma-telescope-dataset/telescope_data.csv', names=features + ['class'], skiprows=1)

In [None]:
raw_data

### Encode class labels

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(raw_data['class'])
raw_data['class'] = le.transform(raw_data['class'])
X = raw_data[features].values
y = raw_data['class'].values

### Checking class imbalance

In [None]:
raw_data['class'].plot.hist()

As you can see dataset is not balanced, hence for evaluating the model in addition to accuracy we'll also check the f1-score

### Create the train and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

It is better to write a function for easier evaluation of the classifiers. 

In [None]:
def evaluate_clf(X_train, X_test, y_train, y_test, clf, name="Classifier"):
  from sklearn.metrics import f1_score, accuracy_score
  # fit the classifier
  clf.fit(X_train, y_train)
  pred = clf.predict(X_test)
  # evaluate prediction using acc and f1 score
  score_f1 = f1_score(y_test, pred)
  score_acc = accuracy_score(y_test, pred)
  print('{} acc-score: {}'.format(name, score_acc))
  print('{} f1-score: {}'.format(name, score_f1))

## Define The Enemy!
Decision Tree is an Excellent classifier. First we'll evaluate it's performance then try to beat it by using Gaussian models.

In [None]:
from sklearn.tree import DecisionTreeClassifier
evaluate_clf(X_train, X_test, y_train, y_test, DecisionTreeClassifier(), "Decision Tree")

## Maximum Likelihood and Maximum a Posterior Classifiers
If you can find a proper probability distribution for every class, then you can calculate the likelihood of a new data item and find the class for which the likelihood is maximum. Sometimes you have a prior knowledge about you classes and you can encode it as prior distributions. In this notebook we use Multivariate Normal or Gaussian Distribution for modeling density of classes. We will start by a simple model later we'll improve it until out performing the Decision Tree classifier.


## Gaussian Naive Bayes Classifier
Let's start by a simple Gaussian model. In this model we assume that the features are independent. That means a diagonal covariance matrix for each Gaussian distribution. Because of the assumption of feature inedpendence it's called Naive. 

In [None]:
Image('../input/gaussiannotebookimg/2.png')

In [None]:
from sklearn.naive_bayes import GaussianNB
evaluate_clf(X_train, X_test, y_train, y_test, GaussianNB(), "Gaussian NB")

Well, in comparison to Decision Tree result it's disappointing. But we can improve it. Before that, we must know what's the problem.

### Why Gaussian Naive Bayes doesn't perform well?
In this case that's mainly because of the assumption that indicates features are independent.
From the geometrical point of view it means the elipsoid of Normal ditribution can not rotate and only can be scaled along it's axes.
To understand better, let's check the correlation of features.

In [None]:
raw_data[features].corr()

We can see some features are highly correlated. for example see the scatter plot for **fSize**
and **fConc**

In [None]:
raw_data[features].plot.scatter('fSize', 'fConc')

### How to fix the feature correlation problem?
In Naive Bayes model, Covariance Matrix of Normal ditributions of classes was diagonal. Instead, we can use a full Covariance Matrix, But for now let's use a unique Covariance Matrix for each class. Indeed we assume that the correlation of features for every class is same. Such a model is called *LDA* or *Linear Discriminant Analysis*.

## Linear Discriminant Analysis (LDA)

In [None]:
Image('../input/gaussiannotebookimg/3.png')

### Performing LDA

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
evaluate_clf(X_train, X_test, y_train, y_test, LinearDiscriminantAnalysis(), "LDA")

Much better than Naive Bayes but still worse than Decision Tree.

### Why not to use different Covariance Matrices?
We can do so, However if we have many classes, it will increase the number of parameters significantly and can lead to overfittig. This method is called *Quadratic Discriminant Analysis*  or *QDA*.

In [None]:
Image('../input/gaussiannotebookimg/4.png')

### QDA Result

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
evaluate_clf(X_train, X_test, y_train, y_test, QuadraticDiscriminantAnalysis(), "QDA")

### Why LDA result was better?
It seems for this dataset, increaing flexibility of Normal distribution doesn't have a big impact on performance and instead caused overfitting. Unfortunately Decision Tree is still better. let's find another way.

### So, Now how to improve the model?
By far, for every class we used one Normal Distribution. If that's not enough let's use more of them. We can use a convex combination of them! this is called Gaussian Mixture Model.

In [None]:
Image('../input/gaussiannotebookimg/5.png')

### Expectation Maximization Algorithm
Estimating parameters of a single Gaussian distribution is trivial and includes only calculation of mean and covariance.However parameter estimation for a Gaussian Mixture model is not so trivial. Also optimizing parameters of these models for minimizing the negative log likelihood is not easy for gradient-based optimizers. that's because the Covariance Matrix should be Positive-definite which is not an easy to handle constraint for many optimizers [2]. Instead we can use the Expectation Maximization Algorithm. Usually used for estimating parameters of Graphical models who use latent variables (in our case coefficients of gaussian distributions). Fortunately EM algorithm for Gaussian Mixture model has been implemented in Scikit-learn, hence there is no need to implement it manually, However implementing EM is not hard at all. for details of EM see the reference [1]. 

### Creating a Scikit-Learn classifier based on Gaussian Mixture 
Now we use scikit implementation of EM Algorithm for Gaussian Mixture models to create a custom estimator for classification.

In [None]:
from sklearn.base import BaseEstimator
from sklearn.mixture import GaussianMixture

class GaussianMixtureClassifier(BaseEstimator):
  
  def __init__(self, n_components=1):
    self.n_components = n_components

  def fit(self, X, y):
    # find number of classes
    self.n_classes = int(y.max() + 1)
    # create a GM for each class
    self.gm_densities = [GaussianMixture(self.n_components, covariance_type='full') for _ in range(self.n_classes)]
    # fit the Mixture densities for each class
    for c in range(self.n_classes):
      # find the correspond items
      temp = X[np.where(y == c)]
      # estimate density parameters using EM
      self.gm_densities[c].fit(temp)

  def predict(self, X):
    # calculate log likelihood for each class
    log_likelihoods = np.hstack([ self.gm_densities[c].score_samples(X).reshape((-1, 1)) for c in range(self.n_classes) ])
    # return the class whose density maximizes the log likelihoods
    log_likelihoods = log_likelihoods.argmax(axis=1)
    return log_likelihoods

Now, we create a Gaussian Mixture Classifier with a mixture of 2 Gaussian distributions per class.

In [None]:
evaluate_clf(X_train, X_test, y_train, y_test, GaussianMixtureClassifier(n_components=2), "Gaussian Mixture")

### Finally we beated the Decision Tree!!!
Now you can see the mixture model outperforms the Decision Tree.

## How to improve the result further?
- Encode your domain knowledge about the problem as prior distributions.
- Try optimizing hyperparameters of the model, number of Gaussians for each class, Covariance Matrix type for each class.
- Everything is not Gaussian. again use your domain knowledge and find proper distribution for each feature.
- Do a little feature engineering

## References
These resources help me a lot to write this notebook.  
- [1] Machine Learning: A Probabilistic Perspective, Kevin P. Murphy  
See Chapter 4 for Gaussian Models and Chapter 11 for Mixture Models and EM Algorithm  
- [2] Coursera, Advanced Machine Learning Specialization, Bayesian Methods for Machine Learning Course

I will be happy to know your comments :)