# Naive Bayes Intuition

Naive Bayes is more centered around probability, using Bayes Theorem as the basis of inference of the probability of an event occuring given the occurence of other events. 

Naive Bayes does not consisder the correlation between events thus assumes that all events/features are independent of each other. 

This means that these probabilities can be multiplied together to get the joint propability, particularly with word vocabularies. 

### Example
Given a set of words w which tend to appear in 30 emails, you must determine whether an email is a spam email or not. 

Using Bayes theorem, we can determine that we will have a probability distribution of P(spam|X) = ( P(X|spam)* P(spam)) / P(X). The values are classified depending on the probability obtained. If the probability of the words involved being spam is greater than the probability of the words in the email not being spam, then the email is considered to a spam email i.e. P(spam|X) > P(not spam|X) => spam; P(not spam|X) > P(spam |X) => not spam. Thus the actual value is derived from the argmax of the conditional statement given an event X : 

Y = argmax(P(C|X)) == argmax(P(X|C)* P(C))

Effectively, the probability of an event C occuring on condition X can thus be given by

P(C|X) = product_of(P(C|X))

If there are 10 spam emails and 20 not spam emsils,then P(spam) = 1/3; P(not spam) = 2/3 


## Implementation
Using the MNIST datasets means that the values in the arrays will be between 0 and 255. For ease of computation, these values will need to be normalised to be between 0 and 1 which could be done by dividing by 255 or using the gaussian distribution equation: 

P(X) = (1/sqrt(2*pi*Variance(X))) * exp(-0.5((X-(np.mean(X)**2))/(variance(X))

There is also the multivariate Gaussian distribution in which much like the standard gaussian distribution, works from determining the probability of the vector input X such that:

P(X) = (1/(sqrt(pow(2*pi, D) * L1_norm(covariance))) * exp(-0.5 * ((X-np.mean(X)).T * 1/np.sum(X-np.mean(X)))

The scipy library can calculate the multivariate gaussian distribution directly. 

NB: The covariance matrix shows the relationship between points in a vector and since Naive Bayes asssumes that these points are all independent, the values returned will be zero: Cov(i,j) = E[(x_i - np.mean(x_i))(x_j-np.mean(x_j)] = (- if x_i is independent of x_j)

To combat this, we will need a D-size vector - an axis aligned elliptical covariance.

Additionally since the probability is monotonically increasing, we can take the log of each and not suffer any consequences. 


Prediction = argmax_c{log(P(X|C) + log(P(C))}

### Smoothing 
This tackles the singular covariance problem which arrises when you invert the matrix, adds numerical stability. To prevent this, multiply the identity matrix by a small number given by lambda e.g. 10**-3

'''
def fit(X,Y):
    dict_of_gaussians = {}
    priors = {}
    for c in classes:
        Xc = X[corresponding Y == c]
        mu, var = mean and diagnonal covariance of Xc
        dict_of_gaussians[c] = {'mu':mu, 'var':var}
        priors[c] = len(Xc)/len(X)  
        
def predict(X):
    predictions = []
    max_posterior = -inf
    best_class = None
    
    for x in X: # loop through each value 
        for c in classes: # loop through all classes available
            mu, var = dict_gaussians[c]
            
            #use mean and var to get the log probability distribution fnc
            #added to log of priors to give posterior
            posterior = log_pdf(x,mu,var) + log(priors[c])
            if posterior > max_posterior:
                max_posterior = posterior
                best_class = c
        predictions.append(best_class)
    return predictions
'''