<a href="https://colab.research.google.com/github/raj-jaiswal/Machine-Learning-Techniques/blob/main/Week_8_Naive_Bayes_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import numpy as np
import matplotlib.pyplot as plt

# Bernoulli naive Bayes

Run the below cell to get the following variables:

`X` = Data matrix of shape $(n, d)$. All the features are binary taking values $0$ or $1$.

`y` = label vector. Labels are $0$ and $1$.

In [4]:
rng = np.random.default_rng(seed=1)
X1 = np.concatenate((rng.binomial(size = 50,n = 1, p =0.7), rng.binomial(size = 50,n = 1, p =0.2))).reshape(-1, 1)
X2 = np.concatenate((rng.binomial(size = 50,n = 1, p =0.6), rng.binomial(size = 50,n = 1, p =0.1))).reshape(-1, 1)
X3 = np.concatenate((rng.binomial(size = 50,n = 1, p =0.6), rng.binomial(size = 50,n = 1, p =0.2))).reshape(-1, 1)
X4 = np.concatenate((rng.binomial(size = 50,n = 1, p =0.8), rng.binomial(size = 50,n = 1, p =0.1))).reshape(-1, 1)


X = np.column_stack((X1,X2,X3,X4))

y = np.concatenate((np.zeros(50, dtype= int), np.ones(50, dtype = int))).reshape(-1, 1)
permute = rng.permuted(range(100))

X = X[permute]
y = y[permute]

## Question 1
If we train the naive Bayes model on the dataset, What will be the value of $\hat{p}$, the estimate for $P(Y=1)$?



In [5]:
# Enter your solution here
p = y.sum()/len(y)
p

0.5

## Question 2
What will be the value of $\hat{p}_0^0$, the estimate of $P(f_0=1|y=0)$?  Write your answer correct to two decimal places.



In [6]:
# Enter your solution here
def estimatedProb(label, index):
  val=0
  for i in range(len(X)):
    if y[i] == label:
      val += X[i][index]
  return val/y.sum()

estimatedProb(0,0)

0.68

## Question 3
What will be the value of $\hat{p}_0^1$, the estimate of $P(f_0=1|y=1)$?  Write your answer correct to two decimal places.



In [7]:
# Enter your solution here
estimatedProb(1,0)

0.26

## Question 4
What will be the value of $\hat{p}_3^1$, the estimate of $P(f_3=1|y=1)$?  Write your answer correct to two decimal places.




In [8]:
# Enter your solution here
estimatedProb(1,3)

0.12

## Question 5

What will be the predicted label for the point $[1, 0, 1, 0]$?



In [9]:
# Enter your solution here
def predict(point):
  p0 = 1
  p1 = 1
  for i in range(len(point)):
    if point[i] == 1:
      p0 *= estimatedProb(0,i)
      p1 *= estimatedProb(1,i)
    else:
      p0 *= 1-estimatedProb(0,i)
      p1 *= 1-estimatedProb(1,i)
  p0 *= p
  p1 *= 1-p
  return 1 if p1 > p0 else 0

predict([1, 0, 1, 0])

1

## Question 6

What will be the predicted label for the point $[1, 0, 1, 1]$?



In [10]:
# Enter your solution here
predict([1, 0, 1, 1])

0

# Gaussian naive Bayes

Run the below cell to get the following variables:

`X_train` = Training dataset of the shape $(n, d)$. All the examples are coming from multivariate gaussian distribution.

`y_train` = label vector for corresponding training examples. labels are $0$ and $1$.

`X_test` = Test dataset of the shape $(m, d)$, where $m$ is the number of examples in the test dataset. All the examples are coming from multivariate gaussian distribution.

`y_test` = label vector for corresponding test examples. labels are $0$ and $1$.



In [11]:
from sklearn.datasets import make_classification, make_blobs
from sklearn.model_selection import train_test_split

# generate artificial data points
X, y = make_blobs(n_samples = 100,
                  n_features=2,
                  centers=[[5,5],[10,10]],
                  cluster_std=1.5,
                  random_state=2)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=123)

## Question 7

How many examples are there in the trianing dataset?



In [12]:
# Enter your solution here
len(X_train)

80

## Question 8
How many features are there in the dataset?



In [13]:
# Enter your solution here
len(X_train[0])

2

## Question 9

If we train the Gaussian naive Bayes model on the trianing dataset, What will be the value of $\hat{p}$, the estimate for $P(Y=1)$? Write your answer correct to two decimal places.





In [14]:
# Enter your solution here
p = y_train.sum()/len(y_train)
p

0.4875

## Question 10

If $\hat{\mu}_0 = [\mu_1, \mu_2, ..., \mu_d]$ be the estimate for $\mu_0$, the mean of $0$ labeled examples, what will be the value of $\mu_1+\mu_2+...+\mu_d$? Write your answer correct to two decimal places.



In [19]:
# Enter your solution here
def getmu(label):
  mu = np.zeros(len(X_train[0]))
  count = 0
  for i in range(len(X_train)):
    if y_train[i] == label:
      mu += X_train[i]
      count += 1
  return mu/count

mu0 = getmu(0)
mu0.sum()

9.575936394688135

We will be using the different covariances for different labeled examples. The estimate for $\Sigma_k$ will be

$$\hat{\Sigma}_k = \sigma_iI$$ where $\sigma_i$ is the variance of $i^{th}$ feature values of examples labeled $k$.



## Question 11
What will be value of $\text{trace}({\hat{\Sigma}}_0)$?  Write your answer correct to two decimal places.







In [26]:
# Enter your solution here
def getSigma(label, mu):
  sigma = np.zeros((len(X_train[0]), len(X_train[0])))
  count = 0
  for i in range(len(X_train)):
    if y_train[i] == label:
      sigma += (X_train[i]-mu0).T@(X_train[i]-mu0)
      count += 1
  return sigma/count

sigma0 = getSigma(0, mu0)
np.trace(sigma0)

8.870408389003142

## Question 12

Once we have estimated all the parameters for Gaussian naive Bayes assuming the different covariance matrices, we predict the labels for the training examples. What will be the training accuracy?

Accuracy is defined as the proportion of correctly classified examples.  Write your answer correct to two decimal places.




In [27]:
# Enter your solution here
mu0 = getmu(0)
mu1 = getmu(1)
sigma0 = getSigma(0, mu0)
sigma1 = getSigma(1, mu1)

def predict(point):
  p0 = 1
  p1 = 1
  for i in range(len(point)):
    p0 *= 1/(np.sqrt(2*np.pi*sigma0[i][i]))*np.exp(-(point[i]-mu0[i])**2/(2*sigma0)[i][i])
    p1 *= 1/(np.sqrt(2*np.pi*sigma1[i][i]))*np.exp(-(point[i]-mu1[i])**2/(2*sigma1)[i][i])
  return 1 if p1 > p0 else 0

def getAccuracy(X, y):
  count = 0
  for i in range(len(X)):
    if predict(X[i]) == y[i]:
      count += 1
  return count/len(X)

getAccuracy(X_train, y_train)

0.9875

## Question 13

What will be the test accuracy?

Accuracy is defined as the proportion of correctly classified examples.  




In [28]:
# Enter your solution here
getAccuracy(X_test, y_test)

1.0