# Naive bayesian classifier

In [16]:
import pandas as pd
import numpy as np
import math
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [17]:
data=pd.read_csv("mark_evaluation.csv")
data.head(5)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,...,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,pass
0,1,1,18,1,0,0,4,4,2,2,...,3,4,1,1,3,6,5,6,6,0
1,1,1,17,1,0,1,1,1,1,2,...,3,3,1,1,3,4,5,5,6,0
2,1,1,15,1,1,1,1,1,1,2,...,3,2,2,3,3,10,7,8,10,1
3,1,1,15,1,0,1,4,2,1,3,...,2,2,1,1,5,2,15,14,15,1
4,1,1,16,1,0,1,3,3,1,2,...,3,2,1,2,5,4,6,10,10,1


In [18]:
data.isna().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
pass          0
dtype: int64

In [19]:
x=data.iloc[:,0:-1]
y=data.iloc[:,-1]

In [20]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [21]:
x_train=pd.DataFrame(x_train)
x_test=pd.DataFrame(x_test)

In [22]:
def calprobability(x_train,y_train):
    probability={}
    x_train["result"]=y_train
    column=x_train.columns
    x_positive=x_train[x_train["result"]==1]

    for i in range(len(column)):
        element=pd.unique(x_train.iloc[:,i])
        probability[column[i]]={}
        total_val=dict(pd.value_counts(x_train.iloc[:,i]))
        positive_val=dict(pd.value_counts(x_positive.iloc[:,i]))
        for j,k in enumerate(total_val):
            if k in positive_val.keys():
                probability[column[i]][k]=positive_val[k]/total_val[k]
            else:
                probability[column[i]][k]=0
    return probability

In [23]:
baysianprobability=calprobability(x_train,y_train)

In [24]:
x_test=x_test.reset_index(drop=True)

In [25]:
def testmodel(x_test,baysianprobability):
    y_pred=[]
    column=list(x_test.columns)
    for i in range(len(x_test)):
        current_ele=x_test.iloc[i]
        positive_prob=1
        negative_prob=1
        for j in range(len(column)):
            if current_ele[j] in baysianprobability[column[j]].keys():
                individual_prob=baysianprobability[column[j]][current_ele[j]]
                positive_prob*=individual_prob
                negative_prob*=(1-individual_prob)
        if(positive_prob>negative_prob):
            y_pred.append(1)
        else:
            y_pred.append(0)
    return y_pred

In [26]:
y_pred=testmodel(x_test,baysianprobability)

In [27]:
confusion_matrix(y_test,y_pred)

array([[20,  9],
       [ 2, 48]], dtype=int64)

In [28]:
accuracy=accuracy_score(y_test,y_pred)

In [29]:
print("Accurecy of our model :",accuracy*100,"%")

Accurecy of our model : 86.07594936708861 %



 The code reads in a CSV file and creates two dataframes, x_train and x_test.
 The first line of the code is where we import pandas as pd.
 This allows us to use the read_csv function on our dataframe which will return all of the columns from our input file into a new pandas DataFrame object.
 The next line of code is where we create an empty list called y that will hold all of the rows from our input file.
 We then iterate over each row in order to create a new list with only one element for each row, this time named y_.
 We then split up these lists by using train_test_split which takes three arguments: test size (0.2), random state (0) and what type of splits should be made (in this case, it's set to "classification").
 It returns four values: x-train, x-test, y-train and y-test.
 The code attempts to take a data set and split it into two parts.
 The first part will be used for training and the second part will be used for testing.
 The code then takes the training set and creates a new DataFrame called x_train which contains the input variables of x with their corresponding values in column 0, 1, 2, 3, 4.
 It also creates a new DataFrame called x_test which contains the input variables of x with their corresponding values in column 0, 1, 2, 3, 4 but with an additional test variable that is not present in x_train (i.e., test_size=0.2).
 The code is trying to find the probability of a person having a particular result.
 It starts by creating an empty dictionary called "probability".
 Then it iterates through each column in the dataframe and creates an empty dictionary for that column, which is then filled with values from the value_counts() function.
 The next step is to create two dictionaries: one containing all positive results and another containing all negative results.
 The code iterates through each row in x_train, looking for any rows where y_train == 1 (the desired outcome).
 For every such row, it looks up the index of that row in x_positive and uses dict(pd.value_counts()) to get the number of times that outcome occurred within those columns.
 This gives us total occurrences of this outcome across all columns as well as how many times it occurred within each individual column.
 The code will calculate the probability of a specific outcome for a given input.
 The code calculates the probability of an outcome where x_train["result"] is equal to 1, and then iterates through each column in x_train.columns and calculates the probabilities of that outcome occurring in each column.
 The code starts by creating a list of all the columns in x_test.
 This is done so that it can be used to create an array of probabilities for each column.
 The code then loops through each row in x_test and creates an array of probabilities for each column, which are stored as probability[column][k] where k is the index of the particular row.
 The next line initializes baysianprobability with a random number between 0 and 1, which will be used later on to calculate the probability value for any given cell in y_train.
 The next line calculates the total number of values in y_train using len(x_train).
 Next, it sets up an empty list called y_pred that will store predictions from this model based on data from x_test.
 It also sets up a list called column that stores all columns from x-test so they can be iterated over one at a time during training (i.e., iteration over rows).
 Then comes another loop through every row in x-test and calculates the probability value for each column based on whether or not its corresponding element is positive or negative according to baysianprobability's random number generator
 The code is a function that takes in two arguments, x_test and baysianprobability.
 The x_test argument is the dataset of test data to be used for training.
 The baysianprobability argument is the probability distribution which was created by the code above.
 The function will iterate through each column of x_test and create an array called y_pred.
 This array will have one element for each row of x_test.
 Each row will have a list of all possible values for that column, with the value 0 being omitted from this list.
 The code starts by creating a list of all the columns in the dataframe.
 Then, it creates an array called baysianprobability that contains all the probabilities for each column.
 Next, it loops through each row and calculates individual probability based on Bayes' theorem.
 The loop then checks if there is a positive or negative probability for this particular row and adds 1 to y_pred if there is a positive probability or 0 otherwise.
 Finally, it returns y_pred as the predicted values for x_test
 The code is used to generate a confusion matrix for the given data.
 The accuracy of the model is calculated by calculating the accuracy score for each row and column of the confusion matrix.
