## Data Mining Task 1
I follow this links when i write this code.
Sources: <br>
1-) https://medium.com/@abhishek.km23/naive-bayes-classifier-calculation-of-prior-likelihood-evidence-posterior-74d7d27eec24 <br>
2-) https://www.geeksforgeeks.org/ml-naive-bayes-scratch-implementation-using-python/ <br>
3-) https://scikit-learn.org/stable/modules/naive_bayes.html <br>
4-) https://gist.github.com/tuttelikz/94f750ef3bf14f8a126a <br>

### İmport required libraries and read dataset from cvs file 

In [233]:
import itertools
import numpy as np # using dataframe as list of lists
import pandas as pd # reading csv and translate it to dataframe 
import math # for sqrt, pi and exp steps
import random  # for creating randomization when spliting data to train and test

df = pd.read_csv('teleCust1000t.csv')
df.head(3)

Unnamed: 0,region,tenure,age,marital,address,income,ed,employ,retire,gender,reside,custcat
0,2,13,44,1,9,64.0,4,5,0.0,0,2,1
1,3,11,33,1,7,136.0,5,5,0.0,0,6,4
2,3,68,52,1,24,116.0,1,29,0.0,1,2,3


### Drop (or delete) the columns that we will not use. Keep only numerical values and class labels. We can drop the colums or transfer only the columns that we will use to dataset list.

In [234]:
# another method -------------------------------------------------------------------------
#                                                                                        |
# df = df.drop(['region','marital', 'address', 'ed', 'retire', 'gender', 'reside'],1) <--- 
dataset = df[['tenure','age','income','employ','custcat']].values
type(dataset)
# we have a list that every element of list have one row from our dataset.csv file.

numpy.ndarray

### Split the dataset for train and test data.

In [235]:
def split(datataset, ratio):
    sizeoftrain = int(len(dataset)*ratio) # get size of total dataset and calculate size of train data size with the ratio(value for splitting)
    trainDataset = [] # create new list for train dataset
    copyDataset = list(dataset) # get copy of original dataset
    while len(trainDataset) < sizeoftrain:
        x = random.randrange(len(copyDataset)) # select random row of dataset
        trainDataset.append(copyDataset.pop(x)) # move selected row to train dataset
    return [trainDataset, copyDataset] # return train and test set

### Calculate probability of likelihood (P(Xi | Label))

In [236]:
# https://wikimedia.org/api/rest_v1/media/math/render/svg/685339e22f57b18d804f2e0a9c507421da59e2ab  <--- formula
# we dont use pow(stddev,2) on first part of formula because it will become a stdDev in sqrt function
def calculateLikelihood(x,mean,stdDev):
    exponent = math.exp(-(math.pow(x - mean,2)/2*math.pow(stdDev,2)))
    result = 1/(math.sqrt(2*math.pi)*stdDev)*exponent
    return result

### Seperate the dataset by classes and store them in dictionary

In [237]:
def separateByClass(dataset):
    classDataDict = {} # create dictionary for store classes and their instances {label1:rows, label2:rows ...}
    for i in range(len(dataset)):
        current = dataset[i] # get the current row to "current" variable
        if current[-1] not in classDataDict: # if the current row's class label not seen before, create new list for it
            classDataDict[current[-1]] = []
        classDataDict[current[-1]].append(current) # append row (instance) to their class labels list on dictionary 
    return classDataDict

### Get every class label's instances mean and standart deviations of every columns(features)

In [238]:
def meanAndStddevs(dataset):
	summaries = [(np.mean(attribute), np.std(attribute)) for attribute in zip(*dataset)] # zip(*list) function unpack elements in list and use them
    # https://stackoverflow.com/questions/29139350/difference-between-ziplist-and-ziplist    <-- source
	del summaries[-1] # we delete the last element of list because its the mean and stdDev of our class labels and we will not use this values
	return summaries

### Calculate mean and stdDev for every class label

In [239]:
def meanAndStddevPerClass(dataset):
    seperated = separateByClass(dataset) # get seperated by class dictionary 
    result = {}
    for label, instances in seperated.items(): # calculate mean and stdDev of instances per Class and store them on dict
        result[label] = meanAndStddevs(instances)
    return result

### Calculate probabilities of test input and find the best probability(and use it as result)

In [240]:
def predict(meanAndStddevData, inputData):  # https://www.saedsayad.com/images/Bayes_rule.png
    probabilities = {}
    for label, meanandstdDev in meanAndStddevData.items():
        probabilities[label] = 1
        for i in range(len(meanandstdDev)): # loop run for every column of instances and use the mean and stdDev values of this columns for this class labels
            x = inputData[i]
            mean, stdev = meanandstdDev[i]
            probabilities[label] *= calculateLikelihood(x,mean,stdev) 
            # we dont calculate P(C) as (number of C/ total) because every calculation "total" is constant and dont effect our calculation

    bestlabel = None
    bestProb = -1
    for labels, probability in probabilities.items():
        if bestlabel is None or bestProb < probability:
            bestProb = probability
            bestlabel = labels
    return bestlabel

### Get test sample and calculate accuracy value

In [241]:
def getPredictions(meanAndStddevData, testSet):
	predictions = []
	for i in range(len(testSet)):
		result = predict(meanAndStddevData, testSet[i])
		predictions.append(result) # "prediction" list have the class labels that predicted 
	return predictions


def getAccuracy(testSet, predictions):
	correct = 0
	for x in range(len(testSet)):
		if testSet[x][-1] == predictions[x]: # if prediction is true on dataset, increase correct answer counter
			correct += 1
	result = (correct/float(len(testSet)))*100.0
	return result

### Test the naive bayes implementation

In [242]:
def main():
    ratio = 0.80
    trainingSet, testSet = split(dataset, ratio)
    print('Size of dataset: ', len(dataset), '\nTrain dataset size:', len(trainingSet), '\nTest dataset size: ', len(testSet))
    meanAndStdDevdata = meanAndStddevPerClass(trainingSet)
    predictions = getPredictions(meanAndStdDevdata, testSet)
    accuracy = getAccuracy(testSet, predictions)
    print('Accuracy: ', accuracy)

In [243]:
main()

Size of dataset:  1000 
Train dataset size: 800 
Test dataset size:  200
Accuracy:  25.5


### The accuracy value can change every run because of the random selection on dataset splitting and also the ratio effect the accuracy.

Some results that i got:
Ratio: 0.75     %26.8 <br>
Ratio: 0.8      %27.0 <br>
Ratio: 0.85     %30.0 <br>
Ratio: 0.9      %28.9 <br>
Ratio: 0.85     %34.0 <br>

This values can be change on every run.