<a href="https://colab.research.google.com/github/simonrio23/Simon_TM9/blob/main/Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Naive-Bayes Algorithm
=====================
***

What Is It?
-----------

The Naive-Bayes algorithm is an intuitive approach to making predictions based on prior beliefs or probabilities. Quoting Jason Brownlee, "it is the supervised learning approach you would come up with if you wanted to model a predictive modeling problem probabilistically".

Let's dive into the mathematics. We start off with a belief or a *prior probability* of event $A$. This is denoted as $P(A)$. Now, everything seems to be going well until we're hit with some new evidence $X$, which implies something that affects the probability of our belief. As much as we'd like to, we can't simply ignore $X$ and go home. Instead, given evidence $X$, we must calculate a new value for event $A$ called the *posterior probability*. This is denoted as $P(A | X)$. Finally, for the sake of completion, $P(X | A)$ is the probability of observing evidence $X$ for event $A$ and $P(X)$ is the untouched probability of observing evidence $X$.

\begin{align}
 P( A | X ) = & \frac{ P(X | A) P(A) } {P(X) } \\\\[5pt]
\end{align}

You're probably wondering what makes this algorithm *naive*. Well, it's due to the underlying assumption that the probability of event $A$ given any evidence $X_n$ is totally independent of each other. This simplifies a lot of things and explains its popularity in many fields.

The content of this notebook uses Python to classify whether a patient is diagnosed with diabetes given a set of attributes. The data set is called the "Pima Indians Diabetes Data Set" provided by the National Institute of Diabetes and Digestive and Kidney Diseases. The target accuracy to indicate the algorithm's credibility is between 70% - 76%.

Data Loading and Formatting
-------------------------------

The data set is given as a `csv` file, which requires parsing and partitioning to form a training set and a test set.

In [21]:
from google.colab import drive
drive.mount ('/content/drive')

Mounted at /content/drive


In [29]:
import csv

def load_csv(file):
    with open(file, 'r') as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
        for i in range(len(dataset)):
            dataset[i] = [float(x) for x in dataset[i]]
    return dataset

file = ('/content/drive/MyDrive/SimonRioArwam/TM9/Dataset/pima-indians-diabetes.data.csv')  # Sesuaikan dengan path file Anda
dataset = load_csv(file)
print('Loaded data from {0} with {1} rows'.format(file, len(dataset)))

Loaded data from /content/drive/MyDrive/SimonRioArwam/TM9/Dataset/pima-indians-diabetes.data.csv with 768 rows


In [31]:
from random import randrange

def partition_data(dataset, ratio):
    train_size = int(len(dataset) * ratio)
    test_set = list(dataset)
    train_set = []

    while len(train_set) < train_size:
        index = randrange(len(test_set))
        train_set.append(test_set.pop(index))

    return [train_set, test_set]

train_set, test_set = partition_data(dataset, 0.67)
print('Split total data ({0} rows) into training set ({1} rows) and testing set ({2} rows)'.format(len(dataset), len(train_set), len(test_set)))


Split total data (768 rows) into training set (514 rows) and testing set (254 rows)


In [33]:
def group_by_class(dataset):
    klass_map = {}
    for el in dataset:
        klass = int(el[-1])
        if klass not in klass_map:
            klass_map[klass] = []
        klass_map[klass].append(el[:-1])
    return klass_map

classified_set = group_by_class(train_set)

for klass, data_points in classified_set.items():
    print('Class {0} contains {1} data points'.format(klass, len(data_points)))

Class 0 contains 336 data points
Class 1 contains 178 data points


In [34]:
import math
def mean(n):
    return sum(n) / float(len(n))

def stdev(n):
    average = mean(n)
    return math.sqrt(sum([pow(x - average, 2) for x in n]) / float(len(n) - 1))

In [35]:
import multiprocessing as mp

def format_calc(t):
    return (mean(t), stdev(t))

def prepare_data(dataset):
    pool = mp.Pool(mp.cpu_count())
    summary = {}
    for klass, data_points in dataset.items():  # Menggunakan items() alih-alih iteritems()
        summary[klass] = pool.map(format_calc, zip(*data_points))
    pool.close()
    pool.join()
    return summary

summary_set = prepare_data(classified_set)

for klass, tupl in summary_set.items():  # Menggunakan items() alih-alih iteritems()
    print('Class {0} contains {1} tuples'.format(klass, len(tupl)))  # Memindahkan metode format() ke dalam tanda kurung print()


Class 0 contains 8 tuples
Class 1 contains 8 tuples


In [36]:
import math

def gauss(x, mean, stdev):
    ex = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2))))
    return (1 / (math.sqrt(2 * math.pi) * stdev)) * ex

In [37]:
def predict(summary_set, data_point):
    probabilities = {}
    for klass, summary in summary_set.iteritems():
        probabilities[klass] = 1
        for i in xrange(len(summary)):
            mean, stdev = summary[i]
            probabilities[klass] *= gauss(data_point[i], mean, stdev)
    return max(probabilities.iterkeys(), key=(lambda key: probabilities[key]))

In [68]:
accuracy = get_accuracy(summary_set, test_set)

print('The_Naive-Bayes_Model_yields {0}% accuracy'.format(round(accuracy, 2)))

AttributeError: 'dict' object has no attribute 'iteritems'