# Project


Naive Bayes is a classifiction algorithm that is using Bayes Theorem in order to provide prediction based on conditional probability of an event A given another event B occured.

#### Bayes Theorem

$$P(A|B) = \frac{P(B|A) P(A)}{P(B)}$$

where:

- $ P(A|B) $ - conditional probability of A given B
- $ P(B|A) $ - conditional probability of B given A
- $ P(A) $ - probability of A
- $ P(B) $ - probability of B

# Non-code examples

# Imports

In [3]:
import numpy as np
import pandas as pd
import urllib
import sklearn

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score

In [4]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Iris sklearn Gaussian Naive Bayes 

### Loading Data

In [5]:
def load_data():
    """Function loads dictionary containing iris dataset from scikit-learn
    library and splits samples in stratified way, in 0.8/0.2 ratio, into 
    train and test datasets.
    """
    iris = load_iris()
    samples, targets = iris["data"], iris["target"]
    return train_test_split(
        samples, targets, 
        test_size=0.2, 
        stratify=targets,
        shuffle=True, 
        random_state=42
    )

### Implementation

In [6]:

X_train, X_test, y_train, y_test = load_data()
print("X_train matrix size: {}".format(X_train.shape))
print("y_train vector size: {}".format(y_train.shape))
print("X_test matrix size: {}".format(X_test.shape))
print("y_test vector size: {}".format(y_test.shape))
print("\n----\n")


X_train matrix size: (120, 4)
y_train vector size: (120,)
X_test matrix size: (30, 4)
y_test vector size: (30,)

----



In [7]:

# Creating Gausian Naive Bayes model
scikit_nb = GaussianNB()

# Training model
scikit_nb.fit(X_train, y_train)

# Making prediction
pred = scikit_nb.predict(X_test)
print("Prediction vector: {}".format(pred))
print("  Expected values: {}".format(y_test))
print("\n----\n")


Prediction vector: [0 2 1 1 0 1 0 0 2 1 2 2 2 1 0 0 0 1 1 2 0 2 1 2 2 2 1 0 2 0]
  Expected values: [0 2 1 1 0 1 0 0 2 1 2 2 2 1 0 0 0 1 1 2 0 2 1 2 2 1 1 0 2 0]

----



In [9]:

# Evaluation
accuracy = accuracy_score(pred, y_test)
print("Prediction accuracy: {}%".format(accuracy * 100.0))

Prediction accuracy: 96.66666666666667%


In [10]:
# Naivebayes for email prediction
from sklearn.pipeline import Pipeline


In [11]:

clf = Pipeline([('nb', GaussianNB())
])
clf.fit(X_train, y_train)


In [12]:

clf.score(X_test,y_test)
# clf.predict(user_input)

0.9666666666666667

# Predict spam

In [7]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"

import urllib.request

raw_data = urllib.request.urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=',')
print(dataset[0])

[  0.      0.64    0.64    0.      0.32    0.      0.      0.      0.
   0.      0.      0.64    0.      0.      0.      0.32    0.      1.29
   1.93    0.      0.96    0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.778   0.      0.
   3.756  61.    278.      1.   ]


In [14]:
X = dataset[:,0:48]

y = dataset[:,-1]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=17)

In [18]:
BernNB = BernoulliNB(binarize=True)
BernNB.fit(X_train, y_train)
print(BernNB)

y_expect = y_test
y_pred = BernNB.predict(X_test)

print(accuracy_score(y_expect, y_pred))

BernoulliNB(binarize=True)
0.8577633007600435


In [19]:
MultiNB = MultinomialNB()
MultiNB.fit(X_train, y_train)
print(MultiNB)


y_pred = MultiNB.predict(X_test)

print(accuracy_score(y_expect, y_pred))

MultinomialNB()
0.8816503800217155


In [20]:
GausNB = GaussianNB()
GausNB.fit(X_train, y_train)
print(GausNB)
y_pred = GausNB.predict(X_test)
print(accuracy_score(y_expect, y_pred))

GaussianNB()
0.8197611292073833


In [None]:
BernNB = BernoulliNB(binarize=0.1)
BernNB.fit(X_train, y_train)
print(BernNB)

y_expect = y_test
y_pred = BernNB.predict(X_test)

print(accuracy_score(y_expect, y_pred))

BernoulliNB(alpha=1.0, binarize=0.1, class_prior=None, fit_prior=True)
0.9109663409337676


# Other titles

## Data

##### VARIABLE DESCRIPTIONS

### Checking that your target variable is binary

### Checking for missing values

### Taking care of missing values
##### Dropping missing values (evaluate each variable)

### Inputing missing values for var1

### Converting categorical variables to a dummy indicators

### Checking for independence between features

### Checking that your dataset size is sufficient

### Deploying and evaluating the model

## Model Evaluation



### Classification report without cross-validation

### K-fold cross-validation & confusion matrices

### Make a test prediction