# Naive Bayes
### Buiiding a Spam Detection model

In [2]:
import pandas as pd
df = pd.read_csv('spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [6]:
df['spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [7]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.25)

#### Conversion of text into numbers from the column 'Message'

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x_train_count = v.fit_transform(x_train.values)
x_train_count.toarray()[:5]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [9]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train_count, y_train)

In [11]:
x_test_count = v.transform(x_test)
model.predict(x_test_count)

array([0, 0, 1, ..., 0, 0, 0], dtype=int64)

In [12]:
model.score(x_test_count, y_test)

0.9842067480258435

In [13]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [14]:
clf.fit(x_train, y_train)

In [15]:
clf.score(x_test, y_test)

0.9842067480258435

In [16]:
clf.predict(x_test)

array([0, 0, 1, ..., 0, 0, 0], dtype=int64)

In [17]:
y_test

1400    0
4197    0
5487    1
1074    0
2698    0
       ..
5163    0
800     0
4535    0
1364    0
2752    0
Name: spam, Length: 1393, dtype: int64

## Another Task

Use wine dataset from sklearn.datasets to classify wines into 3 categories. Load the dataset and split it into test and train. After that train the model using Gaussian and Multinominal classifier and post which model performs better. Use the trained model to perform some predictions on test data.

In [20]:
from sklearn.datasets import load_wine
wine = load_wine()
dir(wine)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [23]:
data = pd.DataFrame(wine.data, columns=wine.feature_names)
data

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


In [24]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.2)

In [25]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB
model1 = GaussianNB()
model2 = MultinomialNB()

In [26]:
model1.fit(x_train, y_train)

In [27]:
model1.score(x_test, y_test)

0.9722222222222222

In [28]:
model2.fit(x_train, y_train)

In [29]:
model2.score(x_test, y_test)

0.8611111111111112

#### Hence, GaussianNB performs better than MultinomialNB

In [30]:
model1.predict(x_test)

array([2, 1, 0, 1, 1, 1, 2, 0, 2, 1, 2, 1, 1, 1, 1, 2, 0, 1, 2, 1, 2, 2,
       0, 1, 1, 0, 0, 2, 1, 0, 2, 1, 1, 2, 2, 0])

In [31]:
model2.predict(x_test)

array([1, 1, 0, 1, 1, 1, 2, 0, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1,
       1, 1, 1, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 0])