<a type="anchor"></a>
# **Proyecto MAC**
# **Clasificador con redes bayesianas**
Alumno: Victor Alberto Lopez Cardona

Exp: 747175

<a type="anchor" id="100"></a>
# **Contenido**

1.	[Introduccion](#1)
2.	[Requerimientos](#2)
3.	[Datos](#3)
4.	[Limpiar la informacion](#4)
5.	[Transformar la informacion en vectores](#5)
6.	[Entrenar el modelo](#6)
7.	[Pruebas](#7)
8.	[Conclusiones](#8)


# **1. Introduccion** <a type="anchor" id="1"></a>
[Contenido](#100)

## **Multinomial Naïve Bayes algorithm**

With a Multinomial Naïve Bayes model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial (p1, . . . ,pn) where pi is the probability that event i occurs. Multinomial Naïve Bayes algorithm is preferred to use on data that is multinomially distributed. It is one of the standard algorithms which is used in text categorization typeification.

## **Gaussian Naïve Bayes algorithm**


When we have continuous attribute values, we made an assumption that the values associated with each type are distributed according to Gaussian or Normal distribution. For example, suppose the training data contains a continuous attribute x. We first segment the data by the type, and then compute the mean and variance of x in each type. Let µi be the mean of the values and let σi be the variance of the values associated with the ith type. Suppose we have some observation value xi . Then, the probability distribution of xi given a type can be computed by the following equation –


![Gaussian Naive Bayes algorithm](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQEWCcq1XtC1Yw20KWSHn2axYa7eY-a0T1TGtdVn5PvOpv9wW3FeA&s)

## **Bernoulli Naïve Bayes algorithm**

In the multivariate Bernoulli event model, features are independent boolean variables (binary variables) describing inputs. Just like the multinomial model, this model is also popular for document typeification tasks where binary term occurrence features are used rather than term frequencies.

Para nuestro proyecto utlizaremos Multinomial Naive Bayes algorithm porque...

# **2. Requerimientos** <a type="anchor" id="2"></a>
[Contenido](#100)

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
message = pd.read_csv('SMSSpamCollection',sep='\t',names=["type","message"])
message.head()



Unnamed: 0,type,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# **3. Datos** <a type="anchor" id="3"></a>
[Contenido](#100)

In [20]:
message.groupby('type').describe()

Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


In [4]:
message['length']=message['message'].apply(len)
message.head()

Unnamed: 0,class,message,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


In [5]:
message.length.describe()

count    5572.000000
mean       80.489950
std        59.942907
min         2.000000
25%        36.000000
50%        62.000000
75%       122.000000
max       910.000000
Name: length, dtype: float64

El mensaje mas largo consta de 910 caracteres, entonces...

La longitud del texto es un indicativo de si el mensaje es spam?

Ahora vamos a limpiar nuestra informacion

# **4. Limpiar la informacion** <a type="anchor" id="4"></a>
[Contenido](#100)

In [6]:
import nltk
try:
    nltk.corpus.stopwords.words('english')
except:
    nltk.download('stopwords')

def clean_data(mnsj):
    msg = [x for x in mnsj if x not in string.punctuation]
    msg = ''.join(msg)
    clean_msg = [x for x in msg.split() if x.lower not in nltk.corpus.stopwords.words('english')]
    return clean_msg

message['message'].apply(clean_data)


0       [Go, until, jurong, point, crazy, Available, o...
1                          [Ok, lar, Joking, wif, u, oni]
2       [Free, entry, in, 2, a, wkly, comp, to, win, F...
3       [U, dun, say, so, early, hor, U, c, already, t...
4       [Nah, I, dont, think, he, goes, to, usf, he, l...
                              ...                        
5567    [This, is, the, 2nd, time, we, have, tried, 2,...
5568         [Will, ü, b, going, to, esplanade, fr, home]
5569    [Pity, was, in, mood, for, that, Soany, other,...
5570    [The, guy, did, some, bitching, but, I, acted,...
5571                     [Rofl, Its, true, to, its, name]
Name: message, Length: 5572, dtype: object

# **5. Transformar la informacion en vectores** <a type="anchor" id="5"></a>
[Contenido](#100)

In [7]:
bow_transformer = CountVectorizer(analyzer=clean_data).fit(message['message'])
messages_bow = bow_transformer.transform(message['message'])
print('Shape of Sparse Matrix: ',messages_bow.shape)
print('Amount of non-zero occurences:',messages_bow.nnz)

Shape of Sparse Matrix:  (5572, 11747)
Amount of non-zero occurences: 79463


In [8]:
sparsity =(100.0 * messages_bow.nnz/(messages_bow.shape[0]*messages_bow.shape[1]))
print('sparsity:{}'.format(round(sparsity)))

sparsity:0


In [12]:
tfidf_transformer=TfidfTransformer().fit(messages_bow)
messages_tfidf=tfidf_transformer.transform(messages_bow)
print(messages_tfidf.shape)

(5572, 11747)


# **6. Entrenar el modelo** <a type="anchor" id="6"></a>
[Contenido](#100)

In [31]:
spam_detect_model = MultinomialNB().fit(messages_tfidf,message['type'])

message4=message['message'][3]
bow4=bow_transformer.transform([message4])
tfidf4 = tfidf_transformer.transform(bow4)
print('predicted:',spam_detect_model.predict(tfidf4)[0])
print('expected:',message.type[3])

predicted: ham
expected: ham


# **7. Pruebas** <a type="anchor" id="7"></a>
[Contenido](#100)

In [24]:
all_predictions = spam_detect_model.predict(messages_tfidf)
print(all_predictions)

['ham' 'ham' 'spam' ... 'ham' 'ham' 'ham']


Checar la precision

In [35]:
from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(message['type'], all_predictions)))

Model accuracy score: 0.9722


# **8. Conclusiones** <a type="anchor" id="8"></a>
[Contenido](#100)