## CORRECTION
https://nbviewer.jupyter.org/github/fxbabin/ateliers_ml_2019/blob/master/Semaine7/NaiveBayesSpamClassifier.ipynb

# Naive Bayes pour la classification du spam


La classification naïve bayésienne (Naive bayes) est un algorithme de classification probabiliste relativement simple qui convient bien aux données que l'on peut catégoriser.

En Machine Learning, les applications courantes de Naive Bayes sont la classification des courriels non sollicités (spam), l'analyse des sentiments (emotion analyses) et la catégorisation des documents (data categorisation). Naive Bayes présente des avantages par rapport aux autres algorithmes de classification couramment utilisés en raison de sa simplicité, de sa rapidité et de sa précision sur de petits ensembles de données.

## Data Description

Nous utiliserons les données du dépot d'apprentissage machine de l'UCI qui contient plusieurs commentaires Youtube de vidéos musicales très populaires. Chaque commentaire dans les données a été étiqueté comme spam ou ham (commentaire légitime), et nous utiliserons ces données pour former notre algorithme Naive Bayes pour la classification de spam commentaire youtube.

In [1]:
# Import modules

# Pour la manipulation des données
import pandas as pd

# Pour les opérations matricielles
import numpy as np

# Pour utiliser le regexabs
import re

In [2]:
# Charger les données du fichier 'YoutubeCommentsSpam.csv' en utilisant pandas
data_comments = pd.read_csv('YoutubeCommentsSpam.csv')

# Créer des libellés de colonne : "content" et "label". 
# conseils : la méthode 'colums' peut être utile 
data_comments.columns = ['content','label']

# Afficher les premières lignes de notre ensemble de données pour s'assurer que la colonne "label" a été ajoutée
data_comments.head(10)
data_comments["content"]

0                +447935454150 lovely girl talk to me xxx
1          I always end up coming back to this song<br />
2       my sister just received over 6,500 new <a rel=...
3                                                    Cool
4                               Hello I am from Palastine
5       Wow this video almost has a billion views! Did...
6       Go Sgrout check out my rapping video called Fo...
7                                        Almost 1 billion
8                    sgrout Aslamu Lykum... From Pakistan
9       Eminem is idol for very people in EspaÃ±a and ...
10                            Help me get 50 subs please 
11                                         i love song :)
12      Alright ladies, if you like this song, then ch...
13      The perfect example of abuse from husbands and...
14       The boyfriend was Charlie from the TV show LOST 
15      <a href="https://www.facebook.com/groups/10087...
16                  Take a look at this video on YouTube:
17            

$\textbf{ATTENTION: Ne regardez pas les liens dans les commentaires, ce sont des spams! ;)}$

In [3]:
# Afficher les commentaires de spam dans les données
# N'ALLEZ PAS SUR LES LIENS !!!!! sérieusement, ce sont des spams.... 
print(data_comments["content"][data_comments["label"] == 1])

0                +447935454150 lovely girl talk to me xxx
2       my sister just received over 6,500 new <a rel=...
4                               Hello I am from Palastine
6       Go Sgrout check out my rapping video called Fo...
8                    sgrout Aslamu Lykum... From Pakistan
10                            Help me get 50 subs please 
12      Alright ladies, if you like this song, then ch...
15      <a href="https://www.facebook.com/groups/10087...
16                  Take a look at this video on YouTube:
17                 Check out our Channel for nice Beats!!
19                    Check out this playlist on YouTube:
21                                            like please
24      I shared my first song &quot;I Want You&quot;,...
25      Come and check out my music!Im spamming on loa...
26                    Check out this playlist on YouTube:
27      HUH HYUCK HYUCK IM SPECIAL WHO S WATCHING THIS...
30      Check out this video on YouTube:<br /><br />Lo...
33            

En parcourant les commentaires qui ont été étiquetés comme spam dans ces données, il semble que ces commentaires sont soit sans rapport avec la vidéo, soit comme une forme de publicité. L'expression "check out" semble être très populaire dans ces commentaires.

## Summary Statistics and Data Cleaning

Le tableau ci-dessous montre que cet ensemble de données se compose de $1959$ commentaires youtube, dont environ $49\%$ sont des commentaires légitimes et environ $51\%$ sont du spam. Cette grande variation de classes dans notre ensemble de données nous aidera à tester l'exactitude de nos algorithmes sur l'ensemble des données de test. 

La longueur moyenne de chaque commentaire est d'environ $96$ caractères, ce qui représente environ $15$ mots en moyenne par commentaire.

In [4]:
# Ajouter une nouvelle colonne pour la longueur de chaque commentaire
# conseils: utiliser map et lambda
data_comments['length_as_lambda'] = list(map (lambda x: len(x), data_comments['content']))
data_comments['length_as_len'] = list(map (len, data_comments['content']))
data_comments['length'] = data_comments['content'].apply(len)

# Permet d'afficher un tableau avec plusieurs données statistiques (mean, stdev, min, max)
data_comments[["label","length_as_lambda"]].describe()


data_comments.loc[:,["length_as_lambda","length_as_len",'length']]

Unnamed: 0,length_as_lambda,length_as_len,length
0,40,40,40
1,46,46,46
2,200,200,200
3,4,4,4
4,25,25,25
5,73,73,73
6,65,65,65
7,16,16,16
8,36,36,36
9,69,69,69


Pour notre algorithme de classification Naive Bayes, nous diviserons les données en deux parties: entrainement et tests. La partie d'entrainement sera utilisé pour former l'algorithme de classification du spam, et l'ensemble de test ne sera utilisé que pour tester sa précision. 

En général, La partie d'entrainement devrait être plus grand que La partie de test et les deux devraient provenir de la même population (la population dans notre cas est Youtube commentaires pour les vidéos musicales). 

**Nous sélectionnerons au hasard $75\%$ des données pour la formation et $25\%$ des données pour les tests.**

In [5]:
# Séparons les données en 2 groupes ! (75% training, 25% test)

# Ceci nous permet d'obtenir la même allocation aléatoire pour chaque série de codes. RTFM if you want ;)
np.random.seed(2019)

# Ajout d'un vecteur colonne 'uniform' de nombres générés aléatoirement entre 0 et 1 
# Astuce : dans numpy, il existe une méthode pour prélever un échantillon à partir d'une distribution uniforme.

data_comments["uniform"] = np.random.uniform(0,1, len(data_comments))
#print(data_comments['uniform'])

#a_list_of_unif = np.random.uniform(0,1, len(data_comments))
#a_list_of_unif
# Comme le nombre dans notre colonne 'uniform' est distribué uniformément, 
# environ 75 % de ces chiffres devraient être inférieurs à 0,75 %, prenons ces 75 %.
data_comments_train = data_comments[data_comments["uniform"] < 0.75]

# Même chose pour les 25 % de ces numéros qui sont supérieurs à 0,75
data_comments_test = data_comments[data_comments["uniform"] >= 0.75]

print(data_comments_train.head())
print(data_comments_test.head())

                                             content  label  length_as_lambda  \
1     I always end up coming back to this song<br />      0                46   
2  my sister just received over 6,500 new <a rel=...      1               200   
3                                               Cool      0                 4   
5  Wow this video almost has a billion views! Did...      0                73   
6  Go Sgrout check out my rapping video called Fo...      1                65   

   length_as_len  length   uniform  
1             46      46  0.393081  
2            200     200  0.623970  
3              4       4  0.637877  
5             73      73  0.299172  
6             65      65  0.702198  
                                              content  label  \
0            +447935454150 lovely girl talk to me xxx      1   
4                           Hello I am from Palastine      1   
7                                    Almost 1 billion      0   
8                sgrout Aslamu Lyku

In [6]:
# Vérifiez que les données d'entraînement contiennent à la fois des commentaires de spam et de ham
data_comments_train["label"].describe()

count    1467.000000
mean        0.507157
std         0.500119
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: label, dtype: float64

In [7]:
# Même chose pour le data test
data_comments_test["label"].describe()

count    492.000000
mean       0.528455
std        0.499698
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: label, dtype: float64

Les données d'entrainement et de test ont toutes deux un bon mélange de spam et de ham, nous sommes donc prêts à passer à la formation sur le classificateur Naive Bayes. 

In [8]:
# Joindre tout les commentaires dans une seule et même liste
# Astuce: 'separator'.join(list)
training_list_words = ' '.join(data_comments_train["content"])

# Diviser la liste des commentaires en une liste de mots uniques
# Astuce: set() and sorted()
train_unique_words = set(sorted(str.split(training_list_words)))

# Nombre de mots uniques dans le data training
vocab_size_train = len(train_unique_words)

#print(training_list_words)

In [9]:
# Description résumée des commentaires
print('Unique words in training data: %s' % vocab_size_train)
print('First 5 words in our unique set of words: \n % s' % list(train_unique_words)[1:6])

Unique words in training data: 5491
First 5 words in our unique set of words: 
 ['clicked', 'Dear', 'mother', 'DID', 'Hi']


Cela devrait ressember à quelquechose comme ça:

```Unique words in training data: 5898
First 5 words in our unique set of words: 
['now!!!!!!', 'yellow', 'four', '/>.Pewdiepie', 'Does']```


Actuellement, "now !!" et "Now !!!!!!", ainsi que "DOES", "DoEs", et "does" sont tous considérés comme des mots uniques. Pour la classification du spam, il est probablement préférable de traiter légèrement les données pour en améliorer l'exactitude. Dans notre cas, nous pouvons nous concentrer sur les lettres et les chiffres, ainsi que convertir tous les commentaires en minuscules.

In [10]:
def cleanword(a_list_of_word):
    a_list_of_word = [ re.sub('[^a-zA-Z0-9]*', '', word) for word in a_list_of_word]
    a_list_of_word = set([ word.lower() for word in a_list_of_word])
    return a_list_of_word

In [14]:
# Garder uniquement les chiffres et les lettres
# Astuce: utiliser regex et les list comprehension
#train_unique_words_alt = [ re.sub('\W*', '', word) for word in train_unique_words]
vocab_size_train_before = len(train_unique_words)
train_unique_words_alt = cleanword(train_unique_words)
train_unique_words = [ re.sub('[^a-zA-Z0-9]*', '', word) for word in train_unique_words]

# Convertir toutes les lettres en minuscule
# Astuce: set() ?
train_unique_words_alt = set([ word.lower() for word in train_unique_words_alt])
train_unique_words = set([ word.lower() for word in train_unique_words])
# Nombre de mots uniques dans le data training
vocab_size_train = len(train_unique_words)

# Description résumée des commentaires
#print('Unique words in processed training data: %s' % vocab_size_train)
print('First 5 words in our processed unique set of words: \n % s' % list(train_unique_words)[1:6],vocab_size_train_before, vocab_size_train)
print('alt First 5 words in our processed unique set of words: \n % s' % list(train_unique_words_alt)[1:6], len(train_unique_words_alt))

First 5 words in our processed unique set of words: 
 ['colour', 'hot', 'quothow', 'httpwwwgcmforexcompartnersawaspxtaskjoint2ampaffiliateid9107', 'versace'] 5491 3481
alt First 5 words in our processed unique set of words: 
 ['colour', 'httpwwwgcmforexcompartnersawaspxtaskjoint2ampaffiliateid9107', 'hot', 'quothow', 'versace'] 3481


## Naive Bayes for Spam Classification

Ok, alors voilà le plan :

- Tout d'abord, nous avons séparé nos données de formation en 2 sous-ensembles : training et test.

- puis nous allons créer plusieurs fonctions pour vérifier combien de fois chaque mot est apparu dans le spam et non dans les commentaires de spam, 
    - et vérifier la probabilité que chaque mot apparaisse dans le spam/non spam

- alors les 2 fonctions les plus importantes : train() et classify()

- Et enfin, nous allons vérifiez l'exactitude de nos prédictions.

Passons au code !

In [15]:
#trainPositive = dict(\
#                     word : count for word in \
#                     set(sorted(str.split(data_comments_train["content"][data_comments_train["label"] == 1])))

In [16]:
# On initialise des dictionnaire avec des mots de commentaires comme "keys", 
# et leur étiquette comme "value".
trainPositive = dict()
trainNegative = dict()

# On initialise ces variables à zéro
positiveTotal = 0
negativeTotal = 0

# Même chose, mais en float ;) 
pSpam = .0
pNotSpam = .0

# Laplace smoothing
alpha = 1

In [17]:
#def initialize_dicts():

# # On initialise les dictionnaires avec 0 comme valeur 
for word in train_unique_words:
    #print(word)
    global trainPositive
    global trainNegative
    # skip empty words('' and ' ')
    if word == '' or word == ' ':
        continue # goes to next word
    
    # Pour le moment, tout est classifier comme ham (légitime) 
    ## NON for the moment nothing is none
    trainPositive[word] = 0 #create word
    trainNegative[word] = 0
print(trainNegative['sgrout'])

0


In [18]:
# Compter le nombre de fois que le mot dans le commentaire apparaît dans les commentaires de spam et ham
def processComment(comment,label):
    global positiveTotal
    global negativeTotal
    global trainNegative
    global trainPositive
    
    # Séparer le commentaire en liste de mots
    comment = cleanword(set(sorted(str.split(comment))))
    
    # Pour chaque mot du commentaire
    for word in comment:
        print(word)
        ##print(type(train_unique_words))
        #Checker si le mot est bien dans la base de donnée
        if not(word in trainPositive):
            print("mot inconnu")
            continue
        
        #Checker si ce n'est pas un '' ou ' '
        if word == '' or word == ' ':
            continue
        
        # Checker si le mot n'est pas du spam (ham)
        if label == 0:
            trainNegative[word] += 1
            negativeTotal += 1
            # Incrémenter le nombre de fois que le mot apparaît dans les commentaires non spam
            
            
        # spam comments
        if label == 1:
            trainPositive[word] += 1
            positiveTotal += 1
            # Incrémenter le nombre de fois que le mot apparaît dans les commentaires spam
            
            
#onedata = data_comments_train.loc[158]
#data_comments.loc[2:3,["label"]]

##processComment(onedata["content"],onedata["label"])
#print(onedata["content"],onedata["label"])

#print(trainPositive['help'],trainNegative['someone'],positiveTotal, negativeTotal)

##METTRE LA FORMULE



In [19]:
# Ici, on a la fonction qui va calculer la Prob(word|spam) et Prob(word|ham)
def conditionalWord(word,label):
   
    # Paramètre de lissage de Laplace (Laplace Smoothing)
    # Rappel : pour avoir accès à une variable globale à l'intérieur d'une fonction 
    # vous devez le spécifier en utilisant le mot 'global'.
    global positiveTotal
    global negativeTotal
    global alpha
    
    if word not in trainPositive:
        return 1.0
    # word in ham comment
    if(label == 0):
        print("trying word[",word,"]")
        # Calculer Prob(word|ham)
        print((trainNegative[word] + alpha) / (float)(negativeTotal + (alpha * vocab_size_train)))
        return (trainNegative[word] + alpha) / (float)(negativeTotal + (alpha * vocab_size_train))
    
    # word in spam comment
    else:
        # Calculer Prob(word|ham)
        ##print(trainPositive[word] + alpha) / (float)(positiveTotal + (alpha * vocab_size_train))
        return (trainPositive[word] + alpha) / (float)(positiveTotal + (alpha * vocab_size_train))
        
       
       

In [20]:
# Ici, on a la fonction qui va calculer la Prob(spam|comment) or Prob(ham|comment)
def conditionalComment(comment,label):
    
    # On initialise la probabilité conditionelle
    prob_label_comment = 1.0
    
    # On sépare le commentaire en liste de mots
    #comment = cleanword(set(sorted(str.split(comment))))
    comment = comment.split(' ')
    
    # Pour chaque mot du commentaire
    for word in comment:
        
        # Calculer la valeur de P(label|comment)
        # On suppose ici qu'on a une independance conditionnelle (p(A) * p(B))
        prob_label_comment = prob_label_comment * conditionalWord(word,label)
    
    return prob_label_comment

In [21]:
# Calculer plusieurs probabilités conditionnelles dans les données d'entraînement
def train():
    # Rappel: on aura besoin de pSpam et pNotSpam ici ;) 
    ## NON on peut juste les trouver tout de suite
    global pSpam
    global pNotSpam

    # Initialisation de nos variables: le nombre total de commentaires et le nombre de commentaires de spam 
    total = 0.0
    num_spam = 0.0
    
    #print(data_comments_train.head())
    print('Starting training ...')
    
    # Passez en revue chaque commentaire dans les données d'entraînement 
    for index, row in data_comments_train.iterrows():
        #print(type(row))
        #print(row)
        total += 1
                
       # Vérifiez si le commentaire est du spam ou non (ham)
        if row.label == 1: #spam
       # Incrémenter les valeurs selon que le commentaire est du spam ou non
            num_spam += 1
       # Mettre à jour le dictionnaire du spam et ham
        
        ##print("line is [",row.content,"]\n")
        processComment(row.content,row.label)
            
        #conditionalComment(row.content,row.label)
        
    # Calcule des probabilitées a priori, P(spam), P(ham)
    pSpam = num_spam / total
    pNotSpam = 1 - pSpam
    print('Training done')

In [22]:
# Lancer notre fonction train de Naive Bayes
train()

Starting training ...
back
this

mot inconnu
always
end
coming
up
songbr
i
to
youtube
was
used
sister
received
com
my
classothashtag
a
now
the
just
views
only
right
6500
new
she
thing
hrefhttpsplusgooglecoms23activeactivea
pimpmyviews
over
relnofollow
cool
this
has
views
video
was
t
wow
know
almost
a
billion
so
popular
it
didn

mot inconnu
video
check
my
out
go
rapping
sgrout
four
called
wheels
please
idol
latinoamerica
or
eminem
and
is
people
for
very
espaa
in
mexico
me
get
subs
50
help
please

mot inconnu
love
song
i
hot
than
like
check
not
rapper
but
out
some
you
hotter
alright
better
this
channel
my
lyrically
he
a
coming
songs
the
if
on
rage
game
john
into
smoking
song
s
hear
eminem
then
his
of
ladies
the
was
show
charlie
lost
tv
boyfriend
from
at
this
youtube
video
on
look
a
take
channel
beats
our
check
nice
out
for
rihanna
are
together
eminem
unstoppable
and

mot inconnu
this
song
br
basically
and
rape
is
cheating
about
like
please
this
5
song
was
old
t
get
ago
can
though
years
b

love
this
song
two
s
know
guys
part
there
a
of
do
you
d

mot inconnu
at
qerrassa
check
make
yboiveth
over
out
you
scale
visit
called
it
home
start
this
moneygqcom
reiltas
shame
parede
mogotrevo
plifal
today
qiameth
from
can
as
shorogyt
month
website
eggmode
drirathiel
value
extraordinary
besloor
and
guys
hoppler
making
moderock
i
online
memory
ferirama
money
3000
wazzasoft
working
am
should
sasaroo
per
charlie
s
hey
lost
from
it
this
youtube
video
check
on
out
me
this
fears
my
guss
be
to
life
abusive
him
and
who
live
towered
worst
i
song
someone
with
one
is
of
in
the
plz
youtube
video
watch
vote
quoteminem
share
of
yearquot
to
artist
ytma
the
someone
video
get
dont
still
and
help
years
5
i
music

mot inconnu
song
video
check
t
new
ad
my
out
don
play
music
please
fgw
the
us
like
tell
and
subscribe
title
so
your
i
can
music
please

mot inconnu
views
this
br
video
kinda
close
is
1
to
million
you
eminem
fuck

mot inconnu
at
hilarious
check
make
drews
out
expansion
you
visit
called
smoke
it

my
way
a
coming
no
got
subscribers
to
criticism
leave
little
the
if
been
constructive
on
and
im
heard
ive
getting
who
just
so
favour
told
up
young
really
i
take
views
have
4000
enjoy
of
its
sub
wont
music
please
eminem
rocks
youtube
funny
check
cool
my
and
out
for
videos
rap
sometime
keep
make
luck
you
good
honesty
they
this
much
get
my
kill
thanks
12year
for
doing
can
makeing
breaken
on
dream
people
self
so
cazzy
love
i
hay
song
harts
old
or
rule
am
go
is
thank
wright
sure
your
comeing
awesome
lovet
i
video
like
my
and
subscribe
please
btw

mot inconnu
have
keep
good
it
brother
a
you
up
subscribed
style

mot inconnu
the
best
eminem

mot inconnu
15
trying
check
make
bless
rapper
out
hello
you
me
my
help
a
friends
to
life
can
second
and
reach
god
young
videos
i
take
year
old
something
am
dreams
of
nothingplease
your
song
beautiful
best
ever
rap
this
song
i
lover
appreciate
big
wanna
skip
check
rapper
look
time
know
but
out
hey
you
m
this
my
a
coming
for
constructive
and
dream
guys
chanc

sucks
love
because
i
it

mot inconnu
picture
like
hey
likes
where
you
concert
lot
it
this
perry
a
be
to
if
could
httpswwwfacebookcomexagdlphotosa9368685796602841073741943111719098841907937732262907249type1amptheater
nice
would
and
guys
im
so
very
really
i
loosing
katy
contest
go
her
of
thank
need
want
in
please
this
inaccurate
video
rip
face
would
is
her
a
of
very
tiger
check
out
great
channel
creativity
reviews
perry
a
for
no
review
well
as
our
an
plus
and
comparison
tech
artist
comparisons
katy
video
admire
nonsense
iphone
by
her
mobile
6
in
animals
real
are
those
this
the
15
video
done
best
in
pop
folks
is
last
how
years
maybe
music
katy
perry
lt3
love
i
boooobs
views
this
has
understand
600
or
lol
dont
not
im
anything
its
hating
how
just
million
really
i
confusing
katy
happy
katys
say

mot inconnu
october
almost
25th
its
birthday
to
lets
katy
song
go
you
great
my
telephone
this
views
video
get
would
be
billion
1
want
to
amazing
really
stole
out
started
catchy
hating
book
perry
all


as
it
katy

mot inconnu
the
my
girl
is
dreams
of
mine
this
love
song
i
fit
is
she
me
the
think
song
fill
make
roar
way
am
love
i
it
have
has
the
11000
plane
happening
and
what
forget
surviving
chance
a
of
do
you
choice
how
crash
fuck
hits
are
like
look
many
there
stupid
but
mothers
they
numbering
about
lot
because
it
great
guess
ever
this
piece
idiots
trash
a
all
for
these
to
foolish
little
the
garnered
fathers
lyrics
would
why
millions
and
also
turn
who
going
just
so
young
i
listen
into
have
video
singer
anyone
or
fool
that
she
thing
girls
is
obviously
her
of
how
arrogant
in

mot inconnu
full
firework
plz
katys
birtgday
attention
time
thumbs
browser
it
gtgtjust
wasting
left
share
for
be
to
1
from
500m
watch
10
could
on
hand
and
reach
more
up
need
gtgtdont
sec
listen
we
views
have
only
day
video
view
remember
possible
dont
half
open
different
its
tabs
how
million
30

mot inconnu
colors
the
beauty
httpsplaygooglecomstoreappsdetailsidcomrapidfacebookmagicdroid
characterized
by
fast
and
i

last
music
best
ever
song
nice
love
musique
super
wowwwwwwwwwwwwwwwwwwwwwwwwwwwwww
party

mot inconnu
8
rock
if
2011
it
watched
thumbs
you
up
in
increidebl
this
is
lmfao

mot inconnu
just
superr
2015
llikee
start
let
the
s
yeah
party
views
this
year
song
awesome
while
4
not
fairrrrrrr
switch
reach
1
its
800
mil
years
tango
to
bitch
needed
lmfao

mot inconnu
super
awesome
videobr

mot inconnu
dance
3
just
this
will
hrefhttpwwwyoutubecomwatchvkq6zr6kcpj8ampt3m40s340aa
the
video
favorite
my
always
part
but
is
a
be
songbr
in
hrefhttpwwwyoutubecomwatchvkq6zr6kcpj8ampt4m11s411a

mot inconnu
lmfao
moves
bennett
rock
meh
some
makes
hrefhttpsyoutubekq6zr6kcpj8httpsyoutubekq6zr6kcpj8a
lauren
ft
he
anthem
a
man
got
box
d
smile
br
party
goo

mot inconnu
ever
hirebr
song
the
this
best
i
actually
lmfao
uncle
funny
are
t
lmao
that
and
know
part
brother
her
is
its
of
mahogany
there
i
didn
loveparty
rock
anthem
is
party
life
top
stopis
a
1
very
back
suck
t
my
when
life
ahhh
didn
party
xd
rock

mot inco

turned
views
the
wanted
check
came
on
is
just
mute
soon
to
as
i
it
me
hey
subscribe
to
youre
not
started
stay
do
you
entire
me
alive
to
once
now
if
die
and
subscribe
stop
so
have
will
right
day
one
within
want
reading
family
httpstwittercomgbphotographygb
views
only
the
checking
im
httpwwwebaycomitm171183229277sspagenamestrkmeselxitamptrksidp3984m1555l2649
httpubuntuonecom40beuutvu2zkxk4utgpz8k
at
news
are
clothing
oncueapparel
you
visit
crop
follow
bringing
festival
us
perfect
apparel
tank
neon
company
site
tops
designs
edm
for
free
to
our
accessories
an
on
any
inspired
and
instagraml
more
vnecks
facebook
rave
giveaways
we
have
or
is
dedicated
tshirts
music
httpwwwteespringcomtigermeathoodie
show
here
pride
your
auburn
views
the
just
checking
channel
out
my
check

mot inconnu
comstrategygameandrijamatf
money
marketglory
earn
game
real
from

mot inconnu
yr
check
like
make
even
out
some
hey
you
it
my
thanks
chiptunes
a
be
to
checked
if
wonderful
would
and
guys
im
producer
remix
i
8bit
h

our
us
video
check
on
new
give
and
out
chance
a
you
be
disappointed
wont
music
please
me
subscribe
to
please
this
the
youtube
video
funny
best
and
history
true
is
of
viral
check
vidios
my
out
please
oppa
gangnam
style
lt3
oppa
holy
views
the
check
here
im
shit
to
this
have
loving
day
are
and
a
person
you
reading
dear
beautiful
great
httpwwwermailpldolaczv3veygin
click
toll
tried
tsu
even
twitter
you
thingyou
paid
this
fb
get
special
here
registr
share
a
well
can
as
has
the
on
and
posts
httpswwwtsucoworldwidelife
have
new
social
network
guy
suit
the
kinda
like
yellow
jaesuk
looks
in

mot inconnu
like
make
not
time
commenting
fbyou
you
paid
liking
fb
waste
get
here
a
for
onedont
everyone
to
free
difference
the
on
httpswwwtsucoslema13
being
and
also
people
posts
so
register
only
sharing
money
new
that
with
is
everyday
your
wellcome
network
this
the
video
are
and
27
there
watched
omg
people
7
billion
world
because
2
in
saying
he
is
what
this
has
views
many
so
views
omg
billion
over
2
me
pl

is
download
other
lets
simply
in
wakad
video
check
new
my
halloween
out
waka
trying
industry
wanna
in
are
like
give
enter
look
know
not
anything
but
prove
annoying
do
see
thumb
you
visit
about
comment
called
ourself
channel
made
this
us
get
boys
a
these
everyone
thankss
to
play
can
as
hell
the
if
our
dreamers
on
and
heard
chance
game
subscribe
just
so
up
maybe
sorry
i
14
take
we
things
have
only
song
year
old
that
one
music
please
back
me
u
plz
wilsubscribe
frndzzl
subscribe
i

mot inconnu
at
check
make
out
you
describes
visit
fantastic
called
it
home
start
this
moneygqcom
does
today
son
from
can
as
month
website
the
swim
why
extraordinary
and
audit
guys
chance
enlist
making
person
ice
i
rate
online
money
3000
frequent
working
am
fragile
should
how
over
per
at
media
35
social
youtube
facebooktwitter
are
t
make
experience
time
immediately
twitter
world
you
prior
skills
paid
begin
home
hour
mess
4netjobscom
monthly
get
wide
wasting
currently
a
work
all
incomeonly
for
to
from
can
as
such


click
day
singer
killing
anyone
only
s
by
hear
that
truly
am
enjoy
is
thank
of
writer
page
please
check
are
out
others
case
lands
send
to
the
courtthanks
on
and
line
supreme
google
l
reed
freedom
steven
justice
in
please

mot inconnu
write
guitar
original
continue
namepicture
know
too
drums
thumb
m
you
already
because
2
it
made
this
meet
my
way
help
a
record
completely
to
play
songs
musiclyrics
independent
the
if
on
nice
album
found
listenersupported
person
up
willing
bass
i
amp
johnny
click
keyboards
day
awesome
by
new
hear
sing
enjoy
re
is
100
1
want
wish
1manband
reading
music
wow

mot inconnu
this
love
song
this
quotthis
youtubebr
our
video
thankyou
check
shakira
on
africaquot
time
and
one
motherlandbr
out
for
trayvon
hot
this
views
the
i
that
and
it
am
girl
is
of
watched
amount
convinced
so
because
large
now

mot inconnu
me
this
song
gives
always
chills
song
dis
3
love
i
this
will
year
song
like
never
my
forget
of
a
for
your
hearing
life
comment
i
in
movement
the
join
check
old
co

In [28]:
# for fun, let's see that the pNotSpam is easy to find
1 - len(data_comments_train[data_comments_train["label"] == 1])/len(data_comments_train)

0.49284253578732107

In [29]:
pNotSpam

0.49284253578732107

In [30]:
len(data_comments_train[data_comments_train["label"] == 1])/len(data_comments_train)

0.5071574642126789

In [31]:
pSpam

0.5071574642126789

In [32]:
# Classifier les commentaires sont du spam ou ham
def classify(comment):
    
    # get global variables
    global pSpam
    global pNotSpam
    
    # Calculer la valeur proportionnelle à Pr(comment|ham)
    isNegative = pSpam * float(conditionalComment(comment, 0))
    
    # Calculer la valeur proportionnelle à Pr(comment|spam)
    isPositive = pNotSpam * float(conditionalComment(comment, 1))
    
    # Output -> True = spam, False = ham en fonction des 2 variables calculées précédemment (il faut comparer les variables)
    return (isNegative < isPositive)

In [33]:
print(trainNegative)
print(trainPositive)



In [34]:
# Initialiser la prédiction du spam dans les données de test
prediction_test = []

# Obtenez la précision des prédictions sur les données d'essai
for comment in data_comments_test.content:

    # ajouter un commentaire classifié à la liste prediction_test 
    prediction_test.append(classify(comment))

# Checker la précision: 
# D'abord le nombre de prédictions correctes 
correct_labels = np.equal(prediction_test, data_comments_test["label"])
# Ensuite la moyenne des prédictions correctes
test_accuracy = np.mean(correct_labels)

#print prediction_test
print("Proportion of comments classified correctly on test set: %s" % test_accuracy)

trying word[ girl ]
0.0012723995334535045
trying word[ to ]
0.009330929911992365
trying word[ me ]
0.003074965539179302
trying word[ xxx ]
0.0003180998833633761
trying word[ am ]
0.0015904994168168805
trying word[ from ]
0.0016965327112713393
trying word[ 1 ]
0.0020146325946347153
trying word[ billion ]
0.003817198600360513
trying word[ sgrout ]
0.0001060332944544587
trying word[ perfect ]
0.0007422330611812109
trying word[ of ]
0.0062559643728130635
trying word[ from ]
0.0016965327112713393
trying word[ and ]
0.009118863323083449
trying word[ the ]
0.016011027462623263
trying word[ thing ]
0.0006361997667267522
trying word[ is ]
0.01473862792916976
trying word[ m ]
0.0007422330611812109
trying word[ a ]
0.007846463789629944
trying word[ feminist ]
0.0002120665889089174
trying word[ so ]
0.005513731311631852
trying word[ agree ]
0.0001060332944544587
trying word[ with ]
0.0022266991835436325
trying word[ this ]
0.020888559007528364
trying word[ song ]
0.016117060757077722
trying word[ 

0.0003180998833633761
trying word[ getting ]
0.0003180998833633761
trying word[ with ]
0.0022266991835436325
trying word[ of ]
0.0062559643728130635
trying word[ 2013 ]
0.0001060332944544587
trying word[ because ]
0.00243876577245255
trying word[ to ]
0.009330929911992365
trying word[ was ]
0.0026508323613614673
trying word[ to ]
0.009330929911992365
trying word[ become ]
0.0002120665889089174
trying word[ a ]
0.007846463789629944
trying word[ me ]
0.003074965539179302
trying word[ keep ]
0.0005301664722722935
trying word[ you ]
0.006998197433994274
trying word[ can ]
0.001802566005725798
trying word[ please ]
0.0003180998833633761
trying word[ give ]
0.0002120665889089174
trying word[ me ]
0.003074965539179302
trying word[ a ]
0.007846463789629944
trying word[ chance ]
0.0002120665889089174
trying word[ and ]
0.009118863323083449
trying word[ so ]
0.005513731311631852
trying word[ more ]
0.001060332944544587
trying word[ people ]
0.002544799066907009
trying word[ can ]
0.0018025660057

trying word[ on ]
0.003180998833633761
trying word[ my ]
0.004029265189269431
trying word[ me ]
0.003074965539179302
trying word[ improve ]
0.0001060332944544587
trying word[ everyday ]
0.0005301664722722935
trying word[ m ]
0.0007422330611812109
trying word[ lyrical ]
0.0001060332944544587
trying word[ and ]
0.009118863323083449
trying word[ i ]
0.019828226062983775
trying word[ keep ]
0.0005301664722722935
trying word[ it ]
0.008058530378538861
trying word[ real ]
0.0009542996500901282
trying word[ help ]
0.0004241331778178348
trying word[ me ]
0.003074965539179302
trying word[ reach ]
0.0006361997667267522
trying word[ my ]
0.004029265189269431
trying word[ me ]
0.003074965539179302
trying word[ build ]
0.0001060332944544587
trying word[ a ]
0.007846463789629944
trying word[ fan ]
0.0003180998833633761
trying word[ song ]
0.016117060757077722
trying word[ song ]
0.016117060757077722
trying word[ in ]
0.008058530378538861
trying word[ world ]
0.0019085993001802565
trying word[ even ]

trying word[ know ]
0.0015904994168168805
trying word[ its ]
0.003074965539179302
trying word[ annoying ]
0.0002120665889089174
trying word[ getting ]
0.0003180998833633761
trying word[ sorry ]
0.0004241331778178348
trying word[ bout ]
0.0002120665889089174
trying word[ that ]
0.0059378644894496875
trying word[ but ]
0.003074965539179302
trying word[ please ]
0.0003180998833633761
trying word[ take ]
0.0001060332944544587
trying word[ a ]
0.007846463789629944
trying word[ moment ]
0.0002120665889089174
trying word[ to ]
0.009330929911992365
trying word[ check ]
0.001060332944544587
trying word[ out ]
0.001060332944544587
trying word[ my ]
0.004029265189269431
trying word[ channel ]
0.0002120665889089174
trying word[ with ]
0.0022266991835436325
trying word[ i ]
0.019828226062983775
trying word[ want ]
0.001060332944544587
trying word[ to ]
0.009330929911992365
trying word[ share ]
0.0002120665889089174
trying word[ my ]
0.004029265189269431
trying word[ music ]
0.0032870321280882198
tr

trying word[ with ]
0.0022266991835436325
trying word[ another ]
0.0001060332944544587
trying word[ hit ]
0.0004241331778178348
trying word[ like ]
0.006361997667267522
trying word[ this ]
0.020888559007528364
trying word[ and ]
0.009118863323083449
trying word[ it ]
0.008058530378538861
trying word[ will ]
0.001378432827907963
trying word[ happen ]
0.0001060332944544587
trying word[ video ]
0.006043897783904146
trying word[ is ]
0.01473862792916976
trying word[ so ]
0.005513731311631852
trying word[ are ]
0.003074965539179302
trying word[ only ]
0.0014844661223624219
trying word[ want ]
0.001060332944544587
trying word[ to ]
0.009330929911992365
trying word[ win ]
0.0001060332944544587
trying word[ the ]
0.016011027462623263
trying word[ check ]
0.001060332944544587
trying word[ my ]
0.004029265189269431
trying word[ channel ]
0.0002120665889089174
trying word[ this ]
0.020888559007528364
trying word[ video ]
0.006043897783904146
trying word[ is ]
0.01473862792916976
trying word[ grea

trying word[ school ]
0.0001060332944544587
trying word[ love ]
0.01060332944544587
trying word[ it ]
0.008058530378538861
trying word[ hate ]
0.0008482663556356696
trying word[ it ]
0.008058530378538861
trying word[ when ]
0.002756865655815926
trying word[ comes ]
0.0003180998833633761
trying word[ in ]
0.008058530378538861
trying word[ my ]
0.004029265189269431
trying word[ head ]
0.0004241331778178348
trying word[ this ]
0.020888559007528364
trying word[ is ]
0.01473862792916976
trying word[ like ]
0.006361997667267522
trying word[ 2 ]
0.003180998833633761
trying word[ years ]
0.0023327324779980913
trying word[ out ]
0.001060332944544587
trying word[ this ]
0.020888559007528364
trying word[ video ]
0.006043897783904146
trying word[ on ]
0.003180998833633761
trying word[ is ]
0.01473862792916976
trying word[ 1 ]
0.0020146325946347153
trying word[ billion ]
0.003817198600360513
trying word[ sorry ]
0.0004241331778178348
trying word[ to ]
0.009330929911992365
trying word[ all ]
0.00222

trying word[ me ]
0.003074965539179302
trying word[ some ]
0.0007422330611812109
trying word[ on ]
0.003180998833633761
trying word[ how ]
0.002756865655815926
trying word[ my ]
0.004029265189269431
trying word[ video ]
0.006043897783904146
trying word[ was ]
0.0026508323613614673
trying word[ and ]
0.009118863323083449
trying word[ how ]
0.002756865655815926
trying word[ i ]
0.019828226062983775
trying word[ could ]
0.0007422330611812109
trying word[ improve ]
0.0001060332944544587
trying word[ be ]
0.002120665889089174
trying word[ sure ]
0.0002120665889089174
trying word[ to ]
0.009330929911992365
trying word[ go ]
0.0009542996500901282
trying word[ check ]
0.001060332944544587
trying word[ out ]
0.001060332944544587
trying word[ the ]
0.016011027462623263
trying word[ about ]
0.0011663662389990457
trying word[ to ]
0.009330929911992365
trying word[ see ]
0.0012723995334535045
trying word[ what ]
0.002120665889089174
trying word[ all ]
0.0022266991835436325
trying word[ for ]
0.0036

trying word[ get ]
0.002120665889089174
trying word[ 100 ]
0.0001060332944544587
trying word[ will ]
0.001378432827907963
trying word[ to ]
0.009330929911992365
trying word[ from ]
0.0016965327112713393
trying word[ the ]
0.016011027462623263
trying word[ face ]
0.0005301664722722935
trying word[ of ]
0.0062559643728130635
trying word[ earth ]
0.0006361997667267522
trying word[ and ]
0.009118863323083449
trying word[ guys ]
0.0003180998833633761
trying word[ my ]
0.004029265189269431
trying word[ name ]
0.0004241331778178348
trying word[ is ]
0.01473862792916976
trying word[ and ]
0.009118863323083449
trying word[ do ]
0.001378432827907963
trying word[ football ]
0.0002120665889089174
trying word[ videos ]
0.0004241331778178348
trying word[ have ]
0.0026508323613614673
trying word[ subscribers ]
0.0001060332944544587
trying word[ and ]
0.009118863323083449
trying word[ think ]
0.0014844661223624219
trying word[ you ]
0.006998197433994274
trying word[ guys ]
0.0003180998833633761
trying

0.0001060332944544587
trying word[ out ]
0.001060332944544587
trying word[ this ]
0.020888559007528364
trying word[ playlist ]
0.0001060332944544587
trying word[ on ]
0.003180998833633761
trying word[ out ]
0.001060332944544587
trying word[ this ]
0.020888559007528364
trying word[ video ]
0.006043897783904146
trying word[ on ]
0.003180998833633761
trying word[ out ]
0.001060332944544587
trying word[ this ]
0.020888559007528364
trying word[ playlist ]
0.0001060332944544587
trying word[ on ]
0.003180998833633761
trying word[ subscribe ]
0.0001060332944544587
trying word[ my ]
0.004029265189269431
trying word[ out ]
0.001060332944544587
trying word[ this ]
0.020888559007528364
trying word[ playlist ]
0.0001060332944544587
trying word[ on ]
0.003180998833633761
trying word[ guys ]
0.0003180998833633761
trying word[ love ]
0.01060332944544587
trying word[ this ]
0.020888559007528364
trying word[ but ]
0.003074965539179302
trying word[ check ]
0.001060332944544587
trying word[ out ]
0.001060

0.0003180998833633761
trying word[ for ]
0.003605132011451596
trying word[ people ]
0.002544799066907009
trying word[ who ]
0.001802566005725798
trying word[ can ]
0.001802566005725798
trying word[ t ]
0.0011663662389990457
trying word[ experience ]
0.0001060332944544587
trying word[ the ]
0.016011027462623263
trying word[ that ]
0.0059378644894496875
trying word[ we ]
0.0015904994168168805
trying word[ can ]
0.001802566005725798
trying word[ you ]
0.006998197433994274
trying word[ donate ]
0.0001060332944544587
trying word[ to ]
0.009330929911992365
trying word[ give ]
0.0002120665889089174
trying word[ them ]
0.0005301664722722935
trying word[ a ]
0.007846463789629944
trying word[ amount ]
0.0002120665889089174
trying word[ would ]
0.0007422330611812109
trying word[ do ]
0.001378432827907963
trying word[ on ]
0.003180998833633761
trying word[ the ]
0.016011027462623263
trying word[ link ]
0.0001060332944544587
trying word[ and ]
0.009118863323083449
trying word[ donate ]
0.0001060332

Essayons d'écrire quelques commentaires pour voir s'ils sont classés comme spam ou ham. 

Rappelez-vous que le "True" est pour les commentaires de spam, et "False" est pour les commentaires ham. 
Essayez vous même !

In [35]:
# spam
classify("Guys check out my new chanell")

trying word[ check ]
0.001060332944544587
trying word[ out ]
0.001060332944544587
trying word[ my ]
0.004029265189269431
trying word[ new ]
0.0004241331778178348
trying word[ chanell ]
0.0001060332944544587


True

In [36]:
# spam
classify("I have solved P vs. NP, check my video https://www.youtube.com/watch?v=dQw4w9WgXcQ")

trying word[ have ]
0.0026508323613614673
trying word[ check ]
0.001060332944544587
trying word[ my ]
0.004029265189269431
trying word[ video ]
0.006043897783904146


True

In [37]:
# ham
classify("I liked the video")

trying word[ liked ]
0.0003180998833633761
trying word[ the ]
0.016011027462623263
trying word[ video ]
0.006043897783904146


False

In [38]:
# ham
classify("Its great that this video has so many views")

trying word[ great ]
0.0007422330611812109
trying word[ that ]
0.0059378644894496875
trying word[ this ]
0.020888559007528364
trying word[ video ]
0.006043897783904146
trying word[ has ]
0.002544799066907009
trying word[ so ]
0.005513731311631852
trying word[ many ]
0.001060332944544587
trying word[ views ]
0.0065740642561764396


False

In [41]:
# ??
classify("sgrout your video is interesting but why no update check my website")

trying word[ sgrout ]
0.0001060332944544587
trying word[ your ]
0.0022266991835436325
trying word[ video ]
0.006043897783904146
trying word[ is ]
0.01473862792916976
trying word[ but ]
0.003074965539179302
trying word[ why ]
0.0023327324779980913
trying word[ no ]
0.0011663662389990457
trying word[ update ]
0.0001060332944544587
trying word[ check ]
0.001060332944544587
trying word[ my ]
0.004029265189269431
trying word[ website ]
0.0001060332944544587


True

### Pour aller plus loin...
## Extending Bag of Words by Using TF-IDF

Jusqu'à présent, nous avons utilisé le modèle du Bag of Words pour représenter les commentaires en tant que vecteurs. Le "Bag of Words" est une liste de tous les mots uniques trouvés dans les données training, alors chaque commentaire peut être représenté par un vecteur qui contient la fréquence de chaque mot unique qui apparaît dans le commentaire.

Par exemple, si les données training contiennent les mots $(hi, how, how, my, grade, are, you),$ alors le texte "how are you you" peut être représenté par $(0,1,0,0,0,1,2).$ La principale raison pour laquelle nous faisons cela dans notre application est que les commentaires peuvent varier en longueur, mais la longueur des mots uniques reste fixe.

Dans notre contexte, le TF-IDF est une mesure de l'importance d'un mot dans un commentaire par rapport à tous les mots de nos données de formation. Par exemple, si un mot tel que "the" apparaissait dans la plupart des commentaires, le TF-IDF serait petit car ce mot ne nous aide pas à faire la différence entre les commentaires spam et ham. Notez que "TF" signifie "Term Frequency" et "IDF" signifie "Inverse Document Frequency".

En particulier, "TF" indiqué par $tf(w,c)$ est le nombre de fois que le mot $w$ apparaît dans le commentaire donné $c$. Alors que "IDF" est une mesure de la quantité d'informations qu'un mot donné fournit pour différencier les commentaires. PLus précisement, $IDF$ est formulé comme ceci:


>$idf(w, D) = log(\frac{\text{Number of comments in train data $D$}}{\text{Number of comments containing the word $w$}}).$ 


Pour combiner "TF" et "IDF" ensemble, nous prenons simplement le produit, donc:


>$$TFIDF = tf(w,c) \times idf(w, D) = (\text{Number of times $w$ appears in comment $c$})\times log(\frac{\text{Number of comments in train data $D$}}{\text{Number of comments containing the word $w$}}).$$


Maintenant, le $TF-IDF$ peut être utilisé pour pondérer les vecteurs qui résultent de l'approche "Bag of Words".

Par exemple, supposons qu'un commentaire contienne "ceci" 2 fois, donc $tf = 2$. 
Si nous avions alors 1000 commentaires dans nos données de formation, et que le mot "ceci" apparaît dans 100 commentaires, $idf = log(1000/100) = 2.$. 

Par conséquent, dans cet exemple, le poids TF-IDF serait de $2*2 = 4$ pour le mot "ceci" apparaît deux fois dans un commentaire particulier. Pour incorporer TF-IDF dans le réglage des baies naïves, nous pouvons calculer :

>$$Pr(word|spam) = \frac{\sum_{\text{c is spam}}TFIDF(word,c,D)}{\sum_{\text{word in spam c}}\sum_{\text{c is spam}}TFIDF(word,c,D)+ \text{Number of unique words in data}},$$ 

>where $TFIDF(word,c,D) = TF(word,c) \times IDF(word,data).$ 

In [77]:
# Calculer TFIDF(word, comment, data)
def TFIDF(comment, train):
    
    # Diviser le commentaire en une liste de mot
    comment = comment.split(' ')
    
    # Initiailiser tf-idf selon la longueur du commentaire
    tfidf_comment = np.zeros(len(comment))
    print(tfidf_comment)
    # Initiailiser nombre de commentaires contenant un mot
    num_comment_word = 0
    
    # Initialiser l'index pour les mots dans le commentaire
    word_index = 0
    
    # Pour chaque mot du commentaire
    for word in comment:
        
        # Calculer la fréquence des termes (tf)
        # Compter la fréquence du mot dans les commentaires
        tf = comment.count(word)
        
        # Trouver le nombre de commentaires contenant un mot
        for text in train["content"]:
            
            # Incrémenter le compteur de mots si le mot trouvé dans le commentaire
            if text.split(' ').count(word) > 0:
                num_comment_word += 1
        
        # Calculer la fréquence du document inverse (idf)
        # log(Nombre total de commentaires/nombre de commentaires avec mot)
        idf = np.log(len(train)/num_comment_word)
        
        # Mettre a jour le poids tf-idf du mot
        tfidf_comment[word_index] = tf * idf
        
        # Réinitialiser le nombre de commentaires contenant un mot
        num_comment_word = 0
        
        # Passer au mot suivant dans le commentaire
        word_index += 1
        
    return tfidf_comment

def TFIDF_WCD(word, comment, train):
    list_of_tfidf = TFIDF(comment, train)
    print(comment.split(' ').index(word))
    index_of_word = comment.split(' ').index(word)
    return list_of_tfidf[index_of_word]

In [78]:
TFIDF("Check out my my my new music video plz",data_comments_train)

[0. 0. 0. 0. 0. 0. 0. 0. 0.]


array([2.1261888 , 1.60739501, 5.50696099, 5.50696099, 5.50696099,
       3.73562672, 3.16384039, 2.08148863, 5.21153324])

In [80]:
TFIDF_WCD("music", "Check out my my my new music video plz",data_comments_train)

[0. 0. 0. 0. 0. 0. 0. 0. 0.]
6


3.1638403930978902

In [None]:
# Ici, on a la fonction qui va calculer la Prob(word|spam) et Prob(word|ham)
def conditionalWordTFIDF(word,label):
   
    # Paramètre de lissage de Laplace (Laplace Smoothing)
    # Rappel : pour avoir accès à une variable globale à l'intérieur d'une fonction 
    # vous devez le spécifier en utilisant le mot 'global'.
    global positiveTotal
    global negativeTotal
    global alpha
    
    if word not in trainPositive:
        return 1.0
    # word in ham comment
    if(label == 0):
        #print("trying word[",word,"]")
        # Calculer Prob(word|ham)
        print((trainNegative[word] + alpha) / (float)(negativeTotal + (alpha * vocab_size_train)))
        return (trainNegative[word] + alpha) / (float)(negativeTotal + (alpha * vocab_size_train))
    
    # word in spam comment
    else:
        # Calculer Prob(word|ham)
        ##print(trainPositive[word] + alpha) / (float)(positiveTotal + (alpha * vocab_size_train))
        return (trainPositive[word] + alpha) / (float)(positiveTotal + (alpha * vocab_size_train))
        
   

In [None]:
# Ici, on a la fonction qui va calculer la Prob(spam|comment) or Prob(ham|comment)
def conditionalCommentTFDIF(comment,label):
    
    # On initialise la probabilité conditionelle
    prob_label_comment = 1.0
    
    # On sépare le commentaire en liste de mots
    #comment = cleanword(set(sorted(str.split(comment))))
    comment = comment.split(' ')
    
    # Pour chaque mot du commentaire
    for word in comment:
        
        # Calculer la valeur de P(label|comment)
        # On suppose ici qu'on a une independance conditionnelle (p(A) * p(B))
        prob_label_comment = prob_label_comment * conditionalWordTFIDF(word,label)
    
    return prob_label_comment

In [None]:
# Et maintenant, implémente TFIDF avec ta fonction de classification
# Have fun :D

# Classifier les commentaires sont du spam ou ham
def classifyTDIDF(comment):
    
    # get global variables
    global pSpam
    global pNotSpam
    
    # Calculer la valeur proportionnelle à Pr(comment|ham)
    isNegative = pSpam * float(conditionalCommentTFIDF(comment, 0))
    
    # Calculer la valeur proportionnelle à Pr(comment|spam)
    isPositive = pNotSpam * float(conditionalCommentTFIDF(comment, 1))
    
    # Output -> True = spam, False = ham en fonction des 2 variables calculées précédemment (il faut comparer les variables)
    return (isNegative < isPositive)

