## CORRECTION
https://nbviewer.jupyter.org/github/fxbabin/ateliers_ml_2019/blob/master/Semaine7/NaiveBayesSpamClassifier.ipynb

# Naive Bayes pour la classification du spam


La classification naïve bayésienne (Naive bayes) est un algorithme de classification probabiliste relativement simple qui convient bien aux données que l'on peut catégoriser.

En Machine Learning, les applications courantes de Naive Bayes sont la classification des courriels non sollicités (spam), l'analyse des sentiments (emotion analyses) et la catégorisation des documents (data categorisation). Naive Bayes présente des avantages par rapport aux autres algorithmes de classification couramment utilisés en raison de sa simplicité, de sa rapidité et de sa précision sur de petits ensembles de données.

## Data Description

Nous utiliserons les données du dépot d'apprentissage machine de l'UCI qui contient plusieurs commentaires Youtube de vidéos musicales très populaires. Chaque commentaire dans les données a été étiqueté comme spam ou ham (commentaire légitime), et nous utiliserons ces données pour former notre algorithme Naive Bayes pour la classification de spam commentaire youtube.

In [1]:
# Import modules

# Pour la manipulation des données
import pandas as pd

# Pour les opérations matricielles
import numpy as np

# Pour utiliser le regexabs
import re

In [64]:
# Charger les données du fichier 'YoutubeCommentsSpam.csv' en utilisant pandas
data_comments = pd.read_csv('YoutubeCommentsSpam.csv')

# Créer des libellés de colonne : "content" et "label". 
# conseils : la méthode 'colums' peut être utile 
data_comments.columns = ['content','label']

# Afficher les premières lignes de notre ensemble de données pour s'assurer que la colonne "label" a été ajoutée
data_comments.head(10)
data_comments["content"]

0                +447935454150 lovely girl talk to me xxx
1          I always end up coming back to this song<br />
2       my sister just received over 6,500 new <a rel=...
3                                                    Cool
4                               Hello I am from Palastine
5       Wow this video almost has a billion views! Did...
6       Go Sgrout check out my rapping video called Fo...
7                                        Almost 1 billion
8                    sgrout Aslamu Lykum... From Pakistan
9       Eminem is idol for very people in EspaÃ±a and ...
10                            Help me get 50 subs please 
11                                         i love song :)
12      Alright ladies, if you like this song, then ch...
13      The perfect example of abuse from husbands and...
14       The boyfriend was Charlie from the TV show LOST 
15      <a href="https://www.facebook.com/groups/10087...
16                  Take a look at this video on YouTube:
17            

$\textbf{ATTENTION: Ne regardez pas les liens dans les commentaires, ce sont des spams! ;)}$

In [65]:
# Afficher les commentaires de spam dans les données
# N'ALLEZ PAS SUR LES LIENS !!!!! sérieusement, ce sont des spams.... 
print(data_comments["content"][data_comments["label"] == 1])

0                +447935454150 lovely girl talk to me xxx
2       my sister just received over 6,500 new <a rel=...
4                               Hello I am from Palastine
6       Go Sgrout check out my rapping video called Fo...
8                    sgrout Aslamu Lykum... From Pakistan
10                            Help me get 50 subs please 
12      Alright ladies, if you like this song, then ch...
15      <a href="https://www.facebook.com/groups/10087...
16                  Take a look at this video on YouTube:
17                 Check out our Channel for nice Beats!!
19                    Check out this playlist on YouTube:
21                                            like please
24      I shared my first song &quot;I Want You&quot;,...
25      Come and check out my music!Im spamming on loa...
26                    Check out this playlist on YouTube:
27      HUH HYUCK HYUCK IM SPECIAL WHO S WATCHING THIS...
30      Check out this video on YouTube:<br /><br />Lo...
33            

En parcourant les commentaires qui ont été étiquetés comme spam dans ces données, il semble que ces commentaires sont soit sans rapport avec la vidéo, soit comme une forme de publicité. L'expression "check out" semble être très populaire dans ces commentaires.

## Summary Statistics and Data Cleaning

Le tableau ci-dessous montre que cet ensemble de données se compose de $1959$ commentaires youtube, dont environ $49\%$ sont des commentaires légitimes et environ $51\%$ sont du spam. Cette grande variation de classes dans notre ensemble de données nous aidera à tester l'exactitude de nos algorithmes sur l'ensemble des données de test. 

La longueur moyenne de chaque commentaire est d'environ $96$ caractères, ce qui représente environ $15$ mots en moyenne par commentaire.

In [66]:
# Ajouter une nouvelle colonne pour la longueur de chaque commentaire
# conseils: utiliser map et lambda
data_comments['length_as_lambda'] = list(map (lambda x: len(x), data_comments['content']))
data_comments['length_as_len'] = list(map (len, data_comments['content']))
data_comments['length'] = data_comments['content'].apply(len)

# Permet d'afficher un tableau avec plusieurs données statistiques (mean, stdev, min, max)
data_comments[["label","length_as_lambda"]].describe()


data_comments.loc[:,["length_as_lambda","length_as_len",'length']]

Unnamed: 0,length_as_lambda,length_as_len,length
0,40,40,40
1,46,46,46
2,200,200,200
3,4,4,4
4,25,25,25
5,73,73,73
6,65,65,65
7,16,16,16
8,36,36,36
9,69,69,69


Pour notre algorithme de classification Naive Bayes, nous diviserons les données en deux parties: entrainement et tests. La partie d'entrainement sera utilisé pour former l'algorithme de classification du spam, et l'ensemble de test ne sera utilisé que pour tester sa précision. 

En général, La partie d'entrainement devrait être plus grand que La partie de test et les deux devraient provenir de la même population (la population dans notre cas est Youtube commentaires pour les vidéos musicales). 

**Nous sélectionnerons au hasard $75\%$ des données pour la formation et $25\%$ des données pour les tests.**

In [125]:
# Séparons les données en 2 groupes ! (75% training, 25% test)

# Ceci nous permet d'obtenir la même allocation aléatoire pour chaque série de codes. RTFM if you want ;)
np.random.seed(2019)

# Ajout d'un vecteur colonne 'uniform' de nombres générés aléatoirement entre 0 et 1 
# Astuce : dans numpy, il existe une méthode pour prélever un échantillon à partir d'une distribution uniforme.

data_comments["uniform"] = np.random.uniform(0,1, len(data_comments))
#print(data_comments['uniform'])

#a_list_of_unif = np.random.uniform(0,1, len(data_comments))
#a_list_of_unif
# Comme le nombre dans notre colonne 'uniform' est distribué uniformément, 
# environ 75 % de ces chiffres devraient être inférieurs à 0,75 %, prenons ces 75 %.
data_comments_train = data_comments[data_comments["uniform"] < 0.75]

# Même chose pour les 25 % de ces numéros qui sont supérieurs à 0,75
data_comments_test = data_comments[data_comments["uniform"] >= 0.75]

print(data_comments_train.head())
print(data_comments_test.head())

                                             content  label  length_as_lambda  \
1     I always end up coming back to this song<br />      0                46   
2  my sister just received over 6,500 new <a rel=...      1               200   
3                                               Cool      0                 4   
5  Wow this video almost has a billion views! Did...      0                73   
6  Go Sgrout check out my rapping video called Fo...      1                65   

   length_as_len  length   uniform  
1             46      46  0.393081  
2            200     200  0.623970  
3              4       4  0.637877  
5             73      73  0.299172  
6             65      65  0.702198  
                                              content  label  \
0            +447935454150 lovely girl talk to me xxx      1   
4                           Hello I am from Palastine      1   
7                                    Almost 1 billion      0   
8                sgrout Aslamu Lyku

In [126]:
# Vérifiez que les données d'entraînement contiennent à la fois des commentaires de spam et de ham
data_comments_train["label"].describe()

count    1467.000000
mean        0.507157
std         0.500119
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: label, dtype: float64

In [127]:
# Même chose pour le data test
data_comments_test["label"].describe()

count    492.000000
mean       0.528455
std        0.499698
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: label, dtype: float64

Les données d'entrainement et de test ont toutes deux un bon mélange de spam et de ham, nous sommes donc prêts à passer à la formation sur le classificateur Naive Bayes. 

In [128]:
# Joindre tout les commentaires dans une seule et même liste
# Astuce: 'separator'.join(list)
training_list_words = ' '.join(data_comments_train["content"])

# Diviser la liste des commentaires en une liste de mots uniques
# Astuce: set() and sorted()
train_unique_words = set(sorted(str.split(training_list_words)))

# Nombre de mots uniques dans le data training
vocab_size_train = len(train_unique_words)

#print(training_list_words)

In [129]:
# Description résumée des commentaires
print('Unique words in training data: %s' % vocab_size_train)
print('First 5 words in our unique set of words: \n % s' % list(train_unique_words)[1:6])

Unique words in training data: 5491
First 5 words in our unique set of words: 
 ['ILove', '/>Doing', 'original', 'gaming', 'Netherland/']


Cela devrait ressember à quelquechose comme ça:

```Unique words in training data: 5898
First 5 words in our unique set of words: 
['now!!!!!!', 'yellow', 'four', '/>.Pewdiepie', 'Does']```


Actuellement, "now !!" et "Now !!!!!!", ainsi que "DOES", "DoEs", et "does" sont tous considérés comme des mots uniques. Pour la classification du spam, il est probablement préférable de traiter légèrement les données pour en améliorer l'exactitude. Dans notre cas, nous pouvons nous concentrer sur les lettres et les chiffres, ainsi que convertir tous les commentaires en minuscules.

In [201]:
def cleanword(a_list_of_word):
    a_list_of_word = [ re.sub('[^a-zA-Z0-9]*', '', word) for word in a_list_of_word]
    a_list_of_word = set([ word.lower() for word in a_list_of_word])
    return a_list_of_word

In [233]:
# Garder uniquement les chiffres et les lettres
# Astuce: utiliser regex et les list comprehension
#train_unique_words_alt = [ re.sub('\W*', '', word) for word in train_unique_words]
vocab_size_train_before = len(train_unique_words)
train_unique_words_alt = cleanword(train_unique_words_alt)
train_unique_words = [ re.sub('[^a-zA-Z0-9]*', '', word) for word in train_unique_words]

# Convertir toutes les lettres en minuscule
# Astuce: set() ?
train_unique_words_alt = set([ word.lower() for word in train_unique_words_alt])
train_unique_words = set([ word.lower() for word in train_unique_words])
# Nombre de mots uniques dans le data training
vocab_size_train = len(train_unique_words)

# Description résumée des commentaires
#print('Unique words in processed training data: %s' % vocab_size_train)
print('First 5 words in our processed unique set of words: \n % s' % list(train_unique_words)[1:6],vocab_size_train_before, vocab_size_train)
print('alt First 5 words in our processed unique set of words: \n % s' % list(train_unique_words_alt)[1:6], len(train_unique_words_alt))

First 5 words in our processed unique set of words: 
 ['herehttpswwwfacebookcomtlouxmusic', 'shot', 'filled', 'behavior', 'huge'] 3481 3481
alt First 5 words in our processed unique set of words: 
 ['herehttpswwwfacebookcomtlouxmusic', 'shot', 'filled', 'behavior', 'huge'] 3480


## Naive Bayes for Spam Classification

Ok, alors voilà le plan :

- Tout d'abord, nous avons séparé nos données de formation en 2 sous-ensembles : training et test.

- puis nous allons créer plusieurs fonctions pour vérifier combien de fois chaque mot est apparu dans le spam et non dans les commentaires de spam, 
    - et vérifier la probabilité que chaque mot apparaisse dans le spam/non spam

- alors les 2 fonctions les plus importantes : train() et classify()

- Et enfin, nous allons vérifiez l'exactitude de nos prédictions.

Passons au code !

In [234]:
#trainPositive = dict(\
#                     word : count for word in \
#                     set(sorted(str.split(data_comments_train["content"][data_comments_train["label"] == 1])))

In [235]:
# On initialise des dictionnaire avec des mots de commentaires comme "keys", et leur étiquette comme "value".
trainPositive = dict()
trainNegative = dict()

# On initialise ces variables à zéro
positiveTotal = 0
negativeTotal = 0

# Même chose, mais en float ;) 
pSpam = .0
pNotSpam = .0

# Laplace smoothing
alpha = 1

In [236]:
#def initialize_dicts():

# # On initialise les dictionnaires avec 0 comme valeur 
for word in train_unique_words:
    #print(word)
    global trainPositive
    global trainNegative
    # skip empty words('' and ' ')
    if word == '' or word == ' ':
        continue # goes to next word
    
    # Pour le moment, tout est classifier comme ham (légitime) ## NON for the moment nothing is none
    trainPositive[word] = 0 #create word
    trainNegative[word] = 0
print(trainNegative['sgrout'])

0


In [237]:
# Compter le nombre de fois que le mot dans le commentaire apparaît dans les commentaires de spam et ham
def processComment(comment,label):
    global positiveTotal
    global negativeTotal
    global trainNegative
    global trainPositive
    
    # Séparer le commentaire en liste de mots
    comment = cleanword(set(sorted(str.split(comment))))
    
    # Pour chaque mot du commentaire
    for word in comment:
        print(word)
        ##print(type(train_unique_words))
        #Checker si le mot est bien dans la base de donnée
        if not(word in trainPositive):
            print("mot inconnu")
            continue
        
        #Checker si ce n'est pas un '' ou ' '
        if word == '' or word == ' ':
            continue
        
        # Checker si le mot n'est pas du spam (ham)
        if label == 0:
            trainNegative[word] += 1
            negativeTotal += 1
            # Incrémenter le nombre de fois que le mot apparaît dans les commentaires non spam
            
            
        # spam comments
        if label == 1:
            trainPositive[word] += 1
            positiveTotal += 1
            # Incrémenter le nombre de fois que le mot apparaît dans les commentaires spam
            
            
#onedata = data_comments_train.loc[158]
#data_comments.loc[2:3,["label"]]

##processComment(onedata["content"],onedata["label"])
#print(onedata["content"],onedata["label"])

#print(trainPositive['help'],trainNegative['someone'],positiveTotal, negativeTotal)

##METTRE LA FORMULE



In [238]:
# Ici, on a la fonction qui va calculer la Prob(word|spam) et Prob(word|ham)
def conditionalWord(word,label):
   
    # Paramètre de lissage de Laplace (Laplace Smoothing)
    # Rappel : pour avoir accès à une variable globale à l'intérieur d'une fonction 
    # vous devez le spécifier en utilisant le mot 'global'.
    global positiveTotal
    global negativeTotal
    global alpha
    
    if word not in trainPositive:
        return 1.0
    # word in ham comment
    if(label == 0):
        print("trying word[",word,"]")
        # Calculer Prob(word|ham)
        print((trainNegative[word] + alpha) / (float)(negativeTotal + (alpha * vocab_size_train)))
        return (trainNegative[word] + alpha) / (float)(negativeTotal + (alpha * vocab_size_train))
    
    # word in spam comment
    else:
        # Calculer Prob(word|ham)
        ##print(trainPositive[word] + alpha) / (float)(positiveTotal + (alpha * vocab_size_train))
        return (trainPositive[word] + alpha) / (float)(positiveTotal + (alpha * vocab_size_train))
        
       
       

In [239]:
# Ici, on a la fonction qui va calculer la Prob(spam|comment) or Prob(ham|comment)
def conditionalComment(comment,label):
    
    # On initialise la probabilité conditionelle
    prob_label_comment = 1.0
    
    # On sépare le commentaire en liste de mots
    #comment = cleanword(set(sorted(str.split(comment))))
    comment = comment.split(' ')
    
    # Pour chaque mot du commentaire
    for word in comment:
        
        # Calculer la valeur de P(label|comment)
        # On suppose ici qu'on a une independance conditionnelle (p(A) * p(B))
        prob_label_comment = prob_label_comment * conditionalWord(word,label)
    
    return prob_label_comment

In [240]:
# Calculer plusieurs probabilités conditionnelles dans les données d'entraînement
def train():
    # Rappel: on aura besoin de pSpam et pNotSpam ici ;) 
    global pSpam
    global pNotSpam

    # Initialisation de nos variables: le nombre total de commentaires et le nombre de commentaires de spam 
    total = 0.0
    num_spam = 0.0
    
    #print(data_comments_train.head())
    print('Starting training ...')
    
    # Passez en revue chaque commentaire dans les données d'entraînement 
    for index, row in data_comments_train.iterrows():
        #print(type(row))
        print(row)
        total += 1
                
       # Vérifiez si le commentaire est du spam ou non (ham)
        if row.label == 1: #spam
       # Incrémenter les valeurs selon que le commentaire est du spam ou non
            num_spam += 1
       # Mettre à jour le dictionnaire du spam et ham
        
        print("line is [",row.content,"]\n")
        processComment(row.content,row.label)
            
        #conditionalComment(row.content,row.label)
        
    # Calcule des probabilitées a priori, P(spam), P(ham)
    pSpam = num_spam / total
    pNotSpam = 1 - pSpam
    print('Training done')

In [241]:
# Lancer notre fonction train de Naive Bayes
train()

Starting training ...
content             I always end up coming back to this song<br />
label                                                            0
length_as_lambda                                                46
length_as_len                                                   46
length                                                          46
uniform                                                   0.393081
Name: 1, dtype: object
line is [ I always end up coming back to this song<br /> ]


mot inconnu
songbr
end
coming
this
i
always
up
back
to
content             my sister just received over 6,500 new <a rel=...
label                                                               1
length_as_lambda                                                  200
length_as_len                                                     200
length                                                            200
uniform                                                       0.62397
Name: 2, dtype: o

Name: 108, dtype: object
line is [ EVERYONE GO AND SHARE youtu  be/ARkglzjQuP0 ON FB,TWITTER,G+ TO VOTE FOR EMINEM TO BECOME ARTIST OF THE YEAR ON FIRST EVER YOUTUBE MUSIC AWARDS !!!  AND GET THIS METHOD TO CHEAT AT INTERNET ROULETTE OUT OF EMINEMS VIDEO ! SHADY ARTIST OF THE YEAR ! ]

awards
youtu

mot inconnu
ever
become
out
and
everyone
video
roulette
vote
eminems
eminem
for
of
method
at
this
youtube
on
get
first
shady
go
to
cheat
bearkglzjqup0
the
artist
music
year
internet
share
fbtwitterg
content             Check out Em s dope new song monster here: /wa...
label                                                               1
length_as_lambda                                                   77
length_as_len                                                      77
length                                                             77
uniform                                                      0.246428
Name: 109, dtype: object
line is [ Check out Em s dope new song monster here: /w

bring
can
cope
i
song
lyrics
see
and
where
click
about
up
video
real
m
mine
help
ago
christmas
that
suicide
friend
who
past
away
others
would
you
started
dad
pen
of
paper
so
9
today
7k
at
tried
this
produced
just
has
know
on
over
several
my
to
picture
but
nearly
subscribers
views
thumbs
promise
music
yesterday
thanks
it
commit
picked
released
500k
write
professionally
times
passed
content             â¢â¢â¢â¢âºâºMy name is George and let me...
label                                                               1
length_as_lambda                                                  512
length_as_len                                                     512
length                                                            512
uniform                                                      0.216863
Name: 186, dtype: object
line is [ â¢â¢â¢â¢âºâºMy name is George and let me tell u EMINEM is my idol my inspiration, I ve listen to him growing up, I never thou I would love rap this much Bu

line is [ Hey guys im a 17yr old rapper trying to get exposure... I live in belgium where NO ONE speaks english so i have to resort to this gay SPAM...  Check out my 2 latest tracks as they are probably my best.. Audio isnt the best but im gonna invest in some real equipment for my next track..  Please Thumbs this up so others can see.. or hey dont just check me out yourself and leave a response and a like :D  Thanks in advance, you guys will be part of making my dream come TRUE   -Notorious Niko  ]

some
english
me
a
true
yourself
can
niko
i
out
see
no
and
where
isnt
gay
dream
making
old
tracks
up
rapper
check
equipment
real
hey
gonna
guys
leave
advance
im
others
like
17yr
for
one
of
spam
so
come
in
live
be
part
this
trying
audio
are
just
probably
get
will
notorious
have
resort
track
my
latest
to
please
belgium
but
response
d
speaks
thumbs
the
or
best
they
exposure
thanks
as
2
dont
next
you
invest
content             EMINEM&lt;3Â <br />the best rapper ever&lt;3
label                  

Name: 349, dtype: object
line is [ Thumbs up if you listen this in 2015. ]

if
listen
in
thumbs
this
2015
up
you
content             do you want to make some easy money? check out...
label                                                               1
length_as_lambda                                                  113
length_as_len                                                     113
length                                                            113
uniform                                                     0.0710943
Name: 351, dtype: object
line is [ do you want to make some easy money? check out my page tvcmcadavid.weebly . com dont miss out on this opportunity ]


mot inconnu
some
want
tvcmcadavidweebly
check
com
this
do
easy
miss
out
on
money
opportunity
dont
make
page
you
my
to
content             For all you ladies out there......  Check out ...
label                                                               1
length_as_lambda                                        

Name: 436, dtype: object
line is [ This song is like an oreo, the black part is good but the white part is better ]

an
but
white
the
part
this
better
good
black
oreo
like
is
song
content             subscribe to my channel  /watch?v=NxK32i0HkDs
label                                                           1
length_as_lambda                                               45
length_as_len                                                  45
length                                                         45
uniform                                                  0.136086
Name: 438, dtype: object
line is [ subscribe to my channel  /watch?v=NxK32i0HkDs ]

channel
watchvnxk32i0hkds
subscribe
my
to
content              HI IM 14 YEAR RAPPER SUPPORT ME GUY AND CHECK...
label                                                               1
length_as_lambda                                                  153
length_as_len                                                     153
length            

you
love
content             I    loved        it           so       much  ...
label                                                               0
length_as_lambda                                                  129
length_as_len                                                     129
length                                                            129
uniform                                                      0.502372
Name: 530, dtype: object
line is [ I    loved        it           so       much          because         you          get         to         stand            fear . ]

so

mot inconnu
because
much
fear
i
it
get
stand
you
loved
to
content             Hi guys i sell Jack Daniel's Hard Back Cover C...
label                                                               1
length_as_lambda                                                  212
length_as_len                                                     212
length                                                       

Name: 636, dtype: object
line is [ DOWNLOAD RAPID FACEBOOK FOR FREE NOW https://play.google.com/store/apps/details?id=com.rapid.facebook.magicdroid ]

now
facebook
free
rapid
httpsplaygooglecomstoreappsdetailsidcomrapidfacebookmagicdroid
for
download
content             The song is very good ...but the video makes n...
label                                                               0
length_as_lambda                                                  213
length_as_len                                                     213
length                                                            213
uniform                                                      0.711096
Name: 637, dtype: object
line is [ The song is very good ...but the video makes no sense...just a nonsense  video...I mean she is telling her story of being stuck on an island, but  the song doesn't fit in the situation...but nvm...The song is good ]

telling
doesnt
a
nonsense
mean
island
makes
no
video
an
videoi
story
good
she

seriously
its
who
im
katheryns
one
same
happy
sister
today
knows
birthday
know
have
glad
not
my
to
the
they
katyand
content             Katy Perry can't sing for shit. All i hear it ...
label                                                               0
length_as_lambda                                                   55
length_as_len                                                      55
length                                                             55
uniform                                                       0.66797
Name: 709, dtype: object
line is [ Katy Perry can't sing for shit. All i hear it autotune. ]

autotune
cant
i
katy
it
perry
shit
for
sing
all
hear
content             I fucking hate her. Why? Because she don't wri...
label                                                               0
length_as_lambda                                                  182
length_as_len                                                     182
length                               


lt3
content             KATY PERRY, I AM THE "DÃCIO CABELO", "DECIO H...
label                                                               1
length_as_lambda                                                  439
length_as_len                                                     439
length                                                            439
uniform                                                     0.0736188
Name: 798, dtype: object
line is [ KATY PERRY, I AM THE "DÃCIO CABELO", "DECIO HAIR". I AM 60 YEARS OF AGE. I  DON"T HAVE FAMILY. I"M SINGLE. ALONE. HOMELESS. I WAS AN ALCOHOLIC: 15 AT  THE AGE OF 46. I AM AN INVISIBLE COMPOSER. MY DREAM IS TO RECORD MY SONGS.  COULD YOU HELP ME? PLEASE! PLEASE! I TRUST THAT THE YOU WILL GIVE ME A  CHANCE. I HAVE 109 VIDEOS IN THE YOUTUBE: deciocabelo canal. KATY PERRY, I  WAS BORN IN OCTOBER 25, TOO. THANK YOU VERY MUCH!!! DECIO HAIR. ]

october
me
homeless
dcio
videos
record
a
cabelo
i
katy
46
is
composer
dream
invisible
60
an
25
al

Name: 914, dtype: object
line is [ Best song for everððð¢<br /> ]


mot inconnu
everbr
best
for
song
content                 epic
label                      0
length_as_lambda           4
length_as_len              4
length                     4
uniform             0.604345
Name: 915, dtype: object
line is [ epic ]

epic
content             I like so much this music,  good 
label                                               0
length_as_lambda                                   33
length_as_len                                      33
length                                             33
uniform                                      0.535942
Name: 916, dtype: object
line is [ I like so much this music,  good  ]

so
much
this
music
i
good
like
content             Beautiful song
label                            0
length_as_lambda                14
length_as_len                   14
length                          14
uniform                    0.61352
Name: 917, dtype: object
line is

line is [ Check out this video on YouTube: ]

this
youtube
out
on
check
video
content             ALL SCHOOL DROP OUTS I KNEW AS FRIENDS BEFORE ...
label                                                               1
length_as_lambda                                                  375
length_as_len                                                     375
length                                                            375
uniform                                                      0.564606
Name: 1033, dtype: object
line is [ ALL SCHOOL DROP OUTS I KNEW AS FRIENDS BEFORE THEY DECIDED TO DROP SCHOOL THINK THERE IS NO NEED FOR AN ID CARD OR A CERTIFICATION TO PROVE YOU ARE AN EDUCATED CLEAN IN CRIMINAL RECORD TALENTED PERSON TO WORK IN ANY ENTERTAINMENT FIELD WORLDWIDE. THEY THINK THEY COULD BE RICH ENTERTAINERS BY CONSOLIDATING WITH ACTORS / ACTRESSES AS WELL AS SINGERS FOR A SHARE OF PROFIT(S). ]


mot inconnu
knew
record
a
i
profits
work
well
school
no
educated
need
is
person
there


line is [ Check out this video on YouTube:hjalp ]

youtubehjalp
this
out
on
check
video
content             Check out this video on YouTube:
label                                              1
length_as_lambda                                  32
length_as_len                                     32
length                                            32
uniform                                     0.679099
Name: 1135, dtype: object
line is [ Check out this video on YouTube: ]

this
youtube
out
on
check
video
content             Check out this video on YouTube:
label                                              1
length_as_lambda                                  32
length_as_len                                     32
length                                            32
uniform                                     0.459706
Name: 1136, dtype: object
line is [ Check out this video on YouTube: ]

this
youtube
out
on
check
video
content             Check out this video on YouTube:
label          

stay
now
me
once
entire
alive
right
and
day
die
reading
started
subscribe
one
do
so
if
youre
within
want
family
will
have
not
to
you
stop
content             https://twitter.com/GBphotographyGB
label                                                 1
length_as_lambda                                     35
length_as_len                                        35
length                                               35
uniform                                       0.0332211
Name: 1251, dtype: object
line is [ https://twitter.com/GBphotographyGB ]

httpstwittercomgbphotographygb
content             I'm only checking the views
label                                         0
length_as_lambda                             27
length_as_len                                27
length                                       27
uniform                                0.423415
Name: 1255, dtype: object
line is [ I'm only checking the views ]

the
views
only
im
checking
content             http://www.ebay.co

new
am
back
content             This is a weird video.
label                                    0
length_as_lambda                        22
length_as_len                           22
length                                  22
uniform                           0.432132
Name: 1347, dtype: object
line is [ This is a weird video. ]

a
this
weird
is
video
content             EHI GUYS CAN YOU SUBSCRIBE IN MY CHANNEL? I AM...
label                                                               1
length_as_lambda                                                  110
length_as_len                                                     110
length                                                            110
uniform                                                      0.733845
Name: 1349, dtype: object
line is [ EHI GUYS CAN YOU SUBSCRIBE IN MY CHANNEL? I AM A NEW YOUTUBER AND I PLAY  MINECRAFT THANKS GUYS!... SUBSCRIBE! ]

channel
minecraft
in
guys
ehi
a
play
can
i
youtuber
thanks
you
and
subscribe

Name: 1468, dtype: object
line is [ We pray for you Little Psy â¡ ]


mot inconnu
psy
little
we
for
you
pray
content             ''Little Psy, only 5 months left.. Tumor in th...
label                                                               0
length_as_lambda                                                   76
length_as_len                                                      76
length                                                             76
uniform                                                     0.0399905
Name: 1469, dtype: object
line is [ ''Little Psy, only 5 months left.. Tumor in the head :( WE WILL MISS U &lt;3 ]


mot inconnu
psy
lt3
in
months
little
5
the
u
only
we
miss
will
left
head
tumor
content              Follow me on Instagram. _chris_cz  
label                                                  1
length_as_lambda                                      36
length_as_len                                         36
length                                        

Name: 1566, dtype: object
line is [ https://www.facebook.com/FUDAIRYQUEEN?pnref=story ]

httpswwwfacebookcomfudairyqueenpnrefstory
content             Haha its so funny to see the salt of westerner...
label                                                               0
length_as_lambda                                                  133
length_as_len                                                     133
length                                                            133
uniform                                                      0.480671
Name: 1567, dtype: object
line is [ Haha its so funny to see the salt of westerners that top views of youtube  goes to video they dont even understand, keep the salt up! ]

keep
see
top
its
up
video
westerners
that
understand
even
of
so
youtube
funny
salt
to
haha
the
they
views
goes
dont
content             FOLLOW MY COMPANY ON TWITTER  thanks.  https:/...
label                                                               1
length_as_lambda    


mot inconnu
waka
content             Cool song 
label                        0
length_as_lambda            10
length_as_len               10
length                      10
uniform               0.177175
Name: 1692, dtype: object
line is [ Cool song  ]

cool
song
content             i love uÂ  shakira
label                                0
length_as_lambda                    18
length_as_len                       18
length                              18
uniform                       0.686422
Name: 1693, dtype: object
line is [ i love uÂ  shakira ]

shakira
u
love
i
content             Waka waka 
label                        0
length_as_lambda            10
length_as_len               10
length                      10
uniform              0.0957441
Name: 1694, dtype: object
line is [ Waka waka  ]

waka
content             this song sucks
label                             0
length_as_lambda                 15
length_as_len                    15
length                           15
unifor

give
surveys
this
has
home
than
monthly
go
to
fast
working
bucks
being
site
make
content             Hello Guys...I Found a Way to Make Money Onlin...
label                                                               1
length_as_lambda                                                  496
length_as_len                                                     496
length                                                            496
uniform                                                      0.216673
Name: 1783, dtype: object
line is [ Hello Guys...I Found a Way to Make Money Online You Can Get Paid To Mess Around On Facebook And Twitter! GET PAID UPTO $25 to $35 AN HOUR...Only at 4NetJobs.com Work from the Comfort of your Home... They are Currently Hiring People from all Over the World, For a Wide Range of Social Media Jobs on Sites such as Facebook,Twitter and YouTube You don t Need any Prior Skills or Experience and You can Begin Work Immediately! You Can Easily Make $4000 to $5000+ Mont

saints
part
family
confessors
afflicted
most
5
the
holy
virgins
patriarchs
content             Hi.. Everyone.. If anyone after real online wo...
label                                                               1
length_as_lambda                                                  226
length_as_len                                                     226
length                                                            226
uniform                                                       0.10174
Name: 1868, dtype: object
line is [ Hi.. Everyone.. If anyone after real online work. I can help u. Earn lots of money. It s fun. It s real and affiliated company.. U not think u r working. It s easy and enjoyable. For more info contact me .. Neeru105@ gmail.com ]


mot inconnu
me
contact
r
can
i
gmailcom
work
money
and
s
everyone
affiliated
real
info
fun
neeru105
help
hi
u
more
for
of
earn
if
anyone
online
lots
not
working
think
it
enjoyable
company
after
easy
content             Nice song ^_^
label

piano
thank
my
to
life
but
your
the
or
thumbs
music
being
it
year
15
you
content             Love itt and ppl check out my channel!!!
label                                                      1
length_as_lambda                                          40
length_as_len                                             40
length                                                    40
uniform                                            0.0411311
Name: 1925, dtype: object
line is [ Love itt and ppl check out my channel!!! ]

channel
itt
ppl
out
and
love
my
check
content             adf.ly / KlD3Y
label                            1
length_as_lambda                14
length_as_len                   14
length                          14
uniform                   0.468624
Name: 1927, dtype: object
line is [ adf.ly / KlD3Y ]


mot inconnu
kld3y
adfly
content             adf.ly / KlD3Y
label                            1
length_as_lambda                14
length_as_len                   14
length        

In [242]:
pNotSpam

0.49284253578732107

In [243]:
# Classifier les commentaires sont du spam ou ham
def classify(comment):
    
    # get global variables
    global pSpam
    global pNotSpam
    
    # Calculer la valeur proportionnelle à Pr(comment|ham)
    isNegative = pSpam * float(conditionalComment(comment, 0))
    
    # Calculer la valeur proportionnelle à Pr(comment|spam)
    isPositive = pNotSpam * float(conditionalComment(comment, 1))
    
    # Output -> True = spam, False = ham en fonction des 2 variables calculées précédemment (il faut comparer les variables)
    return (isNegative < isPositive)

In [244]:
print(trainNegative)
print(trainPositive)



In [230]:
# Initialiser la prédiction du spam dans les données de test
prediction_test = []

# Obtenez la précision des prédictions sur les données d'essai
for comment in data_comments_test.content:

    # ajouter un commentaire classifié à la liste prediction_test 
    prediction_test.append(classify(comment))

# Checker la précision: 
# D'abord le nombre de prédictions correctes 
correct_labels = np.equal(prediction_test, data_comments_test["label"])
# Ensuite la moyenne des prédictions correctes
test_accuracy = np.mean(correct_labels)

#print prediction_test
print("Proportion of comments classified correctly on test set: %s" % test_accuracy)

trying word[ girl ]
0.0015939243354741925
trying word[ to ]
0.012282593408654072
trying word[ me ]
0.003984810838685481
trying word[ xxx ]
0.00032816089259762784
trying word[ am ]
0.002015845483099714
trying word[ from ]
0.0021564858656415546
trying word[ 1 ]
0.0025784070132670762
trying word[ billion ]
0.004969293516478365
trying word[ sgrout ]
4.6880127513946835e-05
trying word[ perfect ]
0.0008907224227649899
trying word[ of ]
0.008204022314940697
trying word[ from ]
0.0021564858656415546
trying word[ and ]
0.01200131264357039
trying word[ the ]
0.021142937508790025
trying word[ thing ]
0.0007500820402231494
trying word[ is ]
0.019455252918287938
trying word[ m ]
0.0008907224227649899
trying word[ a ]
0.010313628053068305
trying word[ feminist ]
0.00018752051005578734
trying word[ so ]
0.007219539637147813
trying word[ agree ]
4.6880127513946835e-05
trying word[ with ]
0.002859687778350757
trying word[ this ]
0.02761239510571469
trying word[ song ]
0.021283577891331864
trying word[ 

trying word[ as ]
0.0008907224227649899
trying word[ am ]
0.002015845483099714
trying word[ making ]
0.00018752051005578734
trying word[ over ]
0.0007500820402231494
trying word[ per ]
4.6880127513946835e-05
trying word[ month ]
4.6880127513946835e-05
trying word[ at ]
0.002297126248183395
trying word[ and ]
0.01200131264357039
trying word[ check ]
0.0013126435703905114
trying word[ it ]
0.010594908818151985
trying word[ does ]
0.001734564718016033
trying word[ the ]
0.021142937508790025
trying word[ wood ]
4.6880127513946835e-05
trying word[ the ]
0.021142937508790025
trying word[ does ]
0.001734564718016033
trying word[ the ]
0.021142937508790025
trying word[ act ]
4.6880127513946835e-05
trying word[ the ]
0.021142937508790025
trying word[ the ]
0.021142937508790025
trying word[ guys ]
0.00032816089259762784
trying word[ should ]
0.0008907224227649899
trying word[ check ]
0.0013126435703905114
trying word[ out ]
0.0013126435703905114
trying word[ this ]
0.02761239510571469
trying wor

0.0027190473958089165
trying word[ you ]
0.00918850499273358
trying word[ think ]
0.0018752051005578735
trying word[ it ]
0.010594908818151985
trying word[ would ]
0.0008907224227649899
trying word[ be ]
0.0027190473958089165
trying word[ appreciated ]
4.6880127513946835e-05
trying word[ like ]
0.008344662697482538
trying word[ at ]
0.002297126248183395
trying word[ fashion ]
4.6880127513946835e-05
trying word[ store ]
4.6880127513946835e-05
trying word[ in ]
0.010594908818151985
trying word[ the ]
0.021142937508790025
trying word[ out ]
0.0013126435703905114
trying word[ our ]
4.6880127513946835e-05
trying word[ to ]
0.012282593408654072
trying word[ like ]
0.008344662697482538
trying word[ all ]
0.002859687778350757
trying word[ your ]
0.002859687778350757
trying word[ favourite ]
0.00032816089259762784
trying word[ like ]
0.008344662697482538
trying word[ at ]
0.002297126248183395
trying word[ fashion ]
4.6880127513946835e-05
trying word[ store ]
4.6880127513946835e-05
trying word[ 

0.002015845483099714
trying word[ i ]
0.026205991280296284
trying word[ m ]
0.0008907224227649899
trying word[ on ]
0.004125451221227322
trying word[ day ]
0.0011720031878486709
trying word[ 46 ]
4.6880127513946835e-05
trying word[ if ]
0.003281608925976279
trying word[ you ]
0.00918850499273358
trying word[ guys ]
0.00032816089259762784
trying word[ can ]
0.002297126248183395
trying word[ please ]
0.00032816089259762784
trying word[ like ]
0.008344662697482538
trying word[ this ]
0.02761239510571469
trying word[ comment ]
0.0006094416576813089
trying word[ so ]
0.007219539637147813
trying word[ everyone ]
0.00046880127513946836
trying word[ can ]
0.002297126248183395
trying word[ see ]
0.0015939243354741925
trying word[ it ]
0.010594908818151985
trying word[ and ]
0.01200131264357039
trying word[ follow ]
4.6880127513946835e-05
trying word[ me ]
0.003984810838685481
trying word[ on ]
0.004125451221227322
trying word[ my ]
0.005250574281562045
trying word[ me ]
0.003984810838685481
try

trying word[ to ]
0.012282593408654072
trying word[ record ]
0.00018752051005578734
trying word[ and ]
0.01200131264357039
trying word[ put ]
0.00032816089259762784
trying word[ them ]
0.0006094416576813089
trying word[ on ]
0.004125451221227322
trying word[ any ]
0.0007500820402231494
trying word[ of ]
0.008204022314940697
trying word[ you ]
0.00918850499273358
trying word[ could ]
0.0008907224227649899
trying word[ check ]
0.0013126435703905114
trying word[ it ]
0.010594908818151985
trying word[ out ]
0.0013126435703905114
trying word[ and ]
0.01200131264357039
trying word[ it ]
0.010594908818151985
trying word[ would ]
0.0008907224227649899
trying word[ mean ]
0.0007500820402231494
trying word[ so ]
0.007219539637147813
trying word[ much ]
0.002297126248183395
trying word[ to ]
0.012282593408654072
trying word[ us ]
0.0006094416576813089
trying word[ because ]
0.003140968543434438
trying word[ we ]
0.002015845483099714
trying word[ love ]
0.013970277999156159
trying word[ doing ]
0.

trying word[ start ]
0.00018752051005578734
trying word[ hey ]
0.00046880127513946836
trying word[ guys ]
0.00032816089259762784
trying word[ i ]
0.026205991280296284
trying word[ know ]
0.002015845483099714
trying word[ its ]
0.003984810838685481
trying word[ annoying ]
0.00018752051005578734
trying word[ getting ]
0.00032816089259762784
trying word[ sorry ]
0.00046880127513946836
trying word[ bout ]
0.00018752051005578734
trying word[ that ]
0.007782101167315175
trying word[ but ]
0.003984810838685481
trying word[ please ]
0.00032816089259762784
trying word[ take ]
4.6880127513946835e-05
trying word[ a ]
0.010313628053068305
trying word[ moment ]
0.00018752051005578734
trying word[ to ]
0.012282593408654072
trying word[ check ]
0.0013126435703905114
trying word[ out ]
0.0013126435703905114
trying word[ my ]
0.005250574281562045
trying word[ channel ]
0.00018752051005578734
trying word[ with ]
0.002859687778350757
trying word[ i ]
0.026205991280296284
trying word[ want ]
0.00131264357

trying word[ makes ]
0.0007500820402231494
trying word[ me ]
0.003984810838685481
trying word[ want ]
0.0013126435703905114
trying word[ to ]
0.012282593408654072
trying word[ she ]
0.004969293516478365
trying word[ really ]
0.0015939243354741925
trying word[ did ]
0.0007500820402231494
trying word[ this ]
0.02761239510571469
trying word[ there ]
0.002015845483099714
trying word[ sorry ]
0.00046880127513946836
trying word[ was ]
0.003422249308518119
trying word[ being ]
0.00046880127513946836
trying word[ still ]
0.0027190473958089165
trying word[ love ]
0.013970277999156159
trying word[ you ]
0.00918850499273358
trying word[ not ]
0.0030003281608925974
trying word[ a ]
0.010313628053068305
trying word[ big ]
0.0006094416576813089
trying word[ fan ]
0.00032816089259762784
trying word[ of ]
0.008204022314940697
trying word[ the ]
0.021142937508790025
trying word[ song ]
0.021283577891331864
trying word[ but ]
0.003984810838685481
trying word[ this ]
0.02761239510571469
trying word[ vide

trying word[ so ]
0.007219539637147813
trying word[ close ]
0.00046880127513946836
trying word[ to ]
0.012282593408654072
trying word[ up ]
0.0018752051005578735
trying word[ with ]
0.002859687778350757
trying word[ another ]
4.6880127513946835e-05
trying word[ hit ]
0.00046880127513946836
trying word[ like ]
0.008344662697482538
trying word[ this ]
0.02761239510571469
trying word[ and ]
0.01200131264357039
trying word[ it ]
0.010594908818151985
trying word[ will ]
0.001734564718016033
trying word[ happen ]
4.6880127513946835e-05
trying word[ video ]
0.007922741549857016
trying word[ is ]
0.019455252918287938
trying word[ so ]
0.007219539637147813
trying word[ are ]
0.003984810838685481
trying word[ only ]
0.0018752051005578735
trying word[ want ]
0.0013126435703905114
trying word[ to ]
0.012282593408654072
trying word[ win ]
4.6880127513946835e-05
trying word[ the ]
0.021142937508790025
trying word[ check ]
0.0013126435703905114
trying word[ my ]
0.005250574281562045
trying word[ chan

trying word[ see ]
0.0015939243354741925
trying word[ this ]
0.02761239510571469
trying word[ have ]
0.003422249308518119
trying word[ a ]
0.010313628053068305
trying word[ have ]
0.003422249308518119
trying word[ the ]
0.021142937508790025
trying word[ greatest ]
0.00018752051005578734
trying word[ videos ]
0.00046880127513946836
trying word[ or ]
0.0015939243354741925
trying word[ the ]
0.021142937508790025
trying word[ best ]
0.004969293516478365
trying word[ quality ]
4.6880127513946835e-05
trying word[ now ]
0.002015845483099714
trying word[ feel ]
0.00032816089259762784
trying word[ like ]
0.008344662697482538
trying word[ not ]
0.0030003281608925974
trying word[ getting ]
0.00032816089259762784
trying word[ and ]
0.01200131264357039
trying word[ need ]
0.00032816089259762784
trying word[ your ]
0.002859687778350757
trying word[ help ]
0.00046880127513946836
trying word[ you ]
0.00918850499273358
trying word[ could ]
0.0008907224227649899
trying word[ watch ]
0.001031362805306830

trying word[ companions ]
4.6880127513946835e-05
trying word[ of ]
0.008204022314940697
trying word[ the ]
0.021142937508790025
trying word[ whole ]
0.00018752051005578734
trying word[ school ]
4.6880127513946835e-05
trying word[ love ]
0.013970277999156159
trying word[ it ]
0.010594908818151985
trying word[ hate ]
0.0010313628053068304
trying word[ it ]
0.010594908818151985
trying word[ when ]
0.0035628896910599598
trying word[ comes ]
0.00032816089259762784
trying word[ in ]
0.010594908818151985
trying word[ my ]
0.005250574281562045
trying word[ head ]
0.00046880127513946836
trying word[ this ]
0.02761239510571469
trying word[ is ]
0.019455252918287938
trying word[ like ]
0.008344662697482538
trying word[ 2 ]
0.004125451221227322
trying word[ years ]
0.0030003281608925974
trying word[ out ]
0.0013126435703905114
trying word[ this ]
0.02761239510571469
trying word[ video ]
0.007922741549857016
trying word[ on ]
0.004125451221227322
trying word[ is ]
0.019455252918287938
trying word[ 

0.007922741549857016
trying word[ on ]
0.004125451221227322
trying word[ out ]
0.0013126435703905114
trying word[ this ]
0.02761239510571469
trying word[ video ]
0.007922741549857016
trying word[ on ]
0.004125451221227322
trying word[ subscribe ]
4.6880127513946835e-05
trying word[ to ]
0.012282593408654072
trying word[ my ]
0.005250574281562045
trying word[ channel ]
0.00018752051005578734
trying word[ yo ]
4.6880127513946835e-05
trying word[ people ]
0.003281608925976279
trying word[ are ]
0.003984810838685481
trying word[ going ]
0.0007500820402231494
trying word[ for ]
0.0046880127513946835
trying word[ more ]
0.0013126435703905114
trying word[ information ]
4.6880127513946835e-05
trying word[ subscribe ]
4.6880127513946835e-05
trying word[ to ]
0.012282593408654072
trying word[ my ]
0.005250574281562045
trying word[ channel ]
0.00018752051005578734
trying word[ or ]
0.0015939243354741925
trying word[ search ]
4.6880127513946835e-05
trying word[ for ]
0.0046880127513946835
trying w

0.001453283952932352
trying word[ clip ]
0.00032816089259762784
trying word[ in ]
0.010594908818151985
trying word[ and ]
0.01200131264357039
trying word[ give ]
0.00018752051005578734
trying word[ me ]
0.003984810838685481
trying word[ some ]
0.0008907224227649899
trying word[ on ]
0.004125451221227322
trying word[ how ]
0.0035628896910599598
trying word[ my ]
0.005250574281562045
trying word[ video ]
0.007922741549857016
trying word[ was ]
0.003422249308518119
trying word[ and ]
0.01200131264357039
trying word[ how ]
0.0035628896910599598
trying word[ i ]
0.026205991280296284
trying word[ could ]
0.0008907224227649899
trying word[ improve ]
4.6880127513946835e-05
trying word[ be ]
0.0027190473958089165
trying word[ sure ]
0.00018752051005578734
trying word[ to ]
0.012282593408654072
trying word[ go ]
0.0011720031878486709
trying word[ check ]
0.0013126435703905114
trying word[ out ]
0.0013126435703905114
trying word[ the ]
0.021142937508790025
trying word[ about ]
0.00145328395293235

trying word[ while ]
0.0008907224227649899
trying word[ im ]
0.0018752051005578735
trying word[ the ]
0.021142937508790025
trying word[ only ]
0.0018752051005578735
trying word[ one ]
0.0027190473958089165
trying word[ watching ]
0.0018752051005578735
trying word[ here ]
0.0015939243354741925
trying word[ on ]
0.004125451221227322
trying word[ lol ]
0.001453283952932352
trying word[ the ]
0.021142937508790025
trying word[ most ]
0.0015939243354741925
trying word[ liked ]
0.00032816089259762784
trying word[ video ]
0.007922741549857016
trying word[ on ]
0.004125451221227322
trying word[ 2 ]
0.004125451221227322
trying word[ all ]
0.002859687778350757
trying word[ earth ]
0.0007500820402231494
trying word[ population ]
0.00018752051005578734
trying word[ of ]
0.008204022314940697
trying word[ the ]
0.021142937508790025
trying word[ hope ]
4.6880127513946835e-05
trying word[ your ]
0.002859687778350757
trying word[ having ]
4.6880127513946835e-05
trying word[ a ]
0.010313628053068305
tryi

trying word[ getting ]
0.00032816089259762784
trying word[ so ]
0.007219539637147813
trying word[ much ]
0.002297126248183395
trying word[ money ]
0.00032816089259762784
trying word[ lol ]
0.001453283952932352
trying word[ get ]
0.0027190473958089165
trying word[ 100 ]
4.6880127513946835e-05
trying word[ will ]
0.001734564718016033
trying word[ to ]
0.012282593408654072
trying word[ from ]
0.0021564858656415546
trying word[ the ]
0.021142937508790025
trying word[ face ]
0.0006094416576813089
trying word[ of ]
0.008204022314940697
trying word[ earth ]
0.0007500820402231494
trying word[ and ]
0.01200131264357039
trying word[ guys ]
0.00032816089259762784
trying word[ my ]
0.005250574281562045
trying word[ name ]
0.00046880127513946836
trying word[ is ]
0.019455252918287938
trying word[ and ]
0.01200131264357039
trying word[ do ]
0.001734564718016033
trying word[ football ]
0.00018752051005578734
trying word[ videos ]
0.00046880127513946836
trying word[ have ]
0.003422249308518119
trying 

trying word[ the ]
0.021142937508790025
trying word[ if ]
0.003281608925976279
trying word[ not ]
0.0030003281608925974
trying word[ mind ]
4.6880127513946835e-05
trying word[ just ]
0.004266091603769162
trying word[ checking ]
0.00046880127513946836
trying word[ what ]
0.0027190473958089165
trying word[ the ]
0.021142937508790025
trying word[ views ]
0.008625943462566218
trying word[ are ]
0.003984810838685481
trying word[ up ]
0.0018752051005578735
trying word[ to ]
0.012282593408654072
trying word[ everyone ]
0.00046880127513946836
trying word[ joking ]
0.00018752051005578734
trying word[ about ]
0.001453283952932352
trying word[ how ]
0.0035628896910599598
trying word[ he ]
0.001453283952932352
trying word[ to ]
0.012282593408654072
trying word[ get ]
0.0027190473958089165
trying word[ 2 ]
0.004125451221227322
trying word[ billion ]
0.004969293516478365
trying word[ views ]
0.008625943462566218
trying word[ because ]
0.003140968543434438
trying word[ a ]
0.010313628053068305
trying

trying word[ stil ]
4.6880127513946835e-05
trying word[ this ]
0.02761239510571469
trying word[ love ]
0.013970277999156159
trying word[ you ]
0.00918850499273358
trying word[ subscribed ]
4.6880127513946835e-05
trying word[ out ]
0.0013126435703905114
trying word[ this ]
0.02761239510571469
trying word[ playlist ]
4.6880127513946835e-05
trying word[ on ]
0.004125451221227322
trying word[ out ]
0.0013126435703905114
trying word[ this ]
0.02761239510571469
trying word[ video ]
0.007922741549857016
trying word[ on ]
0.004125451221227322
trying word[ out ]
0.0013126435703905114
trying word[ this ]
0.02761239510571469
trying word[ playlist ]
4.6880127513946835e-05
trying word[ on ]
0.004125451221227322
trying word[ subscribe ]
4.6880127513946835e-05
trying word[ my ]
0.005250574281562045
trying word[ out ]
0.0013126435703905114
trying word[ this ]
0.02761239510571469
trying word[ playlist ]
4.6880127513946835e-05
trying word[ on ]
0.004125451221227322
trying word[ guys ]
0.0003281608925976

trying word[ a ]
0.010313628053068305
trying word[ la ]
4.6880127513946835e-05
trying word[ out ]
0.0013126435703905114
trying word[ the ]
0.021142937508790025
trying word[ check ]
0.0013126435703905114
trying word[ out ]
0.0013126435703905114
trying word[ our ]
4.6880127513946835e-05
trying word[ new ]
0.00046880127513946836
trying word[ s ]
0.0015939243354741925
trying word[ a ]
0.010313628053068305
trying word[ check ]
0.0013126435703905114
trying word[ out ]
0.0013126435703905114
trying word[ our ]
4.6880127513946835e-05
trying word[ bands ]
4.6880127513946835e-05
trying word[ page ]
4.6880127513946835e-05
trying word[ on ]
0.004125451221227322
trying word[ youtube ]
0.002297126248183395
trying word[ killtheclockhd ]
4.6880127513946835e-05
trying word[ check ]
0.0013126435703905114
trying word[ out ]
0.0013126435703905114
trying word[ some ]
0.0008907224227649899
trying word[ of ]
0.008204022314940697
trying word[ our ]
4.6880127513946835e-05
trying word[ original ]
4.6880127513946

trying word[ m ]
0.0008907224227649899
trying word[ doing ]
0.00032816089259762784
trying word[ this ]
0.02761239510571469
trying word[ to ]
0.012282593408654072
trying word[ money ]
0.00032816089259762784
trying word[ for ]
0.0046880127513946835
trying word[ people ]
0.003281608925976279
trying word[ who ]
0.002297126248183395
trying word[ can ]
0.002297126248183395
trying word[ t ]
0.001453283952932352
trying word[ experience ]
4.6880127513946835e-05
trying word[ the ]
0.021142937508790025
trying word[ that ]
0.007782101167315175
trying word[ we ]
0.002015845483099714
trying word[ can ]
0.002297126248183395
trying word[ you ]
0.00918850499273358
trying word[ donate ]
4.6880127513946835e-05
trying word[ to ]
0.012282593408654072
trying word[ give ]
0.00018752051005578734
trying word[ them ]
0.0006094416576813089
trying word[ a ]
0.010313628053068305
trying word[ amount ]
0.00018752051005578734
trying word[ would ]
0.0008907224227649899
trying word[ do ]
0.001734564718016033
trying wor

Essayons d'écrire quelques commentaires pour voir s'ils sont classés comme spam ou ham. 

Rappelez-vous que le "True" est pour les commentaires de spam, et "False" est pour les commentaires ham. 
Essayez vous même !

In [245]:
# spam
classify("Guys check out my new chanell")

trying word[ check ]
0.001060332944544587
trying word[ out ]
0.001060332944544587
trying word[ my ]
0.004029265189269431
trying word[ new ]
0.0004241331778178348
trying word[ chanell ]
0.0001060332944544587


True

In [246]:
# spam
classify("I have solved P vs. NP, check my video https://www.youtube.com/watch?v=dQw4w9WgXcQ")

trying word[ have ]
0.0026508323613614673
trying word[ check ]
0.001060332944544587
trying word[ my ]
0.004029265189269431
trying word[ video ]
0.006043897783904146


True

In [247]:
# ham
classify("I liked the video")

trying word[ liked ]
0.0003180998833633761
trying word[ the ]
0.016011027462623263
trying word[ video ]
0.006043897783904146


False

In [248]:
# ham
classify("Its great that this video has so many views")

trying word[ great ]
0.0007422330611812109
trying word[ that ]
0.0059378644894496875
trying word[ this ]
0.020888559007528364
trying word[ video ]
0.006043897783904146
trying word[ has ]
0.002544799066907009
trying word[ so ]
0.005513731311631852
trying word[ many ]
0.001060332944544587
trying word[ views ]
0.0065740642561764396


False

In [253]:
# ??
classify("sgrout your video is interesting but why no update")

trying word[ sgrout ]
0.0001060332944544587
trying word[ your ]
0.0022266991835436325
trying word[ video ]
0.006043897783904146
trying word[ is ]
0.01473862792916976
trying word[ but ]
0.003074965539179302
trying word[ why ]
0.0023327324779980913
trying word[ no ]
0.0011663662389990457
trying word[ update ]
0.0001060332944544587


False

### Pour aller plus loin...
## Extending Bag of Words by Using TF-IDF

Jusqu'à présent, nous avons utilisé le modèle du Bag of Words pour représenter les commentaires en tant que vecteurs. Le "Bag of Words" est une liste de tous les mots uniques trouvés dans les données training, alors chaque commentaire peut être représenté par un vecteur qui contient la fréquence de chaque mot unique qui apparaît dans le commentaire.

Par exemple, si les données training contiennent les mots $(hi, how, how, my, grade, are, you),$ alors le texte "how are you you" peut être représenté par $(0,1,0,0,0,1,2).$ La principale raison pour laquelle nous faisons cela dans notre application est que les commentaires peuvent varier en longueur, mais la longueur des mots uniques reste fixe.

Dans notre contexte, le TF-IDF est une mesure de l'importance d'un mot dans un commentaire par rapport à tous les mots de nos données de formation. Par exemple, si un mot tel que "the" apparaissait dans la plupart des commentaires, le TF-IDF serait petit car ce mot ne nous aide pas à faire la différence entre les commentaires spam et ham. Notez que "TF" signifie "Term Frequency" et "IDF" signifie "Inverse Document Frequency".

En particulier, "TF" indiqué par $tf(w,c)$ est le nombre de fois que le mot $w$ apparaît dans le commentaire donné $c$. Alors que "IDF" est une mesure de la quantité d'informations qu'un mot donné fournit pour différencier les commentaires. PLus précisement, $IDF$ est formulé comme ceci:


>$idf(w, D) = log(\frac{\text{Number of comments in train data $D$}}{\text{Number of comments containing the word $w$}}).$ 


Pour combiner "TF" et "IDF" ensemble, nous prenons simplement le produit, donc:


>$$TFIDF = tf(w,c) \times idf(w, D) = (\text{Number of times $w$ appears in comment $c$})\times log(\frac{\text{Number of comments in train data $D$}}{\text{Number of comments containing the word $w$}}).$$


Maintenant, le $TF-IDF$ peut être utilisé pour pondérer les vecteurs qui résultent de l'approche "Bag of Words".

Par exemple, supposons qu'un commentaire contienne "ceci" 2 fois, donc $tf = 2$. 
Si nous avions alors 1000 commentaires dans nos données de formation, et que le mot "ceci" apparaît dans 100 commentaires, $idf = log(1000/100) = 2.$. 

Par conséquent, dans cet exemple, le poids TF-IDF serait de $2*2 = 4$ pour le mot "ceci" apparaît deux fois dans un commentaire particulier. Pour incorporer TF-IDF dans le réglage des baies naïves, nous pouvons calculer :

>$$Pr(word|spam) = \frac{\sum_{\text{c is spam}}TFIDF(word,c,D)}{\sum_{\text{word in spam c}}\sum_{\text{c is spam}}TFIDF(word,c,D)+ \text{Number of unique words in data}},$$ 

>where $TFIDF(word,c,D) = TF(word,c) \times IDF(word,data).$ 

In [268]:
# Calculer TFIDF(word, comment, data)
def TFIDF(comment, train):
    
    # Diviser le commentaire en une liste de mot
    comment = comment.split(' ')
    
    # Initiailiser tf-idf selon la longueur du commentaire
    tfidf_comment = np.zeros(len(comment))
    print(tfidf_comment)
    # Initiailiser nombre de commentaires contenant un mot
    num_comment_word = 0
    
    # Initialiser l'index pour les mots dans le commentaire
    word_index = 0
    
    # Pour chaque mot du commentaire
    for word in comment:
        
        # Calculer la fréquence des termes (tf)
        # Compter la fréquence du mot dans les commentaires
        tf = comment.count(word)
        
        # Trouver le nombre de commentaires contenant un mot
        for text in train["content"]:
            
            # Incrémenter le compteur de mots si le mot trouvé dans le commentaire
            if text.split(' ').count(word) > 0:
                num_comment_word += 1
        
        # Calculer la fréquence du document inverse (idf)
        # log(Nombre total de commentaires/nombre de commentaires avec mot)
        idf = np.log(len(train)/num_comment_word)
        
        # Mettre a jour le poids tf-idf du mot
        tfidf_comment[word_index] = tf * idf
        
        # Réinitialiser le nombre de commentaires contenant un mot
        num_comment_word = 0
        
        # Passer au mot suivant dans le commentaire
        word_index += 1
        
    return tfidf_comment

In [269]:
TFIDF("Check out my new music video plz",data_comments_train)

[0. 0. 0. 0. 0. 0. 0.]


array([2.1261888 , 1.60739501, 1.83565366, 3.73562672, 3.16384039,
       2.08148863, 5.21153324])

In [None]:
# Et maintenant, implémente TFIDF avec ta fonction de classification
# Have fun :D

# Classifier les commentaires sont du spam ou ham
def classifyTDIDF(comment):
    
    # get global variables
    global pSpam
    global pNotSpam
    
    # Calculer la valeur proportionnelle à Pr(comment|ham)
    isNegative = pSpam * float(conditionalComment(comment, 0))
    
    # Calculer la valeur proportionnelle à Pr(comment|spam)
    isPositive = pNotSpam * float(conditionalComment(comment, 1))
    
    # Output -> True = spam, False = ham en fonction des 2 variables calculées précédemment (il faut comparer les variables)
    return (isNegative < isPositive)

