# PARTIE 1 : Nettoyage et prétraitement des données

## Import librairies

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np

pd.set_option("display.max_colwidth", None)

## Import des données

In [35]:
df = pd.read_csv('https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/project/spam.csv', encoding='latin-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives around here though",,,


In [3]:
df.shape

(5572, 5)

In [4]:
df['v1'].value_counts()

ham     4825
spam     747
Name: v1, dtype: int64

In [5]:
print('Proportion of spam:', round(df['v1'].value_counts()['spam']/len(df)*100,2), '%')

Proportion of spam: 13.41 %


## Prétraitement des données

### Récupération du message entier

Nous rassemblons les données des colonnes *v2*, *Unnamed: 2*, *Unnamed: 3*, *Unnamed: 4* pour obtenir le contenu entier du message.

Nous stockons le message entier dans une nouvelle colonne nommée *full_message*.

In [36]:
df['full_message'] = df['v2']

for i in range(2,5):
    df['full_message'] += ' ' + df[f'Unnamed: {i}'].fillna('')

In [37]:
df[df['Unnamed: 2'].isna() == False].head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4,full_message
95,spam,"Your free ringtone is waiting to be collected. Simply text the password \MIX\"" to 85069 to verify. Get Usher and Britney. FML",PO Box 5249,"MK17 92H. 450Ppw 16""",,"Your free ringtone is waiting to be collected. Simply text the password \MIX\"" to 85069 to verify. Get Usher and Britney. FML PO Box 5249 MK17 92H. 450Ppw 16"""
281,ham,\Wen u miss someone,the person is definitely special for u..... But if the person is so special,why to miss them,"just Keep-in-touch\"" gdeve..""","\Wen u miss someone the person is definitely special for u..... But if the person is so special why to miss them just Keep-in-touch\"" gdeve.."""
444,ham,\HEY HEY WERETHE MONKEESPEOPLE SAY WE MONKEYAROUND! HOWDY GORGEOUS,"HOWU DOIN? FOUNDURSELF A JOBYET SAUSAGE?LOVE JEN XXX\""""",,,"\HEY HEY WERETHE MONKEESPEOPLE SAY WE MONKEYAROUND! HOWDY GORGEOUS HOWU DOIN? FOUNDURSELF A JOBYET SAUSAGE?LOVE JEN XXX\"""""
671,spam,SMS. ac sun0819 posts HELLO:\You seem cool,"wanted to say hi. HI!!!\"" Stop? Send STOP to 62468""",,,"SMS. ac sun0819 posts HELLO:\You seem cool wanted to say hi. HI!!!\"" Stop? Send STOP to 62468"""
710,ham,Height of Confidence: All the Aeronautics professors wer calld &amp; they wer askd 2 sit in an aeroplane. Aftr they sat they wer told dat the plane ws made by their students. Dey all hurried out of d plane.. Bt only 1 didnt move... He said:\if it is made by my students,"this wont even start........ Datz confidence..""",,,"Height of Confidence: All the Aeronautics professors wer calld &amp; they wer askd 2 sit in an aeroplane. Aftr they sat they wer told dat the plane ws made by their students. Dey all hurried out of d plane.. Bt only 1 didnt move... He said:\if it is made by my students this wont even start........ Datz confidence.."""


In [38]:
cols_to_drop = ['v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
df_full = df.drop(cols_to_drop, axis=1).reset_index(drop=True).rename(columns={'v1': 'label'})
df_clean = df_full.copy()
df_clean.head()

Unnamed: 0,label,full_message
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


## Nettoyage du texte

Nous utilisons la librairie Spacy pour nettoyer le texte.

Nous débarassons le texte des caractères spéciaux, des espaces en trop entre les mots, des majuscules, des éventuels espaces en tout début et fin de texte, ainsi que des mots connecteurs (stop words).

Puis récupérons la racine (lemma) de chaque mot.

In [9]:
!python -m spacy download en_core_web_sm -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [10]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [11]:
from spacy.lang.en.stop_words import STOP_WORDS

In [39]:
df_clean['message_clean'] = df_clean['full_message'].apply(lambda x: ''.join(ch for ch in x if ch.isalnum() or ch==' '))
df_clean['message_clean'] = df_clean['message_clean'].apply(lambda x: x.replace(" +"," ").lower().strip())
df_clean['message_clean'] = df_clean['message_clean'].apply(lambda x: " ".join([token.lemma_ for token in nlp(x) if (token.lemma_ not in STOP_WORDS) & (token.text not in STOP_WORDS)]))

In [40]:
df_clean.head()

Unnamed: 0,label,full_message,message_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",jurong point crazy available bugis n great world la e buffet cine amore wat
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,free entry 2 wkly comp win fa cup final tkts 21st 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s
3,ham,U dun say so early hor... U c already then say...,u dun early hor u c
4,ham,"Nah I don't think he goes to usf, he lives around here though",nah think usf live


In [41]:
df_clean['label'] = df_clean['label'].apply(lambda x : 1 if (x == 'spam') else 0)

In [42]:
df_clean.head()

Unnamed: 0,label,full_message,message_clean
0,0,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",jurong point crazy available bugis n great world la e buffet cine amore wat
1,0,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,free entry 2 wkly comp win fa cup final tkts 21st 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s
3,0,U dun say so early hor... U c already then say...,u dun early hor u c
4,0,"Nah I don't think he goes to usf, he lives around here though",nah think usf live


Nous retirons les lignes où message_clean est vide.

In [54]:
df_clean[df_clean['message_clean'] == '']

Unnamed: 0,label,full_message,message_clean
43,0,WHO ARE YOU SEEING?,
959,0,Where @,
1087,0,You can never do NOTHING,
1190,0,We're done...,
1236,0,How much are we getting?,
1407,0,Then we gotta do it after that,
2740,0,Nothing. Can...,
2805,0,Can a not?,
2871,0,See you there!,
2927,0,Anything...,


In [57]:
rows_to_remove = df_clean[df_clean['message_clean'] == ''].index.tolist()
len(rows_to_remove)

17

In [59]:
df_clean.drop(rows_to_remove, axis=0, inplace = True)
df_clean.shape

(5555, 3)

In [60]:
df_clean.rename(columns={'message_clean': 'message'}, inplace=True)
df_clean.drop(['full_message'], axis=1, inplace=True)
df_clean.head()

Unnamed: 0,label,message
0,0,jurong point crazy available bugis n great world la e buffet cine amore wat
1,0,ok lar joke wif u oni
2,1,free entry 2 wkly comp win fa cup final tkts 21st 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s
3,0,u dun early hor u c
4,0,nah think usf live


In [61]:
df_clean.to_csv("messages.csv", index=False)

## Encodage du texte

Maintenant que nous avons un texte nettoyé, nous le transformons en format compréhensible pour l'ordinateur.

Pour ce faire nous utilisons les outils de traitement du langage naturel (NLP) de la librairie Keras.

L'encodage se fait en 3 étapes : Tokenisation -> Construction du vocabulaire -> Vectorisation du texte.

Le texte encodé final est un vecteur.

In [62]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=1000)
tokenizer.fit_on_texts(df_clean["message"])

df_clean["message_encoded"] = tokenizer.texts_to_sequences(df_clean["message"])
df_clean["len_message"] = df_clean["message_encoded"].apply(lambda x: len(x))

In [63]:
df_clean.head()

Unnamed: 0,label,message,message_encoded,len_message
0,0,jurong point crazy available bugis n great world la e buffet cine amore wat,"[233, 447, 462, 944, 35, 52, 205, 945, 78, 946, 58]",11
1,0,ok lar joke wif u oni,"[10, 194, 463, 289, 1]",5
2,1,free entry 2 wkly comp win fa cup final tkts 21st 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s,"[12, 298, 3, 534, 666, 33, 857, 423, 20, 158, 298, 24, 234]",13
3,0,u dun early hor u c,"[1, 125, 150, 1, 85]",5
4,0,nah think usf live,"[711, 22, 667, 129]",4


In [None]:
df_clean.to_csv("messages_encoded.csv", index=False)

In [None]:
tokenizer.index_word

{1: 'u',
 2: 'm',
 3: '2',
 4: 'ur',
 5: 'come',
 6: 's',
 7: '4',
 8: 'know',
 9: 'good',
 10: 'ok',
 11: 'ltgt',
 12: 'free',
 13: 'send',
 14: 'like',
 15: 'want',
 16: 'day',
 17: 'ill',
 18: 'time',
 19: 'love',
 20: 'text',
 21: 'tell',
 22: 'think',
 23: 'need',
 24: 'txt',
 25: 'today',
 26: 'home',
 27: 'lor',
 28: 'stop',
 29: 'reply',
 30: 'd',
 31: 'sorry',
 32: 'r',
 33: 'win',
 34: 'mobile',
 35: 'n',
 36: 'phone',
 37: 'new',
 38: 'work',
 39: 'week',
 40: 'later',
 41: 'hi',
 42: 'ask',
 43: 'da',
 44: 'miss',
 45: 'ì',
 46: 'hope',
 47: 'night',
 48: 'try',
 49: 'claim',
 50: 'wait',
 51: 'thing',
 52: 'great',
 53: 'oh',
 54: 'leave',
 55: 'hey',
 56: 'meet',
 57: 'dear',
 58: 'wat',
 59: 'pls',
 60: 'happy',
 61: 'message',
 62: 'number',
 63: 'friend',
 64: 'feel',
 65: 'thank',
 66: 'way',
 67: 've',
 68: 'late',
 69: 'prize',
 70: 'right',
 71: 'find',
 72: 'let',
 73: 'pick',
 74: 'tomorrow',
 75: 'yes',
 76: 'yeah',
 77: 'min',
 78: 'e',
 79: '1',
 80: 'amp',
 8

## Uniformisation de la longueur des vecteurs

In [None]:
df_clean['len_message'].value_counts()

3     807
2     707
4     581
1     515
5     476
6     401
7     323
8     269
10    220
9     202
12    191
11    181
0     159
13    124
14    100
15     92
16     65
17     48
18     30
20     16
19     14
21     11
25      6
23      6
32      4
47      4
30      4
22      3
27      3
24      3
34      2
26      2
33      1
40      1
37      1
Name: len_message, dtype: int64

In [None]:
df_clean['len_message'].value_counts().index.max()

47

Les textes n'ont pas tous la même longueur. Le texte le plus long fait 47 mots.

Nous devons les mettre à la même taille.

Pour ce faire nous utilisons la méthode .pad_sequences() pour ajouter un padding à la fin des séquences.

In [None]:
messages_pad = tf.keras.preprocessing.sequence.pad_sequences(df_clean['message_encoded'], padding="post")

In [None]:
messages_pad

array([[233, 447, 462, ...,   0,   0,   0],
       [ 10, 194, 463, ...,   0,   0,   0],
       [ 12, 298,   3, ...,   0,   0,   0],
       ...,
       [940,   0,   0, ...,   0,   0,   0],
       [113,  14,  30, ...,   0,   0,   0],
       [310,   0,   0, ...,   0,   0,   0]], dtype=int32)

In [None]:
type(messages_pad)

numpy.ndarray

In [None]:
messages_pad.shape

(5572, 47)

## Enregistrement des données prétraitées

In [None]:
np.save('messages_pad.npy', messages_pad)