<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Load-the-data" data-toc-modified-id="1.-Load-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>1. Load the data</a></span></li><li><span><a href="#2.-Filtering-out-the-noise" data-toc-modified-id="2.-Filtering-out-the-noise-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>2. Filtering out the noise</a></span></li><li><span><a href="#3.-Even-better-filtering" data-toc-modified-id="3.-Even-better-filtering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>3. Even better filtering</a></span></li><li><span><a href="#4.-Term-frequency-times-inverse-document-frequency" data-toc-modified-id="4.-Term-frequency-times-inverse-document-frequency-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>4. Term frequency times inverse document frequency</a></span></li><li><span><a href="#5.-Utility-function" data-toc-modified-id="5.-Utility-function-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>5. Utility function</a></span></li></ul></div>

This notebook is part of the [Machine Learning class](https://github.com/erachelson/MLclass) by [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en).

License: CC-BY-SA-NC.

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Text data pre-processing</div>

In this exercice, we shall load a database of email messages and pre-format them so that we can design automated classification methods or use off-the-shelf classifiers.

"What is there to pre-process?" you might ask. Well, actually, text data comes in a very noisy form that we, humans, have become accustomed to and filter out effortlessly to grasp the core meaning of the text. It has a lot of formatting (fonts, colors, typography...), punctuation, abbreviations, common words, grammatical rules, etc. that we might wish to discard before even starting the data analysis.

Here are some pre-processing steps that can be performed on text:
1. loading the data, removing attachements, merging title and body;
2. tokenizing - splitting the text into atomic "words";
3. removal of stop-words - very common words;
4. removal of non-words - punctuation, numbers, gibberish;
3. lemmatization - merge together "find", "finds", "finder".

The final goal is to be able to represent a document as a mathematical object, e.g. a vector, that our machine learning black boxes can process.

# 1. Text classification in English

## 1.1 Load the data

Let's first load the emails.

In [None]:
#!git clone https://github.com/SupaeroDataScience/deep-learning
#!mv deep-learning/data .
#!mv deep-learning/NLP/datasets .
#!pip install nltk unidecode

In [None]:
import os
data_switch=1
if(data_switch==0):
    train_dir = 'data/ling-spam/train-mails/'
    email_path = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]
else:
    train_dir = 'data/lingspam_public/bare/'
    email_path = []
    email_label = []
    for d in os.listdir(train_dir):
        folder = os.path.join(train_dir,d)
        email_path += [os.path.join(folder,f) for f in os.listdir(folder)]
        email_label += [f[0:3]=='spm' for f in os.listdir(folder)]
print("number of emails",len(email_path))
email_nb = 8 # try 8 for a spam example
print("email file:", email_path[email_nb])
print("email is a spam:", email_label[email_nb])
print(open(email_path[email_nb]).read())

## 1.2. Filtering out the noise

One nice thing about scikit-learn is that is has lots of preprocessing utilities. Like [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for instance, that converts a collection of text documents to a matrix of token counts.

- To remove stop-words, we set: `stop_words='english'`
- To convert all words to lowercase: `lowercase=True`
- The default tokenizer in scikit-learn removes punctuation and only keeps words of more than 2 letters.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
countvect = CountVectorizer(input='filename', stop_words='english', lowercase=True)
word_count = countvect.fit_transform(email_path)

In [None]:
print("Number of documents:", len(email_path))
words = countvect.get_feature_names_out()
print("Number of words:", len(words))
print("Document - words matrix:", word_count.shape)
print("First words:", words[0:100])

## 1.3. Even better filtering

That's already quite ok, but this pre-processing does not perform lemmatization, the list of stop-words could be better and we could wish to remove non-english words (misspelled, with numbers, etc.).

A slightly better preprocessing uses the [Natural Language Toolkit](https://www.nltk.org/https://www.nltk.org/). The one below:
- tokenizes;
- removes punctuation;
- removes stop-words;
- removes non-English and misspelled words (optional);
- removes 1-character words;
- removes non-alphabetical words (numbers and codes essentially).

In [None]:
import nltk
nltk.download('words')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

In [None]:
from nltk import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import words
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer

class LemmaTokenizer(object):
    def __init__(self, remove_non_words=True):
        self.wnl = WordNetLemmatizer()
        self.stopwords = set(stopwords.words('english'))
        self.words = set(words.words())
        self.remove_non_words = remove_non_words
    def __call__(self, doc):
        # tokenize words and punctuation
        word_list = wordpunct_tokenize(doc)
        # remove stopwords
        word_list = [word for word in word_list if word not in self.stopwords]
        # remove non words
        if(self.remove_non_words):
            word_list = [word for word in word_list if word in self.words]
        # remove 1-character words
        word_list = [word for word in word_list if len(word)>1]
        # remove non alpha
        word_list = [word for word in word_list if word.isalpha()]
        return [self.wnl.lemmatize(t) for t in word_list]



The LemmaTokenizer defined above will be applied further in this example. The next step is to define the Count Vectorization pipeline using this Tokenizer.

In [None]:
countvect = CountVectorizer(input='filename',tokenizer=LemmaTokenizer(remove_non_words=True))
bow = countvect.fit_transform(email_path)
feat2word = {v: k for k, v in countvect.vocabulary_.items()}

In [None]:
print("Number of documents:", len(email_path))
words = countvect.get_feature_names_out()
print("Number of words:", len(words))
print("Document - words matrix:", bow.shape)
print("First words:", words[0:100])

## 1.4. Using the bag of words (BOW) object to classify spam

Let's start by splitting the data into train and test sets, using 20% of the data for testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(bow,email_label,test_size=0.2)

In this simple example we will use a Logistic Regression Classifier. Let's fit it to our Training Data

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

lr_classifier=LogisticRegression()
lr_classifier.fit(X_train,y_train)

y_predicted = lr_classifier.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_test,y_predicted))
print("Precision :",metrics.precision_score(y_test,y_predicted))
print("Recall :",metrics.recall_score(y_test,y_predicted))

In many cases, Bag of Words can provide sufficient information for classification. In this case, the accuracy reached by our classifier is pretty good.

## 1.5. Term frequency times inverse document frequency

After this first preprocessing, each document is summarized by a vector of size "number of words in the extracted dictionnary". For example, the first email in the list has become:

In [None]:
mail_number = 0
text = open(email_path[mail_number]).read()
print("Original email:")
print(text)

emailBagOfWords = {feat2word[i]: bow[mail_number, i] for i in bow[mail_number, :].nonzero()[1]}
print("Bag of words representation (", len(emailBagOfWords), " words in dict):", sep='')
print(emailBagOfWords)
print("\nVector reprensentation (", bow[mail_number, :].nonzero()[1].shape[0], " non-zero elements):", sep='')
print(bow[mail_number, :])

Counting words is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called `tf` for Term Frequencies.

Another refinement on top of `tf` is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called `tf–idf` for “Term Frequency times Inverse Document Frequency” and again, scikit-learn does the job for us with the [TfidfTransformer](scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) function.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer().fit_transform(bow)
tfidf.shape

Let's run the classification process again

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf,email_label,test_size=0.2)

#Fitting classifier
lr_classifier.fit(X_train,y_train)

#Testing classifier
y_predicted = lr_classifier.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_test,y_predicted))
print("Precision :",metrics.precision_score(y_test,y_predicted))
print("Recall :",metrics.recall_score(y_test,y_predicted))

In this simplae case, additional filtering is unecessary and even removed some information. There is indeed likely a link between the abundance of words/long emails and the fact that this email is a spam.

# 2. Text classification in French

The previously used dataset is a widely used dataset for introductory text classification.

The field of Natural Language Understanding, and Natural Language Classification in particular, suffers from two challenges :
- Adapting the features and methodologies to various and more complex datasets
- Adapting the process to languages other than english

Concerning the latter, one has to take into account that most of NLU research is currently performed on english. Datasets are rarely available for other languages, and the algorithms proposed for better NLU are often left untested on foreign data.
French, for instance, has less efficient lemmatization (french is a richly flected language). In the following section, we will reuse the same methodologies on a french dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

#load video games reviews
vgr = pd.read_csv("datasets/jvc.csv")
vgr.head()

In [None]:
#convert rating to numeric values, and plot the histogram of values
rating=vgr.website_rating.apply(lambda k: k[:-3])
vgr['rating']=pd.to_numeric(rating)
vgr.rating.plot.hist()

Most games seem to have a rating between 11 and 16. In this exercise, we will try to determine if we can determine if a game is very good (rating above 16) or very bad (rating below 11) based only on the summary of its review.

Let's start by splitting the dataset between good and bad games

In [None]:
bad=vgr[(vgr.rating<=11) & (vgr.platform=="PC")]
bad['quality']=pd.Series(["bad"]*len(bad.index),index=bad.index)
good=vgr[(vgr.rating>=16) & (vgr.platform=="PC")]
good['quality']=pd.Series(["good"]*len(good.index),index=good.index)
selected_games=pd.concat([good,bad]).dropna()

#Keep only reviews and
game_reviews=selected_games['description']
game_quality=selected_games['quality']


Lemmatization in French is a tricky issue.

One example : the verb finir can be expressed as finissons, finirez, finisse, finit, etc...
Lemmatization is typically less efficient in french than in english.

Another alternative is to use Stemming instead. Stemming uses RegEx rules to truncate the end of a word that would normally correspond to conjugations, inflections, etc...
Stemming destructs the readability of the words by truncating their end, but runs faster than Lemmatization

In the next cell, we adapt the LemmaTokenizer that we defined earlier using a FrenchStemmer instead.

In [None]:
from nltk.stem.snowball import FrenchStemmer
from nltk import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.corpus import words
from string import punctuation
class FrenchStemTokenizer(object):
    def __init__(self, remove_non_words=True):
        self.st = FrenchStemmer()
        self.stopwords = set(stopwords.words('french'))
        self.words = set(words.words())
        self.remove_non_words = remove_non_words
    def __call__(self, doc):
        # tokenize words and punctuation
        word_list = wordpunct_tokenize(doc)
        # remove stopwords
        word_list = [word for word in word_list if word not in self.stopwords]
        # remove non words
        if(self.remove_non_words):
            word_list = [word for word in word_list if word in self.words]
        # remove 1-character words
        word_list = [word for word in word_list if len(word)>1]
        # remove non alpha
        word_list = [word for word in word_list if word.isalpha()]
        return [self.st.stem(t) for t in word_list]

countvect = CountVectorizer(tokenizer=FrenchStemTokenizer(remove_non_words=True))
bow_games = countvect.fit_transform(game_reviews)
feat2word = {v: k for k, v in countvect.vocabulary_.items()}

### Classify with BOW

In [None]:
print("Number of documents:", len(game_reviews))
words = countvect.get_feature_names_out()
print("Number of words:", len(words))
print("Document - words matrix:", bow_games.shape)
print("First words:", words[0:100])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(bow_games,game_quality,test_size=0.2)

#Fitting classifier
lr_classifier.fit(X_train,y_train)

#Testing classifier
y_predicted = lr_classifier.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_test,y_predicted))
print("Precision :",metrics.precision_score(y_test,y_predicted,pos_label="good"))
print("Recall :",metrics.recall_score(y_test,y_predicted,pos_label="good"))

### Classify using tf-idf

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_games = TfidfTransformer().fit_transform(bow_games)
tfidf_games.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_games,game_quality,test_size=0.2)

#Fitting classifier
lr_classifier.fit(X_train,y_train)

#Testing classifier
y_predicted = lr_classifier.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_test,y_predicted))
print("Precision :",metrics.precision_score(y_test,y_predicted,pos_label="good"))
print("Recall :",metrics.recall_score(y_test,y_predicted,pos_label="good"))

## Word2Vec

In [None]:
from nltk.stem.snowball import FrenchStemmer
from nltk import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.corpus import words
from string import punctuation
import unidecode

class FrenchTokenizer(object):
    def __init__(self):
        self.stopwords = set(stopwords.words('french'))
        self.words = set(words.words())
    def __call__(self, doc):
        # tokenize words and punctuation
        word_list = wordpunct_tokenize(doc)
        # remove stopwords
        word_list = [word for word in word_list if word not in self.stopwords]
        # remove 1-character words
        word_list = [word for word in word_list if len(word)>1]
        # remove non alpha
        word_list = [word for word in word_list if word.isalpha()]
        return [unidecode.unidecode(t) for t in word_list]

tok=FrenchTokenizer()

text_for_word2vec=[tok(sent) for sent in game_reviews]

The operation above will tokenize all texts by keeping stemmed tokens. Please note the following choices :
- we have applied stemming in order to reduce the dimensionality of our feature space
- we have removed stop words, in order to not let context be learned with it. (depending on the use case, you may want to keep them or remove them)

We can now train the Word2Vec model :

In [None]:
from gensim.models import Word2Vec

model=Word2Vec(text_for_word2vec,vector_size=200,window=5,min_count=1)
model.save("word2vec.model")
w2v=dict(zip(model.wv.index_to_key, model.wv.vectors))

Let's check word similarity in our trained data :

In [None]:
model.wv.most_similar(positive="jeu")

Let's now try again to classify our samples using these embddings

In [None]:
class MeanEmbeddingVectorizer(object):
    def __init__(self,word2vec,dim):
        self.word2vec=word2vec
        self.dim=dim

    def fit(self,X,y):
        return self

    def transform(self,X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec] or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

In [None]:
from sklearn.pipeline import Pipeline
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(game_reviews,game_quality,test_size=0.2)

pipe=Pipeline([('vectorizer',MeanEmbeddingVectorizer(w2v,200)),('classifier',lr_classifier)])

pipe.fit(X_train,y_train)

In [None]:
predicted = pipe.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_test,predicted))
print("Precision :",metrics.precision_score(y_test,predicted,pos_label="good"))
print("Recall :",metrics.recall_score(y_test,predicted,pos_label="good"))

What we observe here is that word2vec embeddings perform worse than what we learned from BOW or TFIDF.

In our case, the training corpus for the embeddings was not large enough to ensure proper convergence and representation of the words.

It is also common that for smaller corpora (<10.000 docs approximately), TFIDF usually performs better for classification, whereas Word2Vec produces better results with larger corpora and across domains (e.g. training on data from Wikipedia, and then using the vectors on data from another field)
