**Day 8**: Natural Language Processing 📰 (***live in 1.49/1.50***)

<center><h1 style="color:maroon">Natural Language Processing</h1>
    <img src="https://drive.google.com/uc?id=16bKE6mY9y66IJazCHGXkp3oLkTMC_-Wt" style="width:1300px;">
    <h3><span style="color: #045F5F">Data Science & Machine Learning for Planet Earth Lecture Series</span></h3><h6><i> by Cédric M. John <span style="size:6pts">(2023)</span></i></h6></center>

## Plan for today's Lecture 🗓 

* Overview of text preprocessing
* Text Vectorizing: different approaches
* Text Embedding for NLP

## Intended learning outcomes 👩‍🎓

* Confidently preprocess text
* Know what text vectorizing / embedding to choose
* Transfer learning for NLP

# Text Preprocessing
<br>

<center><img src="https://drive.google.com/uc?id=16eNcTP_nqAT6mptg-Xj2u8nVq_D1Afc6" style="width:900px;"><br>
 © Cédric John, 2022; Image generated with <a href="https://openai.com/blog/dall-e/">DALL-E</a><br>
<br>Prompt: Snowflakes looking like pure glass and in the shape of typefont characters falling, digital art, 4k photo, warm golden lighting.</center>


<h1 >Natural Language Processing</h1>



<h3>Language model </h3><p>A language model is a model which attempts to predict the next word or character given an input list of words or characters.</p>
<p><img alt="Embedding" src="https://drive.google.com/uc?id=16gaqOOrYNaY-xIkMhZi3dMSwJh8sfjVM" width="800px"/></p>
(<a hlink="https://colab.research.google.com/github/dipanjanS/nlp_workshop_odsc19/blob/master/Module05%20-%20NLP%20Applications/Project08%20-%20Text%20Generation%20with%20Transformers.ipynb">Image from Google</a>).
<p>Such models are present on your phone when it is suggesting you next words.</p>
<p>It is easy to implement, but it is not necessarily easy to obtain something meaningful !</p>



<h3>Text classification, such as sentiment analysis</a></h3><p>Classification depending on a word, a sentence, a paragraph, ...</p>
<p><img src="https://drive.google.com/uc?id=16JczPbBtae-f4kBW2F9A8AsaQdlYw0Xa" width="1000px"/></p>
<p>The typical setting is <strong>sentiment analysis</strong>: Classify positive or negative sentence (but also happiness, sadness, joy, anger, ...).</p>


# Reminder: Data Preprocessing

<img src="https://drive.google.com/uc?id=16CMa7VrgYDUjGlnrGR8t_D0yvJuGGkdd" style="width:1600px">

# Text Preprocessing

<img src="https://drive.google.com/uc?id=16Mwahg3hu2xW8ImaOax8ycGkwmy14283" style="width:1600px">


<h1>NLTK: the Natural Language Preprocessing Toolkit</h1>



<p>Natural Language Toolkit (NLTK) is a library that provides preprocessing and modelling tools for text data.</p>
<p><a href="https://www.nltk.org/">NLTK</a></p>



<h2>Installing NLTK</h2>
<p>For work in Colab, in your notebook, type the following:</p>
<p><code>!pip install nltk</code></p>
<p><a href="https://www.nltk.org/install.html">Installation Documentation</a></p>


### Dataset

<span style="color:teal">**Today's data is:** </span><a href="www.twitter.com">Twitter (now "X") Climate Change Sentiment Analysis</a> (modified)
<img src="https://drive.google.com/uc?id=16mhEA5-Jdi3z5zvWYNky8lzs6ou8hPwo" style="width:1500px"/>

In [None]:
import pandas as pd

data = pd.read_csv('Lecture_data/twitter_sentiment_data_mod.csv')
data

In [None]:
data.isnull().sum()

In [None]:
data = data.dropna()

In [None]:
data.isnull().sum()

In [None]:
from sklearn.model_selection import train_test_split

X = data.message
y = data.sentiment

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=.7, random_state=42)

In [None]:
X_train


<h3>🖥 Lowercase</h3>



<p>For two words to be considered the same they need to have the same casing.</p>


In [None]:
text = X_train.iloc[1018]
text

In [None]:
text = text.lower() 
text


<h3>🖥 Numbers</h3>



<p>Depending on the task, numbers may need to be removed as part of preprocessing.</p>


In [None]:
text

In [None]:
def remove_numbers(txt):
    txt = ''.join(word for word in txt if not word.isdigit())
    return txt

text = remove_numbers(text)
text


<p>👍 Numbers are useful for date extraction</p>
<p>👎 Not useful for topic modelling (or sentiment analysis)</p>



<h3>🖥 Punctuation</h3>



<p>Like numbers, punctuation may need to be removed as part of preprocessing.</p>


In [None]:
text

In [None]:
import string 

string.punctuation

In [None]:
def remove_punctuation(txt):
    
    for punctuation in string.punctuation:
        txt = txt.replace(punctuation, '') 
    
    return txt
     
text = remove_punctuation(text)
text


<p>👍 Punctuation is useful for authorship attribution (e.g. the style of writing) and text generation</p>
<p>👎 Not useful for topic modelling (sentiment analysis)</p>



<p>⚠️ Punctuation is rarely respected in modern text forms (e.g. social media). Best to remove it if the style is not consistent.</p>



<h3>Tokenizing</h3>



<p>Tokenizing means transforming a single string into a list of words, also called word tokens. For preprocessing tasks dealing with entire words, you will need to tokenize you text.</p>
<p><a href="https://www.nltk.org/api/nltk.tokenize.html">NLTK's <code>tokenize</code> module documentation</a></p>


In [None]:
from nltk.tokenize import word_tokenize

def tokenize(txt):
    word_tokens = word_tokenize(txt) 
    return word_tokens

text = tokenize(text)
text


<h3>🖥 Stopwords</h3>



<p>"Stopwords" are words are so frequently used that for many tasks , they don't carry much information. NLTK has an inbuilt corpus of english stopwords that can be loaded and used.</p>


In [None]:
text

In [None]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

def remove_stopwords(word_tokens):
    stop_words = set(stopwords.words('english')) 
    word_tokens = [w for w in word_tokens if not w in stop_words] 
    return word_tokens

text = remove_stopwords(text)
text


<p>👍 Removing stopwords is useful for topic modelling, sentiment analysis</p>
<p>👎 Counterproductive for authorship attribution or text generation</p>



<h3>🖥 Stemming and Lemmatizing </h3><p>Stemming &amp; Lemmatizing are techniques used to find the root of words, in order to group them by meaning rather than exact form.</p>
<p><img src="https://drive.google.com/uc?id=16Yrm2KajEJiTgXkCB50yjf0d2YYdsdv1" style="margin:auto" width="600"/></p>
<a href="https://www.madrasresearch.org/post/stemming-and-lemmatization">Prajey Mehta, 2021</a>


In [None]:
text

In [None]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

stemmed = [stemmer.stem(word) for word in text]

stemmed

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatized = [lemmatizer.lemmatize(word) for word in text]

lemmatized


<p>👍 Stemming/Lemmatizing is useful for topic modelling, sentiment analysis</p>
<p>👎 Hinders authorship attribution and text generation</p>


In [None]:
from nltk.stem import WordNetLemmatizer

def lemmatize(txt):
    lemmatizer = WordNetLemmatizer()
    txt = [lemmatizer.lemmatize(word) for word in txt]
    
    return txt

## Applying all of these transformations to our X_train

We can use functions to apply the transformation to the entire `X_train` (and `X_test`)

In [None]:
def transform_text(txt):
    txt = txt.lower()
    txt = remove_numbers(txt)
    txt = remove_punctuation(txt)
    txt = tokenize(txt)
    txt = remove_stopwords(txt)
    return lemmatize(txt)

transform_text(X_train.iloc[1018])

In [None]:
# If tokens are needed

def to_tokens(X_origin):
    X = X_origin.copy()
    for idx, txt in X.items():
            X.loc[idx] = transform_text(txt)
    return X

In [None]:
# If strings are needed

def to_string_tokens(X_origin):
    X = X_origin.copy()
    for idx, txt in X.items():
        X.loc[idx] = ' '.join(transform_text(txt))
    return X

In [None]:
X_train_prep = to_string_tokens(X_train)

In [None]:
X_test_prep = to_string_tokens(X_test)

In [None]:
X_test_prep

# Vectorizing text for NLP
<br>

<center><img src="https://drive.google.com/uc?id=16HcUdbRs9ShtRwJ1h0znHt668zbF8WdK" style="width:900px;"><br>
 © Cédric John, 2022; Image generated with <a href="https://openai.com/blog/dall-e/">DALL-E</a>
<br>Prompt: A self-portrait of DALL-E writing text for their NLP class, digital art.</center>


<h1>Vectorizing</h1>



<p>In NLP word encoding is referred to <strong style="color:teal">vectorizing</strong>: you have already encountered vectorizing in the form of <code>One Hot Encoding</code>, for instance. Three standard vectorizing techniques used with classical (non neural-network) machine learning approaches are:</p>
<ul>
<li>Bag of Words</li>
<li>Tf-Idf</li>
<li>N-grams</li>
</ul>


In [None]:
# Let's select 5 tweets to illustrate our lesson
text = X_train_prep.iloc[123:128]
text


<h2 >Bag of words representation</h2><p>The Bag-of-words representation consits of counting occurences of the each word in a text. The count for each word becomes a feature, and a sentence is a vector representing the number of each words.</p>
<p><img src="https://drive.google.com/uc?id=16CnVU5VyiX8mOvy6RnbMv2u9N8Xu1uxt" style="margin:auto" width="1000"/></p><br>
<a href="https://www.ronaldjamesgroup.com/blog/grab-your-wine-its-time-to-demystify-ml-and-nlp">RonaldJames consultants</a>



<h3>🖥 Sklearn's <code>CountVectorizer</code> </h3><p>A tool to automatically generate a Bag-of-Word representation of text.</p>


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(text)

X.toarray()

In [None]:

vectorizer.get_feature_names_out()


In [None]:

import pandas as pd

pd.DataFrame(X.toarray(),columns = vectorizer.get_feature_names_out())




<p>Limitations of Bag-of-words representation:</p>
<ul>
<li>Does not take into account document lenght</li>
<li>Does not capture context</li>
<li>Potentially large dimensionality of vector!</li>
</ul>



<h2 >Tf-Idf representation</h2>



<p>Term Frequency - Inverse Document Frequency scores words according to their importance in a text, according to their presence in a collection of documents.</p>
<p><img src="https://drive.google.com/uc?id=16C2TUNSLIEUpxJBPsH1pMaNAHCURC9ra" style="margin:auto" width="1000"/></p><br>
<a href="https://zcsheng95.github.io/2020/07/20/latent-models/">Jeff's Blog</a>



<h3 >🖥 Sklearn's <code>TfidfVectorizer</code></h3><p>A tool to automatically generate a Tf-idf representation of text.</p>


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer()

X = tf_idf_vectorizer.fit_transform(text)

In [None]:
pd.DataFrame(X.toarray(),columns = tf_idf_vectorizer.get_feature_names_out())


<p>👍 Advantages of Tf-Idf representation:</p>
<ul>
<li>Using frequency rather than count is robust to document length</li>
<li>Measure of importance</li>
</ul>



<p>👎 Disadvantage of Tf-Idf representation:</p>
<ul>
<li>Does not capture context</li>
<li>Dimensionality is also a potential problem</li>
</ul>



<h2 id="N-Gram-representation">N-Gram representation<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/05-ML_10-Natural-Language-Processing.html?title=Natural+Language+Processing&amp;program_id=10#N-Gram-representation">¶</a></h2><p>Instead of considering individual words, N-grams consists of considering word sequences. This representation captures <strong>context</strong>. N is the number of words to be consiered as a one.</p>



<h3 id="🖥-ngram_range-parameter">🖥 <code>ngram_range</code> parameter<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/05-ML_10-Natural-Language-Processing.html?title=Natural+Language+Processing&amp;program_id=10#%F0%9F%96%A5-ngram_range-parameter">¶</a></h3><p>A parameter of the two vectorizers to specify the length of sequences to be considered.</p>


In [None]:

tf_idf_vectorizer = TfidfVectorizer(ngram_range = (2,2))

X = tf_idf_vectorizer.fit_transform(text)

X.toarray()

pd.DataFrame(X.toarray(),columns = tf_idf_vectorizer.get_feature_names_out())




<h2 >Key parameters of <code>CountVectorizer</code> and  <code>TfidfVectorizer</code> </h2><ul>
<li><code>max_df</code></li>
<li><code>min_df</code></li>
<li><code>max_features</code></li>
</ul>



<h3 >🖥 <code>max_df</code></h3><p>Used to exclude "corpus specific stopwords", words that are very frequent in the dataset. The vectorizer will ignore the words that have a frequency higher than the specified threshold.</p>


In [None]:
text

In [None]:
tf_idf_vectorizer = TfidfVectorizer(max_df = 0.5)

X = tf_idf_vectorizer.fit_transform(text)

X.toarray()

pd.DataFrame(X.toarray(),columns = tf_idf_vectorizer.get_feature_names_out())



<p>👉 Particularly useful to remove words that are so frequent they have little predictive power.</p>
<p>Example: When classifying texts about climate change, the words "climate" and "change" will appear often, but won't be useful to predict sentiment about climate change.</p>



<h3>🖥 <code>min_df</code></h3><p>Used to exclude words that are very infrequent in the dataset. The vectorizer will ignore the words that have a frequency lower than the specified threshold.</p>


In [None]:
text

In [None]:

tf_idf_vectorizer = TfidfVectorizer(min_df = 0.5)

X = tf_idf_vectorizer.fit_transform(text)

X.toarray()

pd.DataFrame(X.toarray(),columns = tf_idf_vectorizer.get_feature_names_out())




<p>👉 Particularly useful to remove typos or text anomalies missed during preprocessing.</p>



<h3>🖥 <code>max_features</code> </h3><p>Used to specify the number of features to keep when vectorizing. It will retain the top features according to count or tf-idf score.</p>


In [None]:

tf_idf_vectorizer = TfidfVectorizer(max_features = 2)

X = tf_idf_vectorizer.fit_transform(text)

X.toarray()

pd.DataFrame(X.toarray(),columns = tf_idf_vectorizer.get_feature_names_out())




<p>👉 Particularly useful to reduce the dimension of the data.</p>



<h1 id="4.-(Multinomial)-Naive-Bayes-Algorithm">4. (Multinomial) Naive Bayes Algorithm<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/05-ML_10-Natural-Language-Processing.html?title=Natural+Language+Processing&amp;program_id=10#4.-(Multinomial)-Naive-Bayes-Algorithm">¶</a></h1>



<p>A classification algorithm based on Bayes' Theorem of probability.</p>
<p><img src="https://drive.google.com/uc?id=167U6c9JP-zAMX9HyXACuKLWWqYdl0y8U" style="margin:auto" width="600"/></p>



<h2 id="Example">Example<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/05-ML_10-Natural-Language-Processing.html?title=Natural+Language+Processing&amp;program_id=10#Example">¶</a></h2><p>We want to classify mails as normal or spam according to their content.</p>



<ol>
<li>Count occurences of all words in normal mail.</li>
</ol>
<p><img align="left" src="https://drive.google.com/uc?id=16VpY64n7jVr9bfi0ceWI-5B2p4IC_i79" style="margin:auto" width="400"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>



<ol>
<li>Calculate probability of each word being in the mail if its normal</li>
</ol>
<p><img align="left" src="https://drive.google.com/uc?id=16nW3HNLnwyFSj1SHAXFraQaFF0TDyEAA" style="margin:auto" width="800"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<p><img align="left" src="https://drive.google.com/uc?id=164oie98o07Rxk4pkmLOQwpvK_W9LTv77" style="margin:auto" width="800"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<ol>
<li>Do the same with the spam mail!</li>
</ol>
<p><img align="left" src="https://drive.google.com/uc?id=16ZaYMJkNvKUQAsD5v4IjiR2OQDorwmqf" style="margin:auto" width="500"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>





<p><img align="left" src="https://drive.google.com/uc?id=16_Su7yMGZwZXm9vkZeryX_ymf2WZuMK_" style="margin:auto" width="800"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<p><img align="left" src="https://drive.google.com/uc?id=16S8fpinkbFfY2e8TNYrwaqf8B22l9Cu_" style="margin:auto" width="350"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<p>Would a mail containing "Dear friend" be normal or spam?</p>
<p><img align="left" src="https://drive.google.com/uc?id=16PlcmtHebz1sI5Eny4NmWKv-1wBuvK4F" style="margin:auto" width="500"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<ol>
<li>Calculate prior probability of mail being normal</li>
</ol>
<p><img align="left" src="https://drive.google.com/uc?id=16BrKEB5hfPKeXGmMsQnPVtQbUFBkJThl" style="margin:auto" width="600"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<ol>
<li>Calculate probability of the mail being normal</li>
</ol>
<p><img align="left" src="https://drive.google.com/uc?id=16471ESfFKsvBFts7fp6OpT2bCKqpt9EK" style="margin:auto" width="700"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<p><img align="left" src="https://drive.google.com/uc?id=16LCSVaAV_E5s1dgH-3x7WZjV-6XlIl8N" style="margin:auto" width="800"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<ol>
<li>Calculate probability of the mail being spam</li>
</ol>
<p><img align="left" src="https://drive.google.com/uc?id=16BWN6-VwKshedJ0fUSjYJ-30Uaft26qE" style="margin:auto" width="800"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<p>In the case of zero counts, we have a problem...</p>
<p><img align="left" src="https://drive.google.com/uc?id=16JcCFsvxudyTQxGHG3BHjwwgFAnVZDvN" style="margin:auto" width="800"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<h2 id="Smoothing">Smoothing</h2><p>Smoothing consists of adding a count +1 to each feature (word) to avoid zero counts. The smoothing parameter is most often called Alpha.</p>
<p><img align="left" src="https://drive.google.com/uc?id=16j0CWOvuJNXD9BESmpe78NEloKNWTDig" style="margin:auto" width="500"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>



<p><img align="left" src="https://drive.google.com/uc?id=16VfhF1ye3Kl4ekF95lh_PmUXkSpamrlb" style="margin:auto" width="800"/></p><br>
<a href="https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA">From StatsQuest video</a>




<p>"Naive" refers to the fact that each feature (word) is treated individually and order isn't considered.</p>



<p>👍 Advantages of Naive Bayes algorithm:</p>
<ul>
<li>Easy to implement</li>
<li>Ouputs probabilities</li>
<li>Not an iterative learning process. Fast!</li>
<li>Works particularly well on text data because handles big dimensions</li>
</ul>



<p>👎 Disadvantage:</p>
<ul>
<li>Assumes feature independence, rarely the case in real life datasets</li>
</ul>



<h1 id="5.-Modelling-Implementation">5. Modelling Implementation<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/05-ML_10-Natural-Language-Processing.html?title=Natural+Language+Processing&amp;program_id=10#5.-Modelling-Implementation">¶</a></h1>


In [None]:
data.sentiment.value_counts()

In [None]:
tf_idf_vectorizer = TfidfVectorizer(max_features = 15)

X_train_prep = tf_idf_vectorizer.fit_transform(X_train)
X_test_prep = tf_idf_vectorizer.transform(X_test)

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()

nb_model.fit(X_train_prep, y_train)

nb_model.score(X_test_prep,y_test)




<p><a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html">MultinomialNB</a></p>



<h3 id="Tuning-vectorizer-and-model-simultanously">Tuning vectorizer and model simultanously<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/05-ML_10-Natural-Language-Processing.html?title=Natural+Language+Processing&amp;program_id=10#Tuning-vectorizer-and-model-simultanously">¶</a></h3><p>Different vectorizing hyperparameters will affect model performance. As such, it is important to tune the hyperparameters of both the vectorizer and the model simultaneously. This can be done by using a <code>Pipeline</code>.</p>


In [None]:

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Create Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB()),
])

# Set parameters to search
parameters = {
    'tfidf__ngram_range': ((1,1), (2,2)),
    'nb__alpha': (0.1,1),}

# Perform grid search
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, 
                           verbose=1, scoring = "accuracy", 
                           refit=True, cv=5)

grid_search.fit(X_train,y_train)



In [None]:

grid_search.best_params_



In [None]:

grid_search.best_score_



# Suggested Resources

## 📚 Further Reading 
* 📼 <a href="https://www.nltk.org/book/">A full online book on how to process text</a>, by Steven Bird, Ewan Klein, and Edward Loper (NLTK)
* 📼 <a href="https://web.stanford.edu/~jurafsky/slp3/">Speech and language processing</a>, by Dan Jurafsky and James H. Martin
* 📼 <a href="https://www.youtube.com/watch?v=CMrHM8a3hqw">A simple video on NLP</a>, by Simplilearn




## 💻🐍 Time to Code ! 