# Intro to Natural Language Processing 

> "You shall know a word by the company it keeps." ~ John R. Firth

In [233]:
%%html

<iframe width="768" height="432" src="https://miro.com/app/live-embed/uXjVOlC3sTw=/?moveToViewport=-1354,-1121,2108,1681&embedId=334819522676" frameborder="0" scrolling="no" allowfullscreen></iframe>

## Goal

The goal of this short demo is to cover the process of preparing and transforming text data in order to build a similarity based recommender system.

## Table of Contents

1. Libraries
2. The Data
3. Flash NLP Intro
4. Cleaning
5. Recommendation System
6. Summary

## 1. Libraries

Download the following libraries, if not available already. You can check with `!pip list` or with `!conda list` in a new cell.

In [None]:
# !pip install -U spacy panel

In [None]:
import json, re, spacy
from random import choice
import pandas as pd, numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
import panel as pn
from concurrent.futures import ProcessPoolExecutor

pn.extension()

%load_ext autoreload
%autoreload 2

## 2. The Data

With have been given a random corpus of news articles, plus some additional information (outlined below), and we want to make a useful product with it.

| Column | Content |
|--------|---------|
|title |Title of article|
|text | Text inside article|
|domain | Domain Url of article|
|date | YYYY-MM-DD Time|
|description | Abstract of article|
|url | Url of article|
|image_url | Image if available|

Here is the full description of the dataset from HugginFace.

> "CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.
It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English language subset of the CC-News dataset." ~ [Hugging Face cc_news](https://huggingface.co/datasets/cc_news)

Before we do any data cleaning, let's read in the data and explore it a bit.

In [None]:
df = pd.read_parquet("cc_news_sample.parquet")

Let's see how many articles we have and then examine the columns.

In [None]:
df.shape

In [None]:
df.head()

## 3. Flash NLP Intro

Let's pick a random article using `.loc[index, column]` on our dataframe and let's examine it.

In [None]:
random_article = df.iloc[choice(range(5000)), 1]
pprint(random_article)

Notice how the review above looks a bit odd and it has a few characters that will not be useful for our analysis. Let's examine a cleaner version of the article above by running it through `spaCy`'s tokenizer.

When we tokenize a document, we are separating all of its content into each of its components, i.e. words, numbers, punctiations and the like, to make it easier to process it, clean it, transform it and to run computations on it.

For this part, we will load an english model, instantiate it and pass an example article through it. You may need to run the cell below first to download the english model.

In [None]:
# !python -m spacy download en_core_web_md

In [None]:
nlp = spacy.load("en_core_web_md")

In [None]:
parsed_article = nlp(random_article)

In [None]:
parsed_article

Notice how much nicer our article looks now.

We can also grab the sentences and view them one by one using the attribute `.sents` and the built in python function `next()`, since the attribute of a document that has been tokenized by spacy will always return an iterator. Conversely, we can add it to a loop and show each of the sentences in an article.

In [None]:
next(enumerate(parsed_article.sents))

In [None]:
for num, sentence in enumerate(parsed_article.sents):
    print(f"Sentence #{num}:\n {sentence}\n")

We can also have a look at the different kinds of entities in an article. These entities can be a person (called PERSON), and number (called CARDINAL), a geopolitical entity (called GPE), etc.

In [None]:
for num, entity in enumerate(parsed_article.ents):
    print(f"Entity #{num}: {entity} -- {entity.label_}\n")

In [None]:
spacy.explain("LOC")

We can also check weather a word is a stopword or a punctuation, or we can even lemmatize our articles. Lemmatization is a way of taking the root of a word and bringing similar words to a common denominator, for example, `was` will become `be` and most plural words will become singular words.

In [None]:
new_list = []

for token in parsed_article:
    new_list.append(token.text)
    
    
new_list[:10]

In [None]:
new_list = [token.text for token in parsed_article]

new_list[:10]

In [None]:
# here we are taking out of the parsed article each token
token_text = [token.text for token in parsed_article]

# here we are lemmatizing each word possible
token_lemmas = [token.lemma_ for token in parsed_article]

# stopwords are very common so here we will extract a variable that will tell us whether
# a token is a stopword or not
token_stop = [token.is_stop for token in parsed_article]

# a token is a pinctuation or not
token_punc = [token.is_punct for token in parsed_article]

# we will now add all three to a dataframe and display it without assigning it to a variable
pd.DataFrame(zip(token_text, token_lemmas, token_punc, token_stop), columns=['Original Text', 'Lemmatized Text', 'Punctuations', 'stopwords']).head(50)

## 4. Preparation

Let's start by checking if our dataset contains any missin values, and then evaluate the amount of memory we are currently using from our machine.

In [None]:
df.isna().sum()

In [None]:
df.info(memory_usage='deep')

Depending on the random sample you choose at the beginning, you may or may not have a lot. If so, getting rid of the columns you don't need will help release some of the memory in your machine.

In [None]:
df.drop(['url', 'image_url', 'domain'], axis=1, inplace=True)

Perfect! Let's now extract the `text` column and normalize it. This means we will use `spacy` to,
- take out anything that is not a word or a number,
- convert to lower case,
- strip the spaces around the words,
- tokenize the articles,
- remove stopwords (we will use spaCy's list of stopwords for this),
- and then join the cleaned tokens back together.

In [None]:
articles = df['text'].values

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS
len(STOP_WORDS), STOP_WORDS

In [None]:
def normalize_doc(doc):
    """
    This function normalizes your list of documents by taking only
    words, numbers, and spaces in between them. It then filters out
    stop words.
    """
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    tokens = nlp(doc)
    filtered_tokens = [token.lemma_ for token in tokens if not token.is_stop]
    doc = ' '.join(filtered_tokens).replace(" \n ", "")
    return doc

In [None]:
random_article

In [None]:
normalize_doc(random_article)

Since we have quite a few articles, this operation can take quite some time unless we do the cleaning process concurrently or in parallel. We will do this using the `ProcessPoolExecutor()` from the `concurrent.futures` module.

In [None]:
# %%time

# with ProcessPoolExecutor() as e:
#     processed_articles = list(e.map(normalize_doc, articles))

We will add the cleaned versions of the documents back into the dataframe and loop over these while taking the lenght (in characters terms) of each article.

In [None]:
processed_articles = pd.read_csv("processed_articles.csv.gz")
processed_articles.head()

In [None]:
df['clean_text'] = processed_articles.values
df['len_clean_text'] = df['clean_text'].apply(len)
df['len_dirty_text'] = df['text'].apply(len)

Let's now save our cleaned dataset in case we need to restart our notebook and begin the analysis again. We will also release a bit of memory by getting rid of all the data and variables we have loaded up since the beginning of the notebook.

In [None]:
df.head(2)

In [None]:
df = df[['title', 'date', 'clean_text', 'len_clean_text', 'len_dirty_text']].reset_index(drop=True)

It wouldn't make any sense to feed to our algorithms articles with a tiny amount of characters, so let's examine the distribution of characters among both, the raw and the clean version of our articles.

In [None]:
df[['len_clean_text', 'len_dirty_text']].describe().T

In [None]:
df[['len_clean_text', 'len_dirty_text']].skew()

![img](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn.analyticsvidhya.com%2Fwp-content%2Fuploads%2F2020%2F06%2Fsk1.png&f=1&nofb=1)

Now that we know we have a skewed distribution of characters, let's fix that by setting up a rule. We'll evaluate an article using the tweets' maximum character count of 280, at the time of writing, and filter out all articles with less than that. Let's check how many we have first.

In [None]:
greater_than_a_tweet = df['len_clean_text'] > 280
print(f"Before: {df.shape[0]} --- After: {greater_than_a_tweet.sum()}")

In [None]:
df = df[greater_than_a_tweet].copy()

# 5. Recommendation System

Recommendation systems can come in many different forms and sizes. We can create a system that takes into account the behaviour of other users, or a system that only looks at similar articles or items to make a recommendation. Both are powerful systems and could cover an entire section of a book in their own right, which is why we will focus on the latter category, the one that makes recommendations based on similar articles.

To create our recommendation system we first need to convert our articles into a numerical representation. We do this with a so-called bag of words (bow). BOWs are matrices with the documents in the rows, the terms contained in all documents along the columns. The frequency with which each term appears in each document along the values can be found in the doc-token combination. To create this kind of representation we can use `sklearn`'s `CountVectorizer` or `TfidfVectorizer` classes. The latter being the normalized version of the former, i.e. the frequency of a word divided by the amount of documents in which it appears.

To use this classes we first instantiate them, fit the data to them so that they can learn the vocabulary of our corpus, and then we tranform the corpus into a sparse matrix. These sparse matrices hold the location of all non-zero values to make it easier to store the data and compute on it.

In [None]:
%%time

# we first instantiate our class
tf = TfidfVectorizer(min_df=0.035, max_df=0.80)

# we can fit and transform the data in the same step
tfidf_matrix = tf.fit_transform(df['clean_text'].values)

# evaluate the shape of our matrix
tfidf_matrix.shape

We can access our vocabulary with `.get_feature_names()` method.

In [None]:
tf.get_feature_names()[500:550]

The next step is to get the distance between documents and words to see how close and how far, based on words only, are two documents from one another. The `cosine_similarity` function we imported earlier can do this for us, and afterwards, we can create a dataframe to evaluate our results.

**Note:** this operation can take a few minutes if you are using the entire dataset. Make sure to grab some ☕️ 😎

In [None]:
%%time

doc_sim = cosine_similarity(tfidf_matrix)

In [None]:
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

In [None]:
doc_sim.shape

The reason we see a X000xX000 matrix is because both halfs alonside the diagonal line are identical, hence, we have the similarity of all docs vs all docs.

In [None]:
articles_list = df['title'].values
articles_list.shape, articles_list

Let's now
1. pick a title at random
2. get the index of such title
3. select the corresponding row for such title in our new document similarity dataframe
4. sort the index of such values
5. return the top 5 article titles

In [None]:
a_title = choice(articles_list)
a_title

In [None]:
article_idx = np.where(articles_list == a_title)[0][0]
article_idx

In [None]:
article_similarities = doc_sim_df.iloc[article_idx].values
article_similarities

In [None]:
# note that we don't select the first one as this should always be one
similar_articles_idxs = np.argsort(-article_similarities)[1:10]
similar_articles_idxs

In [None]:
df.head()

In [None]:
doc1 = nlp(df.iloc[1, 2])
doc2 = nlp(df.iloc[2, 2])

In [None]:
doc1.similarity(doc2)

In [None]:
a_title

In [None]:
similar_articles = articles_list[similar_articles_idxs]
pprint(similar_articles.tolist())

Lastly, we will create create a mini-dashboard containing,
1. a widget with all of our titles,
2. a function with the steps we followed above,
3. a panel object to store a title, the widget, and the function.

In [None]:
titles = df.title.unique().tolist()
title_widget = pn.widgets.Select(value=choice(titles), options=titles, name='Articles')

In [None]:
@pn.depends(title_widget.param.value)
def article_recommender(title_widget):
    
    article_idx = np.where(articles_list == title_widget)[0][0]
    article_similarities = doc_sim_df.iloc[article_idx].values
    similar_title_idxs = np.argsort(-article_similarities)[1:6]
    similar_titles = articles_list[similar_title_idxs]
    
    return pn.Column(*similar_titles, width=600)

In [None]:
text = pn.pane.Markdown(f"# Small Recommendation Engine", style={"color": "#000000"}, width=600, height=50,
                        sizing_mode="stretch_width", margin=(10,10,10,5))

In [None]:
pn.Column(text, title_widget, article_recommender, align='center', width=600, height=300)

## 6. Summary

Blind Spots

With additional time we could have,
1. Further tweak the parameters of the vectorizers and models;
2. Create visualizations of the document similarity to find more interesting patters;
3. Take the title of an article out of the body of the article to create a better, less biased representation of the words within a document;
4. Using Pytorch's nn.CosineSimilarity would help a lot with increasing the efficiency of our recommendation system;
5. There should have been a lemmatization step in the preprocessing stage.

Takeaways,
1. Recommendation systems are an example of unsupervised machine learning;
2. Recommendation systems can be created with or without users behavioural data;
3. Creating bags of words requires careful attention to the parameters;
4. Where possible, showcase a model or system in a mini-dashboard or data visualization.