# Building NLP Products Tutorial

> "You shall know a word by the company it keeps." ~ John R. Firth

![img](https://cdn.shopify.com/s/files/1/0867/3580/products/vinyl_decal_hello_words_cloud_ig4779_1800x1800.jpg?v=1571439560)

## Learning Outcomes

By the end of this tutorial you will
1. Have a better understanding of natural language processing and some of its applications.
2. Be able to create recommendation systems based on text similarity.
3. Be able to conduct topic modeling on your own corpus.
4. Understand how to put together a simple app using panel.

## Table of Contents

1. Overview
2. The Data
3. Flash NLP Intro
4. Cleaning
5. Recommendation System
6. Topic Modeling
7. Summary

## 1. Overview

With have been given a random corpus of articles taken from Wikipedia and our task is to come up with two products, a recommendations systems and a set of topic that best explains the model. This will help you and anyone else who picks up this notebook, understand the Wikipedia corpus better.

In [None]:
import json, nltk, re, spacy, umap
import pandas as pd, numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
import panel as pn
from concurrent.futures import ThreadPoolExecutor

pn.extension()

%load_ext autoreload
%autoreload 2

It is possible that you will need the following packages in order to move forward. Please copy the two lines below, paste them in a new cell and run it.

```python
nltk.download('wordnet')
nltk.download('punkt')
```

## 2. The Data

The data consist of Wikipedia articles plus some additional columns inside a JSON file. Here is the schema.

| Column | Content |
|--------|---------|
|title |Title of article|
|url | Url of article|
|abstract | Abstract of article|
|body_text | Text inside article|
|body_html | Article inside HTML|

Before we do any data cleaning, let's read in the data and explore it a bit.

In [None]:
%%time

data_list = [] # empty list that will hold a line of data for us

for line in open('data.jsonl', 'r'):
    data_list.append(json.loads(line)) # read in line by line

Let's see how many articles we have and then examine the very first one.

In [None]:
len(data_list), data_list[0]

Now that we have a nice list of dictionaries, we can create a pandas DataFrame. You can think of pandas DataFrames as as Excel spreadsheets we can use to hold and manipulate our data for us.

In [None]:
df = pd.DataFrame(data_list)
df.head()

## 3. Flash NLP Intro

We can use the `.loc[index, column]` method on our dataframe, select one column and one row using a comma to separate both, and examine a prettier version of the text using the python function `pprint()`.

In [None]:
random_article = df.loc[10, 'body_text']
pprint(random_article)

Notice how the review above is quite messy and it has a lot of characters that, for all intents and purposes, will not be useful for our analysis. Let's examine a cleaner version of the article above by running it through spaCy's tokenizer. When we tokenize a document, we are separating all of its content into each of its components, i.e. words, numbers, punctiations and the like, to make it easier to process and to run computations on it.

For this part, we will load an english model, instantiate it and pass an example article through it. You may need to run the cell below first to download the english model.

In [None]:
# python -m spacy download en_core_web_sm

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
parsed_article = nlp(random_article)

In [None]:
parsed_article

Notice how much nicer our article looks like now.

We can also grab sentences and view them one by one we wanted to using the attribute `.sents` and the built in python function `next()`. Conversely, we can add it to a loop and show each of the sentences in an article.

In [None]:
next(enumerate(parsed_article.sents))

In [None]:
for num, sentence in enumerate(parsed_article.sents):
    print(f"Sentence #{num}:\n {sentence}\n")

We can also have a look at the different kinds of entities in an article. These entities can be a person (called PERSON), and number (called CARDINAL), a geopolitical entity (called GPE), etc.

In [None]:
for num, entity in enumerate(parsed_article.ents):
    print(f"Entity #{num}: {entity} -- {entity.label_}\n")

We can also check weather a word is a stopword or a punctuation, or we can even lemmatize our articles. Lemmatization is a way of taking the root of a word and bringing similar words to a common denominator, for example, was will become be and most plural words will be singular words.

In [None]:
# here we are taking out of the parsed article each token
token_text = [token.text for token in parsed_article]

# here we are lemmatizing each word possible
token_lemmas = [token.lemma_ for token in parsed_article]

# stopwords are very common so here we will extract a variable that will tell us whether
# a word is a stopword or not
token_stop = [token.is_stop for token in parsed_article]

token_punc = [token.is_punct for token in parsed_article]

# we will now add all three to a dataframe and display it without assigning it to a variable
pd.DataFrame(zip(token_text, token_lemmas, token_punc, token_stop), columns=['Original Text', 'Lemmatized Text', 'Punctuations', 'stopwords']).head(50)

## 4. Cleaning

Let's start by checking if our dataset contains any missin values, and then evaluate the amount of memory we are currently using from our machine.

In [None]:
df.isna().sum()

In [None]:
df.info(memory_usage='deep')

Over 4 GBs is a lot and it is almost certain that most of that comes from the `body_html` column. Let's get rid of it since we already have the `body_text` column, and then let's evaluate again how much data we are using.

In [None]:
df.drop('body_html', axis=1, inplace=True)

In [None]:
df.info(memory_usage='deep')

Excellent, let's deal with the titles now. It seems that every abstract starts with `Wikibooks:` so let's check if this is the case and if so, let's take that out.

In [None]:
df.title.str.startswith('Wikibooks: ').sum()

In [None]:
df['clean_title'] = df.title.str.replace('Wikibooks: ', '')

Perfect! Let's now extract the `body_text` and `abstract` columns and normalize them. This means we will the `nltk` library to,
- tokenize the documents,
- take out anything that is not a word or a number,
- convert to lower case,
- strip the spaces around the words,
- remove stopwords (we will use spaCy's list of stopwords for this),
- and then join the cleaned tokens back together.

In [None]:
articles = df['body_text'].values
abstracts = df['abstract'].values

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS
len(STOP_WORDS), STOP_WORDS

In [None]:
def normalize_doc(doc):
    """
    This function normalizes your list of documents by taking only
    words, numbers, and spaces in between them. It then filters out
    stop words.
    """
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    tokens = nltk.word_tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in STOP_WORDS]
    doc = ' '.join(filtered_tokens)
    return doc

In [None]:
normalize_doc(random_article)

We will also create the same version of the function but without taking the stopwords out or converting to lowecase, to normalize the abstract.

In [None]:
def normalize_abs(doc):
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.strip()
    tokens = nltk.word_tokenize(doc)
    doc = ' '.join([token for token in tokens])
    return doc

In [None]:
normalize_abs(df.loc[10, 'abstract'])

Since we have about 60k articles, this operation can take quite some time unless we the cleaning process concurrently. We will do this using the `ThreadPoolExecutor()` from the `concurrent.futures` module.

In [None]:
%%time

with ThreadPoolExecutor(max_workers=8) as e:
    processed_articles = list(e.map(normalize_doc, articles))
    processed_abstract = list(e.map(normalize_abs, abstracts))

We will add the cleaned versions of the documents back into the dataframe and loop over these while taking the lenght (in characters terms) of each article.

In [None]:
%%time

df['clean_text'] = processed_articles
df['clean_abstract'] = processed_abstract
df['len_clean_text'] = df['clean_text'].apply(len)
df['len_dirty_text'] = df['body_text'].apply(len)

Let's now save our cleaned dataset in case we need to restart our notebook and begin the analysis again. We will also release a bit of memory by getting rid of all the data and variables we have loaded up since the beginning of the notebook.

In [None]:
%%time

df[['url', 'clean_abstract', 'clean_title', 'clean_text', 'len_clean_text', 'len_dirty_text']].to_parquet('clean_data/clean.parquet', compression='snappy')

In [None]:
del data_list
del df
del articles
del abstracts
del processed_articles
del processed_abstract

In [None]:
df = pd.read_parquet('clean_data/clean.parquet')

In [None]:
df.head()

It wouldn't make any sense to feed to our algorithms articles with zero words, so let's examine the distribution of characters among both, the raw and the clean version of our articles.

In [None]:
df[['len_clean_text', 'len_dirty_text']].describe().T

In [None]:
df[['len_clean_text', 'len_dirty_text']].skew()

Now that we know we have a skewed distribution of characters, let's fix that by setting up a rule. We'll evaluate an article using a tweets' maximum character count, 280 at the time of writing, and filter out all articles with less than that. Let's check how many we have first.

In [None]:
shorter_than_a_tweet = df['len_clean_text'] < 280
shorter_than_a_tweet.sum()

In [None]:
df = df[~shorter_than_a_tweet].copy()

In [None]:
df.shape

# 5. Recommendation System

Recommendation systems can come in many different forms and sizes. We can create a system that takes into account the behaviour of other users, or a system that only looks at similar articles or items to make a recommendation. Both are powerful systems and could cover an entire book in their own right, which is why we will focus on the latter category, the one that makes recommendations based on similar articles.

To create our recommendation system we first need to convert our articles into a numerical representation. We do this with a so-called bag of words (bow). BOWs are matrices with the documents in the rows, the terms contained in all documents along the columns, and the frequency with which each term appears in each document along the values. To create this kind of representation we can use `sklearn`'s `CountVectorizer` or `TfidfVectorizer` classes. The latter being the normalized version of the former, i.e. the frequency of a word divided by the amount of documents in which it appears.

To use this classes we first instantiate them, fit the data to them so that they can learn the vocabulary of our corpus, and then we tranform the corpus into a sparse matrix. These sparse matrices hold the location of all non-zero values to make it easier to store the data and compute on it.

In [None]:
%%time

# if you would rather work with a sample of the dataset to see how it works, use the following one
small_df = df.sample(5_000).copy()

# otherwise, use this one
# small_df = df

small_df.head()

In [None]:
%%time

# we first instantiate our class
tf = TfidfVectorizer(min_df=0.035, max_df=0.80)

# we can fit and transform the data in the same step
tfidf_matrix = tf.fit_transform(small_df['clean_text'].values)

# evaluate the shape of our matrix
tfidf_matrix.shape

We can access our vocabulary with `.get_feature_names()` method.

In [None]:
tf.get_feature_names()

The next step is to get the distance between documents and words to see how close and how far, based on words only, are two documents from one another. The `cosine_similarity` similarity function we imported earlier can do this for us, and afterwards, we can create a dataframe to evaluate our results.

**Note:** this operation can take a few minutes if you are using the entire dataset. Grab some ☕️ 😎

In [None]:
%%time

doc_sim = cosine_similarity(tfidf_matrix)

In [None]:
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

In [None]:
doc_sim.shape

The reason we see a 5000x5000 matrix is because both halfs alonside the diagonal like are completely the same.

In [None]:
articles_list = small_df['clean_title'].values
abstract_list = small_df['clean_abstract'].values
articles_list.shape, articles_list

Let's now
1. pick a title at random
2. get the index of such title
3. select the corresponding row for such title in our new document similarity dataframe
4. sort the index of such values
5. return the top 5 article titles

In [None]:
from random import choice

In [None]:
a_title = choice(articles_list)
a_title

In [None]:
article_idx = np.where(articles_list == a_title)[0][0]
article_idx

In [None]:
article_similarities = doc_sim_df.iloc[article_idx].values
article_similarities

In [None]:
# note that we don't select the first one as this should always be one
similar_articles_idxs = np.argsort(-article_similarities)[1:10]
similar_articles_idxs

In [None]:
similar_articles = articles_list[similar_articles_idxs]
pprint(similar_articles.tolist())

In [None]:
similar_abstracts = abstract_list[similar_articles_idxs]
pprint(similar_abstracts[2])

Lastly, we will create create a mini-dashboard containing,
1. a widget with all of our titles,
2. a function with the steps we followed above,
3. a panel object to store a title, the widget, and the function.

In [None]:
titles = small_df.clean_title.unique().tolist()
title_widget = pn.widgets.Select(value=choice(titles), options=titles, name='Articles')

In [None]:
@pn.depends(title_widget.param.value)
def article_recommender(title_widget):
    
    article_idx = np.where(articles_list == title_widget)[0][0]
    article_similarities = doc_sim_df.iloc[article_idx].values
    similar_title_idxs = np.argsort(-article_similarities)[1:6]
    similar_titles = articles_list[similar_title_idxs]
    
    return pn.Column(*similar_titles, width=600)

In [None]:
text = pn.pane.Markdown(f"# Small Recommendation Engine", style={"color": "#000000"}, width=600, height=50,
                        sizing_mode="stretch_width", margin=(10,10,10,5))

In [None]:
pn.Column(text, title_widget, article_recommender, align='center', width=600, height=300)

## 6. Topic Modeling

What is topic modeling?

> "In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both." ~ [Wikipedia](https://en.wikipedia.org/wiki/Topic_model)

As with the recommendation engine, topic modeling requires a bag of words for the representation of the data and, in contrast, it requires a topic number as the key parameter for the model.

In [None]:
vectorizer = CountVectorizer(strip_accents = 'unicode', min_df=0.035, max_df=0.80)

In [None]:
bow = vectorizer.fit_transform(small_df['clean_text'].values)
bow

In [None]:
topics = 20

In [None]:
lda_model = LatentDirichletAllocation(n_components=topics, # number of topics
                                      max_iter=100, # these are the amount of times the algorithm will run
                                      learning_method='online', 
                                      random_state=42, # setting a seed for reproducible results
                                      n_jobs=-1) # this parameter makes sure we use all of the cores in our machine

In [None]:
%%time

lda_model.fit(bow)

We will create a function to explore the topics and their words to see if we can tease apart the main idea of a topic.

In [None]:
def show_topics(vectorizer, lda_model, n_words=15):
    """
    This function takes our vectorizer, our model, and a
    number of words to display the topics from our model.
    """
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

Play around with the topic number and the words evaluated to see which amounts makes most sense to you./

In [None]:
show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=10)

In [None]:
terms = sorted(tf.vocabulary_.keys())

In [None]:
bow_docs = pd.DataFrame(tfidf_matrix.toarray(), columns=terms)
bow_docs.head()

The components of our model can be found `lda_model.components_` and can help us create different sets of dataframes, namely, terms-to-topics and document-to-topics. The former has as its values the number of times a word is assigned in a topic, and the latter is the probabily of the words in a document being contained in a topic.

In [None]:
topic_term = pd.DataFrame(lda_model.components_.T, index=terms, columns=['topic_' + str(i) for i in range(topics)])
topic_term.tail()

In [None]:
doc_topic = pd.DataFrame(lda_model.transform(tfidf_matrix), index=small_df.clean_title, columns=['topic_' + str(i) for i in range(topics)])
doc_topic.tail(3)

Lastly, a good way to examine the output of an LDA model is by visulizing it with nice graphs and for this we have, `pyLDAvis`. Which is a python library for visualizing topic modeling. We first load it with it's sklearn backend while enabling the notebook setting. Next we use `pyLDAvis.sklearn.prepare` and pass in our model, the bag of words, and the fitted vectorizer to get a nice interactive visualization tool.

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [None]:
pyLDAvis.sklearn.prepare(lda_model, bow, vectorizer)

## 7. Summary

Blind Spots

With additional time we could have,
1. Further tweak the parameters of the vectorizers and models;
2. Create visualizations of both, the best topics and the document similarity to find more interesting patters;
3. Take the title of an article out of the body of the article to create a better, less bias representation of the words within a document.

Takeaways,
1. Recommendation systems and topic modeling are both unsupervised methods;
2. Recommendation systems can be created with or without users behavioural data;
3. Topic modeling compresses the data into the most important and meaninful words set by you;
4. Creating bags of words requires careful attention to the parameters;
5. Where possible, showcase a model or system in a mini-dashboard or data visualization.