# Building NLP Products Tutorial

> "You shall know a word by the company it keeps." ~ John R. Firth

![img](https://cdn.shopify.com/s/files/1/0867/3580/products/vinyl_decal_hello_words_cloud_ig4779_1800x1800.jpg?v=1571439560)

## Learning Outcomes

By the end of this tutorial you will
1. Have a better understanding of natural language processing and some of its applications.
2. Be able to create recommendation systems based on text similarity.
3. Be able to conduct topic modeling on your own corpus.
4. Understand how to put together a simple app using panel.

## Table of Contents

1. Overview
2. The Data
3. Flash NLP Intro
4. Cleaning
5. Recommendation System
6. Topic Modeling
7. Summary

## 1. Overview

With have been given a random corpus of articles taken from Wikipedia and our task is to come up with two products, a recommendations systems and a set of topic that best explains the model. This will help you and anyone else who picks up this notebook, understand the Wikipedia corpus better.

In [1]:
import json, nltk, re, spacy, umap
import pandas as pd, numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
import panel as pn
from concurrent.futures import ThreadPoolExecutor

pn.extension()

%load_ext autoreload
%autoreload 2

It is possible that you will need the following packages in order to move forward. Please copy the two lines below, paste them in a new cell and run it.

```python
nltk.download('wordnet')
nltk.download('punkt')
```

## 2. The Data

The data consist of Wikipedia articles plus some additional columns inside a JSON file. Here is the schema.

| Column | Content |
|--------|---------|
|title |Title of article|
|url | Url of article|
|abstract | Abstract of article|
|body_text | Text inside article|
|body_html | Article inside HTML|

Before we do any data cleaning, let's read in the data and explore it a bit.

In [2]:
%%time

data_list = [] # empty list that will hold a line of data for us

for line in open('data.jsonl', 'r'):
    data_list.append(json.loads(line)) # read in line by line

CPU times: user 8.47 s, sys: 2.83 s, total: 11.3 s
Wall time: 12.1 s


Let's see how many articles we have and then examine the very first one.

In [3]:
len(data_list), data_list[0]

(64844,
 {'title': 'Wikibooks: Romanian/Lesson 9',
  'url': 'https://en.wikibooks.org/wiki/Romanian/Lesson_9',
  'abstract': '==Băuturi/Beverages==',
  'body_text': 'Băuturi/Beverages[edit\xa0| edit source]\nTea\xa0: Ceai\nMilk\xa0: Lapte\nWater\xa0: Apă (If you are in Romania, and want to ask for plain tap water, ask for apă plată.)\nSparkling water\xa0: Apă minerală\nSoda\xa0: Sifon\nBeer\xa0: Bere\nWine\xa0: Vin\nMâncăruri/Foods[edit\xa0| edit source]\nBread\xa0: Pâine\nPotato\xa0: Cartof\nMashed potatoes\xa0: Piure de cartofi\nFrench fries\xa0: Cartofi prăjiți\nCheese (To put on bread)\xa0: Caşcaval\nFeta cheese\xa0: Brânza\nSteak\xa0: Friptură\nSoup\xa0: Supă\nChicken\xa0: Pui\nBeef\xa0: Vacă\nDuck\xa0: Rață\nPork\xa0: Porc\nOranges\xa0: Portocale\nTomatoes\xa0: Roșii\nToast\xa0: Pâine prăjită (lit. "Fried bread".)\nApple\xa0: Măr\nTacâmuri/Eating utensils[edit\xa0| edit source]\nKnife\xa0: Cuţit\nFork\xa0: Furculiţă\nSpoon\xa0: Lingură\nTeaspoon\xa0: Linguriţă\nGlass\xa0: Pahar\n

Now that we have a nice list of dictionaries, we can create a pandas DataFrame. You can think of pandas DataFrames as as Excel spreadsheets we can use to hold and manipulate our data for us.

In [4]:
df = pd.DataFrame(data_list)
df.head()

Unnamed: 0,title,url,abstract,body_text,body_html
0,Wikibooks: Romanian/Lesson 9,https://en.wikibooks.org/wiki/Romanian/Lesson_9,==Băuturi/Beverages==,Băuturi/Beverages[edit | edit source]\nTea : C...,"<div class=""mw-parser-output""><h2><span id=""B...."
1,Wikibooks: Karrigell,https://en.wikibooks.org/wiki/Karrigell,Karrigell is an open Source Python web framewo...,Karrigell is an open Source Python web framewo...,"<div class=""mw-parser-output""><p>Karrigell is ..."
2,Wikibooks: The Pyrogenesis Engine/0 A.D./GuiSe...,https://en.wikibooks.org/wiki/The_Pyrogenesis_...,====setupUnitPanel====,setupUnitPanel[edit | edit source]\nHelper fun...,"<div class=""mw-parser-output""><h4><span class=..."
3,Wikibooks: LMIs in Control/pages/Exterior Coni...,https://en.wikibooks.org/wiki/LMIs_in_Control/...,== The Concept ==,Contents\n\n1 The Concept\n2 The System\n3 The...,"<div class=""mw-parser-output""><div id=""toc"" cl..."
4,Wikibooks: Laptop Computer Models/Dell/Latitud...,https://en.wikibooks.org/wiki/Laptop_Computer_...,= Dell Latitude D830 =,Contents\n\n1 Dell Latitude D830\n\n1.1 CPU\n1...,"<div class=""mw-parser-output""><div id=""toc"" cl..."


## 3. Flash NLP Intro

We can use the `.loc[index, column]` method on our dataframe, select one column and one row using a comma to separate both, and examine a prettier version of the text using the python function `pprint()`.

In [5]:
random_article = df.loc[10, 'body_text']
pprint(random_article)

('This Wikibooks page is a fact sheet and analysis on the article "Habitual '
 'physical activity in children and adolescents with cystic fibrosis" about '
 'how exercise is related to the disease Cystic Fibrosis.\n'
 '\n'
 'Contents\n'
 '\n'
 '1 Background of this research\n'
 '2 Where is the research from\xa0?\n'
 '3 What kind of research was this?\n'
 '4 What did the research involve?\n'
 '\n'
 '4.1 Pulmonary Function testing\n'
 '4.2 Pros / Cons of this test\n'
 '\n'
 '\n'
 '5 What were the basic results?\n'
 '6 What conclusion can we take from this research\xa0?\n'
 '7 Practical Advice\n'
 '8 Further information/ Resources\n'
 '\n'
 '8.1 Cystic Fibrosis Australia\n'
 "8.2 Cystic Fibrosis's National Ambassador Nathan Charles\n"
 '\n'
 '\n'
 '9 References\n'
 '\n'
 '\n'
 '\n'
 'Background of this research[edit\xa0| edit source]\n'
 'The research was about the effects of taking part in exercise constantly or '
 'making it a habit in the population of children and teens that are sever

Notice how the review above is quite messy and it has a lot of characters that, for all intents and purposes, will not be useful for our analysis. Let's examine a cleaner version of the article above by running it through spaCy's tokenizer. When we tokenize a document, we are separating all of its content into each of its components, i.e. words, numbers, punctiations and the like, to make it easier to process and to run computations on it.

For this part, we will load an english model, instantiate it and pass an example article through it. You may need to run the cell below first to download the english model.

In [6]:
# python -m spacy download en_core_web_sm

In [7]:
nlp = spacy.load('en_core_web_sm')

In [8]:
parsed_article = nlp(random_article)

In [9]:
parsed_article

This Wikibooks page is a fact sheet and analysis on the article "Habitual physical activity in children and adolescents with cystic fibrosis" about how exercise is related to the disease Cystic Fibrosis.

Contents

1 Background of this research
2 Where is the research from ?
3 What kind of research was this?
4 What did the research involve?

4.1 Pulmonary Function testing
4.2 Pros / Cons of this test


5 What were the basic results?
6 What conclusion can we take from this research ?
7 Practical Advice
8 Further information/ Resources

8.1 Cystic Fibrosis Australia
8.2 Cystic Fibrosis's National Ambassador Nathan Charles


9 References



Background of this research[edit | edit source]
The research was about the effects of taking part in exercise constantly or making it a habit in the population of children and teens that are severing from the genetic condition cystic Fibrosis.
What is  Cystic Fibrosis
It is a genetic condition, affecting lungs and digestion. Unfortunately, there is no 

Notice how much nicer our article looks like now.

We can also grab sentences and view them one by one we wanted to using the attribute `.sents` and the built in python function `next()`. Conversely, we can add it to a loop and show each of the sentences in an article.

In [10]:
next(enumerate(parsed_article.sents))

(0,
 This Wikibooks page is a fact sheet and analysis on the article "Habitual physical activity in children and adolescents with cystic fibrosis" about how exercise is related to the disease Cystic Fibrosis.)

In [11]:
for num, sentence in enumerate(parsed_article.sents):
    print(f"Sentence #{num}:\n {sentence}\n")

Sentence #0:
 This Wikibooks page is a fact sheet and analysis on the article "Habitual physical activity in children and adolescents with cystic fibrosis" about how exercise is related to the disease Cystic Fibrosis.

Sentence #1:
 

Contents

1 Background of this research
2 Where is the research from ?

Sentence #2:
 
3

Sentence #3:
 What kind of research was this?

Sentence #4:
 
4

Sentence #5:
 What did the research involve?

Sentence #6:
 

4.1 Pulmonary Function testing
4.2 Pros / Cons of this test


5

Sentence #7:
 What were the basic results?

Sentence #8:
 
6

Sentence #9:
 What conclusion can we take from this research ?

Sentence #10:
 
7 Practical Advice
8 Further information/ Resources

8.1 Cystic Fibrosis Australia
8.2 Cystic Fibrosis's National Ambassador Nathan Charles


9 References



Background of this research[edit

Sentence #11:
  | edit source]


Sentence #12:
 The research was about the effects of taking part in exercise constantly or making it a habit in the 

We can also have a look at the different kinds of entities in an article. These entities can be a person (called PERSON), and number (called CARDINAL), a geopolitical entity (called GPE), etc.

In [12]:
for num, entity in enumerate(parsed_article.ents):
    print(f"Entity #{num}: {entity} -- {entity.label_}\n")

Entity #0: 2 -- CARDINAL

Entity #1: 3 -- CARDINAL

Entity #2: 4.2 -- CARDINAL

Entity #3: Pros / Cons -- ORG

Entity #4: 5 -- CARDINAL

Entity #5: 6 -- CARDINAL

Entity #6: 8 -- CARDINAL

Entity #7: Australia -- GPE

Entity #8: 8.2 -- CARDINAL

Entity #9: Nathan Charles -- PERSON

Entity #10: 9 -- CARDINAL

Entity #11: Fibrosis -- PRODUCT

Entity #12: Cystic Fibrosis -- ORG

Entity #13: 1 -- CARDINAL

Entity #14: 3300 -- CARDINAL

Entity #15: American -- NORP

Entity #16: Pittsburgh -- GPE

Entity #17: Two -- CARDINAL

Entity #18: David Michael Orenstein -- PERSON

Entity #19: CF -- GPE

Entity #20: Austria -- GPE

Entity #21: Two -- CARDINAL

Entity #22: David Michael -- PERSON

Entity #23: Patricia -- PERSON

Entity #24: the Journal of Paediatric Pulmonology -- ORG

Entity #25: three -- CARDINAL

Entity #26: 60 -- CARDINAL

Entity #27: 7–17 years -- DATE

Entity #28: 30 -- CARDINAL

Entity #29: 18 -- CARDINAL

Entity #30: 12 -- CARDINAL

Entity #31: 30 -- CARDINAL

Entity #32: 17 --

We can also check weather a word is a stopword or a punctuation, or we can even lemmatize our articles. Lemmatization is a way of taking the root of a word and bringing similar words to a common denominator, for example, was will become be and most plural words will be singular words.

In [13]:
# here we are taking out of the parsed article each token
token_text = [token.text for token in parsed_article]

# here we are lemmatizing each word possible
token_lemmas = [token.lemma_ for token in parsed_article]

# stopwords are very common so here we will extract a variable that will tell us whether
# a word is a stopword or not
token_stop = [token.is_stop for token in parsed_article]

token_punc = [token.is_punct for token in parsed_article]

# we will now add all three to a dataframe and display it without assigning it to a variable
pd.DataFrame(zip(token_text, token_lemmas, token_punc, token_stop), columns=['Original Text', 'Lemmatized Text', 'Punctuations', 'stopwords']).head(50)

Unnamed: 0,Original Text,Lemmatized Text,Punctuations,stopwords
0,This,this,False,True
1,Wikibooks,Wikibooks,False,False
2,page,page,False,False
3,is,be,False,True
4,a,a,False,True
5,fact,fact,False,False
6,sheet,sheet,False,False
7,and,and,False,True
8,analysis,analysis,False,False
9,on,on,False,True


## 4. Cleaning

Let's start by checking if our dataset contains any missin values, and then evaluate the amount of memory we are currently using from our machine.

In [14]:
df.isna().sum()

title        0
url          0
abstract     0
body_text    0
body_html    0
dtype: int64

In [15]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64844 entries, 0 to 64843
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      64844 non-null  object
 1   url        64844 non-null  object
 2   abstract   64844 non-null  object
 3   body_text  64844 non-null  object
 4   body_html  64844 non-null  object
dtypes: object(5)
memory usage: 4.4 GB


Over 4 GBs is a lot and it is almost certain that most of that comes from the `body_html` column. Let's get rid of it since we already have the `body_text` column, and then let's evaluate again how much data we are using.

In [16]:
df.drop('body_html', axis=1, inplace=True)

In [17]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64844 entries, 0 to 64843
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      64844 non-null  object
 1   url        64844 non-null  object
 2   abstract   64844 non-null  object
 3   body_text  64844 non-null  object
dtypes: object(4)
memory usage: 1007.2 MB


Excellent, let's deal with the titles now. It seems that every abstract starts with `Wikibooks:` so let's check if this is the case and if so, let's take that out.

In [18]:
df.title.str.startswith('Wikibooks: ').sum()

64844

In [19]:
df['clean_title'] = df.title.str.replace('Wikibooks: ', '')

Perfect! Let's now extract the `body_text` and `abstract` columns and normalize them. This means we will the `nltk` library to,
- tokenize the documents,
- take out anything that is not a word or a number,
- convert to lower case,
- strip the spaces around the words,
- remove stopwords (we will use spaCy's list of stopwords for this),
- and then join the cleaned tokens back together.

In [20]:
articles = df['body_text'].values
abstracts = df['abstract'].values

In [21]:
from spacy.lang.en.stop_words import STOP_WORDS
len(STOP_WORDS), STOP_WORDS

(326,
 {"'d",
  "'ll",
  "'m",
  "'re",
  "'s",
  "'ve",
  'a',
  'about',
  'above',
  'across',
  'after',
  'afterwards',
  'again',
  'against',
  'all',
  'almost',
  'alone',
  'along',
  'already',
  'also',
  'although',
  'always',
  'am',
  'among',
  'amongst',
  'amount',
  'an',
  'and',
  'another',
  'any',
  'anyhow',
  'anyone',
  'anything',
  'anyway',
  'anywhere',
  'are',
  'around',
  'as',
  'at',
  'back',
  'be',
  'became',
  'because',
  'become',
  'becomes',
  'becoming',
  'been',
  'before',
  'beforehand',
  'behind',
  'being',
  'below',
  'beside',
  'besides',
  'between',
  'beyond',
  'both',
  'bottom',
  'but',
  'by',
  'ca',
  'call',
  'can',
  'cannot',
  'could',
  'did',
  'do',
  'does',
  'doing',
  'done',
  'down',
  'due',
  'during',
  'each',
  'eight',
  'either',
  'eleven',
  'else',
  'elsewhere',
  'empty',
  'enough',
  'even',
  'ever',
  'every',
  'everyone',
  'everything',
  'everywhere',
  'except',
  'few',
  'fifteen',

In [22]:
def normalize_doc(doc):
    """
    This function normalizes your list of documents by taking only
    words, numbers, and spaces in between them. It then filters out
    stop words.
    """
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    tokens = nltk.word_tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in STOP_WORDS]
    doc = ' '.join(filtered_tokens)
    return doc

In [23]:
normalize_doc(random_article)

'wikibooks page fact sheet analysis article habitual physical activity children adolescents cystic fibrosis exercise related disease cystic fibrosis contents 1 background research 2 research 3 kind research 4 research involve 41 pulmonary function testing 42 pros cons test 5 basic results 6 conclusion research 7 practical advice 8 information resources 81 cystic fibrosis australia 82 cystic fibrosiss national ambassador nathan charles 9 references background researchedit edit source research effects taking exercise constantly making habit population children teens severing genetic condition cystic fibrosis cystic fibrosis genetic condition affecting lungs digestion unfortunately cure condition cystic fibrosis cf inherited white population 1 3300 live births diagnosed condition1 research edit edit source research based american childrens hospital pittsburgh cf centre volunteers research included siblings friends hospital employees children condition authors research work department paed

We will also create the same version of the function but without taking the stopwords out or converting to lowecase, to normalize the abstract.

In [24]:
def normalize_abs(doc):
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.strip()
    tokens = nltk.word_tokenize(doc)
    doc = ' '.join([token for token in tokens])
    return doc

In [25]:
normalize_abs(df.loc[10, 'abstract'])

'This Wikibooks page is a fact sheet and analysis on the article Habitual physical activity in children and adolescents with cystic fibrosis about how exercise is related to the disease Cystic Fibrosis'

Since we have about 60k articles, this operation can take quite some time unless we the cleaning process concurrently. We will do this using the `ThreadPoolExecutor()` from the `concurrent.futures` module.

In [26]:
%%time

with ThreadPoolExecutor(max_workers=8) as e:
    processed_articles = list(e.map(normalize_doc, articles))
    processed_abstract = list(e.map(normalize_abs, abstracts))

CPU times: user 13min 10s, sys: 15.2 s, total: 13min 25s
Wall time: 13min 38s


We will add the cleaned versions of the documents back into the dataframe and loop over these while taking the lenght (in characters terms) of each article.

In [27]:
%%time

df['clean_text'] = processed_articles
df['clean_abstract'] = processed_abstract
df['len_clean_text'] = df['clean_text'].apply(len)
df['len_dirty_text'] = df['body_text'].apply(len)

CPU times: user 92.7 ms, sys: 341 ms, total: 434 ms
Wall time: 959 ms


Let's now save our cleaned dataset in case we need to restart our notebook and begin the analysis again. We will also release a bit of memory by getting rid of all the data and variables we have loaded up since the beginning of the notebook.

In [28]:
%%time

df[['url', 'clean_abstract', 'clean_title', 'clean_text', 'len_clean_text', 'len_dirty_text']].to_parquet('clean_data/clean.parquet', compression='snappy')

CPU times: user 2.07 s, sys: 2.48 s, total: 4.55 s
Wall time: 6.86 s


In [29]:
del data_list
del df
del articles
del abstracts
del processed_articles
del processed_abstract

In [2]:
df = pd.read_parquet('clean_data/clean.parquet')

In [3]:
df.head()

Unnamed: 0,url,clean_abstract,clean_title,clean_text,len_clean_text,len_dirty_text
0,https://en.wikibooks.org/wiki/Romanian/Lesson_9,ButuriBeverages,Romanian/Lesson 9,buturibeveragesedit edit source tea ceai milk ...,632,827
1,https://en.wikibooks.org/wiki/Karrigell,Karrigell is an open Source Python web framewo...,Karrigell,karrigell open source python web framework wri...,953,1250
2,https://en.wikibooks.org/wiki/The_Pyrogenesis_...,setupUnitPanel,The Pyrogenesis Engine/0 A.D./GuiSession,setupunitpaneledit edit source helper function...,146,185
3,https://en.wikibooks.org/wiki/LMIs_in_Control/...,The Concept,LMIs in Control/pages/Exterior Conic Sector Lemma,contents 1 concept 2 system 3 data 4 lmi exter...,3034,11040
4,https://en.wikibooks.org/wiki/Laptop_Computer_...,Dell Latitude D830,Laptop Computer Models/Dell/Latitude D830,contents 1 dell latitude d830 11 cpu 12 memory...,543,617


It wouldn't make any sense to feed to our algorithms articles with zero words, so let's examine the distribution of characters among both, the raw and the clean version of our articles.

In [4]:
df[['len_clean_text', 'len_dirty_text']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
len_clean_text,64844.0,5727.485195,24572.765181,0.0,436.0,1485.0,4741.25,1260060.0
len_dirty_text,64844.0,8534.831303,36413.955414,0.0,641.0,2235.0,7150.0,1851361.0


In [5]:
df[['len_clean_text', 'len_dirty_text']].skew()

len_clean_text    22.119524
len_dirty_text    21.950325
dtype: float64

Now that we know we have a skewed distribution of characters, let's fix that by setting up a rule. We'll evaluate an article using a tweets' maximum character count, 280 at the time of writing, and filter out all articles with less than that. Let's check how many we have first.

In [6]:
shorter_than_a_tweet = df['len_clean_text'] < 280
shorter_than_a_tweet.sum()

12031

In [7]:
df = df[~shorter_than_a_tweet].copy()

In [8]:
df.shape

(52813, 6)

# 5. Recommendation System

Recommendation systems can come in many different forms and sizes. We can create a system that takes into account the behaviour of other users, or a system that only looks at similar articles or items to make a recommendation. Both are powerful systems and could cover an entire book in their own right, which is why we will focus on the latter category, the one that makes recommendations based on similar articles.

To create our recommendation system we first need to convert our articles into a numerical representation. We do this with a so-called bag of words (bow). BOWs are matrices with the documents in the rows, the terms contained in all documents along the columns, and the frequency with which each term appears in each document along the values. To create this kind of representation we can use `sklearn`'s `CountVectorizer` or `TfidfVectorizer` classes. The latter being the normalized version of the former, i.e. the frequency of a word divided by the amount of documents in which it appears.

To use this classes we first instantiate them, fit the data to them so that they can learn the vocabulary of our corpus, and then we tranform the corpus into a sparse matrix. These sparse matrices hold the location of all non-zero values to make it easier to store the data and compute on it.

In [12]:
%%time

# if you would rather work with a sample of the dataset to see how it works, use the following one
small_df = df.sample(5_000).copy()

# otherwise, use this one
# small_df = df

small_df.head()

CPU times: user 25.6 ms, sys: 789 ms, total: 815 ms
Wall time: 4.62 s


Unnamed: 0,url,clean_abstract,clean_title,clean_text,len_clean_text,len_dirty_text
11275,https://en.wikibooks.org/wiki/Pinyin/Powerful_...,Search,Pinyin/Powerful hurricane hits Haiti (2016-10-05),share thoughts page wikibooksmaintain notice p...,596,803
21344,https://en.wikibooks.org/wiki/Four-Player_Ches...,B00001111111100001,Four-Player Chess/Common openings/1. g3,1g3 queens pawn opening b c d e f g h j k l m ...,745,1136
45500,https://en.wikibooks.org/wiki/Game_Creation_wi...,NOTOC,Game Creation with XNA/2D Development/Menu and...,menu helpedit edit source game needs game menu...,360,574
46502,https://en.wikibooks.org/wiki/Calculus_of_Vari...,CHAPTER IV PROPERTIES OF THE FUNCTION Fxyxy,Calculus of Variations/CHAPTER IV,chapter iv properties function f x y x y displ...,18007,59278
33158,https://en.wikibooks.org/wiki/Solresol/Orthogr...,Solresol being based on 7 syllables has a wide...,Solresol/Orthography,solresol based 7 syllables wide variety ways w...,421,673


In [13]:
%%time

# we first instantiate our class
tf = TfidfVectorizer(min_df=0.035, max_df=0.80)

# we can fit and transform the data in the same step
tfidf_matrix = tf.fit_transform(small_df['clean_text'].values)

# evaluate the shape of our matrix
tfidf_matrix.shape

CPU times: user 3.22 s, sys: 742 ms, total: 3.96 s
Wall time: 5.28 s


(5000, 1502)

We can access our vocabulary with `.get_feature_names()` method.

In [14]:
tf.get_feature_names()

['000',
 '01',
 '05',
 '10',
 '100',
 '11',
 '110',
 '111',
 '112',
 '113',
 '12',
 '121',
 '122',
 '123',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '1994',
 '1997',
 '1998',
 '1999',
 '20',
 '200',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '300',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '3d',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49',
 '50',
 '500',
 '51',
 '52',
 '53',
 '54',
 '55',
 '56',
 '57',
 '58',
 '59',
 '60',
 '61',
 '62',
 '63',
 '64',
 '65',
 '66',
 '67',
 '70',
 '71',
 '72',
 '73',
 '75',
 '80',
 '81',
 '82',
 '83',
 '85',
 '90',
 '95',
 'ability',
 'able',
 'accept',
 'accepted',
 'access',
 'according',
 'account',
 'accurate',
 'achieve',
 'achieved',
 'act',
 'action',
 'actions',
 'active',
 'activities',
 'activity',
 'acts',
 'actual',
 'actually',
 'add',


The next step is to get the distance between documents and words to see how close and how far, based on words only, are two documents from one another. The `cosine_similarity` similarity function we imported earlier can do this for us, and afterwards, we can create a dataframe to evaluate our results.

**Note:** this operation can take a few minutes if you are using the entire dataset. Grab some ☕️ 😎

In [17]:
%%time

doc_sim = cosine_similarity(tfidf_matrix)

CPU times: user 1.37 s, sys: 817 ms, total: 2.18 s
Wall time: 2.44 s


In [18]:
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,1.0,0.0,0.0,0.003608,0.0,0.0,0.043375,0.002456,0.010328,0.01671,...,0.006236,0.0,0.020755,0.084016,0.057204,0.030873,0.044144,0.0,0.0,0.014853
1,0.0,1.0,0.020138,0.001491,0.014168,0.03545,0.112138,0.06893,0.004493,0.030633,...,0.015104,0.017927,0.022704,0.008304,0.007442,0.027183,0.087372,0.062765,0.008332,0.021505
2,0.0,0.020138,1.0,0.002222,0.0,0.087215,0.052481,0.110099,0.015816,0.003494,...,0.01621,0.023439,0.014445,0.119008,0.018722,0.029182,0.015016,0.027039,0.015722,0.015842
3,0.003608,0.001491,0.002222,1.0,0.015239,0.020821,0.021069,0.01782,0.091355,0.037829,...,0.023404,0.019649,0.757242,0.136245,0.008237,0.013438,0.007581,0.01592,0.015868,0.030486
4,0.0,0.014168,0.0,0.015239,1.0,0.0,0.0,0.048841,0.005498,0.065113,...,0.073968,0.00384,0.015907,0.024191,0.0,0.009756,0.013106,0.0,0.003825,0.007673


In [19]:
doc_sim.shape

(5000, 5000)

The reason we see a 5000x5000 matrix is because both halfs alonside the diagonal like are completely the same.

In [20]:
articles_list = small_df['clean_title'].values
abstract_list = small_df['clean_abstract'].values
articles_list.shape, articles_list

((5000,),
 array(['Pinyin/Powerful hurricane hits Haiti (2016-10-05)',
        'Four-Player Chess/Common openings/1. g3',
        'Game Creation with XNA/2D Development/Menu and Help', ...,
        'Mambo Open Source/Individual styling for each module position',
        "Down'n'dirty Blacksmithing/Hot Cutting",
        'Freistil/"Freiwillige Ausreise" Unwort des Jahres 2006'],
       dtype=object))

Let's now
1. pick a title at random
2. get the index of such title
3. select the corresponding row for such title in our new document similarity dataframe
4. sort the index of such values
5. return the top 5 article titles

In [23]:
from random import choice

In [25]:
a_title = choice(articles_list)
a_title

'Wiki Dispute Resolution/Introduction'

In [27]:
article_idx = np.where(articles_list == a_title)[0][0]
article_idx

3658

In [28]:
article_similarities = doc_sim_df.iloc[article_idx].values
article_similarities

array([0.02794027, 0.        , 0.04164891, ..., 0.0453757 , 0.01005291,
       0.        ])

In [30]:
# note that we don't select the first one as this should always be one
similar_articles_idxs = np.argsort(-article_similarities)[1:10]
similar_articles_idxs

array([2477, 2985, 2508, 2374, 3973, 1158, 3999, 2454, 3217])

In [31]:
similar_articles = articles_list[similar_articles_idxs]
pprint(similar_articles.tolist())

['Startups in the Philippines/A Survival Guide for Startups in the '
 'Philippines/Startup Stacks in the Philippines/Using Octane AI and other bot '
 'platforms',
 'FOSS Open Content/Limitations of Open Content',
 'Mambo Open Source/Which options available for each usertype',
 'Using Wikibooks/Policy and Guidelines',
 'Principles of Sociology/Cohousing and the New Everyday Life Model',
 'Grsecurity/The RBAC System',
 'Business Analysis Guidebook/User Experience',
 "MediaWiki Administrator's Handbook/Protect",
 'Animal Behavior/Behavioral Genetics']


In [38]:
similar_abstracts = abstract_list[similar_articles_idxs]
pprint(similar_abstracts[2])

('There are a number of different user types in Mambo and it can be useful to '
 'know that privileges and rights each user type has Sometimes you may want a '
 'user to be able to write content but not publish it there are user types for '
 'this')


Lastly, we will create create a mini-dashboard containing,
1. a widget with all of our titles,
2. a function with the steps we followed above,
3. a panel object to store a title, the widget, and the function.

In [43]:
titles = small_df.clean_title.unique().tolist()
title_widget = pn.widgets.Select(value=choice(titles), options=titles, name='Articles')

In [44]:
@pn.depends(title_widget.param.value)
def article_recommender(title_widget):
    
    article_idx = np.where(articles_list == title_widget)[0][0]
    article_similarities = doc_sim_df.iloc[article_idx].values
    similar_title_idxs = np.argsort(-article_similarities)[1:6]
    similar_titles = articles_list[similar_title_idxs]
    
    return pn.Column(*similar_titles, width=600)

In [45]:
text = pn.pane.Markdown(f"# Small Recommendation Engine", style={"color": "#000000"}, width=600, height=50,
                        sizing_mode="stretch_width", margin=(10,10,10,5))

In [46]:
pn.Column(text, title_widget, article_recommender, align='center', width=600, height=300)

## 6. Topic Modeling

What is topic modeling?

> "In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both." ~ [Wikipedia](https://en.wikipedia.org/wiki/Topic_model)

As with the recommendation engine, topic modeling requires a bag of words for the representation of the data and, in contrast, it requires a topic number as the key parameter for the model.

In [47]:
vectorizer = CountVectorizer(strip_accents = 'unicode', min_df=0.035, max_df=0.80)

In [48]:
bow = vectorizer.fit_transform(small_df['clean_text'].values)
bow

<5000x1502 sparse matrix of type '<class 'numpy.int64'>'
	with 563153 stored elements in Compressed Sparse Row format>

In [49]:
topics = 20

In [50]:
lda_model = LatentDirichletAllocation(n_components=topics, # number of topics
                                      max_iter=100, # these are the amount of times the algorithm will run
                                      learning_method='online', 
                                      random_state=42, # setting a seed for reproducible results
                                      n_jobs=-1) # this parameter makes sure we use all of the cores in our machine

In [51]:
%%time

lda_model.fit(bow)

CPU times: user 50 s, sys: 16.1 s, total: 1min 6s
Wall time: 2min 9s


LatentDirichletAllocation(learning_method='online', max_iter=100,
                          n_components=20, n_jobs=-1, random_state=42)

In [52]:
def show_topics(vectorizer, lda_model, n_words=15):
    """
    This function takes our vectorizer, our model, and a
    number of words to display the topics from our model.
    """
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

In [53]:
show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=10)

[array(['data', 'model', 'information', 'map', 'meaning', 'native',
        'million', 'database', 'models', 'shared'], dtype='<U16'),
 array(['edit', 'source', '10', '12', '11', '01', '05', '13', 'contents',
        '14'], dtype='<U16'),
 array(['new', 'government', 'states', 'state', 'world', 'business',
        'american', 'company', 'united', 'economic'], dtype='<U16'),
 array(['number', 'line', 'value', 'items', 'table', 'characters', 'text',
        'id', 'command', 'item'], dtype='<U16'),
 array(['cell', 'cells', 'energy', 'water', 'light', 'system', 'called',
        'different', 'form', 'source'], dtype='<U16'),
 array(['de', 'end', 'board', 'begin', 'count', 'en', 'false', 'al',
        'integer', 'true'], dtype='<U16'),
 array(['question', 'en', 'english', 'words', 'word', 'yes', 'language',
        'lesson', 'like', 'questions'], dtype='<U16'),
 array(['page', 'book', 'document', 'section', 'text', 'chapter',
        'article', 'use', 'wikipedia', 'information'], dtype='<U1

In [64]:
terms = sorted(tf.vocabulary_.keys())

In [65]:
bow_docs = pd.DataFrame(tfidf_matrix.toarray(), columns=terms)
bow_docs.head()

Unnamed: 0,000,01,02,05,10,100,1000,101,11,110,...,writing,written,wrong,www,year,years,yes,york,young,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.012461,0.0,0.0,0.00698,0.053304,0.0,0.0,0.003266,0.0,...,0.0,0.004624,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0788
3,0.0,0.0,0.0,0.0,0.0,0.023378,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.022186,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
lda_model.components_??

[0;31mType:[0m            ndarray
[0;31mString form:[0m    
[[5.00047649e-02 5.00000006e-02 5.00000009e-02 ... 5.00000006e-02
  5.00000005e-02 5.00359565e-02]
 [5.00000007e-02 5.00000005e-02 5.00000008e-02 ... 5.31203020e+01
  3.57364442e+01 5.00000010e-02]
 [5.00000019e-02 5.00000003e-02 5.00000003e-02 ... 5.00000021e-02
  8.26463338e+00 1.31705620e+01]
 ...
 [9.73559120e+02 5.00000005e-02 5.00000011e-02 ... 6.51924658e+02
  1.42266503e+02 5.00000007e-02]
 [5.00000002e-02 5.00000003e-02 5.00000001e-02 ... 5.00000002e-02
  5.00000002e-02 2.95105585e+02]
 [5.00000044e-02 5.00000005e-02 5.00000018e-02 ... 5.00000005e-02
  5.00000006e-02 8.54510045e+00]]
[0;31mLength:[0m          20
[0;31mFile:[0m            ~/anaconda/envs/vector/lib/python3.9/site-packages/numpy/__init__.py
[0;31mDocstring:[0m       <no docstring>
[0;31mClass docstring:[0m
ndarray(shape, dtype=float, buffer=None, offset=0,
        strides=None, order=None)

An array object represents a multidimensional, homo

In [68]:
topic_term = pd.DataFrame(lda_model.components_.T, index=terms, columns=['topic_' + str(i) for i in range(topics)])
topic_term.tail()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
years,30.022224,0.929457,0.058155,30.295352,1017.888384,4.965693,0.05,0.05,208.438423,55.894452,151.415479,68.159305,0.050001,20.010137,40.184057,0.05,680.04166,1073.750694,0.05,80.798978
yes,540.188462,65.632107,207.997319,0.05,108.930518,0.05,25.54992,0.05,33.749135,0.05,0.05,0.050009,25.636089,0.05,0.05,0.05,3.831653,0.05,20.317896,106.742375
york,0.05,53.120302,0.05,0.05,8.121724,0.05,0.05,0.05,0.090923,32.927806,0.05,93.241872,0.05,0.05,0.05,0.05,18.12586,651.924658,0.05,0.05
young,0.05,35.736444,8.264633,0.05,243.844635,0.05,0.05,0.05,79.47672,0.05,0.05,0.054446,98.256109,0.05,0.05,0.05,172.737895,142.266503,0.05,0.05
zero,0.050036,0.05,13.170562,52.581053,0.05,359.260702,412.7816,0.05,2.425016,0.05,0.05,0.05,17.477831,0.05,0.05,0.152368,0.05,0.05,295.105585,8.5451


In [71]:
doc_topic = pd.DataFrame(lda_model.transform(tfidf_matrix), index=small_df.clean_title, columns=['topic_' + str(i) for i in range(topics)])
doc_topic.head(3)

Unnamed: 0_level_0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
clean_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Git,0.619392,0.006328,0.006328,0.258204,0.006328,0.006328,0.006328,0.006328,0.006328,0.006328,0.006328,0.006328,0.006328,0.006328,0.006328,0.006328,0.006328,0.014828,0.006328,0.006328
Japanese Phrasebook/At office,0.011309,0.011309,0.011309,0.011309,0.011309,0.011309,0.011309,0.011309,0.41579,0.011309,0.011309,0.011309,0.011309,0.011309,0.011309,0.011309,0.011309,0.011309,0.011309,0.380641
Advanced Mathematics for Engineers and Scientists/Scale Analysis,0.007549,0.007675,0.00763,0.007559,0.018644,0.430808,0.007945,0.007549,0.007656,0.007553,0.007549,0.007549,0.007549,0.007549,0.007549,0.007613,0.007568,0.007593,0.421366,0.007549
Exercise as it relates to Disease/Effect of Physical Activity on Older Adults with HIV,0.005974,0.012044,0.005975,0.005974,0.005975,0.005982,0.005974,0.005976,0.005974,0.005974,0.005977,0.014692,0.005975,0.005975,0.014758,0.005974,0.862907,0.005974,0.005974,0.005974
Magic: The Gathering/Mental Magic Format,0.355985,0.006207,0.006207,0.006207,0.428685,0.109815,0.006207,0.006207,0.006207,0.006207,0.006207,0.006207,0.006207,0.006207,0.006207,0.006207,0.006207,0.006207,0.006207,0.006207


In [69]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [80]:
pyLDAvis.sklearn.prepare(lda_model, bow, vectorizer)

  default_term_info = default_term_info.sort_values(


## 7. Summary

Blind Spots

With additional time we could have,
1. Further tweak the parameters of the vectorizers and models
2. Create visualizations of both, the best topics and the document similarity to find more interesting patters
3. Take the title of an article out of the body of the article
