> "You shall know a word by the company it keeps." ~ John R. Firth

![img](https://cdn.shopify.com/s/files/1/0867/3580/products/vinyl_decal_hello_words_cloud_ig4779_1800x1800.jpg?v=1571439560)

## Learning Outcomes

By the end of this tutorial you will

1. Have a better understanding of what is natural language processing and what are some of its applications.
2. Learn about the root of a word, what is means, and why we use them.
3. Be able to create a recommendation system based on text similarity.
4. Be able to conduct topic modeling on your own corpus.
5. Understand how to put together a simple app using panel.

Assumptions about you

- Have at least 1 year of coding experience in Python.
- Are comfortable with loops, functions, lists comprehensions, and if-else statements.
- Have some knowledge of pandas and NumPy.
- Have at least 5 GB of free space in your computer.
- While it is not required to have experience using Jupyter Notebooks, this would be very beneficial for the session.

What this tutorial is not

- A deep dive into Natural Language Processing.
- A deep learning tutorial.
- A web application tutorial.

## Table of Contents

1. Libraries
2. The Data
3. Flash NLP Intro
4. Cleaning
5. Recommendation System
6. Topic Modeling (Optional)
7. Summary

## 1. Libraries

In [1]:
import json, re, spacy
import pandas as pd, numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
import panel as pn
from concurrent.futures import ProcessPoolExecutor
from datasets import load_dataset
import datasets

pn.extension()

%load_ext autoreload
%autoreload 2

## 2. The Data

With have been given a random corpus of news articles and our task is to come up with a product(s), a recommendations systems (and a set of topic that best explains the model). The data consist of news articles plus some additional columns inside for which you can find more information in the table below.

| Column | Content |
|--------|---------|
|title |Title of article|
|text | Text inside article|
|domain | Domain Url of article|
|date | YYYY-MM-DD Time|
|description | Abstract of article|
|url | Url of article|
|image_url | Image if available|

In addition, here is the full description of the dataset from Huggin Face.

> "CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.
It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English language subset of the CC-News dataset." ~ [Hugging Face cc_news](https://huggingface.co/datasets/cc_news)

Before we do any data cleaning, let's read in the data and explore it a bit.

In [None]:
# this cell should only be used if running the notebook on Binder
# datasets.config.IN_MEMORY_MAX_SIZE = 500_000_000

In [2]:
%%time

# use this one if running in BINDER
# dataset = load_dataset('cc_news', keep_in_memory=500_000_000)

dataset = load_dataset('cc_news')

Reusing dataset cc_news (/home/ramonperez/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b)


CPU times: user 135 ms, sys: 63.3 ms, total: 198 ms
Wall time: 1.55 s


Let's see how many articles we have and then examine the columns.

In [3]:
dataset.shape

{'train': (708241, 7)}

In [4]:
dataset.column_names['train']

['title', 'text', 'domain', 'date', 'description', 'url', 'image_url']

Now that we have a dictionary, we can create a pandas DataFrame.

In [5]:
# df = next(dataset['train'].to_pandas(batched=True, batch_size=2_000)).reset_index(drop=True)

df = dataset['train'].to_pandas().sample(5_000).reset_index(drop=True)

In [6]:
df.head()

Unnamed: 0,title,text,domain,date,description,url,image_url
0,"सड़क हादसे में दरोगा की मौत, कोहराम",Read the latest and breaking Hindi news on ama...,www.amarujala.com,2018-02-04 00:00:00,0 भतीजे की पत्नी की तेरहवीं में शामिल होने गए ...,https://www.amarujala.com/uttar-pradesh/allaha...,https://spiderimg.amarujala.com/assets/images/...
1,A star-studded dinner in the WA Outback,The woman in the red sequined dress and matchi...,www.dailymail.co.uk,2018-05-31 06:42:00,The woman in the red sequined dress and matchi...,http://www.dailymail.co.uk/wires/aap/article-5...,http://i.dailymail.co.uk/i/pix/m_logo_636x382p...
2,Solar eclipse scams to watch out for,× Solar eclipse scams to watch out for\nThe Be...,myfox8.com,2017-08-14 16:01:25,The Better Business Bureau (BBB) of Central No...,http://myfox8.com/2017/08/14/solar-eclipse-sca...,https://localtvwghp.files.wordpress.com/2017/0...
3,Soccer-Lukaku strikes twice as United crush We...,* Manchester United beat West Ham United 4-0 a...,in.reuters.com,2017-08-13 17:01:51,* Manchester United beat West Ham United 4-0 a...,https://in.reuters.com/article/soccer-england-...,https://s4.reutersmedia.net/resources_v2/image...
4,Mitch's Sports Report: Montreal Cans Coach Wit...,"File under ""didn't see this coming."" Yesterday...",digital.vpr.net,,"File under ""didn't see this coming."" Yesterday...",http://digital.vpr.net/post/mitchs-sports-repo...,http://mediad.publicbroadcasting.net/p/vpr/fil...


## 3. Flash NLP Intro

We can use the `.loc[index, column]` method on our dataframe, select one column and one row using a comma to separate both, and examine a prettier version of the text using the python function `pprint()`.

In [7]:
random_article = df.iloc[10, 1]
pprint(random_article)

('NEW KNOXVILLE – At their meeting Monday night, the New Knoxville Board of '
 'Education approved the hiring of a new varsity basketball coach and '
 'congratulated a student achieving an Arts Honors Diploma.\n'
 'Michael Piatt, as the new varsity basketball coach, said he is looking '
 'forward to building a team that not only focuses on sports but achievement. '
 'His most recent position was as assistant varsity coach at Chaminade '
 'Julienne in Dayton. He coached Sidney Lehman Cavaliers from 2002 to 2007, '
 'leading the team to an 18-5 record and a sectional championship in the '
 '2006-07 season.\n'
 'High School senior Brittany Bambauer gave the board a presentation on how '
 'she achieved an Arts Honors Diploma. Jenny Fledderjohann, principal grades 4 '
 '– 12, said Baumbauer, had to achieve a 3.5 grade point average, and an ACT '
 'score of 27 or higher. Baumbauer, who plans to attend Ohio University '
 'pursuing a Music Therapy bachelor’s degree, said it was her field exper

Notice how the review above is quite messy and it has a lot of characters that, for all intents and purposes, will not be useful for our analysis. Let's examine a cleaner version of the article above by running it through spaCy's tokenizer. When we tokenize a document, we are separating all of its content into each of its components, i.e. words, numbers, punctiations and the like, to make it easier to process it, clean it, and to run computations on it.

For this part, we will load an english model, instantiate it and pass an example article through it. You may need to run the cell below first to download the english model.

In [8]:
# !python -m spacy download en_core_web_sm

In [9]:
nlp = spacy.load("en_core_web_sm")

In [10]:
parsed_article = nlp(random_article)

In [11]:
parsed_article

NEW KNOXVILLE – At their meeting Monday night, the New Knoxville Board of Education approved the hiring of a new varsity basketball coach and congratulated a student achieving an Arts Honors Diploma.
Michael Piatt, as the new varsity basketball coach, said he is looking forward to building a team that not only focuses on sports but achievement. His most recent position was as assistant varsity coach at Chaminade Julienne in Dayton. He coached Sidney Lehman Cavaliers from 2002 to 2007, leading the team to an 18-5 record and a sectional championship in the 2006-07 season.
High School senior Brittany Bambauer gave the board a presentation on how she achieved an Arts Honors Diploma. Jenny Fledderjohann, principal grades 4 – 12, said Baumbauer, had to achieve a 3.5 grade point average, and an ACT score of 27 or higher. Baumbauer, who plans to attend Ohio University pursuing a Music Therapy bachelor’s degree, said it was her field experience with music therapist Brittany Scherer that convinc

Notice how much nicer our article looks now.

We can also grab the sentences and view them one by one using the attribute `.sents` and the built in python function `next()`, since the attribute of a document that has been tokenized by spacy will always return an iterator. Conversely, we can add it to a loop and show each of the sentences in an article.

In [12]:
next(enumerate(parsed_article.sents))

(0,
 NEW KNOXVILLE – At their meeting Monday night, the New Knoxville Board of Education approved the hiring of a new varsity basketball coach and congratulated a student achieving an Arts Honors Diploma.)

In [13]:
for num, sentence in enumerate(parsed_article.sents):
    print(f"Sentence #{num}:\n {sentence}\n")

Sentence #0:
 NEW KNOXVILLE – At their meeting Monday night, the New Knoxville Board of Education approved the hiring of a new varsity basketball coach and congratulated a student achieving an Arts Honors Diploma.

Sentence #1:
 


Sentence #2:
 Michael Piatt, as the new varsity basketball coach, said he is looking forward to building a team that not only focuses on sports but achievement.

Sentence #3:
 His most recent position was as assistant varsity coach at Chaminade Julienne in Dayton.

Sentence #4:
 He coached Sidney Lehman Cavaliers from 2002 to 2007, leading the team to an 18-5 record and a sectional championship in the 2006-07 season.

Sentence #5:
 
High School senior Brittany Bambauer gave the board a presentation on how she achieved an Arts Honors Diploma.

Sentence #6:
 Jenny Fledderjohann, principal grades 4 – 12, said Baumbauer, had to achieve a 3.5 grade point average, and an ACT score of 27 or higher.

Sentence #7:
 Baumbauer, who plans to attend Ohio University pursu

We can also have a look at the different kinds of entities in an article. These entities can be a person (called PERSON), and number (called CARDINAL), a geopolitical entity (called GPE), etc.

In [14]:
for num, entity in enumerate(parsed_article.ents):
    print(f"Entity #{num}: {entity} -- {entity.label_}\n")

Entity #0: Monday -- DATE

Entity #1: the New Knoxville Board of Education -- ORG

Entity #2: Michael Piatt -- PERSON

Entity #3: Julienne -- PERSON

Entity #4: Dayton -- GPE

Entity #5: Sidney Lehman Cavaliers -- ORG

Entity #6: 2002 -- DATE

Entity #7: 18-5 -- DATE

Entity #8: 2006-07 season -- DATE

Entity #9: High School -- ORG

Entity #10: Brittany Bambauer -- PERSON

Entity #11: Jenny Fledderjohann -- PERSON

Entity #12: 4 -- CARDINAL

Entity #13: 12 -- CARDINAL

Entity #14: Baumbauer -- ORG

Entity #15: 3.5 -- CARDINAL

Entity #16: 27 -- CARDINAL

Entity #17: Baumbauer -- ORG

Entity #18: Ohio University -- ORG

Entity #19: Music Therapy -- WORK_OF_ART

Entity #20: Brittany Scherer -- PERSON

Entity #21: Fledderjohann -- PERSON

Entity #22: Andrea Ott -- PERSON

Entity #23: New Knoxville -- GPE

Entity #24: April 4 -- DATE

Entity #25: Jon Stammen -- PERSON

Entity #26: first -- ORDINAL

Entity #27: New Knoxville High School Band -- ORG

Entity #28: Troy High School -- ORG

Enti

In [16]:
spacy.explain("LOC")

'Non-GPE locations, mountain ranges, bodies of water'

We can also check weather a word is a stopword or a punctuation, or we can even lemmatize our articles. Lemmatization is a way of taking the root of a word and bringing similar words to a common denominator, for example, `was` will become `be` and most plural words will become singular words.

In [17]:
new_list = []

for token in parsed_article:
    new_list.append(token.text)
    
    
new_list[:10]

['NEW',
 'KNOXVILLE',
 '–',
 'At',
 'their',
 'meeting',
 'Monday',
 'night',
 ',',
 'the']

In [18]:
new_list = [token.text for token in parsed_article]

new_list[:10]

['NEW',
 'KNOXVILLE',
 '–',
 'At',
 'their',
 'meeting',
 'Monday',
 'night',
 ',',
 'the']

In [19]:
# here we are taking out of the parsed article each token
token_text = [token.text for token in parsed_article]

# here we are lemmatizing each word possible
token_lemmas = [token.lemma_ for token in parsed_article]

# stopwords are very common so here we will extract a variable that will tell us whether
# a token is a stopword or not
token_stop = [token.is_stop for token in parsed_article]

# a token is a pinctuation or not
token_punc = [token.is_punct for token in parsed_article]

# we will now add all three to a dataframe and display it without assigning it to a variable
pd.DataFrame(zip(token_text, token_lemmas, token_punc, token_stop), columns=['Original Text', 'Lemmatized Text', 'Punctuations', 'stopwords']).head(50)

Unnamed: 0,Original Text,Lemmatized Text,Punctuations,stopwords
0,NEW,new,False,False
1,KNOXVILLE,KNOXVILLE,False,False
2,–,–,True,False
3,At,at,False,True
4,their,their,False,True
5,meeting,meeting,False,False
6,Monday,Monday,False,False
7,night,night,False,False
8,",",",",True,False
9,the,the,False,True


## 4. Cleaning

Let's start by checking if our dataset contains any missin values, and then evaluate the amount of memory we are currently using from our machine.

In [20]:
df.isna().sum()

title          0
text           0
domain         0
date           0
description    0
url            0
image_url      0
dtype: int64

In [21]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        5000 non-null   object
 1   text         5000 non-null   object
 2   domain       5000 non-null   object
 3   date         5000 non-null   object
 4   description  5000 non-null   object
 5   url          5000 non-null   object
 6   image_url    5000 non-null   object
dtypes: object(7)
memory usage: 23.8 MB


Depending on the random sample you choose at the beginning, you may or may not have a lot. If so, getting rid of the columns you don't need will help release some of the memory in your machine.

In [22]:
df.drop(['url', 'image_url', 'domain'], axis=1, inplace=True)

Perfect! Let's now extract the `text` column and normalize it. This means we will use `spacy` to,
- take out anything that is not a word or a number,
- convert to lower case,
- strip the spaces around the words,
- tokenize the articles,
- remove stopwords (we will use spaCy's list of stopwords for this),
- and then join the cleaned tokens back together.

In [23]:
articles = df['text'].values

In [24]:
from spacy.lang.en.stop_words import STOP_WORDS
len(STOP_WORDS), STOP_WORDS

(326,
 {"'d",
  "'ll",
  "'m",
  "'re",
  "'s",
  "'ve",
  'a',
  'about',
  'above',
  'across',
  'after',
  'afterwards',
  'again',
  'against',
  'all',
  'almost',
  'alone',
  'along',
  'already',
  'also',
  'although',
  'always',
  'am',
  'among',
  'amongst',
  'amount',
  'an',
  'and',
  'another',
  'any',
  'anyhow',
  'anyone',
  'anything',
  'anyway',
  'anywhere',
  'are',
  'around',
  'as',
  'at',
  'back',
  'be',
  'became',
  'because',
  'become',
  'becomes',
  'becoming',
  'been',
  'before',
  'beforehand',
  'behind',
  'being',
  'below',
  'beside',
  'besides',
  'between',
  'beyond',
  'both',
  'bottom',
  'but',
  'by',
  'ca',
  'call',
  'can',
  'cannot',
  'could',
  'did',
  'do',
  'does',
  'doing',
  'done',
  'down',
  'due',
  'during',
  'each',
  'eight',
  'either',
  'eleven',
  'else',
  'elsewhere',
  'empty',
  'enough',
  'even',
  'ever',
  'every',
  'everyone',
  'everything',
  'everywhere',
  'except',
  'few',
  'fifteen',

In [25]:
def normalize_doc(doc):
    """
    This function normalizes your list of documents by taking only
    words, numbers, and spaces in between them. It then filters out
    stop words.
    """
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    tokens = nlp(doc)
    filtered_tokens = [token.lemma_ for token in tokens if not token.is_stop]
    doc = ' '.join(filtered_tokens).replace(" \n ", "")
    return doc

In [26]:
random_article

'NEW KNOXVILLE – At their meeting Monday night, the New Knoxville Board of Education approved the hiring of a new varsity basketball coach and congratulated a student achieving an Arts Honors Diploma.\nMichael Piatt, as the new varsity basketball coach, said he is looking forward to building a team that not only focuses on sports but achievement. His most recent position was as assistant varsity coach at Chaminade Julienne in Dayton. He coached Sidney Lehman Cavaliers from 2002 to 2007, leading the team to an 18-5 record and a sectional championship in the 2006-07 season.\nHigh School senior Brittany Bambauer gave the board a presentation on how she achieved an Arts Honors Diploma. Jenny Fledderjohann, principal grades 4 – 12, said Baumbauer, had to achieve a 3.5 grade point average, and an ACT score of 27 or higher. Baumbauer, who plans to attend Ohio University pursuing a Music Therapy bachelor’s degree, said it was her field experience with music therapist Brittany Scherer that conv

In [27]:
normalize_doc(random_article)

'new knoxville   meeting monday night new knoxville board education approve hiring new varsity basketball coach congratulate student achieve art honor diplomamichael piatt new varsity basketball coach say look forward build team focus sport achievement recent position assistant varsity coach chaminade julienne dayton coach sidney lehman cavalier 2002 2007 lead team 185 record sectional championship 200607 seasonhigh school senior brittany bambauer give board presentation achieve art honor diploma jenny fledderjohann principal grade 4   12 say baumbauer achieve 35 grade point average act score 27 high baumbauer plan attend ohio university pursue music therapy bachelor degree say field experience music therapist brittany scherer convince want work rest life particularly find work hospice dementia program rewardingfledderjohann say andrea ott recognize franklin b walter recipient new knoxville april 4 ott turn recognize teacher jon stamman teacher influencesay time school history new knox

Since we have quite a few articles, this operation can take quite some time unless we do the cleaning process concurrently or in parallel. We will do this using the `ProcessPoolExecutor()` from the `concurrent.futures` module.

In [28]:
%%time

with ProcessPoolExecutor(max_workers=6) as e:
    processed_articles = list(e.map(normalize_doc, articles))

CPU times: user 1.77 s, sys: 586 ms, total: 2.36 s
Wall time: 38 s


We will add the cleaned versions of the documents back into the dataframe and loop over these while taking the lenght (in characters terms) of each article.

In [29]:
%%time

df['clean_text'] = processed_articles
df['len_clean_text'] = df['clean_text'].apply(len)
df['len_dirty_text'] = df['text'].apply(len)

CPU times: user 5.62 ms, sys: 0 ns, total: 5.62 ms
Wall time: 5.12 ms


Let's now save our cleaned dataset in case we need to restart our notebook and begin the analysis again. We will also release a bit of memory by getting rid of all the data and variables we have loaded up since the beginning of the notebook.

In [30]:
df.head(2)

Unnamed: 0,title,text,date,description,clean_text,len_clean_text,len_dirty_text
0,"सड़क हादसे में दरोगा की मौत, कोहराम",Read the latest and breaking Hindi news on ama...,2018-02-04 00:00:00,0 भतीजे की पत्नी की तेरहवीं में शामिल होने गए ...,read late break hindi news amarujalacom live h...,195,296
1,A star-studded dinner in the WA Outback,The woman in the red sequined dress and matchi...,2018-05-31 06:42:00,The woman in the red sequined dress and matchi...,woman red sequined dress matching hat tell goo...,3244,5214


In [31]:
%%time

df[['title', 'date', 'clean_text', 'len_clean_text', 
    'len_dirty_text']].reset_index(drop=True).to_parquet('articles_clean.parquet', compression='snappy')

CPU times: user 51.9 ms, sys: 15.9 ms, total: 67.7 ms
Wall time: 84.9 ms


In [32]:
del dataset
del df
del articles
del processed_articles

In [33]:
df = pd.read_parquet('articles_clean.parquet').reset_index(drop=True)

It wouldn't make any sense to feed to our algorithms articles with a tiny amount of characters, so let's examine the distribution of characters among both, the raw and the clean version of our articles.

In [34]:
df[['len_clean_text', 'len_dirty_text']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
len_clean_text,5000.0,1491.3632,1750.005216,23.0,472.0,1071.5,1929.0,30622.0
len_dirty_text,5000.0,2350.2048,2628.579089,68.0,711.75,1714.0,3091.0,41090.0


In [35]:
df[['len_clean_text', 'len_dirty_text']].skew()

len_clean_text    5.710948
len_dirty_text    4.706806
dtype: float64

![img](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn.analyticsvidhya.com%2Fwp-content%2Fuploads%2F2020%2F06%2Fsk1.png&f=1&nofb=1)

Now that we know we have a skewed distribution of characters, let's fix that by setting up a rule. We'll evaluate an article using the tweets' maximum character count of 280, at the time of writing, and filter out all articles with less than that. Let's check how many we have first.

In [36]:
greater_than_a_tweet = df['len_clean_text'] > 280
greater_than_a_tweet.sum()

4255

In [37]:
df = df[greater_than_a_tweet].copy()

In [38]:
df.shape

(4255, 5)

# 5. Recommendation System

Recommendation systems can come in many different forms and sizes. We can create a system that takes into account the behaviour of other users, or a system that only looks at similar articles or items to make a recommendation. Both are powerful systems and could cover an entire section of a book in their own right, which is why we will focus on the latter category, the one that makes recommendations based on similar articles.

To create our recommendation system we first need to convert our articles into a numerical representation. We do this with a so-called bag of words (bow). BOWs are matrices with the documents in the rows, the terms contained in all documents along the columns. The frequency with which each term appears in each document along the values can be found in the doc-token combination. To create this kind of representation we can use `sklearn`'s `CountVectorizer` or `TfidfVectorizer` classes. The latter being the normalized version of the former, i.e. the frequency of a word divided by the amount of documents in which it appears.

To use this classes we first instantiate them, fit the data to them so that they can learn the vocabulary of our corpus, and then we tranform the corpus into a sparse matrix. These sparse matrices hold the location of all non-zero values to make it easier to store the data and compute on it.

In [39]:
%%time

# we first instantiate our class
tf = TfidfVectorizer(min_df=0.035, max_df=0.80)

# we can fit and transform the data in the same step
tfidf_matrix = tf.fit_transform(df['clean_text'].values)

# evaluate the shape of our matrix
tfidf_matrix.shape

CPU times: user 621 ms, sys: 13.4 ms, total: 634 ms
Wall time: 661 ms


(4255, 789)

We can access our vocabulary with `.get_feature_names()` method.

In [40]:
tf.get_feature_names()[500:550]

['park',
 'particularly',
 'partner',
 'party',
 'pass',
 'past',
 'pay',
 'people',
 'percent',
 'perform',
 'performance',
 'period',
 'person',
 'personal',
 'phone',
 'photo',
 'pick',
 'picture',
 'piece',
 'place',
 'plan',
 'play',
 'player',
 'pm',
 'point',
 'police',
 'policy',
 'political',
 'popular',
 'position',
 'possible',
 'post',
 'potential',
 'power',
 'practice',
 'prepare',
 'present',
 'president',
 'press',
 'pressure',
 'pretty',
 'previous',
 'previously',
 'price',
 'prime',
 'private',
 'probably',
 'problem',
 'process',
 'produce']

The next step is to get the distance between documents and words to see how close and how far, based on words only, are two documents from one another. The `cosine_similarity` function we imported earlier can do this for us, and afterwards, we can create a dataframe to evaluate our results.

**Note:** this operation can take a few minutes if you are using the entire dataset. Make sure to grab some ☕️ 😎

In [41]:
%%time

doc_sim = cosine_similarity(tfidf_matrix)

CPU times: user 610 ms, sys: 54 ms, total: 664 ms
Wall time: 664 ms


In [42]:
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4245,4246,4247,4248,4249,4250,4251,4252,4253,4254
0,1.0,0.071004,0.066551,0.119413,0.059563,0.118079,0.093205,0.074626,0.031717,0.144751,...,0.127054,0.029444,0.0777,0.061489,0.093591,0.051793,0.059525,0.15409,0.091389,0.091785
1,0.071004,1.0,0.032688,0.112728,0.16247,0.16917,0.078317,0.210689,0.154278,0.092376,...,0.068239,0.095303,0.032358,0.095482,0.037728,0.055266,0.022597,0.188686,0.148438,0.129617
2,0.066551,0.032688,1.0,0.13783,0.089412,0.102357,0.080342,0.051884,0.037688,0.079271,...,0.027563,0.019178,0.098917,0.022088,0.170996,0.0113,0.072202,0.01737,0.042323,0.073465
3,0.119413,0.112728,0.13783,1.0,0.078428,0.189386,0.081748,0.197049,0.051337,0.270715,...,0.031584,0.079312,0.202865,0.033338,0.019398,0.127421,0.127168,0.090455,0.06277,0.04206
4,0.059563,0.16247,0.089412,0.078428,1.0,0.053798,0.143617,0.065483,0.11021,0.095153,...,0.021248,0.085804,0.026069,0.039735,0.059024,0.02339,0.01057,0.089069,0.059263,0.140952


In [43]:
doc_sim.shape

(4255, 4255)

The reason we see a X000xX000 matrix is because both halfs alonside the diagonal line are identical, hence, we have the similarity of all docs vs all docs.

In [44]:
articles_list = df['title'].values
articles_list.shape, articles_list

((4255,),
 array(['A star-studded dinner in the WA Outback',
        'Solar eclipse scams to watch out for',
        'Soccer-Lukaku strikes twice as United crush West Ham', ...,
        'Tory Burch Wants You to Own Your Ambition',
        'No free power to gaushalas in Punjab: Gau Sewa chief questions cow cess collection, writes to power utility',
        'West Sussex woman jailed for threatening horses'], dtype=object))

Let's now
1. pick a title at random
2. get the index of such title
3. select the corresponding row for such title in our new document similarity dataframe
4. sort the index of such values
5. return the top 5 article titles

In [45]:
from random import choice

In [46]:
a_title = choice(articles_list)
a_title

'Hackers release more HBO episode shows: report'

In [47]:
article_idx = np.where(articles_list == a_title)[0][0]
article_idx

3208

In [48]:
article_similarities = doc_sim_df.iloc[article_idx].values
article_similarities

array([0.0790154 , 0.13307936, 0.02699896, ..., 0.06602159, 0.0306015 ,
       0.10460289])

In [49]:
# note that we don't select the first one as this should always be one
similar_articles_idxs = np.argsort(-article_similarities)[1:10]
similar_articles_idxs

array([ 898, 2488, 1730, 1948, 3595, 2758, 2575,  795, 1389])

In [52]:
doc1 = nlp(df.loc[1, "clean_text"])
doc2 = nlp(df.loc[2, "clean_text"])

In [53]:
doc1.similarity(doc2)

  doc1.similarity(doc2)


0.9809930330519058

In [54]:
a_title

'Hackers release more HBO episode shows: report'

In [55]:
similar_articles = articles_list[similar_articles_idxs]
pprint(similar_articles.tolist())

['Hackers Leak More Unaired Episodes Of HBO Shows, Network Refuses To Comment',
 '‘Pokémon Go’ Reaches 800 Million Downloads',
 'In the time it takes to tweet, Roseanne Barr loses her job, keeps tweeting',
 "On the cards: Revenue surge for China's Tencent from popular fantasy game",
 'Cleveland rallies to stun Yankees, takes commanding 2-0 ALDS lead',
 'HQ Trivia will give away its biggest cash prize ever -- depending on the NBA '
 'Finals',
 'Embrace The Night With This Batman Gift Guide',
 'Rangers focusing on playing better in Game 4 vs. Canadiens',
 "TV show Roseanne axed after star's racist tweet sparks outrage"]


Lastly, we will create create a mini-dashboard containing,
1. a widget with all of our titles,
2. a function with the steps we followed above,
3. a panel object to store a title, the widget, and the function.

In [56]:
titles = df.title.unique().tolist()
title_widget = pn.widgets.Select(value=choice(titles), options=titles, name='Articles')
title_widget

In [57]:
@pn.depends(title_widget.param.value)
def article_recommender(title_widget):
    
    article_idx = np.where(articles_list == title_widget)[0][0]
    article_similarities = doc_sim_df.iloc[article_idx].values
    similar_title_idxs = np.argsort(-article_similarities)[1:6]
    similar_titles = articles_list[similar_title_idxs]
    
    return pn.Column(*similar_titles, width=600)

In [58]:
text = pn.pane.Markdown(f"# Small Recommendation Engine", style={"color": "#000000"}, width=600, height=50,
                        sizing_mode="stretch_width", margin=(10,10,10,5))

In [59]:
pn.Column(text, title_widget, article_recommender, align='center', width=600, height=300).show()

Launching server at http://localhost:41901


<bokeh.server.server.Server at 0x7f2ad216b490>











Opening in existing browser session.




## 6. Topic Modeling

What is topic modeling?

> "In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both." ~ [Wikipedia](https://en.wikipedia.org/wiki/Topic_model)

As with the recommendation engine, topic modeling requires a bag of words for the representation of the data and, in contrast, it requires a topic number as the key parameter for the model.

In [60]:
vectorizer = CountVectorizer(strip_accents = 'unicode', min_df=0.035, max_df=0.80)

In [61]:
bow = vectorizer.fit_transform(df['clean_text'].values)
bow

<4255x789 sparse matrix of type '<class 'numpy.int64'>'
	with 261772 stored elements in Compressed Sparse Row format>

What is Latent Dirichlet Allocation?

> "In natural language processing, the Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning field and in a wider sense to the artificial intelligence field." ~ [Wikipedia](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [62]:
topics = 10

In [63]:
lda_model = LatentDirichletAllocation(n_components=topics, # number of topics
                                      max_iter=100, # these are the amount of times the algorithm will run
                                      learning_method='online', 
                                      random_state=42, # setting a seed for reproducible results
                                      n_jobs=2) # this parameter makes sure we use all of the cores in our machine

In [64]:
%%time

lda_model.fit(bow)

CPU times: user 7.44 s, sys: 766 ms, total: 8.21 s
Wall time: 1min 10s


LatentDirichletAllocation(learning_method='online', max_iter=100, n_jobs=2,
                          random_state=42)

We will create a function to explore the topics and their words to see if we can tease apart the main idea of a topic.

In [65]:
def show_topics(vectorizer, lda_model, n_words=15):
    """
    This function takes our vectorizer, our model, and a
    number of words to display the topics from our model.
    """
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

Play around with the topic number and the words evaluated to see which amounts makes most sense to you./

In [66]:
show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=10)

[array(['say', 'trump', 'president', 'government', 'minister', 'party',
        'white', 'election', 'country', 'state'], dtype='<U14'),
 array(['say', 'police', 'fire', 'man', 'car', 'people', 'kill',
        'hospital', 'report', 'attack'], dtype='<U14'),
 array(['market', 'company', 'million', 'percent', 'price', 'share',
        'report', 'year', 'rate', 'bank'], dtype='<U14'),
 array(['game', 'win', 'team', 'play', 'season', 'player', 'second',
        'point', 'league', 'goal'], dtype='<U14'),
 array(['school', 'student', 'child', 'health', 'family', 'university',
        'help', 'care', 'study', 'people'], dtype='<U14'),
 array(['nt', 'say', 'like', 'go', 'time', 'know', 'year', 'think', 'come',
        'good'], dtype='<U14'),
 array(['new', 'june', 'st', '10', 'pm', 'open', 'art', '2017', 'event',
        'center'], dtype='<U14'),
 array(['low', 'high', 'night', 'chance', '2018', '10', '30', 'monday',
        '50', '20'], dtype='<U14'),
 array(['say', 'state', 'court', 'law', '

In [67]:
terms = sorted(vectorizer.vocabulary_.keys())

In [68]:
bow_docs = pd.DataFrame(bow.toarray(), columns=terms)
bow_docs.head()

Unnamed: 0,10,100,11,12,13,14,15,16,17,18,...,word,work,worker,world,worth,write,wrong,year,york,young
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,1,0,0,0,0,0,...,0,1,0,1,0,0,0,5,0,0
4,0,0,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,2,0,1


The components of our model can be found `lda_model.components_` and can help us create different sets of dataframes, namely, terms-to-topics and document-to-topics. The former has as its values the number of times a word is assigned in a topic, and the latter is the probabily of the words in a document being contained in a topic.

In [69]:
topic_term = pd.DataFrame(lda_model.components_.T, index=terms, columns=['topic_' + str(i) for i in range(topics)])
topic_term.tail()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
write,74.189873,0.100028,10.225397,6.065622,6.551513,413.710713,0.100046,0.100013,164.19817,4.254628
wrong,17.391785,11.178896,0.100012,1.601129,0.100019,144.933104,0.100016,0.100017,14.702245,0.100012
year,154.857333,59.236479,674.78377,438.706439,263.225941,1284.03018,390.252698,0.100012,480.880944,390.764092
york,58.687343,16.602567,47.021927,16.896222,0.100022,77.728866,263.455047,10.831248,101.901696,2.175126
young,0.100036,95.905954,0.100021,63.053213,157.854388,281.059554,32.529527,0.101015,0.100543,3.443631


In [70]:
doc_topic = pd.DataFrame(lda_model.transform(bow), index=df.title, columns=['topic_' + str(i) for i in range(topics)])
doc_topic.tail(3)

Unnamed: 0_level_0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Tory Burch Wants You to Own Your Ambition,0.128757,0.001316,0.001316,0.001316,0.072724,0.743912,0.04671,0.001316,0.001316,0.001316
"No free power to gaushalas in Punjab: Gau Sewa chief questions cow cess collection, writes to power utility",0.001205,0.001205,0.024036,0.001205,0.001205,0.001205,0.001205,0.039617,0.927911,0.001205
West Sussex woman jailed for threatening horses,0.001667,0.502998,0.001667,0.001667,0.094829,0.001667,0.001667,0.001667,0.390503,0.001667


Lastly, a good way to examine the output of an LDA model is by visulizing it with nice graphs and for this we have, `pyLDAvis`. Which is a python library for visualizing topic modeling. We first load it with it's sklearn backend while enabling the notebook setting. Next we use `pyLDAvis.sklearn.prepare` and pass in our model, the bag of words, and the fitted vectorizer to get a nice interactive visualization tool.

In [71]:
# !pip install pyLDAvis

In [72]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [73]:
pyLDAvis.sklearn.prepare(lda_model, bow, vectorizer)

  default_term_info = default_term_info.sort_values(
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


## 7. Summary

Blind Spots

With additional time we could have,
1. Further tweak the parameters of the vectorizers and models;
2. Create visualizations of both, the best topics and the document similarity to find more interesting patters;
3. Take the title of an article out of the body of the article to create a better, less bias representation of the words within a document;
4. Using Pytorch's nn.CosineSimilarity would help a lot with increasing the efficiency of our recommendation system;
5. There should have been a lemmatization step in the preprocessing stage.

Takeaways,
1. Recommendation systems and topic modeling are both unsupervised methods;
2. Recommendation systems can be created with or without users behavioural data;
3. Topic modeling compresses the data into the most important and meaninful words set by you;
4. Creating bags of words requires careful attention to the parameters;
5. Where possible, showcase a model or system in a mini-dashboard or data visualization.