# Intro to Natural Language Processing 

> "You shall know a word by the company it keeps." ~ John R. Firth

![img](https://cdn.shopify.com/s/files/1/0867/3580/products/vinyl_decal_hello_words_cloud_ig4779_1800x1800.jpg?v=1571439560)

## Learning Outcomes

By the end of this tutorial you will

1. Have a better understanding of what is natural language processing and what are some of its applications.
2. Learn about the root of a word, what is means, and why we use them.
3. Be able to create a recommendation system based on text similarity.
5. Understand how to put together a simple app using panel.

Assumptions about you

- Have at least 1 year of coding experience in Python.
- Are comfortable with loops, functions, lists comprehensions, and if-else statements.
- Have some knowledge of pandas and NumPy.
- Have at least 15 GB of free space in your computer.
- While it is not required to have experience using Jupyter Notebooks, this would be very beneficial for the session.

What this tutorial is not

- A deep dive into Natural Language Processing.
- A deep learning tutorial.
- A web application tutorial.

## Table of Contents

1. Libraries
2. The Data
3. Flash NLP Intro
4. Cleaning
5. Recommendation System
6. Summary

## 1. Libraries

Download the following libraries, if not available already.

In [3]:
!pip install -U spacy panel



In [1]:
import json, re, spacy
import pandas as pd, numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
import panel as pn
from concurrent.futures import ProcessPoolExecutor
from datasets import load_dataset
import datasets

pn.extension()

%load_ext autoreload
%autoreload 2

  return torch._C._cuda_getDeviceCount() > 0


## 2. The Data

With have been given a random corpus of news articles and our task is to come up with a product(s), a recommendations systems (and a set of topic that best explains the model). The data consist of news articles plus some additional columns inside for which you can find more information in the table below.

| Column | Content |
|--------|---------|
|title |Title of article|
|text | Text inside article|
|domain | Domain Url of article|
|date | YYYY-MM-DD Time|
|description | Abstract of article|
|url | Url of article|
|image_url | Image if available|

In addition, here is the full description of the dataset from Huggin Face.

> "CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.
It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English language subset of the CC-News dataset." ~ [Hugging Face cc_news](https://huggingface.co/datasets/cc_news)

Before we do any data cleaning, let's read in the data and explore it a bit.

In [2]:
%%time

dataset = load_dataset('cc_news')

Reusing dataset cc_news (/home/ramonperez/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b)


CPU times: user 137 ms, sys: 252 ms, total: 389 ms
Wall time: 1.85 s


Let's see how many articles we have and then examine the columns.

In [3]:
dataset.shape

{'train': (708241, 7)}

In [4]:
dataset.column_names['train']

['title', 'text', 'domain', 'date', 'description', 'url', 'image_url']

Now that we have a dictionary, we can create a pandas DataFrame.

In [5]:
%%time

df = dataset['train'].to_pandas().sample(5_000).reset_index(drop=True)

In [6]:
df.head()

Unnamed: 0,title,text,domain,date,description,url,image_url
0,Cristiano Ronaldo banned for 5 games after pus...,(AP Photo/Manu Fernandez). Real Madrid's Crist...,www.wave3.com,2017-08-14 00:00:00,,http://www.wave3.com/story/36129496/cristiano-...,http://APMOBILE.images.worldnow.com/images/146...
1,Why the Philippines is not truly independent,"FOR leftists, American assistance in the Maraw...",www.manilatimes.net,2017-06-15 00:21:39,,http://www.manilatimes.net/philippines-not-tru...,http://manilatimes.net/wp-content/uploads/2016...
2,Tyron Charles death: Murder victim 'dumped in ...,Image copyright Family handout Image caption T...,www.bbc.com,2018-07-04 20:09:54,Police secretly recorded prison visits to find...,https://www.bbc.com/news/uk-england-leeds-4471...,https://ichef.bbci.co.uk/news/1024/branded_new...
3,North Korean official accuses U.S. of turning ...,Justice Neil Gorsuch heard his first arguments...,theweek.com,2017-04-17 19:12:38,"Official site of The Week Magazine, offering c...",http://theweek.com/speedreads/692792/north-kor...,http://api.theweek.com/sites/default/files/sty...
4,Spring's Sweet Start: Dairy Queen's Free Cone Day,"What to Know Tuesday, March 20\nParticipating ...",www.nbclosangeles.com,2018-03-19 10:44:02,Find your pay-nothing small vanilla cone on th...,https://www.nbclosangeles.com/news/local/Sprin...,https://media.nbclosangeles.com/images/1200*67...


In [8]:
df.to_parquet("cc_news_sample.parquet", compression="gzip")

## 3. Flash NLP Intro

We can use the `.loc[index, column]` method on our dataframe, select one column and one row using a comma to separate both, and examine a prettier version of the text using the python function `pprint()`.

In [9]:
random_article = df.iloc[10, 1]
pprint(random_article)

('Autism is a neurological and developmental diagnosis seen from early '
 'childhood marked by difficulty in communicating, forming relationships and '
 'using languages. In Sri Lanka, one in 93 children have been found to have '
 'autism. Studies have shown that the condition of an autistic child can '
 'improve with early diagnosis. Early diagnosis and provision of further '
 "information to caregivers is largely linked to the level of physicians' "
 'knowledge of autism.\n'
 "A recent study carried out in Sri Lanka to assess doctors' knowledge of "
 'diagnostic features and co-morbidities of childhood autism in a tertiary '
 'care hospital has found that around 50% of the doctors were unaware of some '
 'of the signs and symptoms of it.\n'
 '"Our study has revealed that the knowledge of diagnostic features and '
 'comorbidities of childhood autism among doctors is poor," says Dr Yasodha '
 'Maheshi Rohanachandra, lead author of the research article.\n'
 '"There is a lack of educatio

Notice how the review above is quite messy and it has a lot of characters that, for all intents and purposes, will not be useful for our analysis. Let's examine a cleaner version of the article above by running it through spaCy's tokenizer. When we tokenize a document, we are separating all of its content into each of its components, i.e. words, numbers, punctiations and the like, to make it easier to process it, clean it, and to run computations on it.

For this part, we will load an english model, instantiate it and pass an example article through it. You may need to run the cell below first to download the english model.

In [10]:
# !python -m spacy download en_core_web_lg

  return torch._C._cuda_getDeviceCount() > 0
Collecting en-core-web-lg==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.1.0/en_core_web_lg-3.1.0-py3-none-any.whl (777.1 MB)
[K     |██████████████████████▋         | 548.8 MB 1.7 kB/s eta 1 day, 14:17:02█████████████▍                 | 349.7 MB 271 kB/s eta 0:26:16     |██████████████▋                 | 355.8 MB 134 kB/s eta 0:52:02     |████████████████████▏           | 489.5 MB 66 kB/s eta 1:11:58     |█████████████████████▉          | 528.9 MB 64 kB/s eta 1:04:22     |█████████████████████▉          | 529.4 MB 57 kB/s eta 1:11:51     |██████████████████████          | 532.0 MB 78 kB/s eta 0:52:07     |██████████████████████          | 532.3 MB 74 kB/s eta 0:54:58     |██████████████████████          | 533.9 MB 78 kB/s eta 0:51:47     |██████████████████████          | 534.3 MB 71 kB/s eta 0:56:51

IOStream.flush timed out


[K     |████████████████████████████████| 777.1 MB 53 kB/s  eta 0:00:0174:16:31
Collecting click<7.2.0,>=7.1.1
  Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
Installing collected packages: click, en-core-web-lg
  Attempting uninstall: click
    Found existing installation: click 8.0.1
    Uninstalling click-8.0.1:
      Successfully uninstalled click-8.0.1
Successfully installed click-7.1.2 en-core-web-lg-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [12]:
nlp = spacy.load("en_core_web_lg")

In [13]:
parsed_article = nlp(random_article)

In [14]:
parsed_article

Autism is a neurological and developmental diagnosis seen from early childhood marked by difficulty in communicating, forming relationships and using languages. In Sri Lanka, one in 93 children have been found to have autism. Studies have shown that the condition of an autistic child can improve with early diagnosis. Early diagnosis and provision of further information to caregivers is largely linked to the level of physicians' knowledge of autism.
A recent study carried out in Sri Lanka to assess doctors' knowledge of diagnostic features and co-morbidities of childhood autism in a tertiary care hospital has found that around 50% of the doctors were unaware of some of the signs and symptoms of it.
"Our study has revealed that the knowledge of diagnostic features and comorbidities of childhood autism among doctors is poor," says Dr Yasodha Maheshi Rohanachandra, lead author of the research article.
"There is a lack of educational psychologists and teachers trained in specialized autism 

Notice how much nicer our article looks now.

We can also grab the sentences and view them one by one using the attribute `.sents` and the built in python function `next()`, since the attribute of a document that has been tokenized by spacy will always return an iterator. Conversely, we can add it to a loop and show each of the sentences in an article.

In [15]:
next(enumerate(parsed_article.sents))

(0,
 Autism is a neurological and developmental diagnosis seen from early childhood marked by difficulty in communicating, forming relationships and using languages.)

In [16]:
for num, sentence in enumerate(parsed_article.sents):
    print(f"Sentence #{num}:\n {sentence}\n")

Sentence #0:
 Autism is a neurological and developmental diagnosis seen from early childhood marked by difficulty in communicating, forming relationships and using languages.

Sentence #1:
 In Sri Lanka, one in 93 children have been found to have autism.

Sentence #2:
 Studies have shown that the condition of an autistic child can improve with early diagnosis.

Sentence #3:
 Early diagnosis and provision of further information to caregivers is largely linked to the level of physicians' knowledge of autism.

Sentence #4:
 


Sentence #5:
 A recent study carried out in Sri Lanka to assess doctors' knowledge of diagnostic features and co-morbidities of childhood autism in a tertiary care hospital has found that around 50% of the doctors were unaware of some of the signs and symptoms of it.

Sentence #6:
 
"Our study has revealed that the knowledge of diagnostic features and comorbidities of childhood autism among doctors is poor," says Dr Yasodha Maheshi Rohanachandra, lead author of the 

We can also have a look at the different kinds of entities in an article. These entities can be a person (called PERSON), and number (called CARDINAL), a geopolitical entity (called GPE), etc.

In [17]:
for num, entity in enumerate(parsed_article.ents):
    print(f"Entity #{num}: {entity} -- {entity.label_}\n")

Entity #0: Sri Lanka -- GPE

Entity #1: one -- CARDINAL

Entity #2: 93 -- CARDINAL

Entity #3: Sri Lanka -- GPE

Entity #4: around 50% -- PERCENT

Entity #5: Dr Yasodha Maheshi Rohanachandra -- PERSON

Entity #6: Sri Lanka -- GPE

Entity #7: 62% -- PERCENT

Entity #8: 176 -- CARDINAL

Entity #9: 76% -- PERCENT

Entity #10: only 61% -- PERCENT

Entity #11: Sri Lanka -- GPE

Entity #12: Dr Rohanachandra -- PERSON

Entity #13: Rohanachandra -- PERSON

Entity #14: Y.M. et al. -- PERSON

Entity #15: 2017 -- DATE

Entity #16: Sri Lanka Journal of Child Health -- ORG

Entity #17: 46(1 -- CARDINAL

Entity #18: pp.29-32 -- DATE

Entity #19: SLJOL -- ORG

Entity #20: SLJOL -- ORG

Entity #21: the National Science Foundation of -- ORG

Entity #22: Sri Lanka -- GPE



In [18]:
spacy.explain("LOC")

'Non-GPE locations, mountain ranges, bodies of water'

We can also check weather a word is a stopword or a punctuation, or we can even lemmatize our articles. Lemmatization is a way of taking the root of a word and bringing similar words to a common denominator, for example, `was` will become `be` and most plural words will become singular words.

In [19]:
new_list = []

for token in parsed_article:
    new_list.append(token.text)
    
    
new_list[:10]

['Autism',
 'is',
 'a',
 'neurological',
 'and',
 'developmental',
 'diagnosis',
 'seen',
 'from',
 'early']

In [20]:
new_list = [token.text for token in parsed_article]

new_list[:10]

['Autism',
 'is',
 'a',
 'neurological',
 'and',
 'developmental',
 'diagnosis',
 'seen',
 'from',
 'early']

In [21]:
# here we are taking out of the parsed article each token
token_text = [token.text for token in parsed_article]

# here we are lemmatizing each word possible
token_lemmas = [token.lemma_ for token in parsed_article]

# stopwords are very common so here we will extract a variable that will tell us whether
# a token is a stopword or not
token_stop = [token.is_stop for token in parsed_article]

# a token is a pinctuation or not
token_punc = [token.is_punct for token in parsed_article]

# we will now add all three to a dataframe and display it without assigning it to a variable
pd.DataFrame(zip(token_text, token_lemmas, token_punc, token_stop), columns=['Original Text', 'Lemmatized Text', 'Punctuations', 'stopwords']).head(50)

Unnamed: 0,Original Text,Lemmatized Text,Punctuations,stopwords
0,Autism,autism,False,False
1,is,be,False,True
2,a,a,False,True
3,neurological,neurological,False,False
4,and,and,False,True
5,developmental,developmental,False,False
6,diagnosis,diagnosis,False,False
7,seen,see,False,False
8,from,from,False,True
9,early,early,False,False


## 4. Cleaning

Let's start by checking if our dataset contains any missin values, and then evaluate the amount of memory we are currently using from our machine.

In [22]:
df.isna().sum()

title          0
text           0
domain         0
date           0
description    0
url            0
image_url      0
dtype: int64

In [23]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        5000 non-null   object
 1   text         5000 non-null   object
 2   domain       5000 non-null   object
 3   date         5000 non-null   object
 4   description  5000 non-null   object
 5   url          5000 non-null   object
 6   image_url    5000 non-null   object
dtypes: object(7)
memory usage: 32.8 MB


Depending on the random sample you choose at the beginning, you may or may not have a lot. If so, getting rid of the columns you don't need will help release some of the memory in your machine.

In [24]:
df.drop(['url', 'image_url', 'domain'], axis=1, inplace=True)

Perfect! Let's now extract the `text` column and normalize it. This means we will use `spacy` to,
- take out anything that is not a word or a number,
- convert to lower case,
- strip the spaces around the words,
- tokenize the articles,
- remove stopwords (we will use spaCy's list of stopwords for this),
- and then join the cleaned tokens back together.

In [25]:
articles = df['text'].values

In [26]:
from spacy.lang.en.stop_words import STOP_WORDS
len(STOP_WORDS), STOP_WORDS

(326,
 {"'d",
  "'ll",
  "'m",
  "'re",
  "'s",
  "'ve",
  'a',
  'about',
  'above',
  'across',
  'after',
  'afterwards',
  'again',
  'against',
  'all',
  'almost',
  'alone',
  'along',
  'already',
  'also',
  'although',
  'always',
  'am',
  'among',
  'amongst',
  'amount',
  'an',
  'and',
  'another',
  'any',
  'anyhow',
  'anyone',
  'anything',
  'anyway',
  'anywhere',
  'are',
  'around',
  'as',
  'at',
  'back',
  'be',
  'became',
  'because',
  'become',
  'becomes',
  'becoming',
  'been',
  'before',
  'beforehand',
  'behind',
  'being',
  'below',
  'beside',
  'besides',
  'between',
  'beyond',
  'both',
  'bottom',
  'but',
  'by',
  'ca',
  'call',
  'can',
  'cannot',
  'could',
  'did',
  'do',
  'does',
  'doing',
  'done',
  'down',
  'due',
  'during',
  'each',
  'eight',
  'either',
  'eleven',
  'else',
  'elsewhere',
  'empty',
  'enough',
  'even',
  'ever',
  'every',
  'everyone',
  'everything',
  'everywhere',
  'except',
  'few',
  'fifteen',

In [27]:
def normalize_doc(doc):
    """
    This function normalizes your list of documents by taking only
    words, numbers, and spaces in between them. It then filters out
    stop words.
    """
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    tokens = nlp(doc)
    filtered_tokens = [token.lemma_ for token in tokens if not token.is_stop]
    doc = ' '.join(filtered_tokens).replace(" \n ", "")
    return doc

In [28]:
random_article

'Autism is a neurological and developmental diagnosis seen from early childhood marked by difficulty in communicating, forming relationships and using languages. In Sri Lanka, one in 93 children have been found to have autism. Studies have shown that the condition of an autistic child can improve with early diagnosis. Early diagnosis and provision of further information to caregivers is largely linked to the level of physicians\' knowledge of autism.\nA recent study carried out in Sri Lanka to assess doctors\' knowledge of diagnostic features and co-morbidities of childhood autism in a tertiary care hospital has found that around 50% of the doctors were unaware of some of the signs and symptoms of it.\n"Our study has revealed that the knowledge of diagnostic features and comorbidities of childhood autism among doctors is poor," says Dr Yasodha Maheshi Rohanachandra, lead author of the research article.\n"There is a lack of educational psychologists and teachers trained in specialized a

In [29]:
normalize_doc(random_article)

'autism neurological developmental diagnosis see early childhood mark difficulty communicate form relationship language sri lanka 93 child find autism study show condition autistic child improve early diagnosis early diagnosis provision information caregiver largely link level physician knowledge autismrecent study carry sri lanka assess doctor knowledge diagnostic feature comorbiditie childhood autism tertiary care hospital find 50 doctor unaware sign symptomstudy reveal knowledge diagnostic feature comorbiditie childhood autism doctor poor say dr yasodha maheshi rohanachandra lead author research articlelack educational psychologist teacher train specialized autism educational strategy poor awareness autism community service child autism sri lanka limit centralized addaccord research majority 62 176 survey doctor believe lack competence identify autismawareness symptom doctor high impaired social interaction 76 contrast 61 doctor aware restrict repetitive interest behavior potential 

Since we have quite a few articles, this operation can take quite some time unless we do the cleaning process concurrently or in parallel. We will do this using the `ProcessPoolExecutor()` from the `concurrent.futures` module.

In [30]:
%%time

with ProcessPoolExecutor(max_workers=12) as e:
    processed_articles = list(e.map(normalize_doc, articles))

CPU times: user 3.48 s, sys: 1.6 s, total: 5.07 s
Wall time: 49.4 s


We will add the cleaned versions of the documents back into the dataframe and loop over these while taking the lenght (in characters terms) of each article.

In [31]:
%%time

df['clean_text'] = processed_articles
df['len_clean_text'] = df['clean_text'].apply(len)
df['len_dirty_text'] = df['text'].apply(len)

CPU times: user 6.35 ms, sys: 0 ns, total: 6.35 ms
Wall time: 6.11 ms


Let's now save our cleaned dataset in case we need to restart our notebook and begin the analysis again. We will also release a bit of memory by getting rid of all the data and variables we have loaded up since the beginning of the notebook.

In [32]:
df.head(2)

Unnamed: 0,title,text,date,description,clean_text,len_clean_text,len_dirty_text
0,Cristiano Ronaldo banned for 5 games after pus...,(AP Photo/Manu Fernandez). Real Madrid's Crist...,2017-08-14 00:00:00,,ap photomanu fernandez real madrids cristiano ...,3192,4720
1,Why the Philippines is not truly independent,"FOR leftists, American assistance in the Maraw...",2017-06-15 00:21:39,,leftist american assistance marawi battle show...,3734,6017


In [33]:
%%time

(
    df[['title', 'date', 'clean_text', 'len_clean_text', 'len_dirty_text']]
     .reset_index(drop=True)
     .to_parquet('articles_clean.parquet', compression='snappy')
)

CPU times: user 60.8 ms, sys: 19.4 ms, total: 80.2 ms
Wall time: 123 ms


In [34]:
del dataset
del df
del articles
del processed_articles

In [36]:
df = pd.read_parquet('articles_clean.parquet').reset_index(drop=True)

It wouldn't make any sense to feed to our algorithms articles with a tiny amount of characters, so let's examine the distribution of characters among both, the raw and the clean version of our articles.

In [37]:
df[['len_clean_text', 'len_dirty_text']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
len_clean_text,5000.0,1471.2924,1721.160127,16.0,491.0,1081.5,1919.25,40389.0
len_dirty_text,5000.0,2325.8164,2652.81815,70.0,749.75,1697.5,3076.0,65754.0


In [38]:
df[['len_clean_text', 'len_dirty_text']].skew()

len_clean_text    6.642526
len_dirty_text    6.399837
dtype: float64

![img](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn.analyticsvidhya.com%2Fwp-content%2Fuploads%2F2020%2F06%2Fsk1.png&f=1&nofb=1)

Now that we know we have a skewed distribution of characters, let's fix that by setting up a rule. We'll evaluate an article using the tweets' maximum character count of 280, at the time of writing, and filter out all articles with less than that. Let's check how many we have first.

In [39]:
greater_than_a_tweet = df['len_clean_text'] > 280
greater_than_a_tweet.sum()

4321

In [40]:
df = df[greater_than_a_tweet].copy()

In [41]:
df.shape

(4321, 5)

# 5. Recommendation System

Recommendation systems can come in many different forms and sizes. We can create a system that takes into account the behaviour of other users, or a system that only looks at similar articles or items to make a recommendation. Both are powerful systems and could cover an entire section of a book in their own right, which is why we will focus on the latter category, the one that makes recommendations based on similar articles.

To create our recommendation system we first need to convert our articles into a numerical representation. We do this with a so-called bag of words (bow). BOWs are matrices with the documents in the rows, the terms contained in all documents along the columns. The frequency with which each term appears in each document along the values can be found in the doc-token combination. To create this kind of representation we can use `sklearn`'s `CountVectorizer` or `TfidfVectorizer` classes. The latter being the normalized version of the former, i.e. the frequency of a word divided by the amount of documents in which it appears.

To use this classes we first instantiate them, fit the data to them so that they can learn the vocabulary of our corpus, and then we tranform the corpus into a sparse matrix. These sparse matrices hold the location of all non-zero values to make it easier to store the data and compute on it.

In [42]:
%%time

# we first instantiate our class
tf = TfidfVectorizer(min_df=0.035, max_df=0.80)

# we can fit and transform the data in the same step
tfidf_matrix = tf.fit_transform(df['clean_text'].values)

# evaluate the shape of our matrix
tfidf_matrix.shape

CPU times: user 697 ms, sys: 0 ns, total: 697 ms
Wall time: 728 ms


(4321, 788)

We can access our vocabulary with `.get_feature_names()` method.

In [43]:
tf.get_feature_names()[500:550]

['pass',
 'past',
 'pay',
 'people',
 'percent',
 'perform',
 'performance',
 'period',
 'person',
 'personal',
 'phone',
 'photo',
 'pick',
 'picture',
 'piece',
 'place',
 'plan',
 'play',
 'player',
 'pm',
 'point',
 'police',
 'policy',
 'political',
 'popular',
 'position',
 'positive',
 'possible',
 'post',
 'potential',
 'power',
 'practice',
 'prepare',
 'present',
 'president',
 'press',
 'pressure',
 'pretty',
 'prevent',
 'previous',
 'previously',
 'price',
 'private',
 'probably',
 'problem',
 'process',
 'produce',
 'product',
 'production',
 'professional']

The next step is to get the distance between documents and words to see how close and how far, based on words only, are two documents from one another. The `cosine_similarity` function we imported earlier can do this for us, and afterwards, we can create a dataframe to evaluate our results.

**Note:** this operation can take a few minutes if you are using the entire dataset. Make sure to grab some ☕️ 😎

In [44]:
%%time

doc_sim = cosine_similarity(tfidf_matrix)

CPU times: user 691 ms, sys: 268 ms, total: 959 ms
Wall time: 961 ms


In [45]:
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4311,4312,4313,4314,4315,4316,4317,4318,4319,4320
0,1.0,0.104807,0.010998,0.041108,0.037607,0.013447,0.019827,0.065076,0.086727,0.076672,...,0.028044,0.103528,0.01251,0.027298,0.020729,0.431844,0.063176,0.008133,0.030969,0.023385
1,0.104807,1.0,0.051263,0.017592,0.087388,0.006243,0.051306,0.214864,0.079791,0.00726,...,0.14182,0.162437,0.058343,0.0264,0.120998,0.070996,0.158498,0.032312,0.054947,0.077151
2,0.010998,0.051263,1.0,0.106262,0.03626,0.030976,0.073106,0.049003,0.053113,0.0,...,0.028965,0.123234,0.030842,0.023701,0.041,0.016222,0.062175,0.003205,0.029527,0.02122
3,0.041108,0.017592,0.106262,1.0,0.011884,0.033769,0.0,0.061406,0.027842,0.0,...,0.138607,0.078569,0.041286,0.0,0.030631,0.065558,0.048351,0.019614,0.102716,0.021051
4,0.037607,0.087388,0.03626,0.011884,1.0,0.009302,0.009201,0.090487,0.10446,0.0,...,0.15873,0.078334,0.059583,0.005564,0.065376,0.073324,0.121543,0.013669,0.031697,0.039799


In [46]:
doc_sim.shape

(4321, 4321)

The reason we see a X000xX000 matrix is because both halfs alonside the diagonal line are identical, hence, we have the similarity of all docs vs all docs.

In [47]:
articles_list = df['title'].values
articles_list.shape, articles_list

((4321,),
 array(['Cristiano Ronaldo banned for 5 games after pushing referee - wave3.com-Louisville News, Weather & Sports',
        'Why the Philippines is not truly independent',
        "Tyron Charles death: Murder victim 'dumped in builder's bag'",
        ...,
        'Dental Health May Be Within Reach With Financial Education Benefits Center Resources',
        'Israel says anti-Semitism rising in Poland as Holocaust row simmers',
        "State law: Schools with 'D' or 'F' must inform neighborhoods"],
       dtype=object))

Let's now
1. pick a title at random
2. get the index of such title
3. select the corresponding row for such title in our new document similarity dataframe
4. sort the index of such values
5. return the top 5 article titles

In [48]:
from random import choice

In [49]:
a_title = choice(articles_list)
a_title

'Storm Team 27: Mostly cloudy, snow tonight'

In [50]:
article_idx = np.where(articles_list == a_title)[0][0]
article_idx

2548

In [51]:
article_similarities = doc_sim_df.iloc[article_idx].values
article_similarities

array([0.01256244, 0.04339975, 0.        , ..., 0.01731104, 0.        ,
       0.07849991])

In [52]:
# note that we don't select the first one as this should always be one
similar_articles_idxs = np.argsort(-article_similarities)[1:10]
similar_articles_idxs

array([ 666,  949,  248, 3838, 2793, 3806, 1829, 3692, 3612])

In [53]:
df.head()

Unnamed: 0,title,date,clean_text,len_clean_text,len_dirty_text
0,Cristiano Ronaldo banned for 5 games after pus...,2017-08-14 00:00:00,ap photomanu fernandez real madrids cristiano ...,3192,4720
1,Why the Philippines is not truly independent,2017-06-15 00:21:39,leftist american assistance marawi battle show...,3734,6017
2,Tyron Charles death: Murder victim 'dumped in ...,2018-07-04 20:09:54,image copyright family handout image caption t...,1173,1778
3,North Korean official accuses U.S. of turning ...,2017-04-17 19:12:38,justice neil gorsuch hear argument supreme cou...,960,1611
4,Spring's Sweet Start: Dairy Queen's Free Cone Day,2018-03-19 10:44:02,know tuesday march 20participate dairy queensm...,1165,1972


In [54]:
doc1 = nlp(df.iloc[1, 2])
doc2 = nlp(df.iloc[2, 2])

In [55]:
doc1.similarity(doc2)

0.8292398212223382

In [56]:
a_title

'Storm Team 27: Mostly cloudy, snow tonight'

In [57]:
similar_articles = articles_list[similar_articles_idxs]
pprint(similar_articles.tolist())

['TX Lubbock TX Zone Forecast',
 'TX Shreveport LA Zone Forecast',
 'TX Midland/Odessa TX Zone Forecast',
 'Friday Early Forecast',
 'WA Portland OR Zone Forecast',
 'On 1st day of spring, snow in forecast; storm watch posted',
 'Timeline: The disappearance of Stephanie Low - WAOW - Newsline 9, Wausau '
 'News, Weather, Sports',
 'Stop taking drugs for lower back pain and do this instead',
 'Sunny, hot Monday']


Lastly, we will create create a mini-dashboard containing,
1. a widget with all of our titles,
2. a function with the steps we followed above,
3. a panel object to store a title, the widget, and the function.

In [58]:
titles = df.title.unique().tolist()
title_widget = pn.widgets.Select(value=choice(titles), options=titles, name='Articles')

In [59]:
@pn.depends(title_widget.param.value)
def article_recommender(title_widget):
    
    article_idx = np.where(articles_list == title_widget)[0][0]
    article_similarities = doc_sim_df.iloc[article_idx].values
    similar_title_idxs = np.argsort(-article_similarities)[1:6]
    similar_titles = articles_list[similar_title_idxs]
    
    return pn.Column(*similar_titles, width=600)

In [60]:
text = pn.pane.Markdown(f"# Small Recommendation Engine", style={"color": "#000000"}, width=600, height=50,
                        sizing_mode="stretch_width", margin=(10,10,10,5))

In [61]:
pn.Column(text, title_widget, article_recommender, align='center', width=600, height=300)

## 6. Summary

Blind Spots

With additional time we could have,
1. Further tweak the parameters of the vectorizers and models;
2. Create visualizations of the document similarity to find more interesting patters;
3. Take the title of an article out of the body of the article to create a better, less biased representation of the words within a document;
4. Using Pytorch's nn.CosineSimilarity would help a lot with increasing the efficiency of our recommendation system;
5. There should have been a lemmatization step in the preprocessing stage.

Takeaways,
1. Recommendation systems are an example of unsupervised machine learning;
2. Recommendation systems can be created with or without users behavioural data;
3. Creating bags of words requires careful attention to the parameters;
4. Where possible, showcase a model or system in a mini-dashboard or data visualization.