# Exploration: NLP for Content Based Recommendations

The purpose of this notebook is to do some EDA and NLP on the content data. The Results will then be used in the "Content Based" section of the main notebook.

## Import libraries, load data

In [114]:
import pandas as pd
import numpy as np

import re
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords', 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words'])
from nltk.corpus import stopwords
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(); color='rebeccapurple'
%matplotlib inline  

# display settings
pd.set_option('max_colwidth', 200)
pd.set_option('display.max_columns', None)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [63]:
# load content data

df_content = pd.read_csv('data/articles_community.csv')
del df_content['Unnamed: 0']

## Perform EDA

In [64]:
df_content.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1056 entries, 0 to 1055
Data columns (total 5 columns):
doc_body           1042 non-null object
doc_description    1053 non-null object
doc_full_name      1056 non-null object
doc_status         1056 non-null object
article_id         1056 non-null int64
dtypes: int64(1), object(4)
memory usage: 41.3+ KB


In [130]:
df_content.head()

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
0,"Skip navigation Sign in SearchLoading...\r\n\r\nClose Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.\r\nWATCH QUEUE\r\nQUEUE\r\nWatch Queue Queue * Remove all\r\n * Disconnect\r\n\r\nThe next ...",Detect bad readings in real time using Python and Streaming Analytics.,Detect Malfunctioning IoT Sensors with Streaming Analytics,Live,0
1,No Free Hunch Navigation * kaggle.com\r\n\r\n * kaggle.com\r\n\r\nCommunicating data science: A guide to presenting your work 4COMMUNICATING DATA SCIENCE: A GUIDE TO PRESENTING YOUR WORK\r\nMegan ...,"See the forest, see the trees. Here lies the challenge in both performing and presenting an analysis. As data scientists, analysts, and machine learning engineers faced with fulfilling business obj…",Communicating data science: A guide to presenting your work,Live,1
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Paths\r\n * Courses * Our Courses\r\n * Partner Courses\r\n \r\n \r\n * Badges * Our Badges\r\n * BDU Badge Program\r\n \r\n \r\n * Watson ...,Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (April 18, 2017)",Live,2
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCALE - BOOST THE PERFORMANCE OF YOUR\r\nDISTRIBUTED DATABASE\r\nShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 29...","Learn how distributed DBs solve the problem of scaling persistent storage, but introduce latency as data size increases and become I/O bound.",DataLayer Conference: Boost the performance of your distributed database,Live,3
4,"Skip navigation Sign in SearchLoading...\r\n\r\nClose Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.\r\nWATCH QUEUE\r\nQUEUE\r\nWatch Queue Queue * Remove all\r\n * Disconnect\r\n\r\nThe next ...",This video demonstrates the power of IBM DataScience Experience using a simple New York State Restaurant Inspections data scenario.,Analyze NY Restaurant data using Spark in DSX,Live,4


In [66]:
df_content.loc[:3, 'doc_body'].apply(lambda x : len(x.split()))

0     681
1    3430
2     806
3     285
Name: doc_body, dtype: int64

In [71]:
# analyze mean text length in words by column

def get_mean_length(col):
    
    length_list = []
    
    for row in df_content[col]:
        try:
            length = len(row.split())
            length_list.append(length)
        
        except:
            continue
            
    mean_length = round(np.mean(length_list), 0)
    std_length = round(np.std(length_list), 0)
    
    print(str(col), mean_length, std_length)

In [72]:
# call the function

cols = ['doc_body', 'doc_description', 'doc_full_name']

for col in cols:
    get_mean_length(col)

doc_body 1294.0 996.0
doc_description 28.0 15.0
doc_full_name 7.0 3.0


In [10]:
for text in df_content['doc_body'][:3]:
    print(text)

Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDEMO: DETECT MALFUNCTIONING IOT SENSORS WITH STREAMING ANALYTICS
IBM AnalyticsLoading...

Unsubscribe from IBM Analytics? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 26KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

175 views 6LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 7 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loa

**Conclusion:** Altough `doc_body` has by far the richest content, it would need lots of cleaning to be properly prepared for NLP and used for content based recommendations. This is beyond the scope of this project. We will use the `doc_description`instead.

## NLP with TFIDF-Vectorizer

In [73]:
def tokenize_text(message):
    """Tokenization function to process text data. """
    
    lemmatizer = WordNetLemmatizer()
    stop_words = stopwords.words('english')
    
    # normalize case and remove punctuation
    message = re.sub(r"[^a-zA-Z0-9]", " ", message.lower())
    # tokenize text
    tokens = word_tokenize(message)
    # lemmatize, stip and remove stop words
    tokens = [lemmatizer.lemmatize(word.strip()) for word in tokens if word not in stop_words]
    # add part-of-speech tags
    tokens = pos_tag(tokens)
    
    return tokens

In [74]:
# check function

for description in df_content['doc_description'][:2]:
    tokens = tokenize_text(description)
    print(description)
    print(tokens, '\n')

Detect bad readings in real time using Python and Streaming Analytics.
[('detect', 'JJ'), ('bad', 'JJ'), ('reading', 'VBG'), ('real', 'JJ'), ('time', 'NN'), ('using', 'VBG'), ('python', 'JJ'), ('streaming', 'VBG'), ('analytics', 'NNS')] 

See the forest, see the trees. Here lies the challenge in both performing and presenting an analysis. As data scientists, analysts, and machine learning engineers faced with fulfilling business obj…
[('see', 'VB'), ('forest', 'JJS'), ('see', 'VB'), ('tree', 'JJ'), ('lie', 'NN'), ('challenge', 'NN'), ('performing', 'VBG'), ('presenting', 'VBG'), ('analysis', 'NN'), ('data', 'NNS'), ('scientist', 'NN'), ('analyst', 'NN'), ('machine', 'NN'), ('learning', 'VBG'), ('engineer', 'NN'), ('faced', 'VBD'), ('fulfilling', 'VBG'), ('business', 'NN'), ('obj', 'NN')] 



In [108]:
# define function for TF-IDF-Vectorizing

def apply_TfidfVectorizing(corpus):
    # initialize tf-idf vectorizer object
    vectorizer = TfidfVectorizer(tokenizer=tokenize_text)
    # compute bag of word counts and tf-idf values
    X = vectorizer.fit_transform(corpus)
#     # convert sparse matrix to numpy array to view
#     X.toarray()
    
    return X, vectorizer

In [117]:
# test function

corpus = df_content['doc_description'][:10]
X, vectorizer = apply_TfidfVectorizing(corpus)

# check result
X.shape

(10, 118)

In [110]:
# explore
X.toarray()

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.25622513,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.20503742]])

In [111]:
# explore
print(vectorizer.get_feature_names())

[('4', 'CD'), ('5', 'CD'), ('9', 'CD'), ('analysis', 'NN'), ('analyst', 'NN'), ('analytics', 'NNS'), ('apache', 'NN'), ('appears', 'VBZ'), ('bad', 'JJ'), ('become', 'NN'), ('big', 'JJ'), ('bound', 'IN'), ('browser', 'NN'), ('build', 'VB'), ('business', 'NN'), ('career', 'NN'), ('challenge', 'NN'), ('class', 'NN'), ('collaborate', 'NN'), ('company', 'NN'), ('compete', 'JJ'), ('compose', 'JJ'), ('console', 'NN'), ('data', 'NNS'), ('datascience', 'NN'), ('db', 'JJ'), ('demonstrates', 'VBZ'), ('deployment', 'JJ'), ('detect', 'JJ'), ('distributed', 'VBD'), ('driven', 'RB'), ('ecosystem', 'NN'), ('engineer', 'NN'), ('engineering', 'NN'), ('essential', 'JJ'), ('experience', 'NN'), ('faced', 'VBD'), ('fang', 'NN'), ('forest', 'JJS'), ('fulfilling', 'VBG'), ('home', 'NN'), ('ibm', 'JJ'), ('increase', 'NN'), ('inspection', 'NN'), ('interesting', 'JJ'), ('introduce', 'NN'), ('kaggle', 'NN'), ('latency', 'NN'), ('learn', 'NN'), ('learn', 'VBP'), ('learning', 'NN'), ('learning', 'VBG'), ('lie', 'NN

In [116]:
# generate the cosine similarity matrix

cosine_sim = cosine_similarity(X, X)

# check
cosine_sim

array([[ 1.        ,  0.        ,  0.        ,  0.        ,  0.05152734,
         0.1014903 ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.04329467,  0.01197067,  0.01228819,
         0.02420331,  0.        ,  0.02042251,  0.1502582 ,  0.05241081],
       [ 0.        ,  0.04329467,  1.        ,  0.05020093,  0.05153251,
         0.1015005 ,  0.        ,  0.08564509,  0.1649513 ,  0.13199794],
       [ 0.        ,  0.01197067,  0.05020093,  1.        ,  0.01424837,
         0.02806417,  0.        ,  0.02368026,  0.02680074,  0.02144658],
       [ 0.05152734,  0.01228819,  0.05153251,  0.01424837,  1.        ,
         0.10967299,  0.        ,  0.02430838,  0.07795722,  0.02201545],
       [ 0.1014903 ,  0.02420331,  0.1015005 ,  0.02806417,  0.10967299,
         1.        ,  0.06617805,  0.04787875,  0.05418799,  0.04336252],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.06617805,  1.        ,  0.02792018

In [118]:
# update function to return the similarity_matrix

def create_similarity_matrix(corpus):
     
    # initialize tf-idf vectorizer object
    vectorizer = TfidfVectorizer(tokenizer=tokenize_text)
    # compute bag of word counts and tf-idf values
    X = vectorizer.fit_transform(corpus)
    # create cosine similarity_matrix
    sim_matrix = cosine_similarity(X, X)
    
    return sim_matrix

In [125]:
# fill NaN in description column with full title

df_content['doc_description'].fillna(value=df_content['doc_full_name'], inplace=True)

matrix = create_similarity_matrix(df_content['doc_description'])

In [128]:
matrix.shape

(1056, 1056)