# <font color="maroon"> 1.0 NLP Toolkits and Preprocessing Techniques </font>

### NLP Toolkits
1.Python libraries for natural language processing

### Text Preprocessing Techniques
1.Converting text to a meaningful format for analysis<br>
2.Preprocessing and cleaning text

### The Top 10 NLP Tools
1.MonkeyLearn | NLP made simple <br>
2.Aylien | Leveraging news content with NLP<br>
3.IBM Watson | A pioneer AI platform for businesses<br>
4.Google Cloud NLP API | Google technology applied to NLP<br>
5.Amazon Comprehend | An AWS service to get insights from text<br>
<font color="red">6.NLTK | The most popular Python library<br> </font>
7.Stanford Core NLP | Stanford’s fast and robust toolkit<br>
<font color="red">8.TextBlob | An intuitive interface for NLTK<br></font>
9.SpaCy | Super-fast library for advanced NLP tasks<br>
10.GenSim | State-of-the-art topic modeling<br>

## How to Install NLTK?

### Method (i) Command Line
pip install nlt

### Method (ii) Jupyter Notebook
import nltk
nltk.download()

### Method (iii) Anaconda navigator (Environment)
![NLTK.png](attachment:NLTK.png)

### Method (iv) Download Package and Place into Site-package directory
Install nltk toolkit from https://sourceforge.net/projects/nltk/<br>
![](https://imgur.com/yRQpekK.png)

Locate the package into site-package directory <br>
(to find the path:<br> import site <br>site.getsitepackages())

## Sample Text Data

Consider this sentence:
**Hi Mr. Smith! I am going to buy some vegetables (3 tomatoes and 3 cucumbers) from the
store. Should I pick up some black-eyed peas as well?**

Text data is messy. To analyze this data, we need to preprocess the text.


In [1]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


![](https://i.imgur.com/pt5p6Hb.png)

# Code: Tokenization (Words)

In [3]:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

my_text = '''Hi Mr. Smith! I am going to buy some vegetables (3 tomatoes and 3 cucumbers)
from the store. Should I pick up some black-eyed peas as well?'''

print(word_tokenize(my_text)) # print function requires Python 3

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yikso\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


['Hi', 'Mr.', 'Smith', '!', 'I', 'am', 'going', 'to', 'buy', 'some', 'vegetables', '(', '3', 'tomatoes', 'and', '3', 'cucumbers', ')', 'from', 'the', 'store', '.', 'Should', 'I', 'pick', 'up', 'some', 'black-eyed', 'peas', 'as', 'well', '?']


# Code: Tokenization (Sentences)

In [4]:
from nltk.tokenize import sent_tokenize

my_text = '''Hi Mr. Smith! I am going to buy some vegetables (3 tomatoes and 3 cucumbers)
from the store. Should I pick up some black-eyed peas as well?'''

print(sent_tokenize(my_text))

['Hi Mr. Smith!', 'I am going to buy some vegetables (3 tomatoes and 3 cucumbers)\nfrom the store.', 'Should I pick up some black-eyed peas as well?']


![](https://i.imgur.com/3L6x92C.png)

# Code: Remove Punctuation

In [5]:
import re # Regular expression library
import string
#Replace punctuations with a white space
#clean_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', my_text)
#clean_text
s = re.sub('[^\w\s]','',my_text)
s

'Hi Mr Smith I am going to buy some vegetables 3 tomatoes and 3 cucumbers\nfrom the store Should I pick up some blackeyed peas as well'

# Code: Make All Text Lowercase

In [6]:
clean_text = s.lower()
clean_text

'hi mr smith i am going to buy some vegetables 3 tomatoes and 3 cucumbers\nfrom the store should i pick up some blackeyed peas as well'

# Code: Remove Numbers

In [7]:
# Removes all words containing digits
clean_text = re.sub('\d', '', clean_text)
clean_text

'hi mr smith i am going to buy some vegetables  tomatoes and  cucumbers\nfrom the store should i pick up some blackeyed peas as well'

# <font color='blue'>Preprocessing: Stop Words</font>

![](https://i.imgur.com/T5RJXrX.png)

# Code: Stop Words

from nltk.corpus import stopwords <br>
set(stopwords.words('english'))

# Code: Remove Stop Words

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer</a>

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

my_text = ["Hi Mr. Smith! I’m going to buy some vegetables \
(3 tomatoes and 3 cucumbers from the store. Should I pick up some black-eyed peas as well?"]

# Incorporate stop words when creating the count vectorizer
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(my_text)
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

#Reference: https://www.geeksforgeeks.org/difference-between-pandas-vs-numpy/



Unnamed: 0,black,buy,cucumbers,eyed,going,hi,mr,peas,pick,smith,store,tomatoes,vegetables
0,1,1,1,1,1,1,1,1,1,1,1,1,1


![](https://i.imgur.com/9qllh8j.png)

# Code: Stemming

In [9]:
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()

# Try some stems
print('drive:{}'.format(stemmer.stem('drive')))
print('drives:{}'.format(stemmer.stem('drives')))
print('driver:{}'.format(stemmer.stem('driver')))
print('drivers:{}'.format(stemmer.stem('drivers')))
print('driven:{}'.format(stemmer.stem('driven')))

drive:driv
drives:driv
driver:driv
drivers:driv
driven:driv


# Code: Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()

input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse


![](https://i.imgur.com/8edVsCR.png)

# Code: Parts of Speech Tagging

In [None]:
from nltk.tag import pos_tag

my_text = "James Smith lives in the United States."

tokens = pos_tag(word_tokenize(my_text))
print(tokens)

#Reference:https://pythonspot.com/nltk-speech-tagging/

[('James', 'NNP'), ('Smith', 'NNP'), ('lives', 'VBZ'), ('in', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('.', '.')]


![ ](https://imgur.com/UOlnpKT.png)  

## Named Entity Recognition

In [None]:
from nltk.chunk import ne_chunk
my_text = "James Smith lives in the United States."
tokens = pos_tag(word_tokenize(my_text)) # this labels each word as a part of speech
entities = ne_chunk(tokens) # this extracts entities from the list of words
print (entities)

# <font color="blue"> Prepocessing: Compound Term Extraction </font>

![](https://i.imgur.com/q1WuWai.png)

# Code: Compound Term Extraction

In [None]:
from nltk.tokenize import MWETokenizer # multi-word expression

my_text = "You all are the greatest students of all time."

mwe_tokenizer = MWETokenizer([('You','all'), ('of', 'all', 'time')])
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(my_text))

mwe_tokens

![](https://i.imgur.com/HpgLFOT.png)

# Basic Pandas Functionality

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6,4))
data = pd.read_csv('cookie_reviews.csv')
#data

#Selecting top and bottom rows:
#Returns the first n rows.
first = data
s=first.head()
#Returns the last n rows.
first = data
e=first.tail()

print(s)
print(e)

#Selecting columns:
#data['column_name']
#or data.column_name
col_names=data.columns
print("\n column names = ",col_names)
print ("\n \n")
#Selecting by indexer:
data.iloc[0] #- first row of data frame
#data.iloc[-1] #- last row of data frame
#data.iloc[:,0] #- first column of data frame
#data.iloc[:,-1] #- last column of data frame
#Data.iloc[0,1] #– first row, second column of the dataframe
#data.iloc[0:4, 3:5] # first 4 rows and 3rd, 4th, 5th columns of data frame

![](https://i.imgur.com/w9gWcfX.png)

In [None]:
# Basic example
square_me=lambda x: x*x

my_numbers=[9, 3, 4, 100, 2, 1]
my_numbers_squared = list(map(square_me, my_numbers))#map=applies a function to all the items in an input_list
print(my_numbers_squared)

# <font color=red>Preprocessing Exercise </font>



# Introduction

We will be using review data from Kaggle to practice preprocessing text data. The dataset contains user reviews for many products, but today we'll be focusing on the product in the dataset that had the most reviews - an oatmeal cookie.

The following code will help you load in the data. If this is your first time using nltk, you'll to need to pip install it first.


In [None]:
import nltk
# nltk.download() <-- Run this if it's your first time using nltk to download all of the datasets and models

import pandas as pd

In [None]:
df = pd.read_csv('cookie_reviews.csv')
df.head()

**Question 1:**

Determine how many reviews there are in total.
   

**Question 2:**
    
Determine the percentage of 1, 2, 3, 4 and 5 star reviews.

**Question 3:**

(a) Remove stop words

(b) Change to lower case

(b) Perform stemming

# <font color="maroon"> 2.0 Text Similarity Measures </font>

- To measure distance between 2 string

## 2.1 Applications
- Information retrieval
- Text classification
- Document clustering
- Topic Modeling
- Matric decomposition

To measure the word similarity, we use **<font color="blue"> Levenshtein distance </font>**.
- Minimum number of operations to get from one word to another.

![](https://i.imgur.com/FkdJmPi.png)

In [None]:
pip install python-Levenshtein

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from Levenshtein import distance as lev
lev('party', 'park')

2

# TextBlob

### Another toolkit other than NLTK

- Wraps around NLTK and makes it easier to use

### TextBlob capabilities

- Tokenization
- Parts of speech tagging
- Sentiment analysis
- Spell check


# TextBlob Demo: Tokenization

In [None]:
#pip install textblob

from textblob import TextBlob
my_text = TextBlob("We're moving from NLTK to TextBlob. How fun!")
my_text.words

WordList(['We', "'re", 'moving', 'from', 'NLTK', 'to', 'TextBlob', 'How', 'fun'])

# TextBlob Demo: Spell Check

In [None]:
blob = TextBlob("I'm graat at speling.")
print(blob.correct()) # print function requires Python 3

I'm great at spelling.


<font color="blue">
## How does the correct function work?  <br>
    
- Calculates the Levenshtein distance between the word ‘graat’ and all words in its word list </br>
- Of the words with the smallest Levenshtein distance, it outputs the most popular word </br></font>

# TextBlob Demo: Tagging

In [None]:
blob = TextBlob("John hits the ball.")
for words, tag in blob.tags:
 print (words, tag)

John NNP
hits VBZ
the DT
ball NN


# TextBlob Demo: Language Detection and Translation

In [None]:
word=TextBlob("Bonjour, comment allez-vous ")
word.detect_language()


'fr'

In [None]:
word.translate(from_lang='fr', to ='en')

TextBlob("Hello how are you")

# Text Format for Analysis: Count Vectorizer

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus =['This is the first document.', 'This is the second document.', 'And the third one. One is fun.'] #corpus=collection of teks
cv = CountVectorizer()
X = cv.fit_transform(corpus)
pd.DataFrame(X.toarray(),columns=cv.get_feature_names())

Unnamed: 0,and,document,first,fun,is,one,second,the,third,this
0,0,1,1,0,1,0,0,1,0,1
1,0,1,0,0,1,0,1,1,0,1
2,1,0,0,1,1,2,0,1,1,0


![](https://i.imgur.com/OQDeQlb.png)

# Document Similarity: Example

![](https://i.imgur.com/PyirXsy.png)

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['The weather is hot under the sun',
'I make my hot chocolate with milk',
'One hot encoding',
'I will have a chai latte with milk',
'There is a hot sale today']
# create the document-term matrix with count vectorizer
cv = CountVectorizer(stop_words="english")
X = cv.fit_transform(corpus).toarray()
dt = pd.DataFrame(X, columns=cv.get_feature_names())
dt

Unnamed: 0,chai,chocolate,encoding,hot,latte,make,milk,sale,sun,today,weather
0,0,0,0,1,0,0,0,0,1,0,1
1,0,1,0,1,0,1,1,0,0,0,0
2,0,0,1,1,0,0,0,0,0,0,0
3,1,0,0,0,1,0,1,0,0,0,0
4,0,0,0,1,0,0,0,1,0,1,0


# Document Similarity: Example

In [None]:
# calculate the cosine similarity between all combinations of documents
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

# list all of the combinations of 5 take 2 as well as the pairs of phrases
pairs = list(combinations(range(len(corpus)),2)) #sentence (0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), .., (3,4))
print(pairs)
combos = [(corpus[a_index], corpus[b_index]) for (a_index, b_index) in pairs]
print (combos)

# calculate the cosine similarity for all pairs of phrases and sort by most similar
results = [cosine_similarity([X[a_index]], [X[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results, combos), reverse=True)


[(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
[('The weather is hot under the sun', 'I make my hot chocolate with milk'), ('The weather is hot under the sun', 'One hot encoding'), ('The weather is hot under the sun', 'I will have a chai latte with milk'), ('The weather is hot under the sun', 'There is a hot sale today'), ('I make my hot chocolate with milk', 'One hot encoding'), ('I make my hot chocolate with milk', 'I will have a chai latte with milk'), ('I make my hot chocolate with milk', 'There is a hot sale today'), ('One hot encoding', 'I will have a chai latte with milk'), ('One hot encoding', 'There is a hot sale today'), ('I will have a chai latte with milk', 'There is a hot sale today')]


[(array([[0.40824829]]),
  ('The weather is hot under the sun', 'One hot encoding')),
 (array([[0.40824829]]), ('One hot encoding', 'There is a hot sale today')),
 (array([[0.35355339]]),
  ('I make my hot chocolate with milk', 'One hot encoding')),
 (array([[0.33333333]]),
  ('The weather is hot under the sun', 'There is a hot sale today')),
 (array([[0.28867513]]),
  ('The weather is hot under the sun', 'I make my hot chocolate with milk')),
 (array([[0.28867513]]),
  ('I make my hot chocolate with milk', 'There is a hot sale today')),
 (array([[0.28867513]]),
  ('I make my hot chocolate with milk', 'I will have a chai latte with milk')),
 (array([[0.]]),
  ('The weather is hot under the sun', 'I will have a chai latte with milk')),
 (array([[0.]]), ('One hot encoding', 'I will have a chai latte with milk')),
 (array([[0.]]),
  ('I will have a chai latte with milk', 'There is a hot sale today'))]

In [None]:
pairs = list(combinations(range(5),2))
pairs

[(0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (1, 2),
 (1, 3),
 (1, 4),
 (2, 3),
 (2, 4),
 (3, 4)]

![](https://i.imgur.com/jrfN6Jj.png)

![](https://i.imgur.com/BI8XP92.png)

![](https://i.imgur.com/3IbfQXT.png)

![](https://i.imgur.com/pnNqzql.png)

In [None]:
import pandas as pd
corpus = ['This is the first document.',
         'This is the second document.',
         'And the third one. One is fun.']
# original Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=cv.get_feature_names())



Unnamed: 0,and,document,first,fun,is,one,second,the,third,this
0,0,1,1,0,1,0,0,1,0,1
1,0,1,0,0,1,0,1,1,0,1
2,1,0,0,1,1,2,0,1,1,0


In [None]:
# new TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
cv_tfidf = TfidfVectorizer()
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
pd.DataFrame(X_tfidf, columns=cv_tfidf.get_feature_names())

Unnamed: 0,and,document,first,fun,is,one,second,the,third,this
0,0.0,0.450145,0.591887,0.0,0.349578,0.0,0.0,0.349578,0.0,0.450145
1,0.0,0.450145,0.0,0.0,0.349578,0.0,0.591887,0.349578,0.0,0.450145
2,0.36043,0.0,0.0,0.36043,0.212876,0.72086,0.0,0.212876,0.36043,0.0


![](https://i.imgur.com/xlJibKw.png)

## Document Similarity: Example with TF-IDF

In [None]:
corpus = ['The weather is hot under the sun',
'I make my hot chocolate with milk',
'One hot encoding',
'I will have a chai latte with milk',
'There is a hot sale today']

from sklearn.feature_extraction.text import TfidfVectorizer
# create the document-term matrix with TF-IDF vectorizer
cv_tfidf = TfidfVectorizer(stop_words="english")
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
dt_tfidf = pd.DataFrame(X_tfidf,columns=cv_tfidf.get_feature_names())
dt_tfidf

Unnamed: 0,chai,chocolate,encoding,hot,latte,make,milk,sale,sun,today,weather
0,0.0,0.0,0.0,0.370086,0.0,0.0,0.0,0.0,0.6569,0.0,0.6569
1,0.0,0.580423,0.0,0.327,0.0,0.580423,0.468282,0.0,0.0,0.0,0.0
2,0.0,0.0,0.871247,0.490845,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.614189,0.0,0.0,0.0,0.614189,0.0,0.495524,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.370086,0.0,0.0,0.0,0.6569,0.0,0.6569,0.0


In [None]:
# calculate the cosine similarity for all pairs of phrases and sort by most similar
results_tfidf = [cosine_similarity([X_tfidf[a_index]], [X_tfidf[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results_tfidf, combos), reverse=True)


[(array([[0.23204486]]),
  ('I make my hot chocolate with milk', 'I will have a chai latte with milk')),
 (array([[0.18165505]]),
  ('The weather is hot under the sun', 'One hot encoding')),
 (array([[0.18165505]]), ('One hot encoding', 'There is a hot sale today')),
 (array([[0.16050661]]),
  ('I make my hot chocolate with milk', 'One hot encoding')),
 (array([[0.1369638]]),
  ('The weather is hot under the sun', 'There is a hot sale today')),
 (array([[0.12101835]]),
  ('The weather is hot under the sun', 'I make my hot chocolate with milk')),
 (array([[0.12101835]]),
  ('I make my hot chocolate with milk', 'There is a hot sale today')),
 (array([[0.]]),
  ('The weather is hot under the sun', 'I will have a chai latte with milk')),
 (array([[0.]]), ('One hot encoding', 'I will have a chai latte with milk')),
 (array([[0.]]),
  ('I will have a chai latte with milk', 'There is a hot sale today'))]

![](https://i.imgur.com/mj4J60v.png)

# <font color=red>Text Similarity Exercise</font>

## Introduction

We will be using a song lyric dataset from Kaggle to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles.

The following code will help you load in the data and get set up for this exercise.


In [None]:
import nltk
import pandas as pd

In [None]:
data = pd.read_csv('songdata.csv')
data.head()

# Question 1

Apply the following preprocessing steps:

- Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.


## Question 2

(a) List all the rows with "Imagine" in the title


## Question 3

(a) Extract the first line of lyric out from the first song.


(b) Find out the sentiment of the extracted lyric.