# Advanced Natural language processing and handling basic scenarios

We will cover various advanced NLP techniques in this notebook, while also leveraging some popular ML algorithms to extract data from next.

Some of the parts covered include:
* Noun phrase extraction
* Pronounciation/Sound classification
* Speech tagging
* Text similarity
* Word sense disambiguation 

We will also go through basics of different scenarios in NLP, including:
* Sentiment analysis using Blobs
* Classification
* Text Disambiguation

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Before getting into recipes, let’s understand the NLP pipeline and life
cycle first. There are so many concepts we are implementing in this book,
and we might get overwhelmed by the content of it. To make it simpler
and smoother, let’s see what is the flow that we need to follow for an NLP
solution.
For example, let’s consider customer sentiment analysis and
prediction for a product or brand or service.

* Define the Problem: Understand the customer sentiment across the products.
* Understand the depth and breadth of the problem: Understand the customer/user sentiments across the product; why we are doing this? What is the business impact? Etc.
* Data requirement brainstorming: Have a brainstorming activity to list out all possible data points.
* Data collection: Based on the data andthe problem, we might have to incorporate different data collection methods. In this case, we can use web scraping and Twitter APIs.
* Text Preprocessing: We know that data won’t always be clean. We need to spend a significant amount of time to process it and extract insight out of it using different methods 
* Text to feature: As we discussed, texts are characters and machines will have a tough time understanding them. We have to convert them to features that machines and algorithms can understand using encoding methodolgies (discussed in a separate notebook)
* Machine learning/Deep learning: Machine learning/Deep learning is a part of an artificial intelligence umbrella that will make systems automatically learn patterns in the data without being programmed. Most of the NLP solutions are based on this, and since we converted text to features, we can leverage machine learning or deep learning algorithms to achieve the goals like text classification, natural language generation, etc.
* Insights and deployment: There is absolutely no use for building NLP solutions without proper insights being communicated to the business. Always take time to connect the dots between model/analysis output and the business, thereby creating the maximum impact.

<h2>Noun Phrase extraction

In [None]:
import nltk
from textblob import TextBlob

In [None]:
blob = TextBlob("Jacob is learning NLP")
for noun in blob.noun_phrases:
    print(noun)

<h2>Text similarity</h2>

This can be achieved in multiple ways:

* Cosine similarity: Calculates the cosine of the angle between the two vectors.
* Jaccard similarity: The score is calculated using the intersection or union of words.

    `Jaccard Index = (the number in both sets) / (the number in either set) * 100`.

* Levenshtein distance: Minimal number of insertions, deletions, and replacements required for transforming string “a” into string “b.”
* Hamming distance: Number of positions with the same symbol in both strings. But it can be defined only for strings with equal length

We will be working with Cosine similarity as it is one of the most used, and simplest methods to achieve similarity scores:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
documents = (
"I like NLP",
"I am exploring NLP",
"I am a beginner in NLP",
"I want to learn NLP",
"I like advanced NLP"
)

In [None]:
# We will first need to create a vector
tfidf = TfidfVectorizer()
tfidf = tfidf.fit_transform(documents)

In [None]:
tfidf.shape

In [None]:
# Calculating similarity of first sentence with other sentences
cosine_similarity(tfidf[0], tfidf)

We can see how the first and last sentence have high similarity scores.

<h2>Phonetic matching</h2>

Rough matching of two words or sentences, also creates a string as an encoded version of the text or word. It is useful in searching corpus of texts, correcting spelling errors etc.

In [None]:
!pip install soundex

In [None]:
import soundex

In [None]:
# Running the soundex function
worker = soundex.Soundex()

We see that we get the same outputs for words that are the same but are only slightly different. This helps in identifying typos and also formatting:

In [None]:
worker.soundex("natural")

In [None]:
worker.soundex("nataral")

In [None]:
worker.soundex("nutural")

In [None]:
worker.soundex("language")

In [None]:
worker.soundex("processing")

<h2>Tagging parts of speech</h2>

Speech tagging enables labeling of words with parts of speech such as noun, adjective, verb etc. It is the foundation for named entity recognition, answering questions and word sense disambiguation. There are two main ways to build taggers

* Rule based - Created manually - tag a word belonging to a particular POS

* Stochastic based - Capture sequence of words and tag probability using hidden Markov models.

In [None]:
# Achieving this using NLTK
Text = "I like to eat apples. They are healthy"
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))

The returned values of POS tagging are many, here are a few to get you started with:

* CC coordinating conjunction
* CD cardinal digit
* DT determiner
* EX existential there (like: “there is” ... think of it like “there exists”)
* FW foreign word
* IN preposition/subordinating conjunction
* JJ adjective ‘big’
* JJR adjective, comparative ‘bigger’
* JJS adjective, superlative ‘biggest’
* LS list marker 1)
* MD modal could, will
* NN noun, singular ‘desk’
* NNS noun plural ‘desks’

In [None]:
# First we do sentence tokenization
toks = sent_tokenize(Text)
for i in toks:
    # Word tokenize and removing stop words
    words = nltk.word_tokenize(i)
    words = [w for w in words if not w in stop_words]
    # POS
    tags = nltk.pos_tag(words)
    print(tags)

<h2>Entity extraction</h2>

This is also known as named entity recognition. We have multiple different libraries to help us in this task that include NLTK chunker, SpaCy, WatsonNLU, AlchemyAPI, Google Cloud etc.

In [None]:
sent = "John is studying at Stanford University in California"

In [None]:
from nltk import ne_chunk
from nltk import word_tokenize

Here we can see 
* John is tagged as a person
* Stanford is tagged as an organization 
* California is tagged as a Geopolitical entity (GPE)

In [None]:
s = ne_chunk(nltk.pos_tag(word_tokenize(sent)), binary=False)
print(s)

In [None]:
!python -m spacy download en_core_web_md

In [None]:
!python -m spacy link en_core_web_md en

In [None]:
import spacy

In [None]:
nlp = spacy.load('en')

In [None]:
doc = nlp(u'Samsung is launching their new phone for 1000 dollars in Amsterdam')

Here we see it has recognized:
* Samsung as an organization (ORG)
* 1000 Dollars as money (MONEY)
* Amsterdam as a geopolitical entity (GPE)

In [None]:
for i in doc.ents:
    print(i.text, i.start_char, i.end_char, i.label_)

<h2>Extracting topics from text</h2>

If we have an online library with multiple departments based on the genre, we should be able to identify which section a new book belongs to. This process can use topic modelling for NLP. This is the procedure of document tagging and clustering.

In [None]:
# Let us work with an example
s1 = "I am learning NLP, it is very interesting and exciting. it includes machine learning and deep learning"
s2 = "My father is a data scientist and he is nlp expert"
s3 = "My sister has experience in android development"

In [None]:
s_final = [s1, s2, s3]
s_final

In [None]:
# Cleaning the data
from nltk.stem.wordnet import WordNetLemmatizer
import string

stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = "".join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized
doc_clean = [clean(doc).split() for doc in s_final]
doc_clean

In [None]:
# We will prepare a document term matrix
import gensim
from gensim import corpora

# Create term dictionary of our corpus 
# Each unique term is assigned an index

dictionary = corpora.Dictionary(doc_clean)
print(dictionary)

In [None]:
dictionary[4]

We will now create a matrix that will consist of numerical values and their frequency. The index will be mapped from the dictionary

In [None]:
# Converting list of docs into matrix using dictionary prepared above
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix

We can see in the first list element we have (4, 3). 4 refers to the index of 'learning' and 3 is the frequency of the word in the first sentence. We will now create an LDA model

In [None]:
Lda = gensim.models.ldamodel.LdaModel
# Running and Training LDA model on the document term matrix
# for 3 topics.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

In [None]:
print(ldamodel.print_topics())

We can see the weights associated with the topics. They are currently very close to each other. We can perform this on huge data sets to extract significant topics with high weights. 

<h2>Text Disambiguation</h2>

Sentences have ambiguity that arises due to different meanings of words for different contexts:

For eg: **Bank may have different meanings in:**

* The bank gave me a loan
* She was at the river bank

We need a method to understand and clear up this ambiguity. The Lesk algorithm can assist us:

In [None]:
!pip install -U wn==0.0.22

In [None]:
!pip install pywsd

In [None]:
from nltk.stem import PorterStemmer
from itertools import chain
from pywsd.lesk import simple_lesk

In [None]:
sentences = ['The river bank was full of fish today',
            'Christopher works at the bank as an accountant']

In [None]:
print("Context-1: ", sentences[0])
preds = simple_lesk(sentences[0], 'bank')
print("Sense: ", preds)
print("Definition: ", preds.definition())

In [None]:
print("Context-2: ", sentences[1])
preds = simple_lesk(sentences[1], 'bank')
print("Sense: ", preds)
print("Definition: ", preds.definition())

<h2>Classifying data - working with spam</h2>

We will put all our currently gathered knowledge from basic NLP techniques to build a spam classifier using vanilla machine learning algorithms and vectors.

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import os
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import sklearn.feature_extraction.text as text
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
# We can extract the useful columns
df = df[['v1', 'v2']]
df.head()

In [None]:
df.dtypes

In [None]:
# Converting all data into lowercase
df.rename(columns={"v1":"label", "v2":"message"}, inplace=True)

In [None]:
df.head()

Cleaning up the data

In [None]:
# Converting to lowercase
df['message'] = df['message'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [None]:
# Removing stop-words
words = stopwords.words('english')
df['message'] = df['message'].apply(lambda x: " ".join(x for x in x.split() if x not in words))

In [None]:
# Stemming
porter = PorterStemmer()
df['message'] = df['message'].apply(lambda x: " ".join(porter.stem(word) for word in x.split()))

In [None]:
# Lemmatization
df['message'] = df['message'].apply(lambda x: " ".join(Word(word).lemmatize() for word in x.split()))

In [None]:
df.head()

In [None]:
# Proceeding to split the data
x_train, x_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2)

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
# Converting our categorical variables to numerical
le = preprocessing.LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)

In [None]:
y_train

In [None]:
tfidf = TfidfVectorizer(analyzer='word')
tfidf.fit(df['message'])

In [None]:
x_train_t = tfidf.transform(x_train)
x_test_t = tfidf.transform(x_test)

In [None]:
x_train_t.shape

In [None]:
# Creating a generalized model training function
def model_training(clf, x_train, y_train, x_test):
    # fit the training dataset on the classifier
    clf.fit(x_train, y_train)
    preds = clf.predict(x_test)
    return metrics.accuracy_score(preds, y_test)

In [None]:
accuracy = model_training(naive_bayes.MultinomialNB(), x_train_t, y_train, x_test_t)
print("Accuracy using Naive Bayes:", accuracy)

In [None]:
accuracy = model_training(linear_model.LogisticRegression(), x_train_t, y_train, x_test_t)
print("Accuracy using logistic regression:", accuracy)

<h2>Basics on Sentiment Analysis</h2>
Sentiment analysis is one of the widely
used techniques across the industries to understand the sentiments of the
customers/users around the products/services. Sentiment analysis gives
the sentiment score of a sentence/statement tending toward positive or
negative.

We can do this through training our models on data like the spam classification tasks above, but with different sentiments. Here we will proceed with a library implementation. The other notebooks will cover deep-learning implementations using RNNs, LSTMs and transformers.

The TextBlob library will give us two metrics:
* Polarity - Range of [-1, 1] wehere 1 means a positive statement and -1 means a negative statement

* Subjectivity - Refers to it being a public opinion and not factual information [0, 1]

In [None]:
# Our sample data
review1 = "The phone is amazing to use. It is fast and has a strong build quality"
review2 = "Using this phone has been a bad experience. It is sloggy, bulky, clunky and annoying to use."

In [None]:
# Pre-processing function
def processRow(row):
    import re
    import nltk
    from textblob import TextBlob
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from textblob import Word
    from nltk.util import ngrams
    import re
    from wordcloud import WordCloud, STOPWORDS
    from nltk.tokenize import word_tokenize
    tweet = row
    #Lower case
    tweet.lower()
    #Removes unicode strings like "\u002c" and "x96"
    tweet = re.sub(r'(\\u[0-9A-Fa-f]+)',r"", tweet)
    tweet = re.sub(r'[^\x00-\x7f]',r"",tweet)
    #convert any url to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert any @Username to "AT_USER"
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    tweet = re.sub('[\n]+', ' ', tweet)
    #Remove not alphanumeric symbols white spaces
    tweet = re.sub(r'[^\w]', ' ', tweet)
    #Removes hastag in front of a word """
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #Remove :( or :)
    tweet = tweet.replace(':)',"")
    tweet = tweet.replace(':(',"")
    #remove numbers
    tweet = "".join([i for i in tweet if not i.isdigit()])
    #remove multiple exclamation
    tweet = re.sub(r"(\!)\1+", ' ', tweet)
    #remove multiple question marks
    tweet = re.sub(r"(\?)\1+", ' ', tweet)
    #remove multistop
    tweet = re.sub(r"(\.)\1+", ' ', tweet)
    #lemma
    from textblob import Word
    tweet =" ".join([Word(word).lemmatize() for word in tweet.split()])
    #stemmer
    #st = PorterStemmer()
    #tweet=" ".join([st.stem(word) for word in tweet.split()])
    #Removes emoticons from text
    tweet = re.sub(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', "", tweet)
    #trim
    tweet = tweet.strip('\'"')
    row = tweet
    return row

In [None]:
review1 = processRow(review1)
review1

In [None]:
review2 = processRow(review2)
review2

In [None]:
# We get a positive score - Positive sentiment
blob = TextBlob(review1)
blob.sentiment

In [None]:
# We get a negative score - negative sentiment
blob = TextBlob(review2)
blob.sentiment