# Natural Language Processing 
NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner

In [0]:
# 1 import data set

import numpy as np   
import pandas as pd  

from google.colab import files
uploaded = files.upload()
  
  
 
# Import dataset 
#dataset = pd.read_csv('E:/ML/NLP/Restaurant_Reviews.tsv', delimiter = '\t')  


Saving Restaurant_Reviews.tsv to Restaurant_Reviews.tsv


In [0]:
import io
data = io.BytesIO(uploaded['Restaurant_Reviews.tsv'])    

dataset = pd.read_csv(data,delimiter = '\t')

In [0]:
dataset[:5]

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


# 2  Text Preprocessing/Cleaning

Remove Punctuations, Numbers: Punctuations, Numbers doesn’t help much in processong the given text, if included, they will just increase the size of bag of words that we will create as last step and decrase the efficency of algorithm.   

Stemming:  removal of suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

Lemmatization : obtain root word

Ex:-    Fishing, fisher, fished----> fish



1) Noise Removal: removal of stop words

2) Lexicon Normalization 
   1)Lemmatization 
   2)Stemming


Convert each word into its lower case

 
 3) Object Standardization 
Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models. 

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed 

In [0]:
# library to clean data 
import re  
  
# Natural Language Tool Kit 
import nltk  
  
nltk.download('stopwords') 
  
# to remove stopword 
from nltk.corpus import stopwords

# for stemming purpose
# for Stemming propose  
from nltk.stem.porter import PorterStemmer 




  
# Initialize empty array 
# to append clean text  
cor = [] 

# 1000 (reviews) rows to clean 
for i in range(0, 1000):  
      
    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])  
      
    # convert all cases to lower cases 
    review = review.lower()  
      
    # split to array(default delimiter is " ") 
    review = review.split()  
      
    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()  
      
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
                  
    # rejoin all string array elements 
    # to create back into a string 
    review = ' '.join(review)   
      
    # append each string to create 
    # array of clean text  
    cor.append(review) 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
cor [0:10]   # after stemming & lemmatization

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price',
 'get angri want damn pho',
 'honeslti tast fresh',
 'potato like rubber could tell made ahead time kept warmer',
 'fri great',
 'great touch']

# or

In [0]:
import nltk
nltk.download('wordnet')

from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()


word = "multiplying" 
lem.lemmatize(word)



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


'multiplying'

In [0]:
stem.stem(word)

'multipli'

In [0]:
abc = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def aa(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in abc:
            word = abc[word.lower()]
        new_words.append(word)
        new_text = " ".join(new_words) 
    return new_text

aa("RT this is a retweeted dm tweet by Shivam Bansal")

'Retweet this is a retweeted direct message tweet by Shivam Bansal'

# 3   Tokenization, involves splitting sentences and words from the body of the text

Text to Features (Feature Engineering on text data)  
To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques –
 1) Syntactical Parsing,  
 2) Entities / N-grams / word-based features,  
 3) Statistical features,  and  
 4)word embeddings

# 1  Syntactical parsing
involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words


Part of speech tagging –   
Apart from the grammar relations, every word in a sentence is also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence


In [0]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing using google"
tokens = word_tokenize(text)

print( pos_tag(tokens) )

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('using', 'VBG'), ('google', 'NN')]


# 2) 2 Entity Extraction (Entities as features)

Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both. Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, pos tagging and dependency parsing. The applicability of entity detection can be seen in the automated chat bots, content analyzers and consumer insights.

Topic Modelling & Named Entity Recognition are the two key entity detection methods in NLP.

# A. Named Entity Recognition (NER)

The process of detecting the named entities such as person names, location names, company names etc from the text is called as NER. For example :

Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York.

Named Entities –  ( “person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New York”)

A typical NER model consists of three blocks:

1Noun phrase identification: 
This step deals with extracting all the noun phrases from a text using dependency parsing and part of speech tagging.


 2Phrase classification: 
This is the classification step in which all the extracted noun phrases are classified into respective categories (locations, names etc). Google Maps API provides a good path to disambiguate locations, Then, the open databases from dbpedia, wikipedia can be used to identify person names or company names. Apart from this, one can curate the lookup tables and dictionaries by combining information from different sources.


3Entity disambiguation: 
Sometimes it is possible that entities are misclassified, hence creating a validation layer on top of the results is useful. Use of knowledge graphs can be exploited for this purposes. The popular knowledge graphs are – Google Knowledge Graph, IBM Watson and Wikipedia. 


# B.  Topic Modeling
Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.  


Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique, Following is the code to implement topic modeling using LDA in python.

In [0]:

'''


doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

import gensim from gensim
import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print(ldamodel.print_topics())

'''


'\n\n\ndoc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." \ndoc2 = "My father spends a lot of time driving my sister around to dance practice."\ndoc3 = "Doctors suggest that driving may cause increased stress and blood pressure."\ndoc_complete = [doc1, doc2, doc3]\ndoc_clean = [doc.split() for doc in doc_complete]\n\nimport gensim from gensim\nimport corpora\n\n# Creating the term dictionary of our corpus, where every unique term is assigned an index.  \ndictionary = corpora.Dictionary(doc_clean)\n\n# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. \ndoc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]\n\n# Creating the object for LDA model using gensim library\nLda = gensim.models.ldamodel.LdaModel\n\n# Running and Training LDA model on the document term matrix\nldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)\n\n# Results \nprint(ldamodel.print_topics())\n\n'

# C.  N-Grams as Features  
A combination of N words together are called N-Grams. N grams (N > 1) are generally more informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered as the most important features of all the others. The following code generates bigram of a text.

In [0]:
def generate_ngrams(text, n):
    words = text.split()
    output = []  
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output
  
  
generate_ngrams('this is a sample text', 2)  


[['this', 'is'], ['is', 'a'], ['a', 'sample'], ['sample', 'text']]

In [0]:
generate_ngrams('this is a sample text', 3)
  

[['this', 'is', 'a'], ['is', 'a', 'sample'], ['a', 'sample', 'text']]

 # 3 Statistical Features 
Text data can also be quantified directly into numbers using several techniques described in this section: 

A.  Term Frequency – Inverse Document Frequency (TF – IDF)   
TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the text documents into vector models on the basis of occurrence of words in the documents without taking considering the exact ordering. For Example – let say there is a dataset of N text documents, In any document “D”, TF and IDF will be defined as –


#Term Frequency (TF) – 
TF for a term “t” is defined as the count of a term “t” in a document “D”

#Inverse Document Frequency (IDF) – 
IDF for a term is defined as logarithm of ratio of total documents available in the corpus and number of documents containing the term T.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
X

<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [0]:
print(X)

  (0, 7)	0.5844829010200651
  (0, 2)	0.5844829010200651
  (0, 4)	0.444514311537431
  (0, 1)	0.34520501686496574
  (1, 1)	0.3853716274664007
  (1, 0)	0.652490884512534
  (1, 3)	0.652490884512534
  (2, 4)	0.444514311537431
  (2, 1)	0.34520501686496574
  (2, 6)	0.5844829010200651
  (2, 5)	0.5844829010200651


 Making the bag of words via sparse matrix   

Take all the different words of reviews in the dataset without repeating of words.
One column for each word, therefore there are going to be many columns.
Rows are reviews
If word is there in row of dataset of reviews, then the count of word will be there in row of bag of words under the column of the word.  
Examples: Let’s take a dataset of reviews of only two reviews  

Input : "dam good steak", "good food good servic"

Output :  good food servic dam  steak      
                   1                              1        1
                  2         1        1         
                  
                  
                  
                  
For this purpose we need CountVectorizer class from sklearn.feature_extraction.text.
We can also set max number of features (max no. features which help the most via attribute “max_features”). Do the training on corpus and then apply the same transformation to the corpus “.fit_transform(corpus)” and then convert it into array. If review is positive or negative that answer is in second column of : dataset[:, 1] : all rows ans 1st column (indexing from zero).                  
 

In [0]:
# Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer 
  
# To extract max 1500 feature. 
# "max_features" is attribute to 
# experiment with to get better results 
cv = CountVectorizer(max_features = 1500)  
  
# X contains corpus (dependent variable) 
X = cv.fit_transform(cor).toarray()  
  
# y contains answers if review 
# is positive or negative 
y = dataset.iloc[:, 1].values  

In [0]:
print(X)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [0]:
y[0:19]     # 0 is for negative review and 1 is for positive review

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0])

#  Splitting Corpus into Training and Test set. 
For this we need class train_test_split from sklearn.cross_validation. Split can be made 70/30 or 80/20 or 85/15 or 75/25, here I choose 75/25 via “test_size”.

X is the bag of words, y is 0 or 1 (positive or negative)

In [0]:
from sklearn.model_selection import train_test_split
  
# experiment with "test_size" 
# to get better results 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 

#  Fitting a Predictive Model (here random forest)

In [0]:
# to the Training set 
from sklearn.ensemble import RandomForestClassifier 
  
# n_estimators can be said as number of 
# trees, experiment with n_estimators 
# to get better results  
model = RandomForestClassifier(n_estimators = 501, 
                            criterion = 'entropy') 
                              
model.fit(X_train, y_train)  

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=501, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

#  Pridicting Final Results via using .predict() method

In [0]:
# Predicting the Test set results 
y_pred = model.predict(X_test) 
  
y_pred

array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0])

# To know the accuracy, confusion matrix is needed.

In [0]:
# Making the Confusion Matrix 
from sklearn.metrics import confusion_matrix 
  
cm = confusion_matrix(y_test, y_pred) 
  
cm 

array([[110,  11],
       [ 56,  73]])

# Uses/ tasks of NLP
# 1) text classification

Text classification is one of the classical problem of NLP. Notorious examples include – Email Spam Identification, topic classification of news, sentiment classification and organization of web pages by search engines

In [0]:
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]


In [0]:
model = NBC(training_corpus) 
print(model.classify("Their codes are amazing."))

Class_A


In [0]:
print(model.classify("I don't like their computer."))

Class_B


In [0]:
print(model.accuracy(test_corpus))

0.8333333333333334


# using scikit learn

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn import svm 

# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)
# Apply model on test data 
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear') 
model.fit(train_vectors, train_labels) 
prediction = model.predict(test_vectors)
print(prediction)

['Class_A' 'Class_A' 'Class_B' 'Class_B' 'Class_A' 'Class_A']


In [0]:
print (classification_report(test_labels, prediction))

              precision    recall  f1-score   support

     Class_A       0.50      0.67      0.57         3
     Class_B       0.50      0.33      0.40         3

   micro avg       0.50      0.50      0.50         6
   macro avg       0.50      0.50      0.49         6
weighted avg       0.50      0.50      0.49         6



# 2 Text Matching / Similarity

 Important applications of text matching includes   
 automatic spelling correction,     
 data de-duplication and    
 genome analysis etc.   