<a href="https://colab.research.google.com/github/ujoshidev/TestRepo/blob/main/NLP_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Basics

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner

## Installing library

In [1]:
pip install -U nltk

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 5.2 MB/s 
Collecting regex>=2021.8.3
  Downloading regex-2022.4.24-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (749 kB)
[K     |████████████████████████████████| 749 kB 27.2 MB/s 
[?25hInstalling collected packages: regex, nltk
  Attempting uninstall: regex
    Found existing installation: regex 2019.12.20
    Uninstalling regex-2019.12.20:
      Successfully uninstalled regex-2019.12.20
  Attempting uninstall: nltk
    Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.7 regex-2022.4.24


In [2]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloadin

True

## Text Preprocessing

The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing

It is predominantly comprised of three steps:
* Noise Removal
* Lexicon Normalization
* Object Standardization

### Noise Removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise

In [3]:
noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

_remove_noise("this is a sample text")

'sample text'

In [4]:
import re 

def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*"  

_remove_regex("remove this #hashtag from analytics vidhya", regex_pattern)

'remove this  from analytics vidhya'

### Lexicon Normalization

most common lexicon normalization practices are :

* Stemming:  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
* Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary

In [5]:
#  # required to run lemmatize
# nltk.download('wordnet')
# nltk.download('omw-1.4')

In [6]:
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "multiplying" 
lem.lemmatize(word, "v")


'multiply'

In [7]:
stem.stem(word)

'multipli'

### Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs.

In [8]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

_lookup_words("RT this is a retweeted tweet by Shivam Bansal")

'Retweet this is a retweeted tweet by Shivam Bansal'

## Text to Features (Feature Engineering on text data)

To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based features, Statistical features, and word embeddings

###  Syntactic Parsing

Syntactical parsing invol ves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words

In [9]:
# # to use word_tokenize
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

In [10]:
from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)
print(pos_tag(tokens))

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]


***LESK Algorithm***

The Lesk algorithm is the seminal dictionary-based method.
 
This is the definition from Wikipedia: "It is based on the hypothesis that words used together in text are related to each other and that the relation can be observed in the definitions of the words and their senses

In [11]:
!pip3 install pywsd==1.0.2  
# !pip install pywsd  #throwing error due to dependency of word_net.. Library devs are looking into this.

Collecting pywsd==1.0.2
  Downloading pywsd-1.0.2.tar.gz (8.6 kB)
Building wheels for collected packages: pywsd
  Building wheel for pywsd (setup.py) ... [?25l[?25hdone
  Created wheel for pywsd: filename=pywsd-1.0.2-py3-none-any.whl size=12129 sha256=25b0d7571fbf8d215b8ca7560ebed9e4e5f2d6c5ee1554cacc1d8e859d438930
  Stored in directory: /root/.cache/pip/wheels/eb/e4/aa/ae578589aa3be86e761593f8bf31ff9b9a24beee5aa259a055
Successfully built pywsd
Installing collected packages: pywsd
Successfully installed pywsd-1.0.2


In [12]:
#Import functions  
from pywsd.lesk import simple_lesk  #pywsd - python implementation of Word Sense Disambiguation (WSD)

sentences = ['I went to the bank to deposit my money',  
'The river bank was full of dead fishes']  

# calling the lesk function and printing results for both the sentences  
print ("Context-1:", sentences[0])  

answer = simple_lesk(sentences[0],'bank')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-1: I went to the bank to deposit my money
Sense: Synset('deposit.v.02')
Definition :  put into a bank account


In [13]:
print ("Context-2:", sentences[1])  
answer = simple_lesk(sentences[1],'bank')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-2: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition :  sloping land (especially the slope beside a body of water)


In [14]:
new_sentences = ['The workers at the plant were overworked','The plant was no longer bearing flowers','The workers at the industrial plant were overworked']

In [15]:
print ("Context-1:", new_sentences[0])  
answer = simple_lesk(new_sentences[0],'plant')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-1: The workers at the plant were overworked
Sense: Synset('plant.v.06')
Definition :  put firmly in the mind


**Result -- not exactly as expected**

In [16]:
print ("Context-2:", new_sentences[1])  
answer = simple_lesk(new_sentences[1],'plant')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-2: The plant was no longer bearing flowers
Sense: Synset('plant.v.01')
Definition :  put or set (seeds, seedlings, or plants) into the ground


**Result -- as expected**

In [17]:
print ("Context-3:", new_sentences[2])  
answer = simple_lesk(new_sentences[2],'plant')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-3: The workers at the industrial plant were overworked
Sense: Synset('plant.v.06')
Definition :  put firmly in the mind


**Result -- as expected**

### Entity Extraction (Entities as features)

Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both. Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, pos tagging and dependency parsing.

#### Named Entity Recognition (NER)

In [18]:
import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")

In [19]:
raw_text="The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru. It operates under Department of Space which is directly overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well."

text1= NER(raw_text)
text1

The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru. It operates under Department of Space which is directly overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well.

In [20]:
for word in text1.ents:
    print(word.text,word.label_)

The Indian Space Research Organisation ORG
the national space agency ORG
India GPE
Bengaluru GPE
Department of Space ORG
India GPE
ISRO ORG
DOS ORG


In [21]:
print(spacy.explain("ORG"))
print(spacy.explain("GPE"))

Companies, agencies, institutions, etc.
Countries, cities, states


In [22]:
displacy.render(text1,style="ent",jupyter=True)

In [23]:
raw_text2='The Mars Orbiter Mission (MOM), informally known as Mangalyaan, was launched into Earth orbit on 5 November 2013 by the Indian Space Research Organisation (ISRO) and has entered Mars orbit on 24 September 2014. India thus became the first country to enter Mars orbit on its first attempt. It was completed at a record low cost of $74 million.'

In [24]:
text2= NER(raw_text2)
for word in text2.ents:
    print(word.text,word.label_)

The Mars Orbiter Mission ORG
Mangalyaan GPE
Earth LOC
5 November 2013 DATE
the Indian Space Research Organisation ORG
ISRO ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
first ORDINAL
$74 million MONEY


In [25]:
displacy.render(text2,style="ent",jupyter=True)

**Scrap data from NEWS article and apply NER**

In [26]:
from bs4 import BeautifulSoup
import requests
import re

URL='https://www.zeebiz.com/markets/currency/news-cryptocurrency-news-today-june-12-bitcoin-dogecoin-shiba-inu-and-other-top-coins-prices-and-all-latest-updates-158490'

html_content = requests.get(URL).text

soup = BeautifulSoup(html_content, "lxml")
content = soup.find_all('div',{'class':'field-item even'})
# print(len(content))
# print(content[1])

body_content = []
data = ""
body_content.append(content[1].find_all("p"))
# print(body_content)
for i in body_content[0]:
  data = data+i.text

# regex cleaning
data = re.sub(r"[^a-zA-Z0-9]+", ' ', data)

print(data)

Bitcoin and all major top cryptocurrencies were trading in red at 3 45 pm on Saturday June 12 In line with its recent trends overall global crypto market was down by over 15 per cent on the weekend showed CoinSwitch Kuber data World number one cryptocurrency Bitcoin was down by 6 and was trading at Rs 27 28 815 after hitting day s high of Rs 29 00 208 See Zee Business Live TV Streaming Below Ethereum ranked at 2nd position globally was trading at Rs1 84 949 down 3 35 It reached a day high of Rs 1 90 490 and slid up to Rs 1 75 060 Ranked 3 Tether continued to trade in limited space and was marginally up by 0 05 Market price of Tether was RS 77 4716 Meme coins Dogecoin Shiba Inu were down over 5 and 10 Dogecoin was trading at Rs 23 869532 and Shiba Inu at Rs 0 000462 Other coins like Polka Dot and Binanace coin were trading down 9 94 and 6 79 respectively Matic was also trading over 10 per cent lower on Saturday Meanwhile in the latest news related to cryptocurrency China s crackdown on 

In [27]:
text3= NER(data)
displacy.render(text3,style="ent",jupyter=True)

### Topic Modeling

Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

**Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique**

### N-Grams as Features

A combination of N words together are called N-Grams. N grams (N > 1) are generally more informative as compared to words (Unigrams) as features

In [31]:
def generate_ngrams(text, n):
    words = text.split()
    output = []  
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output

generate_ngrams('this is a sample text', 2)

[['this', 'is'], ['is', 'a'], ['a', 'sample'], ['sample', 'text']]

## Statistical Features

### Term Frequency – Inverse Document Frequency (TF – IDF)

TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the text documents into vector models on the basis of occurrence of words in the documents without taking considering the exact ordering

Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document “D”

Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of total documents available in the corpus and number of documents containing the term T.

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print(X)

  (0, 1)	0.34520501686496574
  (0, 4)	0.444514311537431
  (0, 2)	0.5844829010200651
  (0, 7)	0.5844829010200651
  (1, 3)	0.652490884512534
  (1, 0)	0.652490884512534
  (1, 1)	0.3853716274664007
  (2, 5)	0.5844829010200651
  (2, 6)	0.5844829010200651
  (2, 1)	0.34520501686496574
  (2, 4)	0.444514311537431


The model creates a vocabulary dictionary and assigns an index to each word. Each row in the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.

## Word Embedding (text vectors)

Word embedding is the modern way of representing words as vectors. The aim of word embedding is to redefine the high dimensional word features into low dimensional feature vectors by preserving the contextual similarity in the corpus. 

Word2Vec and GloVe are the two popular models to create word embedding of a text. These models takes a text corpus as input and produces the word vectors as output

**Word2Vec** model is composed of 
1. preprocessing module, 
2. a shallow neural network model called Continuous Bag of Words and 
3. another shallow neural network model called skip-gram. 

These models are widely used for all other nlp problems. 

It first constructs a vocabulary from the training corpus and then learns word embedding representations.

In [42]:
from gensim.models import Word2Vec
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]

# train the model on your corpus  
model = Word2Vec(sentences, min_count = 1)

print(model.wv.similarity('data', 'science'))

0.040166736


In [45]:
print(model['learning'] )

[ 0.0020579   0.00436839  0.00204895  0.00053198  0.00327065  0.00409795
 -0.00102926 -0.00427638 -0.00212396  0.00424246  0.0044299   0.00271036
 -0.00049583  0.00038131  0.00496779  0.00214182 -0.00280101  0.00011833
  0.00095929 -0.00129017 -0.00054843  0.0011072   0.0037341  -0.0033985
  0.00356015 -0.00016825 -0.0003     -0.00312034  0.00104331 -0.00400445
  0.00246177  0.00426064 -0.00088089 -0.00097667  0.00019584  0.00208241
  0.00389954  0.00223357 -0.00137875  0.0027438   0.00381098 -0.0005335
  0.00079301  0.00433516  0.00154924 -0.00184159  0.00081882 -0.00401901
  0.00374925 -0.00056443  0.00013093 -0.00244388  0.00162374  0.00465756
 -0.00166573  0.00256206  0.00169609  0.00284494 -0.00194351  0.00367887
 -0.00242371  0.00347863  0.00034231 -0.00405537  0.00393117  0.00113576
  0.0004254  -0.00084362  0.00128815 -0.00348526 -0.00077016  0.00438918
 -0.00049996 -0.00139847  0.00346977 -0.00298098 -0.0033001  -0.00347705
  0.00063246  0.00375366  0.00199477 -0.00137159 -0.0

  """Entry point for launching an IPython kernel.


In [46]:
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

model = NBC(training_corpus) 
print(model.classify("Their codes are amazing."))

Class_A


In [47]:
print(model.classify("I don't like their computer."))
print(model.accuracy(test_corpus))

Class_B
0.8333333333333334


In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import classification_report
from sklearn import svm 

# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)
# Apply model on test data 
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear') 
model.fit(train_vectors, train_labels) 
prediction = model.predict(test_vectors)

print(classification_report(test_labels, prediction))

              precision    recall  f1-score   support

     Class_A       0.50      0.67      0.57         3
     Class_B       0.50      0.33      0.40         3

    accuracy                           0.50         6
   macro avg       0.50      0.50      0.49         6
weighted avg       0.50      0.50      0.49         6



### Text Matching / Similarity

One of the important areas of NLP is the matching of text objects to find similarities. Important applications of text matching includes automatic spelling correction, data de-duplication and genome analysis etc.