<a href="https://colab.research.google.com/github/ujoshidev/TestRepo/blob/main/NLP_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Basics

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner

## Installing library

In [45]:
pip install -U nltk



In [46]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Pac

True

## Text Preprocessing

The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing

It is predominantly comprised of three steps:
* Noise Removal
* Lexicon Normalization
* Object Standardization

### Noise Removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise

In [47]:
noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

_remove_noise("this is a sample text")

'sample text'

In [48]:
import re 

def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*"  

_remove_regex("remove this #hashtag from analytics vidhya", regex_pattern)

'remove this  from analytics vidhya'

### Lexicon Normalization

most common lexicon normalization practices are :

* Stemming:  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
* Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary

In [49]:
#  # required to run lemmatize
# nltk.download('wordnet')
# nltk.download('omw-1.4')

In [50]:
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "multiplying" 
lem.lemmatize(word, "v")


'multiply'

In [51]:
stem.stem(word)

'multipli'

### Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs.

In [52]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

_lookup_words("RT this is a retweeted tweet by Shivam Bansal")

'Retweet this is a retweeted tweet by Shivam Bansal'

## Text to Features (Feature Engineering on text data)

To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based features, Statistical features, and word embeddings

###  Syntactic Parsing

Syntactical parsing invol ves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words

In [53]:
# # to use word_tokenize
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

In [54]:
from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)
print(pos_tag(tokens))

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]


***LESK Algorithm***

The Lesk algorithm is the seminal dictionary-based method.
 
This is the definition from Wikipedia: "It is based on the hypothesis that words used together in text are related to each other and that the relation can be observed in the definitions of the words and their senses

In [55]:
!pip3 install pywsd==1.0.2  
# !pip install pywsd  #throwing error due to dependency of word_net.. Library devs are looking into this.



In [56]:
#Import functions  
from pywsd.lesk import simple_lesk  #pywsd - python implementation of Word Sense Disambiguation (WSD)

sentences = ['I went to the bank to deposit my money',  
'The river bank was full of dead fishes']  

# calling the lesk function and printing results for both the sentences  
print ("Context-1:", sentences[0])  

answer = simple_lesk(sentences[0],'bank')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-1: I went to the bank to deposit my money
Sense: Synset('deposit.v.02')
Definition :  put into a bank account


In [57]:
print ("Context-2:", sentences[1])  
answer = simple_lesk(sentences[1],'bank')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-2: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition :  sloping land (especially the slope beside a body of water)


In [58]:
new_sentences = ['The workers at the plant were overworked','The plant was no longer bearing flowers','The workers at the industrial plant were overworked']

In [59]:
print ("Context-1:", new_sentences[0])  
answer = simple_lesk(new_sentences[0],'plant')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-1: The workers at the plant were overworked
Sense: Synset('plant.v.06')
Definition :  put firmly in the mind


**Result -- not exactly as expected**

In [60]:
print ("Context-2:", new_sentences[1])  
answer = simple_lesk(new_sentences[1],'plant')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-2: The plant was no longer bearing flowers
Sense: Synset('plant.v.01')
Definition :  put or set (seeds, seedlings, or plants) into the ground


**Result -- as expected**

In [61]:
print ("Context-3:", new_sentences[2])  
answer = simple_lesk(new_sentences[2],'plant')  
print ("Sense:", answer)  
print ("Definition : ", answer.definition())  

Context-3: The workers at the industrial plant were overworked
Sense: Synset('plant.v.06')
Definition :  put firmly in the mind


**Result -- as expected**

### Entity Extraction (Entities as features)

Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both. Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, pos tagging and dependency parsing.

#### Named Entity Recognition (NER)

In [62]:
import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")

In [63]:
raw_text="The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru. It operates under Department of Space which is directly overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well."

text1= NER(raw_text)
text1

The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru. It operates under Department of Space which is directly overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well.

In [64]:
for word in text1.ents:
    print(word.text,word.label_)

The Indian Space Research Organisation ORG
the national space agency ORG
India GPE
Bengaluru GPE
Department of Space ORG
India GPE
ISRO ORG
DOS ORG


In [66]:
print(spacy.explain("ORG"))
print(spacy.explain("GPE"))

Companies, agencies, institutions, etc.
Countries, cities, states


In [67]:
displacy.render(text1,style="ent",jupyter=True)

In [69]:
raw_text2='The Mars Orbiter Mission (MOM), informally known as Mangalyaan, was launched into Earth orbit on 5 November 2013 by the Indian Space Research Organisation (ISRO) and has entered Mars orbit on 24 September 2014. India thus became the first country to enter Mars orbit on its first attempt. It was completed at a record low cost of $74 million.'

In [70]:
text2= NER(raw_text2)
for word in text2.ents:
    print(word.text,word.label_)

The Mars Orbiter Mission ORG
Mangalyaan GPE
Earth LOC
5 November 2013 DATE
the Indian Space Research Organisation ORG
ISRO ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
first ORDINAL
$74 million MONEY


In [71]:
displacy.render(text2,style="ent",jupyter=True)

**Scrap data from NEWS article and apply NER**

In [152]:
from bs4 import BeautifulSoup
import requests
import re

URL='https://www.zeebiz.com/markets/currency/news-cryptocurrency-news-today-june-12-bitcoin-dogecoin-shiba-inu-and-other-top-coins-prices-and-all-latest-updates-158490'

html_content = requests.get(URL).text

soup = BeautifulSoup(html_content, "lxml")
content = soup.find_all('div',{'class':'field-item even'})
# print(len(content))
# print(content[1])

body_content = []
data = ""
body_content.append(content[1].find_all("p"))
# print(body_content)
for i in body_content[0]:
  data = data+i.text

# regex cleaning
data = re.sub(r"[^a-zA-Z0-9]+", ' ', data)

print(data)

Bitcoin and all major top cryptocurrencies were trading in red at 3 45 pm on Saturday June 12 In line with its recent trends overall global crypto market was down by over 15 per cent on the weekend showed CoinSwitch Kuber data World number one cryptocurrency Bitcoin was down by 6 and was trading at Rs 27 28 815 after hitting day s high of Rs 29 00 208 See Zee Business Live TV Streaming Below Ethereum ranked at 2nd position globally was trading at Rs1 84 949 down 3 35 It reached a day high of Rs 1 90 490 and slid up to Rs 1 75 060 Ranked 3 Tether continued to trade in limited space and was marginally up by 0 05 Market price of Tether was RS 77 4716 Meme coins Dogecoin Shiba Inu were down over 5 and 10 Dogecoin was trading at Rs 23 869532 and Shiba Inu at Rs 0 000462 Other coins like Polka Dot and Binanace coin were trading down 9 94 and 6 79 respectively Matic was also trading over 10 per cent lower on Saturday Meanwhile in the latest news related to cryptocurrency China s crackdown on 

In [153]:
text3= NER(data)
displacy.render(text3,style="ent",jupyter=True)

### Topic Modeling

Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

In [160]:
pip install corpora

Collecting corpora
  Downloading Corpora-1.0.tar.gz (5.1 kB)
Building wheels for collected packages: corpora
  Building wheel for corpora (setup.py) ... [?25l[?25hdone
  Created wheel for corpora: filename=Corpora-1.0-py3-none-any.whl size=5510 sha256=704274a09b9ed3c4db794eaa9fce7bcd3c1e1a4a6bb657e4b0e7e7032b6b30cd
  Stored in directory: /root/.cache/pip/wheels/bc/c4/86/5eaf5f8befbbdd9cfe78fe9b6c9fa5b43a5c88ba54397b67f0
Successfully built corpora
Installing collected packages: corpora
Successfully installed corpora-1.0


In [162]:
import nltk.corpus

In [163]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

import gensim 
import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print(ldamodel.print_topics())

ModuleNotFoundError: ignored