# Topics:

1. Tasks in Natural language Processing
2. Tokenization
3. Stop Words Removal
4. Identifying N-Grams
5. Stemming
6. POS tagging
7. Word Sense Disambiguation
------------------------------------------
8. Auto-Summarizing Text
    a) Downloading the page using BeautifulSoup
    b) Preprocessing the text
    c) Extracting the summary
------------------------------------------
9. Classifying Text using ML
    a) K-Means clustering to find clusters
    b) Using K-nearest to classify them into those clusters.

# 1. Tasks in Natural Language Processing:


• Tokenization: Breaking down textx into words and sentences.
• Stop Word Removal: Filtering common words. ex: is,an,the,a,etc,.
• N-Grams: Identifying commonly occuring group of words. 
    Ex: Treating "New" and "York" as one word - New York (BiGram)
• Word Disambiguation: Indentifying same words used in different context. 
    Ex: These are really cool effects.
        Give me a glass of cold water.
• Parts of Speech: Identifying parts of speech.
• Stemming: Removing ends of the words. Ex: Closed,Closes,Closing -> Close


# 2. Tokenization

In [58]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
 #Run this for first time:    nltk.download('punkt')

In [59]:
text = "This is good. We can imporve and do better."
print(sent_tokenize(text))
print(word_tokenize(text))

['This is good.', 'We can imporve and do better.']
['This', 'is', 'good', '.', 'We', 'can', 'imporve', 'and', 'do', 'better', '.']


# 3. Stop Words Removal

In [3]:
from nltk.corpus import stopwords
from string import punctuation

In [10]:
# Run for the first time: 
#nltk.download('stopwords')

In [4]:
customeStopWords = set(stopwords.words('english')+list(punctuation))

In [5]:
Words_WO_StopWords = [word for word in word_tokenize(text) if word not in customeStopWords]
Words_WO_StopWords

['This', 'good', 'We', 'imporve', 'better']

# 4. Identifying N-Grams:

In [6]:
from nltk.collocations import *

In [7]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(Words_WO_StopWords)

In [8]:
sorted(finder.ngram_fd.items())

[(('This', 'good'), 1),
 (('We', 'imporve'), 1),
 (('good', 'We'), 1),
 (('imporve', 'better'), 1)]

# 5. Stemming

In [9]:
from nltk.stem.lancaster import LancasterStemmer

In [10]:
text2 = "Mary closed on closing night when she was in the mood to close."
st = LancasterStemmer()
stemmedWords = [st.stem(word) for word in word_tokenize(text2)]
print(stemmedWords)

['mary', 'clos', 'on', 'clos', 'night', 'when', 'she', 'was', 'in', 'the', 'mood', 'to', 'clos', '.']


# 6. POS Tagging

In [28]:
#Run this for first time
#nltk.download('averaged_perceptron_tagger')

In [11]:
nltk.pos_tag(word_tokenize(text2))

[('Mary', 'NNP'),
 ('closed', 'VBD'),
 ('on', 'IN'),
 ('closing', 'NN'),
 ('night', 'NN'),
 ('when', 'WRB'),
 ('she', 'PRP'),
 ('was', 'VBD'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mood', 'NN'),
 ('to', 'TO'),
 ('close', 'VB'),
 ('.', '.')]

# 7. Word Sense Disambiguation

In [33]:
#Run for the first time:
#nltk.download('wordnet')

In [12]:
from nltk.corpus import wordnet as wn

In [13]:
for ss in wn.synsets('bass'):
    print(ss,ss.definition())

Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range


In [14]:
from nltk.wsd import lesk
sensel = lesk(word_tokenize("Sing in a lower tone, along with the bass"),'bass')
print(sensel,sensel.definition())

Synset('bass.n.07') the member with the lowest range of a family of musical instruments


In [15]:
sensel2 = lesk(word_tokenize("This sea bass was really hard to catch"),'bass')
print(sensel2,sensel2.definition())

Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae


# 8. Auto Summarizing Text

In [60]:
# a) Downloading the text:==================================
import urllib
from bs4 import BeautifulSoup

In [61]:
staticURL = "https://www.washingtonpost.com/news/the-switch/wp/2016/10/18/the-pentagons-massive-new-telescope-is-designed-to-track-space-junk-and-watch-out-for-killer-asteroids/"

In [62]:
def getTextfromurl(url):
    page = urllib.request.urlopen(url).read().decode('utf8','ignore')
    soup = BeautifulSoup(page,'lxml')
    text = ' '.join(map(lambda p: p.text,soup.find_all('article')))
    return str(text.encode('ascii',errors='replace')).replace("?","")


In [19]:
text = getTextfromurl(staticURL)

In [20]:
#b) Pre-processing the text:=================================
sents = sent_tokenize(text.replace('.','. '))
sents[7]

'On Tuesday, the Defense Departmenttook another significant step toward monitoring all of the cosmic junk swirling around in space, by deliveringa gigantic new telescope capable of seeing small objects from very far away.'

In [21]:
word_sent = word_tokenize(text.lower())
len(word_sent)

811

In [22]:
word_sent = [word for word in word_sent if word not in customeStopWords]
len(word_sent)

438

In [23]:
# c) Extracting Summary:=====================================
from nltk.probability import FreqDist
freq = FreqDist(word_sent)
freq

FreqDist({'space': 14, 'telescope': 8, 'debris': 7, 'satellites': 6, 'orbit': 6, 'objects': 6, 'air': 6, 'force': 6, 'around': 4, 'small': 4, ...})

In [24]:
# Getting n highest occuring words.
from heapq import nlargest
print(nlargest(10,freq,key=freq.get))

['space', 'telescope', 'debris', 'satellites', 'orbit', 'objects', 'air', 'force', 'around', 'small']


In [25]:
# creating sentence ranking dictionary:
from collections import defaultdict
ranking = defaultdict(int)

for i, sent in enumerate(sents):
    for w in word_tokenize(sent.lower()):
        if w in freq:
            ranking[i] += freq[w]
print(ranking)

defaultdict(<class 'int'>, {0: 52, 1: 8, 2: 33, 3: 16, 4: 1, 5: 2, 6: 44, 7: 65, 8: 56, 9: 51, 10: 24, 11: 12, 12: 12, 13: 16, 14: 23, 15: 43, 16: 24, 17: 41, 18: 67, 19: 31, 20: 59, 21: 42, 22: 38, 23: 27, 24: 25, 25: 35, 26: 21, 27: 20, 28: 79, 29: 27, 30: 12, 31: 7, 32: 3, 33: 8})


In [26]:
# selecting top 4 sentences on these scores.
sents_idx = nlargest(4,ranking,key=ranking.get)
sents_idx

[28, 18, 7, 20]

In [27]:
[sents[i] for i in sorted(sents_idx)]

['On Tuesday, the Defense Departmenttook another significant step toward monitoring all of the cosmic junk swirling around in space, by deliveringa gigantic new telescope capable of seeing small objects from very far away.',
 'The telescope is a big improvement over the legacy ground-based optical telescopes that are used by the U. S.  Air Force, because it can search large areas of sky and also track very faint (small) objects in and around GEO, Brian Weeden, a Technical Advisor at the Secure World Foundation, wrote in an email.',
 'The telescope wouldjoin another new space debris tracking technology known as the Space Fence, which is now being built by Bethesda-based Lockheed Martin.',
 'Every military operation that takes place in the world today is critically dependent on space in one way or another, Air Force Gen.  John Hyten said in an interview earlier this year when he was the commander of the Air Force Space Command.']

In [28]:
# One Final Function to summarize.
def summarize(text, n):
    text = text.replace('.','. ')
    sents = sent_tokenize(text)
    
    assert n <= len(sents)
    word_sent = word_tokenize(text.lower())
    _stopwords = set(stopwords.words('english') + list(punctuation))
    
    word_sent=[word for word in word_sent if word not in _stopwords]
    freq = FreqDist(word_sent)
    
    ranking = defaultdict(int)
    
    for i, sent in enumerate(sents):
        for w in word_tokenize(sent.lower()):
            if w in freq:
                ranking[i] += freq[w]
    sents_idx = nlargest(n,ranking,key=ranking.get)
    return [sents[j] for j in sorted(sents_idx)]


In [29]:
summarize(text,3)

['On Tuesday, the Defense Departmenttook another significant step toward monitoring all of the cosmic junk swirling around in space, by deliveringa gigantic new telescope capable of seeing small objects from very far away.',
 'The telescope is a big improvement over the legacy ground-based optical telescopes that are used by the U. S.  Air Force, because it can search large areas of sky and also track very faint (small) objects in and around GEO, Brian Weeden, a Technical Advisor at the Secure World Foundation, wrote in an email.',
 'Every military operation that takes place in the world today is critically dependent on space in one way or another, Air Force Gen.  John Hyten said in an interview earlier this year when he was the commander of the Air Force Space Command.']

# 9. Classifying Text using ML

In [None]:
# We'll scrap inshorts news articles and cluster them in different categories.

In [6]:
import urllib
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests

In [7]:
#Getting first 25 posts:
def addArticles(url):
    req = urllib.request.Request(url,headers={'User-Agent': 'Mozilla/5.0'})
    client = urllib.request.urlopen(req)
    webpage = client.read()
    client.close()

    soup = BeautifulSoup(webpage)
    articles = soup.findAll("div",{"itemprop":"articleBody"})
    for article in articles:
        posts.append(article.text)

    sc = list(soup.findAll('script'))
    code = str(sc[-1].text[25:35])
    return code

In [8]:
# for scraping further pages, require code viz. o/p of prev page.
def addArticle2(url,code):
    data = {'category':'','news_offset':code}
    response = requests.post(url, data=data)
    htmltext = list((response.json()).values())[1]
    soup = BeautifulSoup(htmltext, 'html.parser')
    articles = soup.findAll("div",{"itemprop":"articleBody"})
    for article in articles:
        posts.append(article.text)
    return list((response.json()).values())[0]

In [9]:
#making first call for first page.
posts = []
url = "https://inshorts.com/en/read"
code = addArticles(url)
print(len(posts))

25


In [18]:
# Hit this code block n times for further n pages:
url = "https://inshorts.com/en/ajax/more_news"
code = addArticle2(url,code)
print(len(posts))

209


# Now we have data, there are two methods for feature extraction:
a. Term Frequency
b. TF-IDF

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
vectorizer = TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')

In [22]:
x = vectorizer.fit_transform(posts)
x

<209x991 sparse matrix of type '<class 'numpy.float64'>'
	with 4018 stored elements in Compressed Sparse Row format>

In [25]:
#print(x[0])

In [26]:
#clustering:
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 3,init = 'k-means++',max_iter=100,n_init=1,verbose=True)


In [27]:
km.fit(x)

Initialization complete
Iteration  0, inertia 387.162
Iteration  1, inertia 199.939
Iteration  2, inertia 199.541
Iteration  3, inertia 199.207
Iteration  4, inertia 198.967
Iteration  5, inertia 198.697
Iteration  6, inertia 198.356
Iteration  7, inertia 198.327
Converged at iteration 7: center shift 0.000000e+00 within tolerance 9.811060e-08


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
       n_clusters=3, n_init=1, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=True)

In [28]:
import numpy as np
np.unique(km.labels_, return_counts=True)

(array([0, 1, 2]), array([60, 76, 73], dtype=int64))

In [29]:
text={}
for i,cluster in enumerate(km.labels_):
    oneDocument = posts[i]
    if cluster not in text.keys():
        text[cluster] = oneDocument
    else:
        text[cluster] += oneDocument

In [31]:
# doing analysis on these 3 clusters:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import defaultdict
from string import punctuation
from heapq import nlargest
import nltk

In [32]:
_stopwords = set(stopwords.words('english') + list(punctuation) +
                 ["million","billion","year","millions","billions",
                  "y/y","'s'","''"])

In [33]:
keywords = {}
counts={}
for cluster in range(3):
    word_sent = word_tokenize(text[cluster].lower())
    word_sent=[word for word in word_sent if word not in _stopwords]
    freq = FreqDist(word_sent)
    keywords[cluster] = nlargest(100,freq,key=freq.get)
    counts[cluster]=freq

In [34]:
unique_keys={}
for cluster in range(3):
    other_clusters=list(set(range(3)) - set([cluster]))
    keys_other_cluster=set(keywords[other_clusters[0]]).union(set(keywords[other_clusters[1]]))
    unique = set(keywords[cluster]) - keys_other_cluster
    unique_keys[cluster] =nlargest(10,unique,key=counts[cluster].get)
    
    
    

In [46]:
print(unique_keys)

{0: ['indian', 'modi', 'wrote', 'trump', 'kohli', 'captain', 'cricket', 'played', 'england', 'twitter'], 1: ['covid-19', 'cases', 'positive', 'hospital', 'patients', 'reported', 'lockdown', 'lakh', 'total', 'cm'], 2: ['arrested', 'used', 'dubey', 'vikas', 'singh', 'killed', 'sushant', 'house', 'allegedly', 'accused']}


In [None]:
# We will make a model which will predict the class of article.

In [38]:
article = posts[3]

In [40]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier()

#Training :
classifier.fit(x,km.labels_)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [42]:
test=vectorizer.transform([article.encode('ascii',errors='ignore')])

In [44]:
test

<1x991 sparse matrix of type '<class 'numpy.float64'>'
	with 24 stored elements in Compressed Sparse Row format>

In [45]:
classifier.predict(test)

array([1])