# Classification of text using Natural Language Processing

In this small project, we will first load a reference data, based on all the text/articles present in the reference dataset, we will run K-Means Clustering to choose main keywords and segment them appropriately.

Then further when a sample input is provided, the algorithm will return, which cluster does it most likely belong to?

## Import Libraries

In [1]:
import urllib2
import numpy as np
import nltk
import bs4

In [2]:
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import defaultdict
from string import punctuation
from heapq import nlargest
from sklearn.neighbors import KNeighborsClassifier

## Extract Data
We will load the doxydonkey blogpost as our reference data.

We have a function, which when given a url, extracts all the posts under doxydonkey domain. And it maintains a list of urls.

We parse/crawl through the doxydonkey blog by knowing the fact that all the urls are present as <a href... format.

In [3]:
def get_all_doxydonkey_posts(url, links):
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    soup = BeautifulSoup(response, 'lxml')
    count = 1
    for a in soup.find_all('a'):
        try:
            url = a['href']
            title = a['title']
            if title == "Older Posts":
                links.append(url)
                get_all_doxydonkey_posts(url, links)
                count += 1
        except:
            title = " "
            count > 3
    return

In [4]:
blog_url = "http://doxydonkey.blogspot.in/"
links = []
get_all_doxydonkey_posts(blog_url, links)

After obtaining the list of urls to visit, we extract the text which is present under "div" segment and in lxml format in each of the url. And we append them to a list.

In [5]:
def get_doxydonkey_text(test_url):
    request = urllib2.Request(test_url)
    response = urllib2.urlopen(request)
    soup = BeautifulSoup(response, 'lxml')
    my_divs = soup.find_all("div", {"class" : "post-body"})

    posts = []
    for div in my_divs:
        posts += map(lambda p: p.text.encode('ascii', errors='replace').replace("?", " "), div.findAll("li"))
    return posts

In [6]:
doxy_donkey_posts = []
links.append("http://doxydonkey.blogspot.in/")
for link in links:
    doxy_donkey_posts += get_doxydonkey_text(link)

## Data Visualization and Clustering

The, we vectorize all the texts obtained.

In [7]:
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')
X = vectorizer.fit_transform(doxy_donkey_posts)

We will use K-Means clustering model to segment the data into 3 clusters. We run it for 100 iterations. We can increase it to obtain more accuracy but at the expense of run time.

In [8]:
km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1, verbose=True)
km.fit(X)
np.unique(km.labels_, return_index=True)

Initialization complete
Iteration  0, inertia 5277.433
Iteration  1, inertia 2700.479
Iteration  2, inertia 2692.552
Iteration  3, inertia 2690.240
Iteration  4, inertia 2689.089
Iteration  5, inertia 2688.380
Iteration  6, inertia 2687.954
Iteration  7, inertia 2687.610
Iteration  8, inertia 2687.429
Iteration  9, inertia 2687.327
Iteration 10, inertia 2687.273
Iteration 11, inertia 2687.258
Iteration 12, inertia 2687.249
Iteration 13, inertia 2687.245
Iteration 14, inertia 2687.228
Iteration 15, inertia 2687.212
Iteration 16, inertia 2687.203
Iteration 17, inertia 2687.198
Iteration 18, inertia 2687.195
Iteration 19, inertia 2687.188
Iteration 20, inertia 2687.186
Iteration 21, inertia 2687.185
Iteration 22, inertia 2687.183
Iteration 23, inertia 2687.181
Iteration 24, inertia 2687.176
Iteration 25, inertia 2687.170
Iteration 26, inertia 2687.159
Iteration 27, inertia 2687.140
Iteration 28, inertia 2687.122
Iteration 29, inertia 2687.095
Iteration 30, inertia 2687.077
Iteration 31, i

(array([0, 1, 2], dtype=int32), array([ 1,  0, 22]))

In [9]:
text = {}
for i, cluster in enumerate(km.labels_):
    one_document = doxy_donkey_posts[i]
    if cluster not in text.keys():
        text[cluster] = one_document
    else:
        text[cluster] += one_document

We will remove the stopwords and punctuations, which are irrelevant while predicting output for the model.

In [10]:
_stopwords = set(stopwords.words('english') + list(punctuation) + ["million", "billion", "year", "``", "millions", "billions", "'s", "''"])

In [11]:
key_words = {}
counts = {}
for cluster in range(3):
    word_sent = word_tokenize(text[cluster].lower())
    word_sent =[word for word in word_sent if word not in _stopwords]
    freq = FreqDist(word_sent)
    key_words[cluster] = nlargest(100, freq, key=freq.get)
    counts[cluster] = freq

In [12]:
unique_keys = {}
for cluster in range(3):
    other_clusters = list(set(range(3))-set([cluster]))
    keys_other_clusters = set(key_words[other_clusters[0]]).union(set(key_words[other_clusters[1]]))
    unique = set(key_words[cluster]) - keys_other_clusters
    unique_keys[cluster] = nlargest(10, unique, key=counts[cluster].get)

i=0
for group in unique_keys:
    print "Cluster {}: {}".format(i, unique_keys[i])
    i = i+1

Cluster 0: ['facebook', 'twitter', 'ads', 'use', 'apps', 'search', 'pay', 'mr.', 'ad', 'social']
Cluster 1: ['quarter', 'shares', 'share', 'stock', 'profit', 'public', 'rose', 'valuation', 'analysts', 'earnings']
Cluster 2: ['amazon', 'prime', 'delivery', 'items', 'amazon.com', 'retailer', 'echo', 'shipping', 'sellers', 'web']


We see that after running the model, we obtain 3 clusters, described above with correspondimg keywords.

## Prediction

After the clustering, given an article or text, we expect to get to know the cluster to which the article belongs to. This can be applied to many domains, for example, ctaegorizing scientific journal, media industries and lot more.

Now, for testing, let us take an example text, which is related to the the Cab industry and mobile apps/payment, and let us see what our model predicts?

In [13]:
article = "This weekend, Uber and Lyft — in their reactions to the Trump administration’s immigration order — illustrated how important companies' political views have become to consumers. Lyft took a public stand against the order and, on Sunday, saw more downloads than Uber for the first time ever, according to analysis firm App Annie. Lyft's Sunday downloads also more than doubled its daily average over the previous two weeks. Uber, on the other hand, had a bad weekend. Hundreds of people called for ride-sharers to ditch the company through the hashtag “#deleteUber” after it announced that it would drop surge pricing for John F. Kennedy Airport trips. Many saw Uber’s move as an attempt to undermine the strike that New York City cabdrivers organized to protest the immigration order and capitalize off the controversy — something Uber was quick to deny. It also didn't help Uber's standing among President Trump's critics that its chief executive is on the administration's business advisory committee. The social reaction to the Uber-Lyft divide was immediate. App Annie confirmed that Lyft climbed the app charts on both Apple and Android phones this weekend. It overtook Uber to reach No. 1 on the Apple App Store. That bump came despite the fact that Lyft didn’t suspend its service during the strike either and despite Lyft's ties to another close Trump ally, investor Peter Thiel. Its pledge over the weekend, however, seemed to speak louder than those facts — and louder than a later Uber vow to devote $3 million to help its drivers with immigration legal costs."

In [14]:
classifier = KNeighborsClassifier(n_neighbors=10)
classifier.fit(X, km.labels_)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [15]:
test = vectorizer.transform([article.decode('utf-8').encode('ascii', errors='ignore')])
print "The article belongs to cluster {}".format(classifier.predict(test))

The article belongs to cluster [0]


Seems like, our model has predicted the correct cluster.

Finally, we can use different reference website, may be some news site with exhaustive range of articles present on plethora of topics. One chosen above is for simplicity. Note that we just have to tweak a little in the extraction part as the HTML and javascript binding of each website may be different but rest algorithm may remain unchanged to classify an article or text.