### **Python Project - Article Theme Classification**

**Objective:**

To classify the theme of an article.

- Gather a corpus of articles from a given URL.
- Feature Extraction - convert the text to numericals to train the Machine Learning Model.
- Train KMeans clustering algorithm.
- Figure out different themes from the clusters.
- Train K Nearest Neighbors classifier algorithm with the articles data and their themes.
- Assign a theme to a new article.



## **Gather a corpus of articles from a given URL**

We need urllib and Beautiful Soup libraries to download and parse the text.

In [None]:
# to download and parse a HTML webpage
import urllib  
from bs4 import BeautifulSoup

### **Get all the article URLs from a blog URL**

Below is the URL of a blog where few technology related articles are summarized and posted every day.
 

In [None]:
blogUrl = "http://doxydonkey.blogspot.in"

In the blog URL, we have 7 articles and a link at last to retrive the previous articles. Each URL will have the blog posts or tech news summaries for seven days. 

There is a link on the homepage called Older Posts, which will give a URL with the summaries for the previous seven days. 

The below function will get all the URL links of the previous articles and append them to the variable `links`.

These will be represented using a tag. `soup.findAll` will find all the links which are present on the homepage. We want to find the link which has the text called Older Posts. For each link, we can find the URL by using the attribute `href` and the text by using the attribute `title`. If the `title` is equal to `Older Posts`, then we want to actually keep that link. So we append it where we are storing all the links that we will later parse for articles. 

**The below method is customized only to this website and if we want to use it for another website we need to make some changes to the code by inspecting the HTML code of the specific website.**

In [None]:
#Gets all the urls from the homepage and append it to a variable
def getAllBlogPosts(url,links):
    response = urllib.request.urlopen(url)
    soup = BeautifulSoup(response)
    for a in soup.findAll('a'):
        try:
            url = a['href']
            title = a['title']
            if title == "Older Posts":
                links.append(url)
                getAllBlogPosts(url,links)
        except:
            title = ""
    return

We create a list `links` and call the above function by passing the blogURL and the list.

In [None]:
#Create a list and store all the article URLs 
links = []
getAllBlogPosts(blogUrl,links)

The `getBlogText` function gets all the articles from the given URLs. Each URL contains 7 articles and appends all the article texts to the `posts` variable and returns it.

To understand how to parse the article text, we can go to the browser and inspect the article text element. Right-click on the article text and click `Inspect`, and this will open up a developer tools window which shows you the HTML corresponding to the element that we are looking at. Each article that we want to collect is a bullet point on this page, and bullet points are represented using the tag `li`. Each day's tech news summary is one post, which might contain multiple such bullet points. Each day's summary is contained within a `div` element whose class name is `post-body`. So in order to collect all the articles, we need to find divs which have this class, post-body, and within those divs, we need to find the bullet points represented by the `li` tag. Given a URL containing multiple tech news summaries, this function will collect all the articles from that blog page.

In [None]:
#get the article text from the URL
def getBlogText(testUrl):
    response = urllib.request.urlopen(testUrl)
    soup = BeautifulSoup(response)
    mydivs = soup.findAll("div", {"class":'post-body'})
    
    posts =[]
    for div in mydivs:
        posts += map(lambda p:p.text.encode('ascii', errors='replace').replace(b"?",b" "), div.findAll("li"))
    return posts

let's create a variable `BlogPosts` to store the corpus of articles.
We can iterate over the `links` to extract the articles using the `getBlogText` function and store them in the variable `BlogPosts`.

In [None]:
BlogPosts = []
for link in links:
    BlogPosts += getBlogText(link)

Let's have a look at first 5 blog posts.

In [None]:
BlogPosts[0:5]

[b"SoftBank's $100 Billion Tech Fund Rankles VCs as Valuations Soar: In the months since Softbank Group Corp. unveiled plans for a $100 billion technology fund, the Japanese company has been making its presence deeply felt across the industry. The Vision Fund closed a few days ago with $93 billion in initial commitments, and already venture firms from London to Silicon Valley are fretting about a behemoth with the resources, clout and name recognition to snatch away the most promising deals. Just last week, SoftBank swooped in and pumped $1.4 billion into Paytm, India s largest digital-payments startup. The deal boosted Paytm's valuation by about 40 percent to $7 billion. That's not outlandish given Paytm's dominant market position, but the valuations of other SoftBank deals have prompted head-scratching and ignited alarm that a funding atmosphere that only recently cooled off will heat up again. there's the concern that SoftBank will ladle out more money than startups need or can abso

As the above string is in binary format, we can now convert all the strings to normal format.

In [None]:
#convert the binary strings to normal strings
for i,string in enumerate(BlogPosts):
  BlogPosts[i] = string.decode('ascii')

BlogPosts[0:5]

["SoftBank's $100 Billion Tech Fund Rankles VCs as Valuations Soar: In the months since Softbank Group Corp. unveiled plans for a $100 billion technology fund, the Japanese company has been making its presence deeply felt across the industry. The Vision Fund closed a few days ago with $93 billion in initial commitments, and already venture firms from London to Silicon Valley are fretting about a behemoth with the resources, clout and name recognition to snatch away the most promising deals. Just last week, SoftBank swooped in and pumped $1.4 billion into Paytm, India s largest digital-payments startup. The deal boosted Paytm's valuation by about 40 percent to $7 billion. That's not outlandish given Paytm's dominant market position, but the valuations of other SoftBank deals have prompted head-scratching and ignited alarm that a funding atmosphere that only recently cooled off will heat up again. there's the concern that SoftBank will ladle out more money than startups need or can absor

## **Feature extraction**



The process of extracting numeric attributes from text is called feature extraction.

There are two methods, **Term Frequency** and **TF-IDF**.

Both the methods require a bag of words model means to create a list representing the universe of all words that can appear in any text from the corpus.

<img src='https://www.romainberg.com/wp-content/uploads/TF_IDF-final-1024x399.png'>

We first need to take all the articles and represent them using the TF-IDF representation. 

`Scikit-learn` is a Python module with a lot of built-in functionality available for machine learning tasks, and this contains a feature extraction module, which allows you to perform TF-IDF representation. 

So we import the `TfidfVectorizer` from `sklearn.feature_extraction.text`, and we instantiate a vectorizer object.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

We need to mention the stop words parameter to ignore the stop words from the articles corpus.

In [None]:
vectorizer = TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')

This vectorizer has a method called fit_transform, which takes a list of strings and then returns a two-dimensional array in which each row represents one article.

In [None]:
X = vectorizer.fit_transform(BlogPosts)
X.shape

(2804, 13220)

The shape of the X is (2804, 13220). 

Here, 
- 2804 represents the number of articles in the corpus. 
- 13220 represents the number of distinct words which are present in all articles. 

In [None]:
X[0].shape

(1, 13220)

the decimal numbers below are the TF-IDF values which represent the particular text.

In [None]:
BlogPosts[0]

"SoftBank's $100 Billion Tech Fund Rankles VCs as Valuations Soar: In the months since Softbank Group Corp. unveiled plans for a $100 billion technology fund, the Japanese company has been making its presence deeply felt across the industry. The Vision Fund closed a few days ago with $93 billion in initial commitments, and already venture firms from London to Silicon Valley are fretting about a behemoth with the resources, clout and name recognition to snatch away the most promising deals. Just last week, SoftBank swooped in and pumped $1.4 billion into Paytm, India s largest digital-payments startup. The deal boosted Paytm's valuation by about 40 percent to $7 billion. That's not outlandish given Paytm's dominant market position, but the valuations of other SoftBank deals have prompted head-scratching and ignited alarm that a funding atmosphere that only recently cooled off will heat up again. there's the concern that SoftBank will ladle out more money than startups need or can absorb

In [None]:
X.shape

(2804, 13220)

In [None]:
print (X[1])

  (0, 1965)	0.0529628954253893
  (0, 12067)	0.055052384601166085
  (0, 4187)	0.0584507917180868
  (0, 8972)	0.07600199422007031
  (0, 3024)	0.05736900975819227
  (0, 9356)	0.05279111734066146
  (0, 6749)	0.03958895210959883
  (0, 1950)	0.043600047428625034
  (0, 2069)	0.043452319046352085
  (0, 4601)	0.054246249916468064
  (0, 7932)	0.07555595918260936
  (0, 3301)	0.08686905713475819
  (0, 2322)	0.0836816938182973
  (0, 4168)	0.05367794768088572
  (0, 11192)	0.03238529318176009
  (0, 13146)	0.019059438993661963
  (0, 210)	0.08686905713475819
  (0, 7579)	0.04294208101613297
  (0, 128)	0.07062282759807993
  (0, 4124)	0.06173325831827569
  (0, 12886)	0.04413318293366655
  (0, 3210)	0.06382995834007718
  (0, 7132)	0.041567635996890104
  (0, 7254)	0.028166537825240253
  (0, 4877)	0.06584998356087884
  :	:
  (0, 5348)	0.05091405493600229
  (0, 6953)	0.08858068218774492
  (0, 5417)	0.03738918655828943
  (0, 9373)	0.0976446354406848
  (0, 11873)	0.10331584492016813
  (0, 1019)	0.05686531099734

## **Train KMeans clustering algorithm**

 **Clustering:**

Here we are given a large group of articles, and we need to divide these articles into clusters or groups on the basis of some common attributes. We want all articles which represent some particular theme to be put into one group. This is a classic example of a clustering problem. 

Whenever you encounter a large number of items and the objective is to divide them into groups based on some measure of similarity, you're basically solving a clustering problem. 

The end objective of clustering is to create groups such that items within one group are similar to one another and items which are in different groups are dissimilar to one another. 

In other words, if you have some metric called similarity, which measures how similar items are to one another, you want to maximize the intracluster similarity, maximize the similarity of items within a cluster, and you want to minimize intercluster similarity. You want to minimize the similarity between items which are in different clusters.

Scikit-learn has a built-in module for clustering called the `sklearn.cluster` module. 

Within this module, we have a class for the K-Means clustering algorithm. So we can import that class and instantiate a K-Means clustering object. This sets up the algorithm with all the parameters - 

`n_clusters` is the parameter that is used to specify the number of clusters, which here is three, because we want to divide our articles into three groups. 

The `init` parameter specifies an algorithm to help choose the initial centroids or means in such a way that we can find the relevant clusters with a minimum number of iterations. 

We're also specifying the maximum number of iterations, so in case the algorithm does not reach convergence until the 100th iteration, then it will stop at that point.

`n_init` - Number of time the k-means algorithm will be run with different centroid seeds. Default value is 10.

In [None]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 100, n_init = 1, verbose = True)

To perform K-Means clustering, we take our documents represented in TF-IDF, which is the array X, and pass it to the fit method of the K-Means clustering algorithm. 

In [None]:
km.fit(X)

Initialization complete
Iteration 0, inertia 5266.672075908238
Iteration 1, inertia 2685.7228855572725
Iteration 2, inertia 2677.5563626124367
Iteration 3, inertia 2674.145183603303
Iteration 4, inertia 2671.0745860108595
Iteration 5, inertia 2669.3755496569056
Iteration 6, inertia 2668.6988850993375
Iteration 7, inertia 2668.4275202310287
Iteration 8, inertia 2668.266815176653
Iteration 9, inertia 2668.179201175489
Iteration 10, inertia 2668.098142680158
Iteration 11, inertia 2668.0405470171218
Iteration 12, inertia 2667.9643083983756
Iteration 13, inertia 2667.874353005027
Iteration 14, inertia 2667.8199574164273
Iteration 15, inertia 2667.7976527097017
Iteration 16, inertia 2667.7792048889432
Iteration 17, inertia 2667.770560499607
Iteration 18, inertia 2667.7685426847183
Converged at iteration 18: strict convergence.


KMeans(max_iter=100, n_clusters=3, n_init=1, verbose=True)

Every document in our array X has now been assigned a number, which represents the cluster to which it belongs. These numbers are stored in the array labels, which is an attribute of the K-Means object.

We also have the counts which represent how many articles are present in each cluster.

In [None]:
import numpy as np
np.unique(km.labels_, return_counts=True) 

(array([0, 1, 2], dtype=int32), array([ 674, 1681,  449]))

## **Find out different themes from the clusters**

We have to actually look at the articles which are present in each cluster to identify what meaningful theme is represented by each cluster. 

So let's find some of the important keywords which occur in each cluster to see if we can articulate what those underlying themes might be. 

We'll set up a dictionary called `text` in which the keys will be the cluster numbers and the values will be the aggregated text across all the articles which are present in that cluster. We'll go through the array of labels, which have the cluster numbers assigned to each document and then collect the text for each document into the corresponding cluster. 

The `enumerate` function converts the array of labels into a list of tuples where the first element is the index of an article. So using the index, we can get the corresponding article from the list `BlogPosts`. Then we aggregate this text for every article into the corresponding value in the text dictionary. 

In [None]:
km.labels_

array([0, 1, 0, ..., 1, 1, 1], dtype=int32)

In [None]:
text={}
for i,cluster in enumerate(km.labels_):
    oneDocument = BlogPosts[i]
    if cluster not in text.keys():
        text[cluster] = oneDocument
    else:
        text[cluster] += oneDocument

In [None]:
text

We can use some NLTK functions to find out the most frequent words within each cluster.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist

from string import punctuation
from heapq import nlargest
import nltk 

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

When we find the most frequent words, we don't want to include stop words, so we'll set up a variable to represent all the stop words that we want to ignore.

Along with the standard list of stop words from English and punctuation, we have some additional words which are common to tech news articles.

In [None]:
_stopwords = set(stopwords.words('english') + list(punctuation)+["million","billion","year","millions","billions","y/y","'s","''","``"])
 

In [None]:
 nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

This bit of code here will take the text from each cluster and find out the top 100 words that occur within that text. 

We iterate through each cluster, and for each cluster, we take the corresponding text and tokenize it into words. We filter out all the stop words and keep only relevant words. 

Then we use the `FreqDist` function to compute the frequency distribution of the words. 

We take the `nlargest` function and pick the top 100 words from this frequency distribution. Along with the top 100 keywords, we also separately store the complete frequency distribution, which will keep the words along with their counts in the dictionary. 

In [None]:
#to find the top 100 words that occurs frequently in a cluster excluding the stop words
keywords = {}
counts={}
for cluster in range(3):
    word_sent = word_tokenize(text[cluster].lower())
    word_sent=[word for word in word_sent if word not in _stopwords]
    freq = FreqDist(word_sent)
    keywords[cluster] = nlargest(100, freq, key=freq.get)
    counts[cluster]=freq

Now that we have the 100 most important keywords from each cluster, let's find the 10 keywords which are unique to each cluster.

As we iterate through each cluster, we find the list of other clusters. So if we look at cluster zero, we collect all the keywords which are present in the other clusters and we remove those keys from the list of keywords in our cluster. From the remaining keywords, we pick the top 10 most frequently occurring keywords. 

In [None]:
#Let's find out the top 10 frequently occurred unique keywords from each cluster
unique_keys={}
for cluster in range(3):   
    other_clusters=list(set(range(3))-set([cluster]))
    keys_other_clusters=set(keywords[other_clusters[0]]).union(set(keywords[other_clusters[1]]))
    unique=set(keywords[cluster])-keys_other_clusters
    unique_keys[cluster]=nlargest(10, unique, key=counts[cluster].get)

In [None]:
unique_keys

**One of the clusters** seems to be related to social media and advertising-related keywords. So articles that covered those themes would reside in this cluster.


**Next cluster** seems to have words related to price and profit, which are stock performance-related. So articles about the stock performance might be a part of this cluster.  

**The other cluster** has words like round, capital, funding, and valuation. So this is a cluster that deals with startups and the budget and investments they are getting. So any news articles related to these topics would reside within this cluster.




## **Train K Nearest Neighbors classifier**

Now that we have different themes that we have identified from our body of articles, we can take any new article and then assign one of these themes to that article. 

Our classification problem statement here is that given an article, we want to pass it to a classifier, and the output should be one of the themes, theme one, two, or three that we have identified from our clustering step.

We take all of our historical data, which is our body of articles, and represent it using the TF-IDF representation. These articles have labels already assigned to them by the clustering step. This training data along with the labels is fed to a standard algorithm in the training step, and that creates the model that can be used in the test step.

We can import the `KNeighborsClassifier` from scikit-learns' neighbors module. This is a built-in class that helps to perform the classification.

So we can instantiate a `KNeighborsClassifier` object and use the fit method to set up the training phase. In the training phase, you need to parse in the complete set of articles for which the labels are already known. 

So variable `X` has our articles represented as TF-IDF tuples, and `km.labels` is an array with all the cluster numbers assigned to those articles. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X,km.labels_)

Let's take an article from any website and assign it to a variable article.

In [None]:
article = "With Snapchat, Twitter, Instagram, and even Facebook, teens are in constant communication. While many will argue that it is important to have this connection outside of face-to-face interaction, studies have said that eliminating in-person communication can impair critical social skills. But what does this really mean? When teens receive messages via social media, all they see is a screen. They can’t see the person on the other side. They are oblivious to the other person’s reaction. Body language, facial expression, even tone of voice, are removed from the conversation. This is why teens choose to send that risky message or that mean reply, things they wouldn’t ever say to someone’s face. Of course, this creates a problem. Not only are the simplest parts of communication invisible, but it has become easier and easier to hide behind a mask on social media. Social media allows anyone to be anonymous, identity erased. Cyberbullies can create fakes accounts, which allows the bully to say anything without facing consequences. Statistics say that one half of teens have been victims of bullying online and one third have sent rude or harassing comments via social media, (Teens, Social Media & Technology Overview 2015). When teens don’t know the person harassing them it is harder to ask for help or tell an adult."

Now before we parse our test article to the `KNeighborsClassifier`, we need to represent that article using the TF-IDF representation. So we use `vectorizer.transform` to convert the article into the TF-IDF form. 

The test variable will be an array which has one row and as many columns as there are the total number of words in our corpus. 

In [None]:
test=vectorizer.transform([article.encode('ascii',errors='ignore')])

Let's check the shape of the test variable.

In [None]:
test.shape

(1, 13220)

Now we can use the predict method of the classifier and that will assign a theme to our article. 

In [None]:
classifier.predict(test)

array([1], dtype=int32)

Here we can see that the article is assigned to a cluster that describes about the social media and advertizing cluster.