# News Articles clustering

Problem Statement :    
    
Dataset consists of news articles pertaining to business,political,entertainment, sports,technology are mentioned in JSON format. 
There is no segregation of articles in the dataset. All are mixed. We have to segregate the articles and validate which
article pertains to which topic(business,political,entertainment, sports,technology ).

Solution:

    1) To tackle this problem I am implementing the K-means clustering
    
    2) Firstly, I am applying the K-means clustering without dimensionality reduction
    
    3) Secondly, I am applying the K-means clustering with dimensionality reduction (Using PCA), which improves 
     performance of model by feature extraction.
    
    4) In addition to that I am finding the 
    
    a) which cluster has max. articles(before & after implementing PCA)
    b) Listing of top 50 words in entertainment cluster and printing last 50th word (before & after implementing PCA)

In [421]:
# Importing all libraries

In [1]:
import numpy as np
import pandas as pd
import warnings
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

warnings.filterwarnings('ignore')

In [2]:
# reading the dataset from JSON format and displaying the content
# Article column--> contains the raw data
# Preprocessed-Article column ---> contains the processed data of 'Article' by removing the 'special characters',
#                                 'un-important words like a,an,the---' etc
# Vector --> are the TFIDF vectorizers of Preprocessed-Article

In [3]:
df=pd.read_json('NewsArticles.json')
df.head()

Unnamed: 0,Article,Preprocessed-Article,Vector
161,Iran jails blogger for 14 years\n\nAn Iranian ...,Iran jails blogger 14 years An Iranian weblogg...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
166,UK gets official virus alert site\n\nA rapid a...,UK gets official virus alert site A rapid aler...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
76,O'Sullivan could run in Worlds\n\nSonia O'Sull...,OSullivan could run Worlds Sonia OSullivan ind...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
8,Mutant book wins Guardian prize\n\nA book abou...,Mutant book wins Guardian prize A book evoluti...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
158,Microsoft seeking spyware trojan\n\nMicrosoft ...,Microsoft seeking spyware trojan Microsoft inv...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [4]:
# here already TFIDF vectorizer values are available in 'vector' column.No need to apply FIDF vectorizer again on
#   Preprocessed-Article column

In [5]:
# here i am taking 'vector' columns into another dataframe because on these vector columns values only we have to apply
# algorithms.But all values in a row within list. We have to do some data engineering on vector values to
# place in each rows and columns

df1=pd.DataFrame(df['Vector'])
df1.head()

Unnamed: 0,Vector
161,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
166,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
76,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
8,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
158,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [6]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179 entries, 161 to 21
Data columns (total 1 columns):
Vector    179 non-null object
dtypes: object(1)
memory usage: 2.8+ KB


In [7]:
df.isna().sum()

Article                 0
Preprocessed-Article    0
Vector                  0
dtype: int64

In [8]:
df1['Vector'][0]

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0876407994,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 

In [9]:
# For placing above vector values in rows and columns of dataframe, i am creating dataframe df2
# no. of columns = max. values in list of each row of vector column

df2=pd.DataFrame(columns=range(0,len(df1['Vector'][1])))
df2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8858,8859,8860,8861,8862,8863,8864,8865,8866,8867


In [10]:
df1.shape[0]

179

In [11]:
df1['Vector'].shape

(179,)

In [12]:
# all vector values are placed in rows and columns of dataframe df2

for i in range(0,df1.shape[0]):
    df2.loc[i]=df1['Vector'][i]

In [13]:
df2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8858,8859,8860,8861,8862,8863,8864,8865,8866,8867
0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.045344,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
175,0.0,0.0,0.0,0.0,0.0,0.031037,0.0,0.031037,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
176,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
177,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0


# Applying K-Means clustering before dimension reduction

In [14]:
# no. of clusters =5, because article topics are 5(business,political,entertainment, sports,technology)

km=KMeans(n_clusters=5,random_state=42)
km.fit(df2)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)

In [15]:
# predicting the labels of clusters

pred_labels=km.predict(df2)

In [16]:
# cal. SSE --> sum of squares of dist. of each point from its cluster centriod

km.inertia_

164.84866161665246

In [17]:
pred_labels

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 1, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 3, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 2, 4, 4, 1, 4, 4, 2, 4, 4, 4, 2, 0, 2, 2, 2,
       4, 1, 2, 4, 4, 4, 2, 2, 2, 2, 2, 4, 4, 2, 4, 4, 2, 4, 2, 4, 2, 0,
       0, 0, 0, 1, 0, 4, 0, 0, 1, 3, 3, 0, 3, 3, 0, 1, 0, 1, 3, 2, 0, 3,
       3, 0, 3, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 3, 0, 0, 0, 0, 0,
       3, 0, 0])

In [18]:
# here i am cal. which cluster having max. articles i.e for finding that i have created dataframe 'pred_clusters'
# cluster label 0 -->having 55 articles
# cluster label 1 -->having 49 articles
# cluster label 2 -->having 38 articles
# cluster label 3 -->having 19 articles
# cluster label 4 -->having 18 articles

# that means cluster label 0 i.e cluster1(0+1=1) having max. of 55 aqrticles

In [19]:
pred_clusters=pd.DataFrame(pred_labels)

In [20]:
cluster_counts=pd.value_counts(pred_clusters[0],ascending=False)
cluster_counts

0    55
3    49
1    38
4    19
2    18
Name: 0, dtype: int64

In [21]:
cluster_counts.max()

55

In [22]:
# cluster 1 has max. list of articles

cluster_counts.index[0]+1

1

In [23]:
# here i am combining the cluster labels(i.e 5 labels) to original dataset for knowing which cluster belong to which article

In [24]:
df['pred_clusters']=pred_clusters

In [25]:
df.head()

Unnamed: 0,Article,Preprocessed-Article,Vector,pred_clusters
161,Iran jails blogger for 14 years\n\nAn Iranian ...,Iran jails blogger 14 years An Iranian weblogg...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
166,UK gets official virus alert site\n\nA rapid a...,UK gets official virus alert site A rapid aler...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
76,O'Sullivan could run in Worlds\n\nSonia O'Sull...,OSullivan could run Worlds Sonia OSullivan ind...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
8,Mutant book wins Guardian prize\n\nA book abou...,Mutant book wins Guardian prize A book evoluti...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
158,Microsoft seeking spyware trojan\n\nMicrosoft ...,Microsoft seeking spyware trojan Microsoft inv...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0


In [26]:
# by verifying the content pertaining to cluster label 0 --> the below data looks similar to 'technology' based on their
# key words like apple iod, play station etc

# that means cluster label 0 --> segregated the data of all 'technology articles' into one cluster based on euclidian distances

In [27]:
df[df['pred_clusters']==0]

Unnamed: 0,Article,Preprocessed-Article,Vector,pred_clusters
161,Iran jails blogger for 14 years\n\nAn Iranian ...,Iran jails blogger 14 years An Iranian weblogg...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
166,UK gets official virus alert site\n\nA rapid a...,UK gets official virus alert site A rapid aler...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
158,Microsoft seeking spyware trojan\n\nMicrosoft ...,Microsoft seeking spyware trojan Microsoft inv...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
145,Ink helps drive democracy in Asia\n\nThe Kyrgy...,Ink helps drive democracy Asia The Kyrgyz Repu...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
171,Solutions to net security fears\n\nFake bank e...,Solutions net security fears Fake bank emails ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
162,Britons fed up with net service\n\nA survey co...,Britons fed net service A survey conducted PC ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
167,PlayStation 3 chip to be unveiled\n\nDetails o...,PlayStation 3 chip unveiled Details chip desig...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
130,Concerns at school diploma plan\n\nFinal appea...,Concerns school diploma plan Final appeals mad...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
172,Intel unveils laser breakthrough\n\nIntel has ...,Intel unveils laser breakthrough Intel unveile...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
153,Wi-fi web reaches farmers in Peru\n\nA network...,Wifi web reaches farmers Peru A network commun...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0


In [29]:
# by verifying the content pertaining to cluster label 1 --> the below data looks similar to 'business' based on their
# key words like US trade , High fuel prices etc

# that means cluster label 1 --> segregated the data of all 'business articles' into one cluster based on euclidian distances

In [30]:
df[df['pred_clusters']==1]

Unnamed: 0,Article,Preprocessed-Article,Vector,pred_clusters
76,O'Sullivan could run in Worlds\n\nSonia O'Sull...,OSullivan could run Worlds Sonia OSullivan ind...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
65,US trade gap hits record in 2004\n\nThe gap be...,US trade gap hits record 2004 The gap US expor...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
40,"Parmalat boasts doubled profits\n\nParmalat, t...",Parmalat boasts doubled profits Parmalat Ital...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
159,US woman sues over cartridges\n\nA US woman is...,US woman sues cartridges A US woman suing Hewl...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
45,UK firm faces Venezuelan land row\n\nVenezuela...,UK firm faces Venezuelan land row Venezuelan a...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
46,Ad sales boost Time Warner profit\n\nQuarterly...,Ad sales boost Time Warner profit Quarterly pr...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
69,"Steel firm 'to cut' 45,000 jobs\n\nMittal Stee...",Steel firm cut 45000 jobs Mittal Steel one w...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
44,Strong demand triggers oil rally\n\nCrude oil ...,Strong demand triggers oil rally Crude oil pri...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
89,O'Sullivan commits to Dublin race\n\nSonia O'S...,OSullivan commits Dublin race Sonia OSullivan ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
142,EU software patent law faces axe\n\nThe Europe...,EU software patent law faces axe The European ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1


In [32]:
# by verifying the content pertaining to cluster label 2 --> the below data looks similar to 'political' based on their
# key words like US tradeHolmes secures comeback victory Britain , Uganda bans Vagina Monologues Uganda etc

# that means cluster label 2 --> segregated the data of all 'political articles' into one cluster based on euclidian distances

In [33]:
df[df['pred_clusters']==2]

Unnamed: 0,Article,Preprocessed-Article,Vector,pred_clusters
85,Collins to compete in Birmingham\n\nWorld and ...,Collins compete Birmingham World Commonwealth ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
101,El Guerrouj targets cross country\n\nDouble Ol...,El Guerrouj targets cross country Double Olymp...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
106,Bekele sets sights on world mark\n\nOlympic 10...,Bekele sets sights world mark Olympic 10000m c...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0942512457, 0...",2
83,"Dibaba breaks 5,000m world record\n\nEthiopia'...",Dibaba breaks 5000m world record Ethiopia Tiru...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
104,Holmes feted with further honour\n\nDouble Oly...,Holmes feted honour Double Olympic champion Ke...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
87,Edwards tips Idowu for Euro gold\n\nWorld outd...,Edwards tips Idowu Euro gold World outdoor tri...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
79,Isinbayeva heads for Birmingham\n\nOlympic pol...,Isinbayeva heads Birmingham Olympic pole vault...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
73,Greene sets sights on world title\n\nMaurice G...,Greene sets sights world title Maurice Greene ...,"[0.0694752607, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...",2
95,Hansen 'delays return until 2006'\n\nBritish t...,Hansen delays return 2006 British triple jumpe...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
129,Women MPs reveal sexist taunts\n\nWomen MPs en...,Women MPs reveal sexist taunts Women MPs endur...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2


In [36]:
# by verifying the content pertaining to cluster label 3 --> the below data looks similar to 'entertainment' based on their
# key words like Harry Potter, Famed music director etc

# that means cluster label 3 --> segregated the data of all 'entertainment articles' into one cluster based on euclidian distances

In [37]:
df[df['pred_clusters']==3]

Unnamed: 0,Article,Preprocessed-Article,Vector,pred_clusters
8,Mutant book wins Guardian prize\n\nA book abou...,Mutant book wins Guardian prize A book evoluti...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
32,New Harry Potter tops book chart\n\nHarry Pott...,New Harry Potter tops book chart Harry Potter ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
10,Artists' secret postcards on sale\n\nPostcards...,Artists secret postcards sale Postcards artis...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
122,PM apology over jailings\n\nTony Blair has apo...,PM apology jailings Tony Blair apologised two ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
164,Virus poses as Christmas e-mail\n\nSecurity fi...,Virus poses Christmas email Security firms war...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
26,West End to honour finest shows\n\nThe West En...,West End honour finest shows The West End hono...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
31,Paraguay novel wins US book prize\n\nA novel s...,Paraguay novel wins US book prize A novel set ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
22,Famed music director Viotti dies\n\nConductor ...,Famed music director Viotti dies Conductor Mar...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
135,Drink remark 'acts as diversion'\n\nThe first ...,Drink remark acts diversion The first minister...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
23,Hundreds vie for best film Oscar\n\nA total of...,Hundreds vie best film Oscar A total 267 films...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3


In [39]:
# by verifying the content pertaining to cluster label 4 --> the below data looks similar to 'sports' based on their
# key words like Collins banned landmark case Sprinter, UK Athletics agrees  etc

# that means cluster label 4 --> segregated the data of all 'sports articles' into one cluster based on euclidian distances

In [47]:
df[df['pred_clusters']==4]

Unnamed: 0,Article,Preprocessed-Article,Vector,pred_clusters
92,Collins banned in landmark case\n\nSprinter Mi...,Collins banned landmark case Sprinter Michelle...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
88,Kenya lift Chepkemei's suspension\n\nKenya's a...,Kenya lift Chepkemei suspension Kenya athletic...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
91,Greek pair attend drugs hearing\n\nGreek sprin...,Greek pair attend drugs hearing Greek sprinter...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
105,Greek sprinters suspended by IAAF\n\nGreek spr...,Greek sprinters suspended IAAF Greek sprinters...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
93,Chepkemei hit by big ban\n\nKenya's athletics ...,Chepkemei hit big ban Kenya athletics body sus...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
75,Verdict delay for Greek sprinters\n\nGreek ath...,Verdict delay Greek sprinters Greek athletics ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
100,Chepkemei joins Edinburgh line-up\n\nSusan Che...,Chepkemei joins Edinburgh lineup Susan Chepkem...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
103,London hope over Chepkemei\n\nLondon Marathon ...,London hope Chepkemei London Marathon organise...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
99,2004: An Irish Athletics Year\n\n2004 won't be...,2004 An Irish Athletics Year 2004 wo nt remem...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
77,McIlroy aiming for Madrid title\n\nNorthern Ir...,McIlroy aiming Madrid title Northern Ireland m...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4


In [48]:
# find of top 50 words in entertainment cluster & printing the last 50th word

In [49]:
entr_words=df[df['pred_clusters']==3]['Preprocessed-Article']
entr_words

8      Mutant book wins Guardian prize A book evoluti...
32     New Harry Potter tops book chart Harry Potter ...
10     Artists  secret postcards sale Postcards artis...
122    PM apology jailings Tony Blair apologised two ...
164    Virus poses Christmas email Security firms war...
26     West End honour finest shows The West End hono...
31     Paraguay novel wins US book prize A novel set ...
22     Famed music director Viotti dies Conductor Mar...
135    Drink remark acts diversion The first minister...
23     Hundreds vie best film Oscar A total 267 films...
47     Air passengers win new EU rights Air passenger...
5      Potter director signs Warner deal Harry Potter...
19     Fears raised ballet future Fewer children UK f...
33     Dirty Den demise seen 14m More 14 million peop...
9      Aviator creator  Oscars snub The man said got ...
24     Obituary  Dame Alicia Markova Dame Alicia Mark...
36     DVD review  Harry Potter Prisoner Azkaban This...
27     Public show Reynolds por

In [50]:
df_new=pd.DataFrame(entr_words)
df_new

Unnamed: 0,Preprocessed-Article
8,Mutant book wins Guardian prize A book evoluti...
32,New Harry Potter tops book chart Harry Potter ...
10,Artists secret postcards sale Postcards artis...
122,PM apology jailings Tony Blair apologised two ...
164,Virus poses Christmas email Security firms war...
26,West End honour finest shows The West End hono...
31,Paraguay novel wins US book prize A novel set ...
22,Famed music director Viotti dies Conductor Mar...
135,Drink remark acts diversion The first minister...
23,Hundreds vie best film Oscar A total 267 films...


In [51]:
df_new.shape[0]

49

In [53]:
list_words=[]
for i in range(0,df_new.shape[0]):
    for j in range(0,len(df_new.iloc[i][0].split())):
        list_words.append(df_new.iloc[i][0].split()[j])
df_words=pd.DataFrame(list_words)
pd.value_counts(df_words[0]).index[49]

'Potter'

In [54]:
len(list_words)

9967

# Applying K-Means clustering after dimension reduction (using PCA)

In [55]:
pca=PCA(n_components=100,random_state=10)
df_array=pca.fit_transform(df2)

In [56]:
df_array.shape

(179, 100)

In [57]:
df_array

array([[-0.05771842,  0.14985452, -0.01053767, ...,  0.08605876,
         0.07831081,  0.04537984],
       [ 0.0170382 ,  0.16960945,  0.06771936, ..., -0.02319928,
         0.07656179,  0.00696263],
       [-0.01857515,  0.13824031,  0.01168986, ...,  0.05999801,
        -0.15616272,  0.01660506],
       ...,
       [-0.10963543, -0.06614896,  0.08843217, ...,  0.19385836,
         0.03243798,  0.01408337],
       [-0.11525731, -0.06279326,  0.03127838, ..., -0.04308292,
        -0.12461803,  0.06273078],
       [-0.10463275, -0.02211632,  0.02936004, ...,  0.07321153,
         0.0088811 , -0.01454663]])

In [58]:
df_transform=pd.DataFrame(df_array)
df_transform

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.057718,0.149855,-0.010538,0.055879,0.073819,0.020924,-0.088327,0.062011,-0.017641,0.038405,...,-0.107370,-0.011354,0.050030,-0.058879,-0.147383,0.112925,0.085521,0.086059,0.078311,0.045380
1,0.017038,0.169609,0.067719,0.018516,-0.023745,0.018102,0.179873,-0.068721,0.008521,-0.006556,...,-0.022062,0.034130,-0.007859,0.061826,-0.069085,0.007019,-0.058245,-0.023199,0.076562,0.006963
2,-0.018575,0.138240,0.011690,0.023415,0.010054,0.021415,0.093179,0.020482,0.015884,0.022942,...,0.098790,0.079841,-0.088743,0.035277,-0.095565,0.054996,0.060393,0.059998,-0.156163,0.016605
3,0.015676,0.094938,0.031672,0.023826,-0.015752,0.001659,0.008865,-0.020402,0.007528,0.059069,...,-0.047586,-0.109064,-0.049550,-0.022281,-0.062218,0.036305,0.006689,-0.023484,-0.023109,0.100231
4,-0.002960,0.306794,0.091705,0.059190,0.027652,0.033571,0.216625,-0.093902,0.018171,-0.063117,...,-0.003844,0.034067,0.185425,0.022069,-0.052599,-0.170903,0.025400,0.091256,-0.029926,-0.045507
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174,-0.052144,-0.048251,-0.044466,-0.061148,-0.107064,-0.038765,-0.000328,0.220984,0.040850,-0.145379,...,0.057577,0.012355,0.062626,-0.042597,-0.162993,0.024575,0.072990,-0.055350,0.048900,0.009637
175,-0.094286,-0.051516,0.027805,-0.037150,-0.145541,-0.039250,0.029938,0.326823,0.085981,-0.208937,...,-0.016436,-0.023108,-0.101988,-0.056081,0.018524,-0.016387,-0.059413,0.015269,-0.081467,0.067881
176,-0.109635,-0.066149,0.088432,-0.028454,0.056653,0.011069,0.021241,0.035355,0.020381,0.016418,...,0.141905,-0.001271,0.173797,0.035171,-0.057151,0.026128,0.068489,0.193858,0.032438,0.014083
177,-0.115257,-0.062793,0.031278,-0.027699,-0.077921,-0.011479,0.017625,0.275407,0.089307,-0.178332,...,0.029818,0.004159,0.034597,0.110258,0.105112,-0.078621,0.070574,-0.043083,-0.124618,0.062731


In [59]:
kmeans_pca=KMeans(n_clusters=5,random_state=20)
kmeans_pca.fit(df_transform)
pred_clust_labels=kmeans_pca.predict(df_transform)

In [60]:
kmeans_pca.inertia_

110.42243775395272

In [61]:
pred_clust_labels

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
       2, 0, 2, 0, 0, 1, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0,
       1, 1, 2, 1, 2, 2, 1, 4, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 2, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1,
       1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 3, 1, 4,
       4, 4, 4, 4, 0, 1, 0, 4, 4, 0, 1, 0, 4, 4, 4, 4, 0, 4, 4, 4, 4, 0,
       1, 4, 1, 4, 4, 4, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0,
       2, 0, 2, 2, 2, 1, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2])

In [62]:
pred_clusters_pca=pd.DataFrame(pred_clust_labels)
cluster_counts_pca=pd.value_counts(pred_clusters_pca[0],ascending=False)

In [63]:
cluster_counts_pca

1    62
0    48
2    42
4    20
3     7
Name: 0, dtype: int64

In [386]:
#  max. articles in cluster1 

cluster_counts_pca.max()

62

In [65]:
# cluster1 has max. articles 

cluster_counts_pca.index[0]+1

2

In [66]:
df_pca=df.drop(columns=['pred_clusters'])

In [67]:
df_pca

Unnamed: 0,Article,Preprocessed-Article,Vector
161,Iran jails blogger for 14 years\n\nAn Iranian ...,Iran jails blogger 14 years An Iranian weblogg...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
166,UK gets official virus alert site\n\nA rapid a...,UK gets official virus alert site A rapid aler...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
76,O'Sullivan could run in Worlds\n\nSonia O'Sull...,OSullivan could run Worlds Sonia OSullivan ind...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
8,Mutant book wins Guardian prize\n\nA book abou...,Mutant book wins Guardian prize A book evoluti...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
158,Microsoft seeking spyware trojan\n\nMicrosoft ...,Microsoft seeking spyware trojan Microsoft inv...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...,...
148,UK net users leading TV downloads\n\nBritish T...,UK net users leading TV downloads British TV v...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
110,Crisis 'ahead in social sciences'\n\nA nationa...,Crisis ahead social sciences A national body d...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
115,Labour plans maternity pay rise\n\nMaternity p...,Labour plans maternity pay rise Maternity pay ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
53,Indonesians face fuel price rise\n\nIndonesia'...,Indonesians face fuel price rise Indonesia gov...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [68]:
df_pca['pred_clusters_pca']=pred_clusters_pca

In [69]:
df_pca

Unnamed: 0,Article,Preprocessed-Article,Vector,pred_clusters_pca
161,Iran jails blogger for 14 years\n\nAn Iranian ...,Iran jails blogger 14 years An Iranian weblogg...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
166,UK gets official virus alert site\n\nA rapid a...,UK gets official virus alert site A rapid aler...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
76,O'Sullivan could run in Worlds\n\nSonia O'Sull...,OSullivan could run Worlds Sonia OSullivan ind...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
8,Mutant book wins Guardian prize\n\nA book abou...,Mutant book wins Guardian prize A book evoluti...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
158,Microsoft seeking spyware trojan\n\nMicrosoft ...,Microsoft seeking spyware trojan Microsoft inv...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
...,...,...,...,...
148,UK net users leading TV downloads\n\nBritish T...,UK net users leading TV downloads British TV v...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
110,Crisis 'ahead in social sciences'\n\nA nationa...,Crisis ahead social sciences A national body d...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
115,Labour plans maternity pay rise\n\nMaternity p...,Labour plans maternity pay rise Maternity pay ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1
53,Indonesians face fuel price rise\n\nIndonesia'...,Indonesians face fuel price rise Indonesia gov...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1


In [70]:
entr_words_pca=df_pca[df_pca['pred_clusters_pca']==2]['Preprocessed-Article']
df_new_pca=pd.DataFrame(entr_words_pca)


In [71]:
list_words_pca=[]
for i in range(0,df_new_pca.shape[0]):
    for j in range(0,len(df_new_pca.iloc[i][0].split())):
        list_words_pca.append(df_new_pca.iloc[i][0].split()[j])
df_words_pca=pd.DataFrame(list_words_pca)
last50_word_after_PCA=pd.value_counts(df_words_pca[0]).index[49]

In [72]:
list_words_pca

['UK',
 'gets',
 'official',
 'virus',
 'alert',
 'site',
 'A',
 'rapid',
 'alerting',
 'service',
 'tells',
 'home',
 'computer',
 'users',
 'serious',
 'internet',
 'security',
 'problems',
 'launched',
 'UK',
 'government',
 'The',
 'service',
 'IT',
 'Safe',
 'issue',
 'damaging',
 'viruses',
 'software',
 'vulnerabilities',
 'weaknesses',
 'devices',
 'mobile',
 'phones',
 'Alerts',
 'tell',
 'people',
 'threats',
 'affect',
 'avoid',
 'trouble',
 'protect',
 'The',
 'service',
 'free',
 'sign',
 'get',
 'email',
 'text',
 'alerts',
 'The',
 'scheme',
 'aimed',
 'home',
 'users',
 'small',
 'businesses',
 'The',
 'government',
 'estimates',
 'issue',
 'security',
 'alerts',
 'six',
 '10',
 'times',
 'year',
 'based',
 'previous',
 'experience',
 'virus',
 'outbreaks',
 'There',
 'clear',
 'need',
 'easytounderstand',
 'simple',
 'independent',
 'advice',
 'nontechnically',
 'minded',
 'people',
 'use',
 'computers',
 'either',
 'home',
 'work',
 'said',
 'Home',
 'Office',
 'Minis

In [73]:
len(list_words_pca)

10727

In [74]:
last50_word_after_PCA

'last'

Conclusion:

By using K-Means we have clustered the news articles based on the nearest distance philosophy. 
Apart from this we can extend K-Means clustering to any area and it will cluster all relevent items.