### Introduction
In this kernal, I will use TF-IDF to vectorize the articles data and cluster them. Then, I will make paper recommendation.

**1. Input data**

In [1]:
import numpy as np
import pandas as pd 
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
text = pd.read_csv('Final.csv')

In [3]:
text = text.drop(['Descriptions', 'Authors', 'Published_Dates', 'Keywords', 'Summaries', 'Publication', 'category', 'Source_URLs'], axis=1).dropna().reset_index(drop=True)
text.head()

Unnamed: 0,Headlines,Articles
0,Object : Alcoholics Anonymous Badge,The contributors father was a member of Alcoho...
1,Every university 'needs' Alcoholics Anonymous ...,Image copyright Getty Images Alcoholics Anonym...
2,What happens in an Alcoholics Anonymous meeting?,Image caption The true nature of alcoholism is...
3,The many groups that have copied Alcoholics An...,Image copyright Other Image caption An AA meet...
4,Should there be a word for an 'almost alcoholic'?,Image copyright Thinkstock Everybody thinks th...


**2. Using TF-IDF vectorize the articles**

**Introduction** to TF-IDF

In [4]:
text_content = text['Articles']
vector = TfidfVectorizer(stop_words = 'english')
tfidf = vector.fit_transform(text_content)
text_content.head()

0    The contributors father was a member of Alcoho...
1    Image copyright Getty Images Alcoholics Anonym...
2    Image caption The true nature of alcoholism is...
3    Image copyright Other Image caption An AA meet...
4    Image copyright Thinkstock Everybody thinks th...
Name: Articles, dtype: object

**3. Using elbow method to decide cluster number**

Introduction to **elbow method**

In [5]:
K = range(1,15)
SSE = []
for k in K:
    kmeans = MiniBatchKMeans(n_clusters = k,batch_size = 300)
    kmeans.fit(tfidf)
    SSE.append(kmeans.inertia_)
    
import matplotlib.pyplot as plt
plt.plot(K,SSE,'bx-')
plt.title('Elbow Method')
plt.xlabel('cluster numbers')
plt.show()

<Figure size 640x480 with 1 Axes>

**5. Using MiniBatchKMean to cluster**

**Comparison of the K-Means and MiniBatchKMeans clustering algorithms**

In [6]:
k = 4
kmeans = MiniBatchKMeans(n_clusters = k)
kmeans.fit(tfidf)
centers = kmeans.cluster_centers_.argsort()[:,::-1]
terms = vector.get_feature_names()

for i in range(0,k):
    word_list=[]
    print("cluster%d:"% i)
    for j in centers[i,:15]:
        word_list.append(terms[j])
    print(word_list) 

cluster0:
['bbc', 'register', 'sign', 'hymn', 'liquor', 'microscopic', 'oily', 'particles', 'inhale', 'gulf', 'shields', 'iraq', 'beloved', '08', 'fish']
cluster1:
['alcohol', 'said', 'people', 'drinking', 'says', 'drink', 'alcoholics', 'aa', 'help', 'addiction', 'image', 'anonymous', 'like', 'life', 'women']
cluster2:
['chandigarh', 'sector', 'singh', 'students', 'tribune', 'police', 'mr', 'today', 'said', 'mohali', 'rs', 'government', 'punjab', 'administration', 'haryana']
cluster3:
['bmc', 'classrooms', 'said', 'aawc', 'civic', 'municipal', 'ngos', 'œwe', 'eviction', 'organisations', 'school', 'œthey', 'education', 'prostitution', 'keller']


**6. Article Recommendation**

Since we had vectorize the articles by TF-IDF, we only need to compare articles by comparing their vectors. Bigger inner product of two vectors means they shares more similar information.

In [7]:
similarity = np.dot(tfidf[0],np.transpose(tfidf))
x = np.array(similarity.toarray()[0])
print(text['Headlines'][0])
print('\nsimiliar papers:')
print('\n'.join(text['Headlines'].loc[np.argsort(x)[-7:-2]]))

Object : Alcoholics Anonymous Badge

similiar papers:
AA: America's Gift to the World
AA: America's Gift to the World
Events in Staffordshire for November 2007
Alcohol misuse: help and advice
AA: America's Gift to the World


**7. Topics modeling**

Here I use NMF instead of LDA because LDA's calculation is time-consuming.

The major topics consist of politics, society, finance and internetional affairs, which is quite similiar to our cluster result.

In [8]:
from sklearn.decomposition import NMF

nmf  = NMF(n_components = 4)
nmf.fit(tfidf)
for i in range(0,k):
    word_list=[]
    print("Topic%d:"% i)
    for j in nmf.components_.argsort()[i,-16:-1]:
        word_list.append(terms[j])
    print(word_list)

Topic0:
['help', 'alcoholics', 'like', 'alcoholic', 'women', 'life', 'years', 'aa', 'image', 'addiction', 'says', 'drink', 'said', 'people', 'drinking']
Topic1:
['fall', 'drama', 'alcoholics', 'laugh', 'comedy', 'recovery', 'love', 'rebecca', 'weekly', 'sue', 'johnston', 'kaye', 'hannah', 'jackson', 'pete']
Topic2:
['city', 'school', 'mr', 'government', 'minister', 'youth', 'mohali', 'august', 'tribune', 'singh', 'said', 'today', 'students', 'police', 'sector']
Topic3:
['0300', 'provides', 'help', 'http', 'https', 'confidential', 'services', 'information', 'alcohol', 'advice', 'phone', 'support', 'helpline', 'uk', 'org']
