## NIPS Conference Papers 1987-2015

+ Clustering research paper basis on keywords of papers (K - Means Clustering)

In [1]:
# importing necessary libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [1]:
data = pd.read_csv("../input/nips-conference-papers-19872015/NIPS_1987-2015.csv")
df = pd.DataFrame(data)
df.shape

In [1]:
df.head()

> In order to cluster the Papers, here K-Means clustering is used as it's simplest and well known algorithm and best suit in this case. In order to define the number of cluster, I'm using Silhouette Method among various others to achieve that. 

> <U>__Silhouette__</U> value measures how similar a point is to its own cluster (cohesion) compared to other clusters. The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster. <U>_The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k._</U>

In [1]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


sil = []
K = range(2,11)
X = df.drop(["Unnamed: 0"], axis=1).T

for k in K:
  model = KMeans(n_clusters=k, max_iter=200).fit(X)
  labels = model.labels_
  sil.append(silhouette_score(X, labels, metric = 'euclidean'))


In [1]:
plt.plot(K, sil, '.-')
plt.title("Silhouette Plot")
plt.xlabel("No of clusters")
plt.ylabel("Silhouette Score")
plt.show()

> Here we get the maximum Silhouette Score at __k = 2__ but also has an appreciable at __k = 4__ also. So let's try with both of them.

In [1]:
model_1 = KMeans(n_clusters=2, max_iter=200).fit(X)

In [1]:
labels = model_1.labels_
clusterd_df = pd.DataFrame(list(zip(df.columns[1:],labels)),columns=['title','cluster'])
clusterd_df.head(10)

In [1]:
plt.figure(figsize=(3,5))
ax = sns.countplot(x='cluster', data = clusterd_df) 

for p in ax.patches:
        ax.annotate('{:}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))

***

In [1]:
model_2 = KMeans(n_clusters=4, max_iter=200).fit(X)

In [1]:
labels = model_2.labels_
clusterd_df_ = pd.DataFrame(list(zip(df.columns[1:],labels)),columns=['title','cluster'])
clusterd_df_.head(10)

In [1]:
plt.figure(figsize=(7,9))
ax = sns.countplot(x='cluster', data = clusterd_df_) 

for p in ax.patches:
        ax.annotate('{:}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))