<a href="https://colab.research.google.com/github/Nikunjbansal99/ClusteringNIPSConferencePapers1987-2015/blob/main/ClusteringNIPSConferencePapers1987_2015.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Methodology**


*   **Import Some Basic Libraries**
*   **Import Data**
*   **Perform Descriptive Analysis on the dataset**
    *   **Data Description**
    *   **Check null/NAN values**
*   **Plotting Dendograms for different number of clusters**
*   **Train-Test Splitting**
*   **Apply Dimensionality Reduction Using PCA**
*   **Implementing different clustering algorithm's for different number of clusters and perform visualization**
*   **Initialize Model Selected**
    *   **Get labels of Training Data**
    *   **Get labels of Testing Data**
*   **On Training data, Evaluating Model based on Silhouette Score, Calinski Harabasz Score and Davies Bouldin Score**
*   **On Testing data, Evaluating Model based on Silhouette Score, Calinski Harabasz Score and Davies Bouldin Score**

# **Importing Some Basic Libraries**

In [None]:
pip install scikit-learn-extra

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.utils import check_random_state
import sys, os
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import MeanShift
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report, confusion_matrix
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score

# **Importing Data**

In [None]:
input_data_dir = "../input/nips-conference-papers-19872015"
NIPS_full_df = pd.read_csv(os.path.join(input_data_dir, "NIPS_1987-2015.csv"))

# **Descriptive Analysis of the dataset**

In [None]:
print("Size of NIPS DataFrame     : {}".format(NIPS_full_df.shape))

## **Data Description**

In [None]:
print("Total Number of Papers included in NIPS DataFrame     : {}".format(len(NIPS_full_df.columns)-1))

In [None]:
NIPS_full_df.info()

In [None]:
NIPS_full_df.describe().T

In [None]:
NIPS_full_df = NIPS_full_df.transpose()

In [None]:
NIPS_full_df.head()

In [None]:
new_header = NIPS_full_df.iloc[0] #grab the first row for the header
NIPS_full_df = NIPS_full_df[1:] #take the data less the header row
NIPS_full_df.columns = new_header #set the header row as the df header

In [None]:
NIPS_full_df.head()

## **NULL VALUES:**

In [None]:
NIPS_full_df.isna().sum() 

In [None]:
print("Total Number of Missing Values in NIPS DataFrame     : {}".format(NIPS_full_df.isna().sum().sum()))   

# **Plotting Dendograms:**

In [None]:
def dendrogramPlot(model, **kwargs):                                            # Create linkage matrix and then plot the dendrogram
    counts = np.zeros(model.children_.shape[0])                                 # Create the counts of samples under each node

    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count
    linkage_matrix = np.column_stack([model.children_, model.distances_,counts]).astype(float)
    dendrogram(linkage_matrix, **kwargs)                                        # Plot the corresponding dendrogram 

In [None]:
ClusteringModel = AgglomerativeClustering(distance_threshold=0, n_clusters=None)# setting distance_threshold=0 ensures we compute the full tree.

In [None]:
ClusteringModel = ClusteringModel.fit(NIPS_full_df)

In [None]:
plt.figure(figsize=(25,10))
plt.title('Clustering Dendrogram for 10 levels')
dendrogramPlot(ClusteringModel, p=10, truncate_mode='level')                    
plt.xlabel("Number of points in node")
plt.show()

**Too much dense. So, going for dendogram with levels 8.**

In [None]:
plt.figure(figsize=(25,10))
plt.title('Clustering Dendrogram for 8 levels')
dendrogramPlot(ClusteringModel, p=8, truncate_mode='level')                    
plt.xlabel("Number of points in node")
plt.show()

**Too much dense. So, going for dendogram with levels 7.**

In [None]:
plt.figure(figsize=(25,10))
plt.title('Clustering Dendrogram for 7 levels')
dendrogramPlot(ClusteringModel, p=7, truncate_mode='level')                    
plt.xlabel("Number of points in node")
plt.show()

**Too much dense. So, going for dendogram with levels 6.**

In [None]:
plt.figure(figsize=(25,10))
plt.title('Clustering Dendrogram for 6 levels')
dendrogramPlot(ClusteringModel, p=6, truncate_mode='level')                    
plt.xlabel("Number of points in node")
plt.show()

In [None]:
plt.figure(figsize=(25,10))
plt.title('Clustering Dendrogram for 5 levels')
dendrogramPlot(ClusteringModel, p=5, truncate_mode='level')                    
plt.xlabel("Number of points in node")
plt.show()

In [None]:
plt.figure(figsize=(25,10))
plt.title('Clustering Dendrogram for 4 levels')
dendrogramPlot(ClusteringModel, p=4, truncate_mode='level')                    
plt.xlabel("Number of points in node")
plt.show()

In [None]:
plt.figure(figsize=(25,10))
plt.title('Clustering Dendrogram for 3 levels')
dendrogramPlot(ClusteringModel, p=3, truncate_mode='level')                    
plt.xlabel("Number of points in node")
plt.show()

In [None]:
plt.figure(figsize=(25,10))
plt.title('Clustering Dendrogram for 2 levels')
dendrogramPlot(ClusteringModel, p=2, truncate_mode='level')                    
plt.xlabel("Number of points in node")
plt.show()

# **Train-Test Splitting:**

In [None]:
# Splitting NIPS_full_df into 70% and 30% to construct training dataframe and testing dataframe respectively.
traindf, testdf = train_test_split(NIPS_full_df, test_size=0.3, random_state=11)

In [None]:
print("Size of Training Dataframe       : {}".format(traindf.shape))
print("Size of Testing Dataframe      : {}".format(testdf.shape))

In [None]:
traindf.head()

In [None]:
testdf.head()

# **Applying Dimensionality Reduction:**

In [None]:
# Initializing Principal Component Analysis(PCA)
PCA_method = PCA(n_components=2)

In [None]:
# Fit And Transorm Data
traindf= PCA_method.fit_transform(traindf)
testdf = PCA_method.transform(testdf)

# **Visualizing various Algorithm's:**

In [None]:
np.random.seed(42)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = 0.02  # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = traindf[:, 0].min() - 1, traindf[:, 0].max() + 1
y_min, y_max = traindf[:, 1].min() - 1, traindf[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

In [None]:
plt.figure(figsize=(30,30))
plt.clf()

plt.suptitle("Comparing Multiple Clustering Algorithms having no. of clusters as 8",fontsize=15,)

choosed_models = [(KMedoids(metric="manhattan", n_clusters=8),"KMedoids (Manhattan)",),
                 (KMedoids(metric="euclidean", n_clusters=8),"KMedoids (Euclidean)",),
                 (KMedoids(metric="cosine", n_clusters=8), "KMedoids (Cosine)"),
                 (KMeans(n_clusters=8), "KMeans"),
                 (KMeans(n_clusters=8, init='k-means++'),"k-means++")]

plot_rows = 2
plot_cols = 3

for i, (i_model, description) in enumerate(choosed_models):
    i_model.fit(traindf)
    Y = i_model.predict(np.c_[xx.ravel(), yy.ravel()])
    Y = Y.reshape(xx.shape)
    plt.subplot(plot_cols, plot_rows, i + 1)                                    # Put the result into a color plot
    plt.imshow(Y, interpolation="nearest", extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, aspect="auto", origin="lower",)
    plt.plot(traindf[:, 0], traindf[:, 1], "k.", markersize=2, alpha=0.3)


    centroids = i_model.cluster_centers_
    plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=169, linewidths=3, color="w", zorder=10,)
                                        # set centroids shape as a X        ; set centroids color as a white
    plt.title(description)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
plt.show()

In [None]:
plt.figure(figsize=(30,30))
plt.clf()

plt.suptitle("Comparing Multiple Clustering Algorithms having no. of clusters as 16",fontsize=15,)

choosed_models = [(KMedoids(metric="manhattan", n_clusters=16),"KMedoids (Manhattan)",),
                 (KMedoids(metric="euclidean", n_clusters=16),"KMedoids (Euclidean)",),
                 (KMedoids(metric="cosine", n_clusters=16), "KMedoids (Cosine)"),
                 (KMeans(n_clusters=16), "KMeans"),
                 (KMeans(n_clusters=16, init='k-means++'),"k-means++")]

plot_rows = 2
plot_cols = 3

for i, (i_model, description) in enumerate(choosed_models):
    i_model.fit(traindf)
    Y = i_model.predict(np.c_[xx.ravel(), yy.ravel()])
    Y = Y.reshape(xx.shape)
    plt.subplot(plot_cols, plot_rows, i + 1)                                    # Put the result into a color plot
    plt.imshow(Y, interpolation="nearest", extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, aspect="auto", origin="lower",)
    plt.plot(traindf[:, 0], traindf[:, 1], "k.", markersize=2, alpha=0.3)


    centroids = i_model.cluster_centers_
    plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=169, linewidths=3, color="w", zorder=10,)
                                        # set centroids shape as a X        ; set centroids color as a white
    plt.title(description)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
plt.show()

In [None]:
plt.figure(figsize=(30,30))
plt.clf()

plt.suptitle("Comparing Multiple Clustering Algorithms having no. of clusters as 32",fontsize=15,)

choosed_models = [(KMedoids(metric="manhattan", n_clusters=32),"KMedoids (Manhattan)",),
                 (KMedoids(metric="euclidean", n_clusters=32),"KMedoids (Euclidean)",),
                 (KMedoids(metric="cosine", n_clusters=32), "KMedoids (Cosine)"),
                 (KMeans(n_clusters=32), "KMeans"),
                 (KMeans(n_clusters=32, init='k-means++'),"k-means++")]

plot_rows = 2
plot_cols = 3

for i, (i_model, description) in enumerate(choosed_models):
    i_model.fit(traindf)
    Y = i_model.predict(np.c_[xx.ravel(), yy.ravel()])
    Y = Y.reshape(xx.shape)
    plt.subplot(plot_cols, plot_rows, i + 1)                                    # Put the result into a color plot
    plt.imshow(Y, interpolation="nearest", extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, aspect="auto", origin="lower",)
    plt.plot(traindf[:, 0], traindf[:, 1], "k.", markersize=2, alpha=0.3)


    centroids = i_model.cluster_centers_
    plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=169, linewidths=3, color="w", zorder=10,)
                                        # set centroids shape as a X        ; set centroids color as a white
    plt.title(description)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
plt.show()

In [None]:
plt.figure(figsize=(30,30))
plt.clf()

plt.suptitle("Comparing Multiple Clustering Algorithms having no. of clusters as 64",fontsize=15,)

choosed_models = [(KMedoids(metric="manhattan", n_clusters=64),"KMedoids (Manhattan)",),
                 (KMedoids(metric="euclidean", n_clusters=64),"KMedoids (Euclidean)",),
                 (KMedoids(metric="cosine", n_clusters=64), "KMedoids (Cosine)"),
                 (KMeans(n_clusters=64), "KMeans"),
                 (KMeans(n_clusters=64, init='k-means++'),"k-means++")]

plot_rows = 2
plot_cols = 3

for i, (i_model, description) in enumerate(choosed_models):
    i_model.fit(traindf)
    Y = i_model.predict(np.c_[xx.ravel(), yy.ravel()])
    Y = Y.reshape(xx.shape)
    plt.subplot(plot_cols, plot_rows, i + 1)                                    # Put the result into a color plot
    plt.imshow(Y, interpolation="nearest", extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, aspect="auto", origin="lower",)
    plt.plot(traindf[:, 0], traindf[:, 1], "k.", markersize=2, alpha=0.3)


    centroids = i_model.cluster_centers_
    plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=169, linewidths=3, color="w", zorder=10,)
                                        # set centroids shape as a X        ; set centroids color as a white
    plt.title(description)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
plt.show()

# **Model Selected:**

In [None]:
KMeansplusModel = KMeans(n_clusters=32, init='k-means++')               # Initialize KMeans++ Model

In [None]:
KMeansplusModel = KMeansplusModel.fit(traindf)                          # Fitting traindf to the Model

In [None]:
CenterOfClusters = KMeansplusModel.cluster_centers_
print('Centers Of Clusters: ')
print(CenterOfClusters)

In [None]:
traindf_labels = KMeansplusModel.labels_
print("traindf labels: ",traindf_labels)

### **Perform Prediction on Testing DataFrame:**

In [None]:
testdf_labels = KMeansplusModel.predict(testdf)
print("testdf labels: ",testdf_labels)

# **Evaluation**

## **On Training Data:**

In [None]:
print("Silhouette Score : ",silhouette_score(traindf, traindf_labels, metric='euclidean'))
print("Calinski Harabasz Score : ",calinski_harabasz_score(traindf, traindf_labels))
print("Davies Bouldin Score : ",davies_bouldin_score(traindf, traindf_labels))

## **On Testing Data:**

In [None]:
print("Silhouette Score : ",silhouette_score(testdf, testdf_labels, metric='euclidean'))
print("Calinski Harabasz Score : ",calinski_harabasz_score(testdf, testdf_labels))
print("Davies Bouldin Score : ",davies_bouldin_score(testdf, testdf_labels))

**We got a Silhouette Score of 0.332 on training data and 0.300 on testing data. It shows our algorithm is performing good but some cluster's are overlapping each other.**

**We got a Calinski Harabasz Score of 3078.58 on training data and 1192.73 on testing data. Which is quite good.**

**We got a Davies Bouldin Score of 0.8173 on training data and 0.8689 on testing data. It shows that the separation between the clusters is low.**