# Text clustering 

Done by: Sebastián Sarasti

Import basic libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load the kedro extension

In [3]:
%load_ext kedro.ipython

Load data

In [4]:
train = catalog.load('train')

# Exploratory Data Analysis (EDA)

1. See the type of data

In [6]:
train.dtypes


ID                       int64
TITLE                   object
ABSTRACT                object
Computer Science         int64
Physics                  int64
Mathematics              int64
Statistics               int64
Quantitative Biology     int64
Quantitative Finance     int64
dtype: object

2. See if there are null values

In [7]:
train.isnull().sum()


ID                      [1;36m0[0m
TITLE                   [1;36m0[0m
ABSTRACT                [1;36m0[0m
Computer Science        [1;36m0[0m
Physics                 [1;36m0[0m
Mathematics             [1;36m0[0m
Statistics              [1;36m0[0m
Quantitative Biology    [1;36m0[0m
Quantitative Finance    [1;36m0[0m
dtype: int64

3. Clean text. 

In order to avoid redudant words, it is going to create a function to clean the text. 
- It is going to be lowercased the text.
- It is going to be removed any number from the text.
- It is going to be removed the stopwords.
- It is going to be removed words lower than 3 characters.
- It is going to be normalized the text.

In [15]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


Get the stop words in an object

In [14]:
stop_words = stopwords.words('english')

Define an object for the lematizer

In [17]:
lematizer = WordNetLemmatizer()

In [21]:
import re
import nltk
from nltk.corpus import stopwords

def clean_text(text):
    """"
    This function receives a text and returns it cleaned.

    Args:
        text (str): a text to be cleaned
    
    Returns:
        str: the cleaned text
    """
    # remove special characters and digits
    text = text.lower()
    # remove numbers
    text = re.sub(r'\d+', '', text)
    # get the tokens 
    tokens = word_tokenize(text)
    # remove stop words
    tokens = [word for word in tokens if word not in stop_words]
    # remove words with less than 3 characters
    tokens = [word for word in tokens if len(word) > 3]
    # lemmatize
    tokens = [lematizer.lemmatize(word) for word in tokens]
    # join all
    text = ' '.join(tokens)
    # remove blank space
    text = text.strip()
    return text

Apply the cleaning for the title and abstract columns

In [22]:
train["TITLE"] = train["TITLE"].apply(clean_text)
train["ABSTRACT"] = train["ABSTRACT"].apply(clean_text)

# Modeling

In the dataset, we have the target values. However, in this application we want simulate scenarios where are not labeled data.

It is going to be discovered possible labels based just on the words for the abstracts and titles with clustering modeling.

1. We need to create a matrix NxM, where N is the samples as rows and M is the features as columns. To do this task, we can do this with TF-IDF.

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
# vectorizer for the titles
vectorizer1 = TfidfVectorizer(max_df=0.99, min_df=0.005)
# vectorizer for the abstracts
vectorizer2 = TfidfVectorizer(max_df=0.99, min_df=0.005)

In [53]:
X_titles = vectorizer1.fit_transform(train["TITLE"])
X_abstracts = vectorizer2.fit_transform(train["ABSTRACT"])

2. Load the NMF model, this will help to transform the NxM matrix into a multiplication of NxK (1st) * KxM (2nd). 

The first matrix NxK contains K features for each title.
 
Meanwhile, the second matrix KxM contains the information about which words appears more often for the K feature.

In [54]:
from sklearn.decomposition import NMF

In [55]:
# model for the titles
nmf1 = NMF(n_components=10, random_state=42)
# model for the abstracts
nmf2 = NMF(n_components=10, random_state=42)

Calculate the matrices

In [56]:
mtitles = nmf1.fit_transform(X_titles)
mabstracts = nmf2.fit_transform(X_abstracts)

Create two DF to save the results from th NFM models

In [57]:
df_titles = pd.DataFrame(nmf1.components_, columns=vectorizer1.get_feature_names_out())
df_abstracts = pd.DataFrame(nmf2.components_, columns=vectorizer2.get_feature_names_out())

Define a function to get the top 10 most repeated words for each topic (K), created.

In [61]:
def display_topics(components_df, no_top_words):
    for topic in range(components_df.shape[0]):
        tmp = components_df.iloc[topic]
        print(f'For topic {topic+1} the words with the highest value are:')
        print(tmp.nlargest(no_top_words))
        print('\n')

Because this is an iterative process, we are going to define a function to find the optimal number of clusters.

To define the most optimal, we have to read the words.

In [64]:
def evaluate_clustering(matrix_tf, n_clusters, vectorizer):
    model = NMF(n_components=n_clusters, random_state=42)
    W = model.fit_transform(matrix_tf)
    df = pd.DataFrame(model.components_, columns=vectorizer.get_feature_names_out())
    display_topics(df, 10)

For the titles

In [66]:
topics = [3, 4, 5]
for topic in topics:
    print(f'Number of clusters selected was {topic}:')
    evaluate_clustering(X_titles, topic, vectorizer1)
    print('\n')

Number of clusters selected was 3:
For topic 1 the words with the highest value are:
network          5.022102
neural           2.204723
deep             0.732297
convolutional    0.624584
using            0.608490
recurrent        0.359711
adversarial      0.315284
based            0.278882
dynamic          0.266942
analysis         0.264291
Name: 0, dtype: float64


For topic 2 the words with the highest value are:
model       3.317209
system      0.661979
based       0.630218
data        0.519725
using       0.478260
analysis    0.446079
dynamic     0.309342
time        0.270941
linear      0.265327
approach    0.254595
Name: 1, dtype: float64


For topic 3 the words with the highest value are:
learning          3.303337
deep              1.359428
machine           0.635892
reinforcement     0.431207
data              0.359883
based             0.335335
using             0.289307
representation    0.274779
multi             0.233294
approach          0.199918
Name: 2, dtype: float64

For the abstracts

In [68]:
topics = [3, 4, 5]
for topic in topics:
    print(f'Number of clusters selected was {topic}:')
    evaluate_clustering(X_abstracts, topic, vectorizer2)
    print('\n')

Number of clusters selected was 3:
For topic 1 the words with the highest value are:
network      1.416461
model        1.280233
data         1.197793
learning     1.045764
method       0.966062
algorithm    0.697156
approach     0.677682
neural       0.655782
based        0.652174
deep         0.584422
Name: 0, dtype: float64


For topic 2 the words with the highest value are:
field          0.623013
spin           0.615562
system         0.610686
phase          0.598236
magnetic       0.576949
energy         0.561037
temperature    0.515173
state          0.512965
quantum        0.454962
effect         0.407985
Name: 1, dtype: float64


For topic 3 the words with the highest value are:
problem      0.700083
function     0.667108
group        0.607623
mathbb       0.579362
graph        0.577283
space        0.558827
equation     0.539288
solution     0.518846
prove        0.467742
algorithm    0.466846
Name: 2, dtype: float64




Number of clusters selected was 4:


For topic 1 the words with the highest value are:
data            1.360011
algorithm       1.291993
method          1.216066
model           1.206187
problem         0.950513
approach        0.721961
based           0.679956
proposed        0.627078
time            0.610926
distribution    0.524656
Name: 0, dtype: float64


For topic 2 the words with the highest value are:
spin           0.690264
field          0.673796
phase          0.656944
magnetic       0.643786
system         0.624146
energy         0.605902
temperature    0.567205
state          0.565281
quantum        0.491960
effect         0.428543
Name: 1, dtype: float64


For topic 3 the words with the highest value are:
group       0.815078
mathbb      0.772709
space       0.639877
graph       0.638647
function    0.600425
equation    0.590457
prove       0.547539
algebra     0.477220
mathcal     0.462917
solution    0.450999
Name: 2, dtype: float64


For topic 4 the words with the highest value are:
network         2.3407

For topic 1 the words with the highest value are:
data            1.362724
algorithm       1.292264
method          1.256766
model           1.216438
problem         1.013568
approach        0.718877
based           0.679619
proposed        0.644394
time            0.639217
distribution    0.586488
Name: 0, dtype: float64


For topic 2 the words with the highest value are:
spin           0.706848
field          0.676740
phase          0.668944
magnetic       0.659013
system         0.619105
energy         0.612241
temperature    0.579461
state          0.572240
quantum        0.495844
transition     0.434457
Name: 1, dtype: float64


For topic 3 the words with the highest value are:
group       0.968156
mathbb      0.921756
space       0.757286
equation    0.725335
function    0.690063
prove       0.605369
algebra     0.574542
solution    0.525274
mathcal     0.519195
operator    0.480887
Name: 2, dtype: float64


For topic 4 the words with the highest value are:
network         2.2685

After the carefull consideration, for the titles, we can define 3 clusters, while for the abstracts are 4. 

The first cluster for the titles is about deep learning, the second about math, and the third about RL.

While, for the abstracts, the first cluster is about data, the second about physics, and the third cluster is about math, and the fourth about machine learning.

# Final Labeling

Based on the previous results, the final model for the titles and abstracts are going to be created.

In [69]:
# model for the titles
nmf1 = NMF(n_components=3, random_state=42)
# model for the abstracts
nmf2 = NMF(n_components=4, random_state=42)

In [70]:
mtitles = nmf1.fit_transform(X_titles)
mabstracts = nmf2.fit_transform(X_abstracts)

In [80]:
mtitles


[1;35marray[0m[1m([0m[1m[[0m[1m[[0m[1;36m0.00136383[0m, [1;36m0.0106293[0m , [1;36m0[0m.        [1m][0m,
       [1m[[0m[1;36m0.14922867[0m, [1;36m0[0m.        , [1;36m0[0m.        [1m][0m,
       [1m[[0m[1;36m0.00123705[0m, [1;36m0.00689299[0m, [1;36m0.00511899[0m[1m][0m,
       [33m...[0m,
       [1m[[0m[1;36m0.00111711[0m, [1;36m0.00701716[0m, [1;36m0.00247215[0m[1m][0m,
       [1m[[0m[1;36m0.0016631[0m , [1;36m0.00933373[0m, [1;36m0.00429966[0m[1m][0m,
       [1m[[0m[1;36m0.00098933[0m, [1;36m0.01071782[0m, [1;36m0.00215841[0m[1m][0m[1m][0m[1m)[0m

In [88]:
df_titles = pd.DataFrame(mtitles, columns=["deep learning", "math", "reinforcement learning"])
df_abstracts = pd.DataFrame(mabstracts, columns=["data", "physics", "math", "machine learning"])

Once we have the optimal number of features of the NxK matrix, we have to select the label with the highest probability.

This will be done with a function.

In [89]:
def get_label(row):
    return row.idxmax()

In [90]:
df_titles["class"] = df_titles.apply(get_label, axis=1)
df_abstracts["class"] = df_abstracts.apply(get_label, axis=1)

In [85]:
df_titles

Unnamed: 0,deep learning,math,reinforcement learning,class
0,0.001364,0.010629,0.000000,math
1,0.149229,0.000000,0.000000,deep learning
2,0.001237,0.006893,0.005119,math
3,0.000037,0.033415,0.000000,math
4,0.002097,0.017937,0.012046,math
...,...,...,...,...
20967,0.000000,0.000000,0.169086,reinforcement learning
20968,0.000000,0.000000,0.000000,deep learning
20969,0.001117,0.007017,0.002472,math
20970,0.001663,0.009334,0.004300,math


In [91]:
df_abstracts

Unnamed: 0,data,physics,math,machine learning,class
0,0.058167,0.007393,0.000000,0.009623,data
1,0.000000,0.000000,0.002665,0.064351,machine learning
2,0.003349,0.000000,0.036629,0.000564,math
3,0.016493,0.016191,0.057782,0.000000,math
4,0.037817,0.004124,0.000000,0.018692,data
...,...,...,...,...,...
20967,0.030044,0.000000,0.000000,0.068329,machine learning
20968,0.005270,0.040468,0.004971,0.000000,physics
20969,0.009245,0.009795,0.004214,0.051813,machine learning
20970,0.049485,0.000000,0.009753,0.000000,data


Let see what we get from the clustering vs the titles and abstracts

In [97]:
train.head(10)

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,reconstructing subject-specific effect,predictive model allow subject-specific infere...,1,0,0,0,0,0
1,2,rotation invariance neural network,rotation invariance translation invariance gre...,1,0,0,0,0,0
2,3,spherical polyharmonics poisson kernel polyhar...,introduce develop notion spherical polyharmoni...,0,0,1,0,0,0
3,4,finite element approximation stochastic maxwel...,stochastic landau lifshitz gilbert equation co...,0,0,1,0,0,0
4,5,comparative study discrete wavelet transforms ...,fourier-transform infra-red ftir spectrum samp...,1,0,0,1,0,0
5,6,maximizing fundamental frequency complement ob...,\omega \subset \mathbb bounded domain satisfyi...,0,0,1,0,0,0
6,7,rotation period shape hyperbolic asteroid oumu...,observed newly discovered hyperbolic minor pla...,0,1,0,0,0,0
7,8,adverse effect polymer coating heat transport ...,ability metallic nanoparticles supply heat liq...,0,1,0,0,0,0
8,9,calculation mars-scale collision role equation...,model large-scale \approx impact mars-like pla...,0,1,0,0,0,0
9,10,\mathcal fails predict outbreak potential pres...,time varying susceptibility host individual le...,0,0,0,0,1,0


In [98]:
df_titles.head(10)

Unnamed: 0,deep learning,math,reinforcement learning,class
0,0.001364,0.010629,0.0,math
1,0.149229,0.0,0.0,deep learning
2,0.001237,0.006893,0.005119,math
3,3.7e-05,0.033415,0.0,math
4,0.002097,0.017937,0.012046,math
5,0.0,0.0,0.0,deep learning
6,0.0,0.0,0.0,deep learning
7,0.000949,0.00893,0.0,math
8,0.003143,0.020337,0.000151,math
9,0.000316,0.002225,0.0,math


In [99]:
df_abstracts.head(10)

Unnamed: 0,data,physics,math,machine learning,class
0,0.058167,0.007393,0.0,0.009623,data
1,0.0,0.0,0.002665,0.064351,machine learning
2,0.003349,0.0,0.036629,0.000564,math
3,0.016493,0.016191,0.057782,0.0,math
4,0.037817,0.004124,0.0,0.018692,data
5,0.0,0.00154,0.04819,0.0,math
6,0.00425,0.021278,0.003275,0.008485,physics
7,0.0,0.055779,0.0,0.002598,physics
8,0.00212,0.08191,0.0,0.0,physics
9,0.018125,0.029084,0.025437,0.008326,physics


**Final conclusion**

The clustering is very good, you can read the titles and abstracts, and the labeling is pretty accurate.

This is a faster way to label tons of data which have not been labeled previuosly.

Disclaimer: the labels provided in the original dataset were ignored because this is not in the scope of the model.