# Experimental Problem Statement
Using label propagation, classify a subset set of textual data to appropriate classes. Using the trained model, classify the remaining datapoints to appropriate clusters.

Well, since its just a demo, we are not gonna do manual labelling, instead we are gonna - 

* Pick a fully labelled dataset of form (text, labels)
* Split this into train, test set
* Retain labels of train set (to verify results)
* Train a clustering algo on train set (assuming, for simplicity, k=#clusters in our data)
* Classify test data to check
* Evaluate results

Along the way, going to explore some new libraries.



In [102]:
!pip install texthero



In [103]:
# install required packages
!pip install sentence_transformers



# importing libraries

In [104]:
import texthero as hero
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Texthero 
* [Github](https://github.com/jbesomi/texthero)
* [Docs](https://texthero.org/)

## loading data

In [105]:
# import data
file_path = r"https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/bbcsport.csv"
df = pd.read_csv(file_path)
display(df.head())

Unnamed: 0,text,topic
0,Claxton hunting first major medal\n\nBritish h...,athletics
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics
2,Greene sets sights on world title\n\nMaurice G...,athletics
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics


## get distinct labels

In [106]:
# check values of labels
df['topic'].value_counts()

football     265
rugby        147
cricket      124
athletics    101
tennis       100
Name: topic, dtype: int64

In [107]:
# first row before prep
df['text'][0]

'Claxton hunting first major medal\n\nBritish hurdler Sarah Claxton is confident she can win her first major medal at next month\'s European Indoor Championships in Madrid.\n\nThe 25-year-old has already smashed the British record over 60m hurdles twice this season, setting a new mark of 7.96 seconds to win the AAAs title. "I am quite confident," said Claxton. "But I take each race as it comes. "As long as I keep up my training but not do too much I think there is a chance of a medal." Claxton has won the national 60m hurdles title for the past three years but has struggled to translate her domestic success to the international stage. Now, the Scotland-born athlete owns the equal fifth-fastest time in the world this year. And at last week\'s Birmingham Grand Prix, Claxton left European medal favourite Russian Irina Shevchenko trailing in sixth spot.\n\nFor the first time, Claxton has only been preparing for a campaign over the hurdles - which could explain her leap in form. In previous

In [108]:
# clean pipeline - https://texthero.org/docs/api/texthero.preprocessing.clean.html#texthero.preprocessing.clean
# first row after perp
hero.clean(df['text'])[0]


'claxton hunting first major medal british hurdler sarah claxton confident win first major medal next month european indoor championships madrid year old already smashed british record 60m hurdles twice season setting new mark seconds win aaas title quite confident said claxton take race comes long keep training much think chance medal claxton national 60m hurdles title past three years struggled translate domestic success international stage scotland born athlete owns equal fifth fastest time world year last week birmingham grand prix claxton left european medal favourite russian irina shevchenko trailing sixth spot first time claxton preparing campaign hurdles could explain leap form previous seasons year old also contested long jump since moving colchester london focused attentions claxton see new training regime pays dividends european indoors take place march'

## preprocessing

In [109]:
df['text'] = hero.clean(df['text'])
df.head()

Unnamed: 0,text,topic
0,claxton hunting first major medal british hurd...,athletics
1,sullivan could run worlds sonia sullivan indic...,athletics
2,greene sets sights world title maurice greene ...,athletics
3,iaaf launches fight drugs iaaf athletics world...,athletics
4,dibaba breaks 000m world record ethiopia tirun...,athletics


In [110]:
df2 = df.sample(frac=1, random_state=42) # shuffle dataset
df2.head()

Unnamed: 0,text,topic
669,johansson takes adelaide victory second seed j...,tennis
33,athens memories soar lows well goodbye another...,athletics
549,england coach faces rap row england coach andy...,rugby
199,new zealand step security new zealand cricket ...,cricket
264,irish finish home game republic ireland manage...,football


## Stratified sampling

In [111]:
# first split data and labels
X = df2.pop('text')
display(X[:10])
y = df2.pop('topic')
print(y[:10])

669    johansson takes adelaide victory second seed j...
33     athens memories soar lows well goodbye another...
549    england coach faces rap row england coach andy...
199    new zealand step security new zealand cricket ...
264    irish finish home game republic ireland manage...
583    ireland south africa ronan gara scored ireland...
39     radcliffe tackles marathon tasks paula radclif...
554    owen set skipper role wales number eight micha...
585    ireland call uncapped campbell ulster scrum ha...
609    bath faced tindall ultimatum mike tindall agen...
Name: text, dtype: object

669       tennis
33     athletics
549        rugby
199      cricket
264     football
583        rugby
39     athletics
554        rugby
585        rugby
609        rugby
Name: topic, dtype: object


In [112]:
# use stratified sampling
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=42, stratify=y)

print(f"X_train.shape - {X_train.shape}")
print(f"X_test.shape - {X_test.shape}")
print(f"y_train.shape - {y_train.shape}")
print(f"y_test.shape - {y_test.shape}")

X_train.shape - (442,)
X_test.shape - (295,)
y_train.shape - (442,)
y_test.shape - (295,)


In [113]:
y_train.value_counts()

football     159
rugby         88
cricket       74
athletics     61
tennis        60
Name: topic, dtype: int64

Now, data is pretty much ready.

# BERT Embedding (sentence transformers)
Converting text data to its respective BERT Embedding

In [114]:
# load sentence transformer
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [115]:
# encode sentences
sentence_embeddings = model.encode(X_train.values)
sentence_embeddings[:1]

array([[-5.55381656e-01,  7.65018642e-01,  8.63958180e-01,
         1.36552021e-01,  7.03095376e-01, -4.67655182e-01,
         1.28361642e+00, -1.95206061e-01,  4.69943136e-02,
         1.62134498e-01, -6.01899326e-02,  1.92754775e-01,
         5.85547507e-01, -5.15420176e-02, -5.37184656e-01,
         2.78839260e-01, -1.15268104e-01, -3.44222933e-02,
         9.87192392e-02, -4.23399717e-01, -4.28273410e-01,
        -1.62947372e-01,  7.91986942e-01,  4.94392157e-01,
         1.12411511e+00,  1.01944113e+00,  3.31209660e-01,
         3.34763043e-02, -1.37123835e+00,  4.80189651e-01,
        -5.98977208e-01,  5.65721631e-01, -4.50971365e-01,
        -4.60465729e-01,  4.60062623e-01,  8.39939892e-01,
         2.11740062e-01, -2.47256160e-02, -2.02038482e-01,
        -3.58699471e-01, -2.92679757e-01,  5.36426961e-01,
        -7.96072856e-02, -3.44232261e-01, -1.17475879e+00,
        -2.33570347e-03, -3.24793249e-01,  3.52797210e-01,
         8.10622036e-01, -9.43653464e-01,  2.11586982e-0

In [116]:
print(type(sentence_embeddings))
print(len(sentence_embeddings))
print(type(sentence_embeddings[0]))
print(len(sentence_embeddings[0]))

<class 'numpy.ndarray'>
442
<class 'numpy.ndarray'>
768


So, every sentence (442 total) are converted to their BERT Embedding of dimension 768 each. Now, data is ready for clustering.

# Clustering

### performing KMeans

In [117]:
from sklearn.cluster import KMeans

num_clusters = 5

clustering_model = KMeans(n_clusters=num_clusters, 
                          init ='k-means++',
                          max_iter=300, 
                          random_state=42, 
                          n_init=10)
clustering_model.fit(sentence_embeddings)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)

### Dimensionality Reduction for Visualisation

In [118]:
# reducing dimension of datapoint from 768 to 3 via
# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X3D_pca = pca.fit_transform(sentence_embeddings)

In [119]:
X3D_pca

array([[-2.848843  , -1.169753  ,  4.499333  ],
       [ 6.2991824 ,  3.263582  , -0.07392542],
       [-5.1077976 ,  1.9794936 ,  2.144274  ],
       ...,
       [ 5.1860175 ,  4.820907  ,  0.7535834 ],
       [-1.2422442 ,  0.1309496 , -0.02445554],
       [ 5.000424  ,  1.6867553 ,  0.30083892]], dtype=float32)

In [120]:
# lets check
clustered_df = pd.DataFrame({
    'sentence' : X_train.values,
    'cluster' : clustering_model.labels_,
    'orig_label' : y_train.values, 
    'dim_X' : X3D_pca[:, 0],
    'dim_Y' : X3D_pca[:, 1],
    'dim_Z' : X3D_pca[:, 2],
})
clustered_df.head()

Unnamed: 0,sentence,cluster,orig_label,dim_X,dim_Y,dim_Z
0,holmes facing fine trials double olympic champ...,1,athletics,-2.848843,-1.169753,4.499333
1,pietersen gives england chance lunch england o...,3,cricket,6.299182,3.263582,-0.073925
2,kluft playing record chance sweden carolina kl...,1,athletics,-5.107798,1.979494,2.144274
3,smith work scottish wonders worst kept secret ...,2,football,-1.413772,-1.783141,-1.828401
4,edgy agassi struggles past dent andre agassi p...,0,tennis,-0.343911,1.056281,3.732847


## Visualising clusters via plotly

In [121]:
import plotly.graph_objects as go

# create new figure
fig = go.Figure()

cluster_label = clustering_model.labels_

# plot data
for cluster in set(cluster_label):
    original_labels = list(clustered_df[clustered_df['cluster']==cluster]['orig_label'].values)
    cl = [cluster] * len(original_labels)
    # print(original_labels)
    fig.add_trace(
        go.Scatter3d(
            x=list(clustered_df[clustered_df['cluster']==cluster]['dim_X'].values), 
            y=list(clustered_df[clustered_df['cluster']==cluster]['dim_Y'].values), 
            z=list(clustered_df[clustered_df['cluster']==cluster]['dim_Z'].values), 
            name=f"Cluster {cluster}",
            mode="markers",
            marker=dict(
                # color=cols,
                size=5,
                # line=dict(width=0.5, color='DarkSlateGrey')
            ), 
            # showlegend=True,
            # to display multiple, variables of custom data
            customdata= tuple(zip(original_labels, cl)),
            # To modify hover data, labels
            hovertemplate='L:%{customdata[0]}<br>C:%{customdata[1]}', 
        )
    )

# for 3D plots, Embedded Scene are used
from plotly.graph_objs.layout import Scene
from plotly.graph_objs.layout.scene import XAxis, YAxis, ZAxis
fig.update_layout(
    showlegend=True,
    legend=dict(
        y=0.99,
        x=0.01,
    ),
    scene=Scene(
        xaxis=XAxis(title='X'),
        yaxis=YAxis(title='Y'),
        zaxis=ZAxis(title='Z')
    ),
    margin=dict(l=0, r=0, b=0, t=0),  # tight Layout
)
fig.show()

In [122]:
grp_obj = clustered_df.groupby(by='cluster')#.get_group(1).value_counts('orig_label')
for i in range(5):
    print(f"Cluster {i}:")
    print(grp_obj.get_group(i).value_counts('orig_label'))
    print()

Cluster 0:
orig_label
tennis      57
football     1
dtype: int64

Cluster 1:
orig_label
athletics    41
football      2
dtype: int64

Cluster 2:
orig_label
football    91
rugby       64
tennis       1
dtype: int64

Cluster 3:
orig_label
cricket    73
rugby       1
dtype: int64

Cluster 4:
orig_label
football     65
rugby        23
athletics    20
tennis        2
cricket       1
dtype: int64



So, based on the datapoints from clustering, it looks like clusters are as folows:
* C0 - Tennis
* C1 - Athletics
* C2 - Football
* C3 - Cricket
* C4 - Rugby

Also verify from clustered_df.

# Label Propagation
Now that we are satisfied with our clusters, we can use the trained model to classify remaining datapoints.

## preparing test data for prediction

In [123]:
# convert test data to BERT Embeddings
test_sentence_embeddings = model.encode(X_test.values)

In [124]:
# predict labels
y_preds = clustering_model.predict(test_sentence_embeddings)

In [125]:
# reducing dimensions for visualisation
X3D_pca_test = pca.transform(test_sentence_embeddings)

In [126]:
# making df of predicted results
# lets check
test_clustered_df = pd.DataFrame({
    'sentence' : X_test.values,
    'cluster' : y_preds,
    'orig_label' : y_test.values, 
    'dim_X' : X3D_pca_test[:, 0],
    'dim_Y' : X3D_pca_test[:, 1],
    'dim_Z' : X3D_pca_test[:, 2],
})
test_clustered_df.head()

Unnamed: 0,sentence,cluster,orig_label,dim_X,dim_Y,dim_Z
0,gerrard happy anfield liverpool captain steven...,2,football,-3.192506,-2.532742,-2.158245
1,gb quartet get cross country call four british...,1,athletics,-5.066533,1.59474,0.943679
2,year remember irish used one subliminal moment...,2,rugby,-0.669114,2.406793,-1.953145
3,hong kong world cup bid hong kong hoping join ...,2,rugby,1.07315,-1.203466,0.65205
4,iranian misses israel match iranian striker va...,4,football,0.361525,-2.985474,2.769322


In [127]:
import plotly.graph_objects as go

# create new figure
fig = go.Figure()

cluster_label = y_preds

# plot data
for cluster in set(y_preds):
    original_labels = list(test_clustered_df[test_clustered_df['cluster']==cluster]['orig_label'].values)
    cl = [cluster] * len(original_labels)
    # print(original_labels)
    fig.add_trace(
        go.Scatter3d(
            x=list(test_clustered_df[test_clustered_df['cluster']==cluster]['dim_X'].values), 
            y=list(test_clustered_df[test_clustered_df['cluster']==cluster]['dim_Y'].values), 
            z=list(test_clustered_df[test_clustered_df['cluster']==cluster]['dim_Z'].values), 
            name=f"Cluster {cluster}",
            mode="markers",
            marker=dict(
                # color=cols,
                size=5,
                # line=dict(width=0.5, color='DarkSlateGrey')
            ), 
            # to display multiple, variables of custom data
            customdata= tuple(zip(original_labels, cl)),
            # To modify hover data, labels
            hovertemplate='L:%{customdata[0]}<br>C:%{customdata[1]}', 
        )
    )

# for 3D plots, Embedded Scene are used
from plotly.graph_objs.layout import Scene
from plotly.graph_objs.layout.scene import XAxis, YAxis, ZAxis
fig.update_layout(
    showlegend=True,
    legend=dict(
        y=0.99,
        x=0.01,
    ),
    scene=Scene(
        xaxis=XAxis(title='X'),
        yaxis=YAxis(title='Y'),
        zaxis=ZAxis(title='Z')
    ),
    margin=dict(l=0, r=0, b=0, t=0),  # tight Layout
)
fig.show()

## Verify result

In [128]:
# make a label mapper
mapping_ = {
    0: 'tennis',
    1: 'athletics',
    2: 'football',
    3: 'cricket',
    4: 'rugby'
}

In [129]:
correct = sum(test_clustered_df['orig_label'] == test_clustered_df['cluster'].map(mapping_))
correct

188

In [130]:
total = test_clustered_df.shape[0]
total

295

In [131]:
percent_correct = correct/total
percent_correct

0.6372881355932203

Great! We were able to correctly classify 63.7% labels correctly.