# Experimental Problem Statement
Using label propagation, classify a subset set of textual data to appropriate classes. Using the trained model, classify the remaining datapoints to appropriate clusters.

Well, since its just a demo, we are not gonna do manual labelling, instead we are gonna - 

* Pick a fully labelled dataset of form (text, labels)
* Split this into train, test set
* Retain labels of train set (to verify results)
* Train a clustering algo on train set (assuming, for simplicity, k=#clusters in our data)
* Classify test data to check
* Evaluate results

Along the way, going to explore some exciting new libraries.



In [2]:
!pip install texthero

Collecting texthero
  Downloading https://files.pythonhosted.org/packages/1f/5a/a9d33b799fe53011de79d140ad6d86c440a2da1ae8a7b24e851ee2f8bde8/texthero-1.0.9-py3-none-any.whl
Collecting nltk>=3.3
[?25l  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
[K     |████████████████████████████████| 1.4MB 4.0MB/s 
Collecting unidecode>=1.1.1
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 17.7MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.5-cp36-none-any.whl size=1434675 sha256=4722af3814dcc61ad354fad23a5bf50cd99468840d8b7df122fb2da502851eff
  Stored in directory: /root/.cache/pip/wheels/ae/8c/3f/b1fe0ba04555b08b57ab52ab7f86023639a

# importing libraries

In [24]:
import texthero as hero
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Texthero 
* [Github](https://github.com/jbesomi/texthero)
* [Docs](https://texthero.org/)

## loading data

In [5]:
# import data
file_path = r"https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/bbcsport.csv"
df = pd.read_csv(file_path)
display(df.head())

Unnamed: 0,text,topic
0,Claxton hunting first major medal\n\nBritish h...,athletics
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics
2,Greene sets sights on world title\n\nMaurice G...,athletics
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics


## get distinct labels

In [10]:
# check values of labels
df['topic'].value_counts()

football     265
rugby        147
cricket      124
athletics    101
tennis       100
Name: topic, dtype: int64

In [11]:
# first row before prep
df['text'][0]

'Claxton hunting first major medal\n\nBritish hurdler Sarah Claxton is confident she can win her first major medal at next month\'s European Indoor Championships in Madrid.\n\nThe 25-year-old has already smashed the British record over 60m hurdles twice this season, setting a new mark of 7.96 seconds to win the AAAs title. "I am quite confident," said Claxton. "But I take each race as it comes. "As long as I keep up my training but not do too much I think there is a chance of a medal." Claxton has won the national 60m hurdles title for the past three years but has struggled to translate her domestic success to the international stage. Now, the Scotland-born athlete owns the equal fifth-fastest time in the world this year. And at last week\'s Birmingham Grand Prix, Claxton left European medal favourite Russian Irina Shevchenko trailing in sixth spot.\n\nFor the first time, Claxton has only been preparing for a campaign over the hurdles - which could explain her leap in form. In previous

In [12]:
# clean pipeline - https://texthero.org/docs/api/texthero.preprocessing.clean.html#texthero.preprocessing.clean
# first row after perp
hero.clean(df['text'])[0]

'claxton hunting first major medal british hurdler sarah claxton confident win first major medal next month european indoor championships madrid year old already smashed british record 60m hurdles twice season setting new mark seconds win aaas title quite confident said claxton take race comes long keep training much think chance medal claxton national 60m hurdles title past three years struggled translate domestic success international stage scotland born athlete owns equal fifth fastest time world year last week birmingham grand prix claxton left european medal favourite russian irina shevchenko trailing sixth spot first time claxton preparing campaign hurdles could explain leap form previous seasons year old also contested long jump since moving colchester london focused attentions claxton see new training regime pays dividends european indoors take place march'

## preprocessing

In [13]:
df['text'] = hero.clean(df['text'])
df.head()

Unnamed: 0,text,topic
0,claxton hunting first major medal british hurd...,athletics
1,sullivan could run worlds sonia sullivan indic...,athletics
2,greene sets sights world title maurice greene ...,athletics
3,iaaf launches fight drugs iaaf athletics world...,athletics
4,dibaba breaks 000m world record ethiopia tirun...,athletics


In [20]:
df2 = df.sample(frac=1) # shuffle dataset
df2.head()

Unnamed: 0,text,topic
581,dallaglio man end controversy lawrence dallagl...,rugby
635,tindall aiming earn lions spot bath england ce...,rugby
279,legendary dutch boss michels dies legendary du...,football
113,england slump defeat fourth one day internatio...,cricket
443,stars shine tsunami benefit ronaldinho world x...,football


## Stratified sampling

In [22]:
# first split data and labels
X = df2.pop('text')
display(X[:10])
y = df2.pop('topic')
print(y[:10])

581    dallaglio man end controversy lawrence dallagl...
635    tindall aiming earn lions spot bath england ce...
279    legendary dutch boss michels dies legendary du...
113    england slump defeat fourth one day internatio...
443    stars shine tsunami benefit ronaldinho world x...
685    moya emotional davis cup win carlos moya descr...
448    year remember club football south america cont...
168    england claim historic series win fifth test c...
415    clean sweep impossible mourinho chelsea boss j...
119    auckland set fortwenty20 twenty20 internationa...
Name: text, dtype: object

581       rugby
635       rugby
279    football
113     cricket
443    football
685      tennis
448    football
168     cricket
415    football
119     cricket
Name: topic, dtype: object


In [25]:
# use stratified sampling
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=42, stratify=y)

print(f"X_train.shape - {X_train.shape}")
print(f"X_test.shape - {X_test.shape}")
print(f"y_train.shape - {y_train.shape}")
print(f"y_test.shape - {y_test.shape}")

X_train.shape - (442,)
X_test.shape - (295,)
y_train.shape - (442,)
y_test.shape - (295,)


In [31]:
y_train.value_counts()

football     159
rugby         88
cricket       74
athletics     61
tennis        60
Name: topic, dtype: int64

Now, data is pretty much ready.

# BERT Embedding (sentence transformers)
Converting text data to its respective BERT Embedding

In [32]:
# install required packages
!pip install sentence_transformers

Collecting sentence_transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/23/833e0620753a36cb2f18e2e4a4f72fd8c49c123c3f07744b69f8a592e083/sentence-transformers-0.3.0.tar.gz (61kB)
[K     |████████████████████████████████| 71kB 2.1MB/s 
[?25hCollecting transformers>=3.0.2
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 4.7MB/s 
Collecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 40.4MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-man

In [33]:
# load sentence transformer
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

100%|██████████| 405M/405M [00:18<00:00, 22.4MB/s]


In [37]:
# encode sentences
sentence_embeddings = model.encode(X_train.values)
sentence_embeddings[:1]

[array([-6.93976223e-01,  7.83699453e-01,  9.26330566e-01,  2.49521956e-02,
         2.86809921e-01, -7.39871621e-01,  4.10335571e-01, -1.54682100e-01,
         6.01563770e-05, -2.78232336e-01,  1.45814195e-01,  1.92639157e-01,
         9.29617763e-01,  1.28118753e-01, -7.53376722e-01,  2.47169361e-01,
        -2.85266459e-01, -5.84301129e-02,  9.24009308e-02, -4.99616235e-01,
        -5.45259893e-01, -6.68091774e-01,  4.75908309e-01,  4.17963594e-01,
         1.13358235e+00,  1.28798306e+00,  9.82014313e-02, -5.72121292e-02,
        -8.98514390e-01,  3.94042164e-01, -8.15320909e-01,  1.88305810e-01,
        -3.54986459e-01, -3.03160876e-01, -2.80413061e-01,  7.13681698e-01,
         2.02751562e-01,  1.95934057e-01, -5.06866649e-02, -4.95628655e-01,
        -1.95234478e-01,  1.37083769e-01, -1.45521253e-01, -5.30138135e-01,
        -1.91455436e+00,  6.43194690e-02, -1.02247250e+00,  3.94341618e-01,
         9.29148138e-01, -1.13072026e+00,  1.93696782e-01,  5.29716074e-01,
        -3.0

In [40]:
print(type(sentence_embeddings))
print(len(sentence_embeddings))
print(type(sentence_embeddings[0]))
print(len(sentence_embeddings[0]))

<class 'list'>
442
<class 'numpy.ndarray'>
768


So, every sentence (442 total) are converted to their BERT Embedding of dimension 768 each. Now, data is ready for clustering.

# Clustering

### performing KMeans

In [122]:
from sklearn.cluster import KMeans

num_clusters = 5

clustering_model = KMeans(n_clusters=num_clusters, 
                          init ='k-means++',
                          max_iter=300, 
                          random_state=42, 
                          n_init=10)
clustering_model.fit(sentence_embeddings)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)

### Dimensionality Reduction for Visualisation

In [123]:
# reducing dimension of datapoint from 768 to 3 via
# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X3D_pca = pca.fit_transform(sentence_embeddings)

In [124]:
X3D_pca

array([[ 6.19280641,  0.52879973, -0.25125083],
       [-5.21612761,  0.61272091,  2.618107  ],
       [ 0.85989599, -1.39813148,  1.22471501],
       ...,
       [-3.07752585,  4.31565073, -0.45486551],
       [-0.35382425, -4.04219549, -1.11109172],
       [-2.48603508,  3.58541882, -3.23355054]])

In [125]:
# lets check
clustered_df = pd.DataFrame({
    'sentence' : X_train.values,
    'cluster' : clustering_model.labels_,
    'orig_label' : y_train.values, 
    'dim_X' : X3D_pca[:, 0],
    'dim_Y' : X3D_pca[:, 1],
    'dim_Z' : X3D_pca[:, 2],
})
clustered_df.head()

Unnamed: 0,sentence,cluster,orig_label,dim_X,dim_Y,dim_Z
0,holmes starts gb events kelly holmes start ser...,3,athletics,6.192806,0.5288,-0.251251
1,sri lankans cleared misconduct two sri lanka c...,4,cricket,-5.216128,0.612721,2.618107
2,collins calls chambers return world 100m champ...,2,athletics,0.859896,-1.398131,1.224715
3,liverpool revel night glory liverpool manager ...,0,football,1.242166,-0.552439,-2.502913
4,henman overcomes rival rusedski tim henman sav...,1,tennis,2.078091,3.45122,1.524083


## Visualising clusters via plotly

In [126]:
import plotly.graph_objects as go

# create new figure
fig = go.Figure()

cluster_label = clustering_model.labels_

# plot data
for cluster in set(cluster_label):
    original_labels = list(clustered_df[clustered_df['cluster']==cluster]['orig_label'].values)
    cl = [cluster] * len(original_labels)
    # print(original_labels)
    fig.add_trace(
        go.Scatter3d(
            x=list(clustered_df[clustered_df['cluster']==cluster]['dim_X'].values), 
            y=list(clustered_df[clustered_df['cluster']==cluster]['dim_Y'].values), 
            z=list(clustered_df[clustered_df['cluster']==cluster]['dim_Z'].values), 
            name=f"Cluster {cluster}",
            mode="markers",
            marker=dict(
                # color=cols,
                size=5,
                # line=dict(width=0.5, color='DarkSlateGrey')
            ), 
            # showlegend=True,
            # to display multiple, variables of custom data
            customdata= tuple(zip(original_labels, cl)),
            # To modify hover data, labels
            hovertemplate='L:%{customdata[0]}<br>C:%{customdata[1]}', 
        )
    )

# for 3D plots, Embedded Scene are used
from plotly.graph_objs.layout import Scene
from plotly.graph_objs.layout.scene import XAxis, YAxis, ZAxis
fig.update_layout(
    showlegend=True,
    legend=dict(
        y=0.99,
        x=0.01,
    ),
    scene=Scene(
        xaxis=XAxis(title='X'),
        yaxis=YAxis(title='Y'),
        zaxis=ZAxis(title='Z')
    ),
    margin=dict(l=0, r=0, b=0, t=0),  # tight Layout
)
fig.show()

So, based on the datapoints from clustering, it looks like clusters are as folows:
* C0 - Rugby/Football
* C1 - Tennis
* C2 - Rugby/Football
* C3 - Athletics
* C4 - Cricket

# Label Propagation
Now that we are satisfied with our clusters, we can use the trained model to classify remaining datapoints.

## preparing test data for prediction

In [128]:
# convert test data to BERT Embeddings
test_sentence_embeddings = model.encode(X_test.values)

In [129]:
# predict labels
y_preds = clustering_model.predict(test_sentence_embeddings)

In [130]:
# reducing dimensions for visualisation
X3D_pca_test = pca.transform(test_sentence_embeddings)

In [131]:
# making df of predicted results
# lets check
test_clustered_df = pd.DataFrame({
    'sentence' : X_test.values,
    'cluster' : y_preds,
    'orig_label' : y_test.values, 
    'dim_X' : X3D_pca_test[:, 0],
    'dim_Y' : X3D_pca_test[:, 1],
    'dim_Z' : X3D_pca_test[:, 2],
})
test_clustered_df.head()

Unnamed: 0,sentence,cluster,orig_label,dim_X,dim_Y,dim_Z
0,benitez issues warning gerrard liverpool manag...,0,football,2.798531,-2.126052,-2.31253
1,dibaba breaks 000m world record ethiopia tirun...,3,athletics,4.443296,1.418262,1.953027
2,wilkinson miss ireland match england take irel...,2,rugby,-1.782344,-1.460831,1.227924
3,johnson announces june retirement former engla...,0,rugby,0.800188,1.005502,-2.018701
4,redknapp poised saints southampton set unveil ...,2,football,-1.740087,-3.323201,0.22011


In [132]:
import plotly.graph_objects as go

# create new figure
fig = go.Figure()

cluster_label = y_preds

# plot data
for cluster in set(y_preds):
    original_labels = list(test_clustered_df[test_clustered_df['cluster']==cluster]['orig_label'].values)
    cl = [cluster] * len(original_labels)
    # print(original_labels)
    fig.add_trace(
        go.Scatter3d(
            x=list(test_clustered_df[test_clustered_df['cluster']==cluster]['dim_X'].values), 
            y=list(test_clustered_df[test_clustered_df['cluster']==cluster]['dim_Y'].values), 
            z=list(test_clustered_df[test_clustered_df['cluster']==cluster]['dim_Z'].values), 
            name=f"Cluster {cluster}",
            mode="markers",
            marker=dict(
                # color=cols,
                size=5,
                # line=dict(width=0.5, color='DarkSlateGrey')
            ), 
            # to display multiple, variables of custom data
            customdata= tuple(zip(original_labels, cl)),
            # To modify hover data, labels
            hovertemplate='L:%{customdata[0]}<br>C:%{customdata[1]}', 
        )
    )

# for 3D plots, Embedded Scene are used
from plotly.graph_objs.layout import Scene
from plotly.graph_objs.layout.scene import XAxis, YAxis, ZAxis
fig.update_layout(
    showlegend=True,
    legend=dict(
        y=0.99,
        x=0.01,
    ),
    scene=Scene(
        xaxis=XAxis(title='X'),
        yaxis=YAxis(title='Y'),
        zaxis=ZAxis(title='Z')
    ),
    margin=dict(l=0, r=0, b=0, t=0),  # tight Layout
)
fig.show()

## Verify result

In [133]:
# make a label mapper
mapping_ = {
    0: 'rugby',
    1: 'tennis',
    2: 'football',
    3: 'athletics',
    4: 'cricket'
}

In [134]:
correct = sum(test_clustered_df['orig_label'] == test_clustered_df['cluster'].map(mapping_))
correct

203

In [135]:
total = test_clustered_df.shape[0]
total

295

In [136]:
percent_correct = correct/total
percent_correct

0.688135593220339

Great! We were able to correctly classify 68.8% labels correctly.