**Unsupervised text classification with BERT embeddings**

This code is taken from :

https://github.com/LaurentVeyssier/Unsupervised-text-classification-with-BERT-embeddings/blob/main/unsupervised_text_classification_with_BERT.ipynb

https://towardsdatascience.com/text-classification-with-no-model-training-935fe0e42180

Other reference:
https://github.com/mdipietro09/DataScience_ArtificialIntelligence_Utils/blob/master/natural_language_processing/example_text_classification.ipynb


Setup: import packages, read data.

*   Preprocessing: clean text data.
*   Create Target Clusters: use Word2Vec with gensim to build the target variable.
*   Feature Engineering: Word Embedding with transformers and BERT.
*   Model Design & Testing: assign observations to clusters by Cosine Similarity and evaluate the performance.
Explainability: understand how the model produces results.
*   List item




In [2]:
## for data
import json
import pandas as pd
import numpy as np
from sklearn import metrics, manifold
from tqdm import tqdm

In [4]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
     |████████████████████████████████| 1.5 MB 24.8 MB/s            
Collecting regex>=2021.8.3
  Downloading regex-2022.3.15-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (764 kB)
     |████████████████████████████████| 764 kB 79.0 MB/s            
Installing collected packages: regex, nltk
Successfully installed nltk-3.7 regex-2022.3.15
You should consider upgrading via the '/usr/local/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [5]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.1.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
     |████████████████████████████████| 24.1 MB 21.7 MB/s            
[?25hCollecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
     |████████████████████████████████| 58 kB 890 kB/s             
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.1.2 smart-open-5.2.1
You should consider upgrading via the '/usr/local/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [6]:
!pip install tensorflow

Collecting six~=1.15.0
  Downloading six-1.15.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: six
  Attempting uninstall: six
    Found existing installation: six 1.16.0
    Uninstalling six-1.16.0:
      Successfully uninstalled six-1.16.0
Successfully installed six-1.15.0
You should consider upgrading via the '/usr/local/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [7]:
## for processing
import re
import nltk

## for plotting
import matplotlib.pyplot as plt
import seaborn as sns

## for w2v
import gensim
import gensim.downloader as gensim_api


In [8]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
     |████████████████████████████████| 3.8 MB 27.1 MB/s            
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
     |████████████████████████████████| 6.5 MB 73.8 MB/s            
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
     |████████████████████████████████| 895 kB 60.8 MB/s            
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.0-py3-none-any.whl (77 kB)
     |████████████████████████████████| 77 kB 901 kB/s             
Installing collected packages: tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed huggingface-hub-0.5.0 sacremoses-0.0.49 tokenizers-0.11.6 transformers-4.17.0
You should consider upgrading via the '/usr/local/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [9]:
## for BERT
import transformers
import os

In [10]:
import tensorflow as tf
print(tf.__version__)

2.6.2


In [11]:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
print(tf.config.list_physical_devices('GPU'))
print(tf.test.is_gpu_available())
print(tf.test.is_built_with_cuda()) 

[]
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
False
True


In [18]:
lst_dics = []
with open("News_Category_Dataset_v2.json", mode='r', errors='ignore') as json_file:
    for dic in json_file:
        lst_dics.append( json.loads(dic))
#lst_dics[0]

In [19]:
## create dtf
dtf = pd.DataFrame(lst_dics)
dtf.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


In [20]:
print(f'the dataset contains {len(dtf)} news articles')

the dataset contains 200853 news articles


In [21]:
print(f'there are {len(set(dtf.category))} categories in the dataset')
print(set(dtf.category))

there are 41 categories in the dataset
{'MEDIA', 'IMPACT', 'SPORTS', 'ENTERTAINMENT', 'MONEY', 'COMEDY', 'WEDDINGS', 'WORLD NEWS', 'WEIRD NEWS', 'GOOD NEWS', 'WOMEN', 'ENVIRONMENT', 'QUEER VOICES', 'PARENTS', 'FOOD & DRINK', 'TECH', 'EDUCATION', 'ARTS', 'TASTE', 'WORLDPOST', 'CRIME', 'STYLE', 'TRAVEL', 'LATINO VOICES', 'DIVORCE', 'HEALTHY LIVING', 'HOME & LIVING', 'WELLNESS', 'STYLE & BEAUTY', 'COLLEGE', 'SCIENCE', 'CULTURE & ARTS', 'THE WORLDPOST', 'RELIGION', 'POLITICS', 'GREEN', 'FIFTY', 'PARENTING', 'ARTS & CULTURE', 'BUSINESS', 'BLACK VOICES'}


In [22]:
dtf['category'].value_counts()

POLITICS          32739
WELLNESS          17827
ENTERTAINMENT     16058
TRAVEL             9887
STYLE & BEAUTY     9649
PARENTING          8677
HEALTHY LIVING     6694
QUEER VOICES       6314
FOOD & DRINK       6226
BUSINESS           5937
COMEDY             5175
SPORTS             4884
BLACK VOICES       4528
HOME & LIVING      4195
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3651
WOMEN              3490
IMPACT             3459
DIVORCE            3426
CRIME              3405
MEDIA              2815
WEIRD NEWS         2670
GREEN              2622
WORLDPOST          2579
RELIGION           2556
STYLE              2254
SCIENCE            2178
WORLD NEWS         2177
TASTE              2096
TECH               2082
MONEY              1707
ARTS               1509
FIFTY              1401
GOOD NEWS          1398
ARTS & CULTURE     1339
ENVIRONMENT        1323
COLLEGE            1144
LATINO VOICES      1129
CULTURE & ARTS     1030
EDUCATION          1004
Name: category, 

In [23]:
## filter categories
dtf = dtf[ dtf["category"].isin(['ENTERTAINMENT','POLITICS','TECH'])        ][["category","headline"]]
## rename columns
dtf = dtf.rename(columns={"category":"category", "headline":"text"})
## print 5 random rows
dtf.sample(5)

Unnamed: 0,category,text
167103,ENTERTAINMENT,"'Hansel and Gretel: Witch Hunters' Reviews, 'P..."
73618,POLITICS,Sanders Ramps Up Spending In Effort To Catch U...
97958,POLITICS,"Judge Scott Walker on His Record, Not His Educ..."
70658,TECH,'Trump Filter' Erases The Donald From Your Chr...
122774,ENTERTAINMENT,Priyanka Chopra: An Exclusive Interview


In [24]:
print(f'the sample dataset now contains {len(dtf)} news articles related to selected categories {set(dtf.category)}')

the sample dataset now contains 50879 news articles related to selected categories {'POLITICS', 'ENTERTAINMENT', 'TECH'}



**Text preprocessing**

*   remove punctuation, convert to lower case
*   tokenize text to words
*   remove stopwords (nltk)
*   Stemming and lemmatisation

In [25]:
def preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):
    '''
    Preprocess a string.
    :parameter
        :param text: string - name of column containing text
        :param lst_stopwords: list - list of stopwords to remove
        :param flg_stemm: bool - whether stemming is to be applied
        :param flg_lemm: bool - whether lemmitisation is to be applied
    :return
        cleaned text
    '''

    ## clean (convert to lowercase and remove punctuations and characters and then strip)
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
            
    ## Tokenize (convert from string to list)
    lst_text = text.split()
    
    ## remove Stopwords
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in lst_stopwords]
                
    ## Stemming (remove -ing, -ly, ...)
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    ## Lemmatisation (convert the word into root word)
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    ## back to string from list
    text = " ".join(lst_text)
    return text

In [26]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [27]:
lst_stopwords = nltk.corpus.stopwords.words("english")
lst_stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [28]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [30]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [31]:
dtf["text_clean"] = dtf["text"].apply(lambda x: preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=lst_stopwords))
dtf.head()

Unnamed: 0,category,text,text_clean
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,smith join diplo nicky jam 2018 world cup offi...
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,hugh grant marries first time age 57
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,jim carrey blast castrato adam schiff democrat...
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,julianna margulies us donald trump poop bag pi...
5,ENTERTAINMENT,Morgan Freeman 'Devastated' That Sexual Harass...,morgan freeman devastated sexual harassment cl...


Create Target Clusters
Load gensim GloVe word embeddings

In [32]:
nlp = gensim_api.load("glove-wiki-gigaword-300")



In [33]:
# The gensim package has a very convenient function that returns the most similar words for any given word into the vocabulary
nlp.most_similar(["obama"], topn=3)

[('barack', 0.9254721999168396),
 ('mccain', 0.7590768337249756),
 ('bush', 0.7570988535881042)]

In [34]:
def get_similar_words(lst_words, top, nlp):
    lst_out = lst_words
    for tupla in nlp.most_similar(lst_words, topn=top):
        lst_out.append(tupla[0])
    return list(set(lst_out))

In [35]:
## Create Dictionary {category:[keywords]}
dic_clusters = {}
dic_clusters["ENTERTAINMENT"] = get_similar_words(['celebrity','cinema','movie','music'], top=30, nlp=nlp)
dic_clusters["POLITICS"] = get_similar_words(['gop','clinton','president','obama','republican'], top=30, nlp=nlp)
dic_clusters["TECH"] = get_similar_words(['amazon','android','app','apple','facebook','google','tech'], top=30, nlp=nlp)

In [36]:
## print top 5 closest words
print('Top words per label:')
for k,v in dic_clusters.items():
    print("{0:15}..... {1}".format(k, v[0:5], len(v)))

Top words per label:
ENTERTAINMENT  ..... ['music', 'comedy', 'shows', 'musical', 'documentary']
POLITICS       ..... ['senate', 'administration', 'democratic', 'congress', 'mccain']
TECH           ..... ['online', 'youtube', 'app', 'users', 'ipod']


Visualize the selected topics with keyword clusters

In [38]:
## word embedding
tot_words = [word for v in dic_clusters.values() for word in v]
X = nlp[tot_words]
        
## pca
pca = manifold.TSNE(perplexity=40, n_components=2, init='pca')
X = pca.fit_transform(X)

## create dtf
dtf_GloVe = pd.DataFrame()
for k,v in dic_clusters.items():
    size = len(dtf_GloVe) + len(v)
    dtf_group = pd.DataFrame(X[len(dtf_GloVe):size], columns=["x","y"], index=v)
    dtf_group["cluster"] = k
    dtf_GloVe = dtf_GloVe.append(dtf_group)
        


In [None]:
## plot
%matplotlib notebook
fig, ax = plt.subplots(figsize=(15,10))
sns.scatterplot(data=dtf_GloVe, x="x", y="y", hue="cluster", ax=ax)
#ax.legend().texts[0].set_text(None)
ax.legend()
ax.set(xlabel=None, ylabel=None, xticks=[], xticklabels=[], yticks=[], yticklabels=[])
for i in range(len(dtf_GloVe)):
    ax.annotate(dtf_GloVe.index[i], xy=(dtf_GloVe["x"].iloc[i], dtf_GloVe["y"].iloc[i]), xytext=(5,2), textcoords='offset points', ha='right', va='bottom')

plt.show()

In [32]:
%matplotlib inline

Load Transformers BERT model to create embeddings

In [39]:
## for BERT
import transformers
## bert tokenizer
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
## bert model
nlp = transformers.TFBertModel.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


create average embeddings of sentence

Use BERT Word Embedding to represent each text with an array (shape: number of tokens x 768) and then summarize each article into a mean vector

> Indented block



In [40]:
## function to apply
def utils_bert_embedding(txt, tokenizer, nlp):
    '''
    Word embedding with Bert (equivalent to nlp["word"]).
    :parameter
        :param txt: string 
        :param tokenizer: transformers tokenizer
        :param nlp: transformers bert
    :return
        tensor sentences x words x vector (1x3x768) 
    '''
    # tokenize sentence to tokens (integers)
    idx = tokenizer.encode(txt)
    # convert to array of shape (1, num_words+2) - EOS and CLS added
    idx = np.array(idx)[None,:]
    # generate embeddings for each token - output is a tuple
    embedding = nlp(idx)
    # select first member of the tuple, remove first dimension which is 1 to get (num_words,embedding size 712)
    # exclude CLS and EOS tokens
    X = np.array(embedding[0][0][1:-1])
    return X

The below code takes 3 hrs

In [41]:
%%time
## create list of news vector
lst_mean_vecs = [utils_bert_embedding(txt, tokenizer, nlp).mean(0) for txt in tqdm(dtf["text_clean"])]

  ret = um.true_divide(
100%|██████████| 50879/50879 [2:27:30<00:00,  5.75it/s]  

CPU times: user 3h 56s, sys: 4min 29s, total: 3h 5min 25s
Wall time: 2h 27min 30s





In [None]:
lst_mean_vecs

In [42]:
## create the feature matrix (n news x 768)
X = np.array(lst_mean_vecs)
X.shape

(50879, 768)

In [44]:
X

array([[ 0.30438578,  0.07948477,  0.67639226, ...,  0.06270271,
        -0.20794424, -0.33524278],
       [-0.3704354 , -0.13170199,  0.8771482 , ..., -0.81547123,
         0.05773297, -0.32166442],
       [-0.45270362, -0.05354853,  0.08928619, ...,  0.1096567 ,
        -0.00415162, -0.13247356],
       ...,
       [-0.07823105, -0.3392273 ,  0.02913952, ..., -0.16773064,
         0.04754758, -0.050371  ],
       [ 0.1271344 , -0.7259398 ,  0.6615607 , ...,  0.21199965,
        -0.04013522, -0.3946028 ],
       [-0.36915013, -0.13852541,  0.42212173, ..., -0.24667107,
        -0.05770321, -0.2588567 ]], dtype=float32)

In [45]:
# Create y as {label:mean_vector}
dic_y = {k:utils_bert_embedding(v, tokenizer, nlp).mean(0) for k,v in tqdm(dic_clusters.items())}

100%|██████████| 3/3 [00:00<00:00,  4.21it/s]


In [46]:
dic_clusters['ENTERTAINMENT']

['music',
 'comedy',
 'shows',
 'musical',
 'documentary',
 'actors',
 'movies',
 'tv',
 'premiere',
 'genre',
 'feature',
 'drama',
 'audiences',
 'celebrity',
 'theatrical',
 'studio',
 'films',
 'dance',
 'entertainment',
 'hollywood',
 'cinema',
 'screen',
 'movie',
 'bollywood',
 'featured',
 'filmmakers',
 'film',
 'concert',
 'television',
 'video',
 'videos',
 'theater',
 'soundtrack',
 'pop']

In [47]:
dic_y['ENTERTAINMENT'].shape

(768,)

In [48]:
dic_y['ENTERTAINMENT']

array([ 5.81588268e-01,  2.15897441e-01,  5.72971404e-01,  3.22584510e-02,
        2.56296601e-02, -3.86683613e-01,  1.54019356e-01,  2.92288954e-03,
       -5.32868266e-01,  1.96008325e-01, -1.62513077e-01, -1.18611082e-01,
        1.96341872e-01,  2.80803293e-01, -1.64656833e-01,  4.44757044e-01,
       -1.15779988e-01, -5.10795452e-02,  3.17152925e-02,  2.30614111e-01,
        7.61869457e-03, -7.14418888e-02,  3.65447074e-01,  2.37631232e-01,
       -4.16093111e-01, -2.28877351e-01, -1.15503341e-01,  1.98854759e-01,
        9.11936611e-02, -9.20143425e-02,  4.90956545e-01, -7.64795439e-03,
        1.22004539e-01,  4.39825147e-01, -6.10453188e-02, -1.41916022e-01,
        1.87735885e-01, -6.39803410e-02,  1.31476119e-01, -4.27391529e-02,
       -1.45539761e-01, -1.02295840e+00,  3.96826088e-01, -3.33147913e-01,
       -4.70207393e-01, -6.50720060e-01,  1.22366525e-01,  4.41449672e-01,
        1.02106910e-02,  5.03539443e-01,  3.04455429e-01,  4.62208450e-01,
       -2.95029998e-01, -

### Model design & predict text classification

In [49]:
def fix_NAN_inf_values(x):
    '''Replace NaN with zero and infinity with large finite numbers'''
    if len(np.where(np.isnan(X))[0])>0 or len(np.where(np.isnan(X))[1])>0:
        return np.nan_to_num(X)

In [50]:
## compute cosine similarities
## Output matrix with shape: number of news x number of labels (3, Entertainment/Politics/Tech). To put it another way, each row will represent an article and contain one similarity score for each target cluster.
similarities = np.array([metrics.pairwise.cosine_similarity(fix_NAN_inf_values(X), y.reshape(1,-1)).T.tolist()[0] for y in dic_y.values()]).T
print(similarities.shape)

## adjust and rescale
labels = list(dic_y.keys())
for i in range(len(similarities)):
    ### assign randomly if there is no similarity   ###############################################################################################################
    if sum(similarities[i]) == 0:
        similarities[i] = [0]*len(labels)
        similarities[i][np.random.choice(range(len(labels)))] = 1
    ### rescale so they sum = 1
    similarities[i] = similarities[i] / sum(similarities[i])

## classify the label with highest similarity score
predicted_prob = similarities
predicted = [labels[np.argmax(pred)] for pred in predicted_prob]

(50879, 3)


### Evaluate predictions

In [51]:
y_test = dtf["category"].values
classes = np.unique(y_test)
y_test_array = pd.get_dummies(y_test, drop_first=False).values

## Accuracy, Precision, Recall
accuracy = metrics.accuracy_score(y_test, predicted)
auc = metrics.roc_auc_score(y_test, predicted_prob, multi_class="ovr")
print("Accuracy:",  round(accuracy,2))
print("Auc:", round(auc,2))
print("Detail:")
print(metrics.classification_report(y_test, predicted))
    
## Plot confusion matrix
cm = metrics.confusion_matrix(y_test, predicted)

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues, cbar=False)
ax.set(xlabel="Pred", ylabel="True", xticklabels=classes, yticklabels=classes, title="Confusion matrix")
plt.yticks(rotation=0)

fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(20,10))
## Plot roc
for i in range(len(classes)):
    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i], predicted_prob[:,i])
    ax[0].plot(fpr, tpr, lw=3,  label='{0} (area={1:0.2f})'.format(classes[i], metrics.auc(fpr, tpr)))
ax[0].plot([0,1], [0,1], color='navy', lw=3, linestyle='--')
ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05], 
          xlabel='False Positive Rate', 
          ylabel="True Positive Rate (Recall)", 
          title="Receiver operating characteristic")
ax[0].legend(loc="lower right")
ax[0].grid(True)
    
## Plot precision-recall curve
for i in range(len(classes)):
    precision, recall, thresholds = metrics.precision_recall_curve(y_test_array[:,i], predicted_prob[:,i])
    ax[1].plot(recall, precision, lw=3, label='{0} (area={1:0.2f})'.format(classes[i], metrics.auc(recall, precision)))
ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall', ylabel="Precision", title="Precision-Recall curve")
ax[1].legend(loc="best")
ax[1].grid(True)
plt.show()

Accuracy: 0.59
Auc: 0.83
Detail:
               precision    recall  f1-score   support

ENTERTAINMENT       0.44      0.93      0.59     16058
     POLITICS       0.94      0.44      0.60     32739
         TECH       0.42      0.29      0.35      2082

     accuracy                           0.59     50879
    macro avg       0.60      0.55      0.51     50879
 weighted avg       0.76      0.59      0.59     50879



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Explain predictions

In [52]:
index = 7
txt_instance = dtf["text_clean"].iloc[index]
print("True:", y_test[index], "--> Pred:", predicted[index], "| Similarity:", round(np.max(predicted_prob[index]),2))
print(txt_instance)

True: ENTERTAINMENT --> Pred: ENTERTAINMENT | Similarity: 0.39
mike myers reveals hed like fourth austin power film


In [53]:
#%matplotlib notebook

In [54]:
def embedding_bert(x, tokenizer=None, nlp=None):
    '''
    Creates a feature matrix (num_docs x vector_size)
    :parameter
        :param x: string or list
        :param tokenizer: transformers tokenizer
        :param nlp: transformers bert
        :param log: bool - print tokens
    :return
        vector or matrix 
    '''
    tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased') if tokenizer is None else tokenizer
    nlp = transformers.TFBertModel.from_pretrained('bert-base-uncased') if nlp is None else nlp
    
    ## single word --> vec (size,)
    if (type(x) is str) and (len(x.split()) == 1):
        X = utils_bert_embedding(x, tokenizer, nlp).reshape(-1)
    
    ## list of words --> matrix (n, size)
    elif (type(x) is list) and (type(x[0]) is str) and (len(x[0].split()) == 1):
        X = utils_bert_embedding(x, tokenizer, nlp)
    
    ## list of lists of words --> matrix (n mean vectors, size)
    elif (type(x) is list) and (type(x[0]) is list):
        lst_mean_vecs = [utils_bert_embedding(lst, tokenizer, nlp).mean(0) for lst in x]
        X = np.array(lst_mean_vecs)
    
    ## single text --> matrix (n words, size)
    elif (type(x) is str) and (len(x.split()) > 1):
        X = utils_bert_embedding(x, tokenizer, nlp)
        
    ## list of texts --> matrix (n mean vectors, size)
    else:
        lst_mean_vecs = [utils_bert_embedding(txt, tokenizer, nlp).mean(0) for txt in x]
        X = np.array(lst_mean_vecs)
    return X

In [55]:
## create embedding Matrix
y = np.concatenate([embedding_bert(v, tokenizer, nlp) for v in dic_clusters.values()])
X = embedding_bert(txt_instance, tokenizer, nlp).mean(0).reshape(1,-1)
M = np.concatenate([y,X])

## pca
pca = manifold.TSNE(perplexity=40, n_components=2, init='pca')
M = pca.fit_transform(M)
y, X = M[:len(y)], M[len(y):]

## create dtf clusters
df = pd.DataFrame()
for k,v in dic_clusters.items():
    size = len(df) + len(v)
    df_group = pd.DataFrame(y[len(df):size], columns=["x","y"], index=v)
    df_group["cluster"] = k
    df = df.append(df_group)





In [60]:
M

array([[-4.1497364 , -0.499701  ],
       [-3.8235395 ,  0.03800524],
       [-2.5020509 , -1.8718694 ],
       [-4.028118  , -0.41711697],
       [-1.76736   , -0.9576222 ],
       [-1.182559  , -1.8271803 ],
       [-2.1045501 , -1.6285963 ],
       [-2.808436  , -0.69282365],
       [-0.85447836, -1.2688786 ],
       [-1.1744214 , -0.6683929 ],
       [-1.2859658 , -0.5544553 ],
       [-2.6324418 , -0.21014084],
       [-1.6748973 , -1.8104311 ],
       [-1.1903776 , -1.4961988 ],
       [-2.2046163 , -0.6068827 ],
       [-1.2783847 , -0.81805634],
       [-2.0083647 , -1.6219627 ],
       [-3.306976  ,  0.33565298],
       [-1.5034437 , -1.4557489 ],
       [-1.9614013 , -0.2871102 ],
       [-1.6785443 , -0.19324078],
       [-0.8371223 , -0.9053245 ],
       [-1.83787   , -0.827031  ],
       [-2.3150008 ,  0.24752063],
       [-0.3533445 , -1.2055366 ],
       [-1.8619164 , -2.7481833 ],
       [-2.1064074 , -0.9302307 ],
       [-3.2610724 , -1.2524972 ],
       [-3.038276  ,

In [59]:
X

array([[-0.23419018, -1.9886333 ]], dtype=float32)

In [58]:
y

array([[-4.1497364 , -0.499701  ],
       [-3.8235395 ,  0.03800524],
       [-2.5020509 , -1.8718694 ],
       [-4.028118  , -0.41711697],
       [-1.76736   , -0.9576222 ],
       [-1.182559  , -1.8271803 ],
       [-2.1045501 , -1.6285963 ],
       [-2.808436  , -0.69282365],
       [-0.85447836, -1.2688786 ],
       [-1.1744214 , -0.6683929 ],
       [-1.2859658 , -0.5544553 ],
       [-2.6324418 , -0.21014084],
       [-1.6748973 , -1.8104311 ],
       [-1.1903776 , -1.4961988 ],
       [-2.2046163 , -0.6068827 ],
       [-1.2783847 , -0.81805634],
       [-2.0083647 , -1.6219627 ],
       [-3.306976  ,  0.33565298],
       [-1.5034437 , -1.4557489 ],
       [-1.9614013 , -0.2871102 ],
       [-1.6785443 , -0.19324078],
       [-0.8371223 , -0.9053245 ],
       [-1.83787   , -0.827031  ],
       [-2.3150008 ,  0.24752063],
       [-0.3533445 , -1.2055366 ],
       [-1.8619164 , -2.7481833 ],
       [-2.1064074 , -0.9302307 ],
       [-3.2610724 , -1.2524972 ],
       [-3.038276  ,

In [56]:
## plot clusters
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(data=df, x="x", y="y", hue="cluster", ax=ax)
#ax.legend().texts[0].set_text(None)
ax.legend()
ax.set(xlabel=None, ylabel=None, xticks=[], xticklabels=[], yticks=[], yticklabels=[])
for i in range(len(df)):
    ax.annotate(df.index[i], 
               xy=(df["x"].iloc[i],df["y"].iloc[i]), 
               xytext=(5,2), textcoords='offset points', 
               ha='right', va='bottom')

## add txt_instance
ax.scatter(x=X[0][0], y=X[0][1], c="red", linewidth=10)
ax.annotate("x", xy=(X[0][0],X[0][1]), ha='center', va='center', fontsize=25)

## calculate similarity
sim_matrix = metrics.pairwise.cosine_similarity(X, y)

## add top similarity
for row in range(sim_matrix.shape[0]):
    ### sorted {keyword:score}
    dic_sim = {n:sim_matrix[row][n] for n in range(sim_matrix.shape[1])}
    dic_sim = {k:v for k,v in sorted(dic_sim.items(), key=lambda item:item[1], reverse=True)}
    ### plot lines
    for k in dict(list(dic_sim.items())[0:5]).keys():
        p1 = [X[row][0], X[row][1]]
        p2 = [y[k][0], y[k][1]]
        ax.plot([p1[0],p2[0]], [p1[1],p2[1]], c="red", alpha=0.5)
plt.show()

<IPython.core.display.Javascript object>