### Data Set: 
https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv

The AG's news topic classification dataset is constructed by choosing the 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. 

The classes are: 

* World
* Sports
* Business
* Science/Technology

#### For more information on how to use Lbl2Vec, visit the [API Guide](https://lbl2vec.readthedocs.io/en/latest/api.html#)

In [1]:
from lbl2vec import Lbl2Vec
import pandas as pd
from gensim.utils import simple_preprocess
from gensim.models.doc2vec import TaggedDocument
from gensim.parsing.preprocessing import strip_tags
from sklearn.metrics import f1_score

### Load data

In [2]:
# load train data
ag_train = pd.read_csv('data/train.csv',sep=',',header=None, names=['class','title','description'])

# load test data
ag_test = pd.read_csv('data/test.csv',sep=',',header=None, names=['class','title','description'])

# load labels with keywords
labels = pd.read_csv('data/labels.csv',sep=';')

# split keywords by separator and save them as array
labels['keywords'] = labels['keywords'].apply(lambda x: x.split(' '))

# convert description keywords to lowercase
labels['keywords'] = labels['keywords'].apply(lambda description_keywords: [keyword.lower() for keyword in description_keywords])

# get number of keywords for each class
labels['number_of_keywords'] = labels['keywords'].apply(lambda row: len(row))

In [3]:
labels

Unnamed: 0,class_index,class_name,keywords,number_of_keywords
0,1,World,"[election, state, president, police, politics,...",11
1,2,Sports,"[olympic, football, sport, league, baseball, r...",32
2,3,Business,"[company, market, oil, consumers, exchange, bu...",10
3,4,Science/Technology,"[laboratory, computers, science, technology, w...",18


### Tokenize data

In [4]:
# doc: document text string
# returns tokenized document
# strip_tags removes meta tags from the text
# simple preprocess converts a document into a list of lowercase tokens, ignoring tokens that are too short or too long 
# simple preprocess also removes numerical values as well as punktuation characters
def tokenize(doc):
    return simple_preprocess(strip_tags(doc), deacc=True, min_len=2, max_len=15)

In [5]:
# add data set type column
ag_train['data_set_type'] = 'train'
ag_test['data_set_type'] = 'test'

# concat train and test data
ag_full_corpus = pd.concat([ag_train,ag_test]).reset_index(drop=True)

In [6]:
# tokenize and tag documents combined title + description for Lbl2Vec training
ag_full_corpus['tagged_docs'] = ag_full_corpus.apply(lambda row: TaggedDocument(tokenize(row['title'] + '. ' + row['description']), [str(row.name)]), axis=1)

In [7]:
# add doc_key column
ag_full_corpus['doc_key'] = ag_full_corpus.index.astype(str)

In [8]:
# add class_name column
ag_full_corpus = ag_full_corpus.merge(labels, left_on='class', right_on='class_index', how='left').drop(['class', 'keywords'], axis=1)

In [9]:
ag_full_corpus.head()

Unnamed: 0,title,description,data_set_type,tagged_docs,doc_key,class_index,class_name,number_of_keywords
0,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",train,"([wall, st, bears, claw, back, into, the, blac...",0,3,Business,10
1,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,train,"([carlyle, looks, toward, commercial, aerospac...",1,3,Business,10
2,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,train,"([oil, and, economy, cloud, stocks, outlook, r...",2,3,Business,10
3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,train,"([iraq, halts, oil, exports, from, main, south...",3,3,Business,10
4,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",train,"([oil, prices, soar, to, all, time, record, po...",4,3,Business,10


# Train Lbl2Vec

Train a new model from scratch with the following parameters:
* keywords_list : iterable list of lists with descriptive keywords for each topic.
* tagged_documents : iterable list of [gensim.models.doc2vec.TaggedDocument](https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument) elements. Each element consists of one document.
* label_names : iterable list of custom names for each label. Label names and keywords of the same topic must have the same index.
* similarity_threshold : only documents with a higher similarity to the respective description keywords than this treshold are used to calculate the label embedding.
* min_num_docs : minimum number of documents that are used to calculate the label embedding. 
* epochs : number of iterations over the corpus.

In [10]:
# init model with parameters
lbl2vec_model = Lbl2Vec(keywords_list=list(labels['keywords']), tagged_documents=ag_full_corpus['tagged_docs'][ag_full_corpus['data_set_type']=='train'], label_names=list(labels['class_name']), similarity_threshold=0.30, min_num_docs=100, epochs=10)

In [11]:
# train model
lbl2vec_model.fit()

2021-07-20 08:58:26,472 - Lbl2Vec - INFO - Train document and word embeddings
2021-07-20 09:00:21,881 - Lbl2Vec - INFO - Train label embeddings


# Predict topics of documents used to train Lbl2Vec

Compute similarity scores of learned document vectors from documents that were used to train the model to each of the learned label vectors. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

In [12]:
# predict similarity scores
model_docs_lbl_similarities = lbl2vec_model.predict_model_docs()

2021-07-20 09:01:58,297 - Lbl2Vec - INFO - Get document embeddings from model
2021-07-20 09:01:58,476 - Lbl2Vec - INFO - Calculate document<->label similarities


In [13]:
model_docs_lbl_similarities.head()

Unnamed: 0,doc_key,most_similar_label,highest_similarity_score,World,Sports,Business,Science/Technology
0,0,Sports,0.435321,0.286335,0.435321,0.416035,0.417741
1,1,Science/Technology,0.455163,0.207111,0.295462,0.441148,0.455163
2,2,Business,0.519264,0.268941,0.348006,0.519264,0.409859
3,3,World,0.548215,0.548215,0.142131,0.358767,0.266806
4,4,Business,0.40946,0.341938,0.369328,0.40946,0.328527


## Evaluate prediction of documents used to train Lbl2Vec

In [14]:
# merge DataFrames to compare the predicted and true topic labels
evaluation_train = model_docs_lbl_similarities.merge(ag_full_corpus[ag_full_corpus['data_set_type']=='train'], left_on='doc_key', right_on='doc_key')

In [15]:
y_true_train = evaluation_train['class_name']
y_pred_train = evaluation_train['most_similar_label']
print('F1 score:',f1_score(y_true_train, y_pred_train, average='micro'))

F1 score: 0.8255416666666666


# Predict topics of unknown documents

Learn document vectors of new documents that were **not** used to train the model and compute the similarity scores to each of the learned label vectors. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

In [16]:
# predict similarity scores of new test documents (they were not used during Lbl2Vec training)
new_docs_lbl_similarities = lbl2vec_model.predict_new_docs(tagged_docs=ag_full_corpus['tagged_docs'][ag_full_corpus['data_set_type']=='test'])

2021-07-20 09:03:57,434 - Lbl2Vec - INFO - Calculate document embeddings
2021-07-20 09:04:01,631 - Lbl2Vec - INFO - Calculate document<->label similarities


In [17]:
new_docs_lbl_similarities.head()

Unnamed: 0,doc_key,most_similar_label,highest_similarity_score,World,Sports,Business,Science/Technology
0,120000,Business,0.339817,0.333314,0.241052,0.339817,0.256396
1,120001,Science/Technology,0.370131,0.240881,0.293863,0.305717,0.370131
2,120002,Science/Technology,0.40708,0.241578,0.297648,0.350758,0.40708
3,120003,Sports,0.349194,0.208357,0.349194,0.275795,0.297635
4,120004,Science/Technology,0.339072,0.31811,0.257425,0.325634,0.339072


## Evaluate prediction of new documents

In [18]:
# merge DataFrames to compare the predicted and true topic labels
evaluation_test = new_docs_lbl_similarities.merge(ag_full_corpus[ag_full_corpus['data_set_type']=='test'], left_on='doc_key', right_on='doc_key')

In [19]:
y_true_test = evaluation_test['class_name']
y_pred_test = evaluation_test['most_similar_label']
print('F1 score:',f1_score(y_true_test, y_pred_test, average='micro'))

F1 score: 0.8196052631578947
