### Data Set: 
https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv

The AG's news topic classification dataset is constructed by choosing the 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. 

The classes are: 

* World
* Sports
* Business
* Science/Technology

#### For more information on how to use Lbl2Vec, visit the [API Guide](https://lbl2vec.readthedocs.io/en/latest/api.html#)

In [1]:
from lbl2vec import Lbl2Vec
import pandas as pd
from gensim.utils import simple_preprocess
from gensim.models.doc2vec import TaggedDocument
from gensim.parsing.preprocessing import strip_tags, STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer
import re

### Load data

In [2]:
# load train data
ag_train = pd.read_csv('data/train.csv',sep=',',header=None, names=['class','title','description'])

# load test data
ag_test = pd.read_csv('data/test.csv',sep=',',header=None, names=['class','title','description'])

# load labels with keywords
labels = pd.read_csv('data/labels.csv',sep=';')

# split keywords by separator and save them as array
labels['keywords'] = labels['keywords'].apply(lambda x: x.split(' '))

# convert description keywords to lowercase
labels['keywords'] = labels['keywords'].apply(lambda description_keywords: [keyword.lower() for keyword in description_keywords])

# get number of keywords for each class
labels['number_of_keywords'] = labels['keywords'].apply(lambda row: len(row))

In [3]:
labels

Unnamed: 0,class_index,class_name,keywords,number_of_keywords
0,1,World,"[election, state, president, police, politics,...",11
1,2,Sports,"[olympic, football, sport, league, baseball, r...",32
2,3,Business,"[company, market, oil, consumers, exchange, bu...",10
3,4,Science/Technology,"[laboratory, computers, science, technology, w...",18


### Tokenize data

In [4]:
# doc: document text string
# returns tokenized document
# strip_tags removes meta tags from the text
# simple preprocess converts a document into a list of lowercase tokens, ignoring tokens that are too short or too long 
# simple preprocess also removes numerical values as well as punktuation characters
def tokenize(doc):
    return simple_preprocess(strip_tags(doc), deacc=True, min_len=2, max_len=15)

In [5]:
# add data set type column
ag_train['data_set_type'] = 'train'
ag_test['data_set_type'] = 'test'

# concat train and test data
ag_full_corpus = pd.concat([ag_train,ag_test]).reset_index(drop=True)

In [6]:
# tokenize and tag documents combined title + description for Lbl2Vec training
ag_full_corpus['tagged_docs'] = ag_full_corpus.apply(lambda row: TaggedDocument(tokenize(row['title'] + '. ' + row['description']), [str(row.name)]), axis=1)

In [7]:
# add doc_key column
ag_full_corpus['doc_key'] = ag_full_corpus.index.astype(str)

In [8]:
# add class_name column
ag_full_corpus = ag_full_corpus.merge(labels, left_on='class', right_on='class_index', how='left').drop(['class', 'keywords'], axis=1)

In [9]:
ag_full_corpus.head()

Unnamed: 0,title,description,data_set_type,tagged_docs,doc_key,class_index,class_name,number_of_keywords
0,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",train,"([wall, st, bears, claw, back, into, the, blac...",0,3,Business,10
1,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,train,"([carlyle, looks, toward, commercial, aerospac...",1,3,Business,10
2,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,train,"([oil, and, economy, cloud, stocks, outlook, r...",2,3,Business,10
3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,train,"([iraq, halts, oil, exports, from, main, south...",3,3,Business,10
4,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",train,"([oil, prices, soar, to, all, time, record, po...",4,3,Business,10


# Train Lbl2Vec

Train a new model from scratch with the following parameters:
* keywords_list : iterable list of lists with descriptive keywords for each topic.
* tagged_documents : iterable list of [gensim.models.doc2vec.TaggedDocument](https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument) elements. Each element consists of one document.
* label_names : iterable list of custom names for each label. Label names and keywords of the same topic must have the same index.
* similarity_threshold : only documents with a higher similarity to the respective description keywords than this treshold are used to calculate the label embedding.
* min_num_docs : minimum number of documents that are used to calculate the label embedding. 
* epochs : number of iterations over the corpus.

In [10]:
# init model with parameters
lbl2vec_model = Lbl2Vec(keywords_list=list(labels['keywords']), tagged_documents=ag_full_corpus['tagged_docs'][ag_full_corpus['data_set_type']=='train'], label_names=list(labels['class_name']), similarity_threshold=0.30, min_num_docs=100, epochs=10)

In [11]:
# train model
lbl2vec_model.fit()

2021-07-20 09:10:15,016 - Lbl2Vec - INFO - Train document and word embeddings
2021-07-20 09:12:13,075 - Lbl2Vec - INFO - Train label embeddings


# Predict document topics of documents used to train Lbl2Vec

Compute similarity scores of learned document vectors from documents that were used to train the model to each of the learned label vectors. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

In [12]:
# predict similarity scores
model_docs_lbl_similarities = lbl2vec_model.predict_model_docs()

2021-07-20 09:13:42,690 - Lbl2Vec - INFO - Get document embeddings from model
2021-07-20 09:13:42,857 - Lbl2Vec - INFO - Calculate document<->label similarities


In [13]:
model_docs_lbl_similarities.head()

Unnamed: 0,doc_key,most_similar_label,highest_similarity_score,World,Sports,Business,Science/Technology
0,0,Sports,0.495201,0.283666,0.495201,0.413946,0.483179
1,1,Science/Technology,0.501601,0.240291,0.36248,0.47335,0.501601
2,2,Business,0.527285,0.221636,0.321899,0.527285,0.427591
3,3,World,0.50849,0.50849,0.174409,0.38905,0.273477
4,4,Business,0.413479,0.4109,0.3089,0.413479,0.309548


In [14]:
# merge DataFrames
full_evaluation_df = model_docs_lbl_similarities.merge(ag_full_corpus, left_on='doc_key', right_on='doc_key')

# pyLDAvis visualization

In [15]:
lbl_dict = {'World':1, 'Sports':2, 'Business':3, 'Science/Technology':4}
full_evaluation_df['label_index'] = full_evaluation_df['most_similar_label'].apply(lambda row: lbl_dict[row])

In [16]:
# create a matrix of document-topic probabilities
doc_topic_dists_vectorizer = CountVectorizer(analyzer='char')
doc_topic_dists = doc_topic_dists_vectorizer.fit_transform([str(x) for x in list(full_evaluation_df['label_index'])])

In [17]:
# concat title + descrpition to one text for each document
full_evaluation_df['full_text'] = full_evaluation_df.apply(lambda row: row['title'] + '. ' + row['description'], axis=1)
# Remove punctuation
full_evaluation_df['full_text_processed'] = full_evaluation_df['full_text'].map(lambda row: re.sub('[,\.!?]', '', row))
# Convert the titles to lowercase
full_evaluation_df['full_text_processed'] = full_evaluation_df['full_text_processed'].map(lambda row: row.lower())
# split into word list
full_evaluation_df['full_text_processed'] = full_evaluation_df['full_text'].apply(lambda row: tokenize(row))
# remove stop words
full_evaluation_df['full_text_processed'] = full_evaluation_df['full_text_processed'].apply(lambda row: [word for word in row if not word in STOPWORDS and len(word)>2])
# join word list to single string
full_evaluation_df['full_text_processed'] = full_evaluation_df['full_text_processed'].apply(lambda row: ' '.join(row))
# get doc lengths
full_evaluation_df['doc_length'] = full_evaluation_df['full_text_processed'].apply(lambda x: len(x.split(' ')))

In [18]:
# group full articles (title + text) by classified label
docs_per_class = full_evaluation_df.groupby(['most_similar_label'], as_index=False).agg({'full_text_processed': ' '.join})

In [19]:
# create topic-word distribution with KxM shape, where K is number of topics and M is vocabulary size
topic_vectorizer = CountVectorizer()
topic_term_matrix = topic_vectorizer.fit_transform(list(docs_per_class['full_text_processed']))

In [20]:
import pyLDAvis
%matplotlib inline
from IPython.core.display import display, HTML
display(HTML("<style>.container { max-width:100% !important; }</style>"))
display(HTML("<style>.output_result { max-width:100% !important; }</style>"))
display(HTML("<style>.output_area { max-width:100% !important; }</style>"))
display(HTML("<style>.input_area { max-width:100% !important; }</style>"))
pyLDAvis.enable_notebook()

# prepare pyLDAvis data
pyLDAvis_data = pyLDAvis.prepare(mds='pcoa', topic_term_dists=topic_term_matrix.toarray(), doc_topic_dists=doc_topic_dists.toarray(), doc_lengths=list(full_evaluation_df['doc_length']), vocab=topic_vectorizer.get_feature_names(), term_frequency=list(pd.DataFrame(topic_term_matrix.toarray()).sum()))

# parse topic names
parsedLDAtopics = pyLDAvis_data.topic_coordinates.reset_index().merge(docs_per_class.reset_index(), left_on='topic',right_on='index')[['topics','most_similar_label']]
parsedpyLDAvistopicsdict = dict(zip(list(parsedLDAtopics['topics']),list(parsedLDAtopics['most_similar_label'])))

In [21]:
# display pyLDAvis data
print('Parsed pyLDAvis topics to class names:',parsedpyLDAvistopicsdict)
pyLDAvis.display(pyLDAvis_data)

Parsed pyLDAvis topics to class names: {1: 'Business', 2: 'Science/Technology', 3: 'Sports', 4: 'World'}


  and should_run_async(code)
