# Shallow-Learning Topic Modelling
- Build a document topic classifier leveraging the graph topoglical information of our bipartite graph we created in 01_nlp_graph_creation.ipynb
- Topic is the label in our raw dataset

In the following we will show you how to create a topic model, using a shallow-learning approach. Here we will use the results and the embeddings obtained from the document-document projection of the bipartite graph.

**NOTE: This Notebook can only be run after the 01_nlp_graph_creation notebook, as some of the results computed in the first notebook will be here reused.** 

### Load Dataset

In [99]:
import pandas as pd
#file_corpus = "/Users/chang/Documents/dev/git/ml/Graph-Machine-Learning/Chapter07/corpus_clean.csv"
#corpus = pd.read_csv(file_corpus, sep='\t')
corpus=pd.read_pickle("corpus.p")
corpus.head(2)

Unnamed: 0_level_0,clean_text,label,language,parsed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
test/14826,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...,[trade],en,"(ASIAN, EXPORTERS, FEAR, DAMAGE, FROM, U.S.-JA..."
test/14828,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...,[grain],en,"(CHINA, DAILY, SAYS, VERMIN, EAT, 7, -, 12, PC..."


In [101]:
#corpus['label'] = corpus['label'].apply(lambda label_str: label_str.strip('[]').replace("'", "").split(', '))
corpus['label'].head(10)

id
test/14826                                           [trade]
test/14828                                           [grain]
test/14829                                  [crude, nat-gas]
test/14832    [corn, grain, rice, rubber, sugar, tin, trade]
test/14833                               [palm-oil, veg-oil]
test/14839                                            [ship]
test/14840       [coffee, lumber, palm-oil, rubber, veg-oil]
test/14841                                    [grain, wheat]
test/14842                                            [gold]
test/14843                                             [acq]
Name: label, dtype: object

In [102]:
from collections import Counter
#topics = Counter([label.replace("'", "") for labels in corpus['label'] for label in labels.strip('][').split(', ')]).most_common(10)

topics = Counter([label for labels in corpus['label'] for label in labels]).most_common(10)
topics

[('earn', 3964),
 ('acq', 2369),
 ('money-fx', 717),
 ('grain', 582),
 ('crude', 578),
 ('trade', 485),
 ('interest', 478),
 ('ship', 286),
 ('wheat', 283),
 ('corn', 237)]

In [103]:
topicsList = [topic[0] for topic in topics]
topicsSet = set(topicsList)
topicsSet

{'acq',
 'corn',
 'crude',
 'earn',
 'grain',
 'interest',
 'money-fx',
 'ship',
 'trade',
 'wheat'}

In [104]:
dataset = corpus[corpus["label"].apply(lambda x: len(topicsSet.intersection(x))>0)]
dataset.head(10)

Unnamed: 0_level_0,clean_text,label,language,parsed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
test/14826,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...,[trade],en,"(ASIAN, EXPORTERS, FEAR, DAMAGE, FROM, U.S.-JA..."
test/14828,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...,[grain],en,"(CHINA, DAILY, SAYS, VERMIN, EAT, 7, -, 12, PC..."
test/14829,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...,"[crude, nat-gas]",en,"(JAPAN, TO, REVISE, LONG, -, TERM, ENERGY, DEM..."
test/14832,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER Th...,"[corn, grain, rice, rubber, sugar, tin, trade]",en,"(THAI, TRADE, DEFICIT, WIDENS, IN, FIRST, QUAR..."
test/14839,AUSTRALIAN FOREIGN SHIP BAN ENDS BUT NSW PORTS...,[ship],en,"(AUSTRALIAN, FOREIGN, SHIP, BAN, ENDS, BUT, NS..."
test/14841,SRI LANKA GETS USDA APPROVAL FOR WHEAT PRICE ...,"[grain, wheat]",en,"(SRI, LANKA, GETS, USDA, APPROVAL, FOR, WHEAT,..."
test/14843,SUMITOMO BANK AIMS AT QUICK RECOVERY FROM MERG...,[acq],en,"(SUMITOMO, BANK, AIMS, AT, QUICK, RECOVERY, FR..."
test/14849,BUNDESBANK ALLOCATES 6.1 BILLION MARKS IN TEND...,"[interest, money-fx]",en,"(BUNDESBANK, ALLOCATES, 6.1, BILLION, MARKS, I..."
test/14852,BOND CORP STILL CONSIDERING ATLAS MINING BAIL-...,"[acq, copper]",en,"(BOND, CORP, STILL, CONSIDERING, ATLAS, MINING..."
test/14858,JAPAN MINISTRY SAYS OPEN FARM TRADE WOULD HIT ...,"[carcass, corn, grain, livestock, oilseed, ric...",en,"(JAPAN, MINISTRY, SAYS, OPEN, FARM, TRADE, WOU..."


Create a class to "simulate" the training of the embeddings

In [105]:
from sklearn.base import BaseEstimator

class EmbeddingsTransformer(BaseEstimator):
    
    def __init__(self, embeddings_file):
        self.embeddings_file = embeddings_file
        
    def fit(self, *args, **kwargs):
        self.embeddings = pd.read_pickle(self.embeddings_file)
        return self
        
    def transform(self, X):
        return self.embeddings.loc[X.index]
    
    def fit_transform(self, X, y):
        return self.fit().transform(X)



In [106]:
from glob import glob 
files = glob("./embeddings/*")
files

['./embeddings/bipartiteGraphEmbeddings_10_20.p']

In [107]:
graphEmbeddings = EmbeddingsTransformer(files[0]).fit()
graphEmbeddings

EmbeddingsTransformer(embeddings_file='./embeddings/bipartiteGraphEmbeddings_10_20.p')

In [108]:
graphEmbeddings.get_params()

{'embeddings_file': './embeddings/bipartiteGraphEmbeddings_10_20.p'}

Train/Test split

In [66]:
def get_labels(corpus, topicsList=topicsList):
    return corpus["label"].apply(
        lambda labels: pd.Series({label: 1 for label in labels}).reindex(topicsList).fillna(0)
    )[topicsList]

In [67]:
def get_features(corpus):
    return corpus["parsed"] #graphEmbeddings.transform(corpus["parsed"])

In [68]:
def get_features_and_labels(corpus):
    return get_features(corpus), get_labels(corpus)

In [111]:
a = corpus.label
a

id
test/14826                                              [trade]
test/14828                                              [grain]
test/14829                                     [crude, nat-gas]
test/14832       [corn, grain, rice, rubber, sugar, tin, trade]
test/14833                                  [palm-oil, veg-oil]
                                      ...                      
training/999                               [interest, money-fx]
training/9992                                            [earn]
training/9993                                            [earn]
training/9994                                            [earn]
training/9995                                            [earn]
Name: label, Length: 10788, dtype: object

In [110]:
b = graphEmbeddings.embeddings.index
b

Index(['said', 'mln', 'net', 'cts', 'dlrs', 'shr', 'year', 'corp', 'u.s.',
       'qtr',
       ...
       'national heritage', 'canadian roxy', 'donners', 'bred', 'andean',
       'debt restructuring', 'imputation', 'chiron', 'prebble',
       'gas reserves'],
      dtype='object', length=25478)

In [112]:
def train_test_split(corpus):
    # TODO: this is the problem; len(graphIndex) = 0
    graphIndex = [index for index in corpus.index if index in graphEmbeddings.embeddings.index]
    print(f'len(graphIndex) = {len(graphIndex)}') 
    train_idx = [idx for idx in graphIndex if "training/" in idx]
    test_idx = [idx for idx in graphIndex if "test/" in idx]
    return corpus.loc[train_idx], corpus.loc[test_idx]

In [113]:
train, test = train_test_split(dataset)

len(graphIndex) = 9034


In [114]:
train.head(2)

Unnamed: 0_level_0,clean_text,label,language,parsed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
training/10,COMPUTER TERMINAL SYSTEMS &lt;CPML> COMPLETES ...,[acq],en,"(COMPUTER, TERMINAL, SYSTEMS, &, lt;CPML, >, C..."
training/1000,NATIONAL AMUSEMENTS AGAIN UPS VIACOM &lt;VIA> ...,[acq],en,"(NATIONAL, AMUSEMENTS, AGAIN, UPS, VIACOM, &, ..."


Build the model and cross-validation 

In [115]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier 
from sklearn.multioutput import MultiOutputClassifier

In [116]:
model = MultiOutputClassifier(RandomForestClassifier())

In [117]:
pipeline = Pipeline([
    ("embeddings", graphEmbeddings),
    ("model", model)
])

In [118]:
from sklearn.model_selection import GridSearchCV

In [119]:
from sklearn.model_selection import RandomizedSearchCV

In [120]:
files

['./embeddings/bipartiteGraphEmbeddings_10_20.p']

In [121]:
param_grid = {
    "embeddings__embeddings_file": files,
    "model__estimator__n_estimators": [50, 100], 
    "model__estimator__max_features": [0.2,0.3, "auto"], 
    #"model__estimator__max_depth": [3, 5]
}

In [122]:
features, labels = get_features_and_labels(train)

In [123]:
from sklearn.metrics import f1_score 

In [124]:
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, n_jobs=-1, 
                           scoring=lambda y_true, y_pred: f1_score(y_true, y_pred,average='weighted'))

In [125]:
model = grid_search.fit(features, labels)

Traceback (most recent call last):
  File "/Users/chang/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: <lambda>() takes 2 positional arguments but 3 were given

Traceback (most recent call last):
  File "/Users/chang/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: <lambda>() takes 2 positional arguments but 3 were given

Traceback (most recent call last):
  File "/Users/chang/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: <lambda>() takes 2 positional arguments but 3 were given

Traceback (most recent call last):
  File "/Users/chang/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line

Traceback (most recent call last):
  File "/Users/chang/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: <lambda>() takes 2 positional arguments but 3 were given

Traceback (most recent call last):
  File "/Users/chang/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: <lambda>() takes 2 positional arguments but 3 were given

Traceback (most recent call last):
  File "/Users/chang/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: <lambda>() takes 2 positional arguments but 3 were given

Traceback (most recent call last):
  File "/Users/chang/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line

In [126]:
model

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('embeddings',
                                        EmbeddingsTransformer(embeddings_file='./embeddings/bipartiteGraphEmbeddings_10_20.p')),
                                       ('model',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             n_jobs=-1,
             param_grid={'embeddings__embeddings_file': ['./embeddings/bipartiteGraphEmbeddings_10_20.p'],
                         'model__estimator__max_features': [0.2, 0.3, 'auto'],
                         'model__estimator__n_estimators': [50, 100]},
             scoring=<function <lambda> at 0x1b0529c10>)

In [127]:
model.best_params_

{'embeddings__embeddings_file': './embeddings/bipartiteGraphEmbeddings_10_20.p',
 'model__estimator__max_features': 0.2,
 'model__estimator__n_estimators': 50}

Evaluate performance 

In [128]:
def get_predictions(model, features):
    return pd.DataFrame(
        model.predict(features), 
        columns=topicsList, 
        index=features.index
    )

In [129]:
preds = get_predictions(model, get_features(test))
labels = get_labels(test)

In [130]:
errors = 1 - (labels - preds).abs().sum().sum() / labels.abs().sum().sum()

In [131]:
errors

0.7161822748475063

In [132]:
from sklearn.metrics import classification_report

In [133]:
print(classification_report(labels, preds))

              precision    recall  f1-score   support

           0       0.97      0.93      0.95      1087
           1       0.94      0.85      0.90       719
           2       0.78      0.61      0.69       179
           3       0.95      0.75      0.84       149
           4       0.92      0.70      0.80       189
           5       0.88      0.45      0.60       117
           6       0.89      0.43      0.58       131
           7       0.90      0.31      0.47        89
           8       0.74      0.41      0.53        71
           9       0.50      0.18      0.26        56

   micro avg       0.93      0.77      0.84      2787
   macro avg       0.85      0.56      0.66      2787
weighted avg       0.92      0.77      0.83      2787
 samples avg       0.81      0.80      0.80      2787



  _warn_prf(average, modifier, msg_start, len(result))
Traceback (most recent call last):
  File "/Users/chang/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: <lambda>() takes 2 positional arguments but 3 were given

