# LDA Testing and F1 Wonderings

There was a short discussion in the competition forum about usefulness of LDA or topic models in general. Just recently someone posted a Gensim based topic model exploration kernel as well. Since I have used LDA/topic models for other purposes before, I thought it would be interesting to give it a try to use the topic distributions of documents (questions in this case) as features. This kernel scores very poorly but it was an interesting exercise. Let me know ideas how to improve.

Much of the basics I have taken from my wine review topic models kernel:
https://www.kaggle.com/donkeys/topic-models-for-wine-reviews

this kernel just uses the topic distributions as features. Got any other ideas on how to use them here?

In [None]:
import pandas as pd
import numpy as np
from gensim import corpora
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

In [None]:
df_train = pd.read_csv("../input/train.csv")
df_train.head()

In [None]:
df_test = pd.read_csv('../input/test.csv')
df_test.head()

Basic stopwords to remove, along with a bit of garbage I picked from the top words for topics, when printed. Did not look into this in too much detail yet, since this is intended as a lightweight experiment:

In [None]:
from string import punctuation
stop_words = set(stopwords.words('english')) 
stop_words = stop_words.union(set(punctuation)) 
stop_words.update(["\'s", "n\'t", "``", "\'\'", "“", "”", "\'m", "’"])

Concat the train and test data for pre-processing:

In [None]:
question_texts = df_train["question_text"]
test_texts = df_test["question_text"]
question_texts = pd.concat([question_texts, test_texts]).reset_index(drop=True)

Lemmatize to get baseforms of words.

In [None]:
lemmatizer = WordNetLemmatizer()
texts = [[lemmatizer.lemmatize(word) for word in word_tokenize(text.lower()) if word not in stop_words] 
         for text in question_texts]


What does it all look like?

In [None]:
question_texts.head()

In [None]:
question_texts[4]

In [None]:
texts[4]

Bi-grams and tri-grams for the text. Combining 2-3 words as one when common, such as mountain and bike as mountain-bike. *min_count* and *threshold* are statistics for when to consider the n-grams as one.

In [None]:
import gensim
#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
# Build the bigram and trigram models
bigram = gensim.models.Phrases(texts, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[texts], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram) 
trigram_mod = gensim.models.phrases.Phraser(trigram)

Look mom, its an example, with the mountain-bike in it, uh:

In [None]:
trigram_mod[bigram_mod[texts[4]]]

Just to do this (bi-gram and tri-gram generation)  for all the question texts we have available:

In [None]:
texts = [trigram_mod[bigram_mod[text]] for text in texts]

In [None]:
 #id to word mapping for gensim
id2word = corpora.Dictionary(texts)

Define a function to build an LDA model to represent the topics and their distributions. Use the Gensim multicore version to make it faster:

In [None]:
from gensim.models import LdaMulticore

def create_lda(topic_count, seed):
    print("Running for topics:"+str(topic_count))
    print("creating corpus")
    corpus = [id2word.doc2bow(text) for text in texts] 
    print("creating lda model")
    lda_model = LdaMulticore(corpus, id2word=id2word, num_topics=topic_count, random_state=seed)
    
    print("applying lda model to texts")
    rows = [lda_model[id2word.doc2bow(text)] for text in texts]
    cols = ['lda'+str(n) for n in range(1, topic_count+1)] #range is exclusive on top end, so +1 needed
    print("changing lda results to weight only")
    rows = [[topic_weight[1] for topic_weight in topics] for topics in rows]
    print("row len:"+str(len(rows[0])))
    print("row 1:"+str(rows[0]))
    df2 = pd.DataFrame(rows, columns=cols)
    return df2, lda_model

A function to try different thresholds on prediction probabilities to maximize the target F1 score. Plot the thresholds in relation to precision, recall, accuracy, and F1 score. Just so I can try to understand how the threshold affects it all. For the plots!

In [None]:
import matplotlib.pyplot as plt
import seaborn  as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

def plot_f1(target, pred):
    best_score = 0
    thresh = 0
    thresholds = []
    f1_scores = []
    p_scores = []
    r_scores = []
    a_scores = []
    for i in np.arange(0.5, 0.99, 0.01):
        pred_bool = (pred < i)
        temp_f1_score = f1_score(target, pred_bool)
        thresholds.append(i)
        f1_scores.append(temp_f1_score)
        p_scores.append(precision_score(target, pred_bool))
        r_scores.append(recall_score(target, pred_bool))
        a_scores.append(accuracy_score(target, pred_bool))
        if(temp_f1_score > best_score):
            best_score = temp_f1_score
            thresh = i
            print("threshold:"+str(i)+"="+str(best_score))

    plt.figure(figsize=(8,5))
    plt.title("F1 at thresholds")
    plt.plot(thresholds, f1_scores, label='F1 score')
    plt.plot(thresholds, p_scores, label='Precision score')
    plt.plot(thresholds, r_scores, label='Recall score')
    plt.plot(thresholds, a_scores, label='Accuracy score')
    #https://stackoverflow.com/questions/24988448/how-to-draw-vertical-lines-on-a-given-plot-in-matplotlib
    plt.axvline(x=thresh, color='k', linestyle='--', label='max='+"%.2f" % thresh)
    plt.legend(loc="upper left")
    plt.show()
    print("%.2f" % thresh+"="+"%.2f" % best_score)
    return thresh, best_score


I tried various numbers of topic counts and various seeds with the above functions. Settled for 5 topics as it seemed to work good for prediction features in this dataset. Larger numbers of topics seem to produce much more sparse feature sets. Not sure if that has something to do with it. Also tried a few seeds, as the seed seems to have surprisingly large impact on score. This one was ok'ish, although I am sure there are much better ones. It seems odd how much the seed can make a difference:

In [None]:
df2, lda_model = create_lda(5, 348617921)

I use LGBM here, just because it is simple, popular, and works good enough for many cases. So suits me fine here. I used the hyperopt library to optimize hyperparameter values, although it did not seem to make much of a difference in this case. Maybe my features are not that great?

In [None]:
import lightgbm as lgb 

df_train_lda = pd.concat([df_train, df2[:len(df_train)]], axis=1)
df3 = df2[len(df_train):].reset_index().drop("index", axis=1)
df_test_lda = pd.concat([df_test, df3], axis=1)

X_cols = [col for col in df_train_lda.columns if col not in ["target", "qid", "question_text"]]
print(X_cols)
X = df_train_lda[X_cols]
y = df_train_lda["target"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=42)

clf = lgb.LGBMClassifier(
                    num_leaves=116,
                    learning_rate=0.1,
                    feature_fraction=0.98,
                    subsample=0.93, 
                    subsample_freq=1, #bagging_freq in some docs
                    n_estimators=2000,
                    boosting_type='gbdt',
                    min_child_samples = 250,
                    n_jobs = 4,
                    max_depth=11,
                    verbosity = 0)

clf.fit(X_train, y_train,
        eval_set=[(X_val, y_val)],
        #eval_metric='accuracy',
        eval_metric='binary_logloss',
        categorical_feature = 'auto',
        early_stopping_rounds=5,
        verbose=False
       )

y_pred_prob = clf.predict_proba(X_val)
y_pred_prob = y_pred_prob[:,0] 
threshold, score = plot_f1(y_val, y_pred_prob)
X_test = df_test_lda[X_cols]

topic_preds = clf.predict_proba(X_test)
topic_preds = topic_preds[:,0] 
best_pred = topic_preds < threshold
best_probs = topic_preds


The figure above plots how the different scores change as the threshold is varied. I was hoping to get a more intuitive feel for how the F1 score changes in relation to precision and recall, and what it really represents. The vertical dashed line is where the F1 score is highest, with threshold of 81%, producing an F1 score of 0.36. I guess the main thing I learned from this plot is that when precision and recall intersect, F1 score also crosses over. For F1 score, I still have look for the intuition. And accuracy is something completely different from all these :)

What does the output look like?

In [None]:
best_pred

Finally build something that can be submitted:

In [None]:
results = pd.DataFrame()
results["pred"] = best_pred
results["pred"].value_counts()

In [None]:
submission = pd.read_csv('../input/sample_submission.csv')

In [None]:
submission['prediction'] = best_pred
submission.to_csv('submission.csv', index=False)

Now that the results are done, a final look at the topic model and how it maps to some questions. First, print the top 20 words for each 5 topics:

In [None]:
lda_model.print_topics(num_words=20)

 There seems to be some underlying logic to the topic distributions, but I let you think about that. LDA topics always struct me as having some meaning behind it for some of the topics, but parts of it eluding me. Guess it is their nature as unsupervised learning.

I tried to print the topic weights for the above topics for a few questions in the dataset. Many topics vs questions were a bit hard to interpret. Tried to pick a few slightly reasonable looking mappings below:

In [None]:
question_texts[1]

In [None]:
lda_model[id2word.doc2bow(texts[1])]

So, I have all the seeds in place, and the topic distributions don't change over runs, question 1 is about adopted dogs. It has a weight of 77% for the first topic, and 14% for the last topic. Seems to make some sense, looking at the topics. 

One more:

In [None]:
question_texts[10]

In [None]:
lda_model[id2word.doc2bow(texts[10])]

This one mapped similarly high on the first topic, and spreads the rest evenly all around.

What can I say in the end? This kernel scores poorly, but it was interesting to play with. 

As far as I understand, LDA maps each word in each document to a single topic. Maybe this also limits its ability to represent more complex relationships within a short document such as these? Although I am not entirely sure how Gensim counts the topic weights for a whole document at once. 

One could also investigate the weights of individual words within a topic for more insights if interested to dig deeper into that. Or try other combinations, but since there are already many techniques that score good, perhaps not.

Any other ideas?

Thats it. Thank you for stopping by :)