# Twitter corpus creation and LDA topic modelling

## Introduction

This notebook has two sections:
1. Prototype code for creating a gensim-compatible corpus from my collection of tweets.
2. Training an LDA topic model on a subset of the corpus.

This is largely prototyping and experimenting with model hyperparameters. When done, I'll create separate scripts for each part of this. There's a very rudimentary version of this (probably not committed. Oops...) using a Wikipedia corpus for training, but I wasn't satisfied with it and retraining takes about 12 hours.

## Libraries and setup

In [None]:
# Python libs
import sys, os
from dotenv import find_dotenv, load_dotenv
import re
import logging
from pathlib import Path
from pprint import pprint
import random

# Database
import pymongo

# NLP libs
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet

from gensim.test.utils import datapath
from gensim.utils import simple_preprocess
from gensim.test.utils import common_texts
from gensim.models import TfidfModel, CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.models.word2vec import Text8Corpus
from gensim.models.phrases import Phrases, Phraser
from gensim.corpora import Dictionary, MmCorpus

# Visualisation libs
import matplotlib.pyplot as plt

In [None]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

## Data loading

In [None]:
# From src/data/db_handlers.py
def mongodb_connect():
    """
    Establish connection to MongoDB
    db name given in .env file
    """
    dotenv_path = find_dotenv()
    load_dotenv(dotenv_path)

    client = pymongo.MongoClient(os.environ.get("DATABASE_URL"), 27017)
    db = client.tweetbase
    return db

In [None]:
db = mongodb_connect()

The Tweetbase db contains two separate collections of tweets, one from Aberdeen, Scotland and the second from Hammersmith, London.

In [None]:
print(f"""There are {db.tweets_abdn.count()} tweets from Aberdeen 
and {db.tweets_hsmith.count()} tweets from Hammersmith in the db""")

In [None]:
# Get full_text field for each tweet
# Only the first 50 tweets were returned for testing purposes
tweets_abdn = db.tweets_abdn.find({}, {"_id": 0, "full_text": 1})[:50000]
tweets_hsmith = db.tweets_hsmith.find({}, {"_id": 0, "full_text": 1})[:50000]

# Convert mongodb cursor objects into lists
tweets_abdn = [_.get("full_text") for _ in tweets_abdn]
tweets_hsmith = [_.get("full_text") for _ in tweets_hsmith]

In [None]:
tweets = tweets_abdn + tweets_hsmith

In [None]:
# Get indices for a random sample of 50 tweets for inspection
tweets_sample_idxs = [idx for idx, _ in random.sample(list(enumerate(tweets)), 50)]
pprint([tweets[i] for i in tweets_sample_idxs])

## Corpus preprocessing

### Cleaning

In [None]:
# Remove links and @ prefixes from tweets
tweets = [re.sub('@|https?\://\S+', '', t) for t in tweets]

In [None]:
pprint([tweets[i] for i in tweets_sample_idxs])

### Tokenizing

Tokenization of the tweets was performed with `gensim.utils.simple_preprocess()`. This method only produces unigram tokens. Using `gensim.models.phrases.Phraser` on the tokenized output should create bigrams, but it is unclear at present whether the method used below actually did for the given input...

In [None]:
tweets_tokens = [simple_preprocess(t) for t in tweets]

In [None]:
sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=1)

# bigram = Phrases(common_texts)
bigram = Phraser(phrases)
tweets_tokens = [bigram[t] for t in tweets_tokens]

In [None]:
pprint([tweets_tokens[i] for i in tweets_sample_idxs])

### Stopword removal

In [None]:
stop = stopwords.words('english')
print(stop)
whitelist = []

For some reason negated forms of *should*, *would* and *might* are included, but not the regular forms. I've added them to the list myself.

In [None]:
stop_additions = ['should', 'would', 'might', 'could']
stop = stop + stop_additions

In [None]:
whitelist = []

tweets_tokens = [[word for word in sentence if word in whitelist or word not in stop]
     for sentence in tweets_tokens]

In [None]:
pprint([tweets_tokens[i] for i in tweets_sample_idxs])

### Lemmatizer

Lemmatization is grouping words under their lemma, or dictionary form e.g. *knows* and *knew* under *know*, or *feet* under *foot*. This requires knowledge of the Part of Speech (PoS) of the item.

While lemmatization was performed on the corpus, it should be noted that it may not necessarily be beneficial. [Schofield and Mimno (2016)](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00099) report that, at best, text preprocessed with the NLTK WordNet lemmatizer showed no meaningful change in topic coherence scores when it comes to LDA topic modelling compared with data that had not been stemmed. At worst, some stemming methods decrease LDA topic stability.

TODO: Possibly compare a lemmatized and unlemmatized version of the corpus

In [None]:
# From https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
wnl = WordNetLemmatizer()

tweets_tokens = [[wnl.lemmatize(word, get_wordnet_pos(word)) for word in sentence]
    for sentence in tweets_tokens]

In [None]:
pprint([tweets_tokens[i] for i in tweets_sample_idxs])

In [None]:
set_tmp = set()
for sentence in tweets_tokens:
    for word in sentence:
        set_tmp.add(word)
print(f"There are {len(set_tmp)} unique words in the corpus")
print(f"The first 150 words are:\n {list(set_tmp)[:150]}")

Many of the words in the corpus so far are Twitter username mentions and hashtags. The creation of the dictionary in the next section will filter those that are not widely used.

## Creating Dictionary and BoW corpus

In [None]:
KEEP_WORDS = 100000 # Max number of words in dictionary
CORPUS_PATH = Path('../../data/corpora') # Location to save corpus and dict

In [None]:
dictionary = Dictionary(tweets_tokens)

# Filter the dictionary:
#    Words must appear in no fewer than n documents
#    Words must not appear in more than n of the total documents
dictionary.filter_extremes(no_below=40, no_above=0.05, keep_n=KEEP_WORDS)
print(dictionary.token2id)

Along with the LDA model hyperparameters, the dictionary filtering above will have a considerable impact on the topic model.

In [None]:
# Create the corpus
tweet_corpus = [dictionary.doc2bow(tweet) for tweet in tweets_tokens] # Use the dict to create BoW vectors
# tweet_corpus = [text for text in tweet_corpus if len(text) > 0]
MmCorpus.serialize(str(CORPUS_PATH / 'tweets') + '_bow.mm', tweet_corpus, progress_cnt=10000)

In [None]:
dictionary.save_as_text(str(CORPUS_PATH / 'tweets') + '_wordids.txt.bz2')