# Topic Modelling

In this project, I will try to compare the tweets of Donald Trump, Barrack Obama, and Hillary Clinton to come up with meaningful insights

In this notebook, I will try to classify their tweets under different topics to judge who talks about what

There are 3 CSV files which will be used:
1. DonaldTrumpClean
2. BarackObamaClean
3. HillaryClintonClean

All 3 have the same structure
date,retweet,text,author

In [33]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

import gensim
from gensim import corpora

Read the clean data

In [34]:
trump = pd.read_csv("data/DonaldTrumpClean.csv")
obama = pd.read_csv("data/BarackObamaClean.csv")
clinton = pd.read_csv("data/HillaryClintonClean.csv")

In [35]:
print(len(trump), len(obama), len(clinton))

8439 2125 3256


## Topic Modelling

In [36]:
trumpTweetList = list(trump.text)

### Cleaning and Preprocessing

Cleaning is an important step before any text mining task, in this step, we will remove the punctuations, stopwords and normalize the corpus.

In [49]:

stop_words = set(stopwords.words('english'))
stop_words.update(['@realdonaldtrump', '@realdonaldtrump.', '.@realdonaldtrump', 'trump', 'donald', 'Donald trump'])
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop_words])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)    
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return stop_free

doc_clean = [clean(doc).split() for doc in trumpTweetList] 

In [70]:
doc_clean

[['statement.'],
 ['really', 'america?', 'terrible!'],
 ['media',
  'establishment',
  'want',
  'race',
  'badly',
  'never',
  'drop',
  'race,',
  'never',
  'let',
  'supporters',
  'down!',
  '#maga'],
 ['wow,',
  '@cnn',
  'town',
  'hall',
  'questions',
  'given',
  'crooked',
  'hillary',
  'clinton',
  'advance',
  'big',
  'debates',
  'bernie',
  'sanders.',
  'hillary',
  'cnn',
  'fraud!'],
 ['debate', 'polls', 'look', 'great', 'thank', 'you!#maga', '#americafirst'],
 ['saying',
  'clinton',
  'campaigns',
  'anticatholic',
  'bigotry',
  'http//bit.ly/2dcbtvkcrooked'],
 ['thank',
  'u.s.',
  'navy',
  'protecting',
  'country,',
  'times',
  'peace',
  'war.',
  'together,',
  'make',
  'america',
  'safe',
  'great',
  'again!'],
 ['little',
  'pickup',
  'dishonest',
  'media',
  'incredible',
  'information',
  'provided',
  'wikileaks.',
  'dishonest!',
  'rigged',
  'system!'],
 ['thank',
  'florida',
  'movement',
  'never',
  'seen',
  'never',
  'seen',
  'again.

### Preparing Document-Term Matrix


In [52]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

### Running LDA Model

In [53]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

In [66]:
# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=5, id2word = dictionary, passes=5)

In [71]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(1, '0.004*"people" + 0.004*"last" + 0.004*"like"'), (0, '0.010*"new" + 0.009*"would" + 0.008*"like"'), (3, '0.006*"great" + 0.005*"think" + 0.004*"new"')]
