<a href="https://colab.research.google.com/github/rbkhb/NLP_IMC/blob/master/NLP_Workshop_Text_cleaning%2C_topic_modeling_and_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

-------
## AU Interacting Minds Centre, NLP Workshop - November 7th 

**Stance Detection & Topic Modelling of Social Media Users' Content**  

*Rebekah Baglini, Luca Nannini, and Arnault-Quentin Vermillet*



![alt text](https://docs.google.com/drawings/d/e/2PACX-1vTRkUtZJSFMxPWXaidljOqwNDnFTLb4E3GWB6AsqVfcWdYKsI4y9f8EaCz2yozWRe4I8vnvePngC-TM/pub?w=1393&h=614)


### Program

1. Data Preprocessing 
 - Load Dataset 
 - Tokenization/Stopword Removal
 - Clean Tweets Strings with Regular Expressions
 - Lemmatization/Stemming

2. Topic modeling
 - Create, Run, and Train the HDP model via Gensim 
 - Visualize topics through an interactive graphs - pyLDAvis 
 - Visualize cosine metrics of topics as a heatmap  
 - HDP and LDA via Gensim Models

 3. Supervised text classification with BERT

### Datasets

Vacc_tweets_raw_n5000.csv
  * Random sample of 5000 (out of > 1 million) tweets from 2019 containing string 'vaccin'
  * Collected using [GetOldTweets3 scraper](https://github.com/Jefferson-Henrique/GetOldTweets-python)

#### Additional sets in Data folder 
5-topic_stance_tweets_training_n2814.csv
  * Training set from [SemEval2016 Task 6](http://alt.qcri.org/semeval2016/task6/) for stance detection task
  * Labels: FAVOR, AGAINST, UNKNOWN
  * Topics: 
    - Atheism
    - Climate change is a concern
    - Feminist movement
    - Hillary Clinton
    - Legalization of abortion

Vacc_articles_w_stance_n3303.csv
  * From [Vaccine sentiment project](https://github.com/gloriakang/vax-sentiment), contains 3303 sentences extracted from online articles labelled pro (n=24), neg (n=22), and neu(tral) (n=7).

Vacc_tweets_w_stance_n1131.csv
* 1131 tweets containing string 'vaccin' labeled for stance
* Labels = pro, anti, unknown

**Or upload your own dataset!**
* Using the upload widget in the Colab file. 

### **Code files**
* During the tutorial, we will be working from a Google Colab notebook. This means you will not have to install or load anything locally. 
* If you'd like to run locally, we've included list of dependencies in Requirements.txt



  


### Libraries & Packages required


In [0]:
import pandas as pd
import numpy as np 
import re
import os 
import string
from string import punctuation
import _collections
from _collections import defaultdict
import warnings
warnings.filterwarnings('ignore',category=DeprecationWarning)

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import gensim
from gensim import corpora, models, similarities
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.corpora import Dictionary

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
#import sys
#!{sys.executable} -m pip install spacy
#!{sys.executable} -m spacy download en

In [0]:
!pip install pyLDAvis 
import pyLDAvis.gensim



## 1. Data Preprocessing 





### Load Dataset


**Part 1 assumes GOT3 output formatted text.**

Download Raw_vacc_tweets_n5000.csv from [Github](https://github.com/rbkhb/NLP_IMC), load from your local drive.



In [0]:
from google.colab import files
uploaded = files.upload()


In [0]:
#Check that yout filename matches that assigned by the Upload Widget above. 
df = pd.read_csv('Raw_vacc_tweets_n5000 (3).csv',encoding="utf8")
df.shape

In [0]:
#Let's peek at our dataframe
df.head(20)

## Cleaning up the tweets

### Extracting URLs

External links are often informative, but URLS will add unwanted noise to our language models. 

Therefore, we strip out hyperlinks found in tweets and copy them to a new column called 'URLS'. 



In [0]:
#We first define a function that finds URLs using regex
def find_URLs(tweets):
    return re.findall(r"((?:https?:\/\/(?:www\.)?|(?:pic\.|www\.)(?:\S*\.))(?:\S*))", tweets)

#We apply the function to our text column of our data frame
df['URLs'] = df.text.apply(find_URLs) 
df['URLs'].head(20)

#Then we can get rid of them inside of the tweet
df['clean_text'] = [re.sub(r"((?:https?:\/\/(?:www\.)?|(?:pic\.|www\.)(?:\S*\.))(?:\S*))",'', x) for x in df['text']]


In [0]:
#Let's look at some examples to see what happened
print(df['text'][1])
print(df['clean_text'][1])
print(df['URLs'][1])
print(" \n")
print(df['text'][6])
print(df['clean_text'][6])
print(df['URLs'][6])
print(" \n")

print(df['text'][19])
print(df['clean_text'][19])
print(df['URLs'][19])

### Clean Tweets Strings with Regular Expression 

We are now standardizing the informal language of tweets, stripping out user mentions, hashtags, punctuation, etc. 

In [0]:
#We start by defining lists of words to remove
my_stopwords = nltk.corpus.stopwords.words('english') #uninformative common words
my_punctuation = r'!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•…' #punctuation
#We specify the stemmer or lemmatizer we want to use
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
wordnet_lemmatizer = WordNetLemmatizer()

#And we define a cleaning master function to do the heavy lifting
def clean_tweet(tweet, bigrams=False, lemma=False):
    tweet = tweet.lower() # lower case
    tweet = re.sub(r'[^\w\s]', ' ', tweet) # strip punctuation
    tweet = re.sub(r'\s+', ' ', tweet) #remove double spacing
    tweet = re.sub(r'([0-9]+)', '', tweet) # remove numbers
    tweet = re.sub(r'([\U00002024-\U00002026]+)', '', tweet) #removing html tag ("..." where a link used to be)
    tweet_token_list = [word for word in tweet.split(' ')
                            if word not in my_stopwords] # remove stopwords

    if lemma == True:
      tweet_token_list = [wordnet_lemmatizer.lemmatize(word) if '#' not in word else word
                        for word in tweet_token_list] # apply lemmatizer
    else:   # or                 
      tweet_token_list = [word_rooter(word) if '#' not in word else word
                        for word in tweet_token_list] # apply word rooter
    if bigrams:
        tweet_token_list = tweet_token_list+[tweet_token_list[i]+'_'+tweet_token_list[i+1]
                                            for i in range(len(tweet_token_list)-1)]
    tweet = ' '.join(tweet_token_list)
    return tweet

#Finally we apply the function to clean tweets (here we use lemmas)
df['clean_text'] = df.clean_text.apply(clean_tweet, lemma=True)

df.head(20)

In [0]:
#Again, Let's take the same examples and look at what happened
print(df['text'][1])
print(df['clean_text'][1])
print(df['URLs'][1])
print(" \n")

print(df['text'][8])
print(df['clean_text'][8])
print(df['URLs'][8])
print(" \n")

print(df['text'][19])
print(df['clean_text'][19])
print(df['URLs'][19])

#### We can now tokenize each tweet in preparation for topic modeling

And before we do so, we might take an extra step.  
To be efficient, we probably should have done this before, but for the sake of clarity, let's do it now.  
  
Depending on your dataset, there might be specific vocabulary that is very significant but not informative (like "vaccine" in our case). Therefore, we might want to desigh a custom list of stop words that we will remove from the tweets.



In [0]:
#We define our list so that we take out those words when tokenising
custom_stop_words = ['vaccine', 'vaccinate']

#We create a new column with tokens
df['token_text'] = [
    [word for word in tweet.split()  if word not in custom_stop_words]
    for tweet in df['clean_text']]
print(df['token_text'])

In [0]:
#One last time, Let's look at the examples:
print(df['text'][1])
print(df['clean_text'][1])
print(df['token_text'][1])
print(" \n")

print(df['text'][6])
print(df['clean_text'][6])
print(df['token_text'][6])
print(" \n")

print(df['text'][19])
print(df['clean_text'][19])
print(df['token_text'][19])

#### Bonus. Other cleaning

##### Clean up mentions and hashtags
We might want to explore the network of tweeps looking at who is mentioned in a tweet, but GOT3 doesn't always  extract mentions or hashtags cleanly. 


In [0]:
#cleans empty @s 
df['mentions'] = [re.sub(r"(@(\s|$))",'', str(x)) for x in df['mentions']]
#cleans URL artifacts in mentions
df['mentions'] = [re.sub(r"((?:https?(?:www)?|pic|www)(?:(?:\s|$)))",'', str(x)) for x in df['mentions']]

#cleans empty #s 
df['hashtags'] = [re.sub(r"(#(\s|$))",'', str(x)) for x in df['hashtags']]
#cleans URL artifacts in mentions
df['hashtags'] = [re.sub(r"((?:https?(?:www)?|pic|www)(?:(?:\s|$)))",'', str(x)) for x in df['hashtags']]




##### Tweet Length and Bots
The longer a tweet is, the more unlikely it is that another tweep would tweet the same message word for word. Identical long tweets are more likely to be the work of Tweetbots.
E.g. in a sample of 5000 vaccines tweets, it's likely that two individuals independently tweeted "say no vaccines", but a tweet like "VALIDATE BEFORE YOU VACCINATE Giving vaccines to pets is not a risk-free procedure. It's important therefore, to weigh up the pro's and con's carefully, before deciding on the best way forward" is too specific to be spontaneously replicated.  
 
NB: Data does not contain retweets.

In [0]:
#Let's get an extra column with tweet length
df['tweet_length']  = df['clean_text'].str.len() #based on length
#And let's take out all tweets that are more than 50 characters after cleaning
dflong = df[df.tweet_length > 50]

In [0]:
#How many do we have?
len(dflong)

In [0]:
#Let's separate the duplicates

df.sort_values("clean_text", inplace = True) 
duplicate_tweet = df[df.duplicated(['clean_text'],keep=False)]
#how many do we have?
len(duplicate_tweet)

In [0]:
#how many unique ones?
len(duplicate_tweet.clean_text.unique())

In [0]:
#and how many times do they appear?
duplicate_tweet['count'] = duplicate_tweet.groupby('clean_text')['clean_text'].transform('count')

In [0]:
#Let's see what they are saying:

#We start by iterating through each unique duplicate
for n in duplicate_tweet.clean_text.unique():
  #Here we use a bit of a trick: it's easier to read a tweet than a cleaned tweet
  #So let's just locate the corresponding tweet for each unique cleaned tweet
  print(duplicate_tweet.loc[duplicate_tweet['clean_text'] == n, 'text'].iloc[0],
  #Wouldn't be nice to see the count too?
        " : ",
        duplicate_tweet.loc[duplicate_tweet['clean_text'] == n, 'count'].iloc[0])

So what do you think? Worth keeping?

In [0]:
#If not, run this. Otherwise, leave this cell alone
dflong.drop_duplicates(subset ="clean_text", keep = False, inplace = True) 

dfshort = df[df.tweet_length <= 50]
df = pd.concat([dfshort, dflong], ignore_index=True)

## Discussion

Can you think of more cleaning we could do? Can you justify doing it? 

Hashtags?

Spelling Correction?

# 3. Topic modeling










**Skip next two cells to proceed to topic modeling with preprocessed tweets from above.**

Otherwise, load from your local drive.



In [0]:
from google.colab import files
uploaded = files.upload()


### Create, Run, and Train the HDP model via Gensim


In [0]:
tweets = df['token_text']

dictionary = corpora.Dictionary(tweets)
corpus = [dictionary.doc2bow(twt) for twt in tweets]

In [0]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

## Latent semantic indexing (LSI)
This is a useful topic modeling algorithm in that it can rank topics by itself. Thus it outputs topics in a ranked order. However it does require a num_topics parameter (set to 200 by default) to determine the number of latent dimensions after the SVD.

In [0]:
lsimodel = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)

In [0]:
lsimodel.show_topics(num_topics=5)  # Showing only the top 5 topics

In [0]:

lsitopics = lsimodel.show_topics(formatted=False)

### Discussion question 

Since all of our texts were collecting on the keyword "vaccine", this word occurs in almost all topic clusters. Should we remove it? 

## Hierarchical Dirichlet process (HDP)
An HDP model is fully unsupervised. It can also determine the ideal number of topics.

In [0]:
hdp = HdpModel(corpus, id2word=dictionary)
hdp.print_topics()

In [0]:
lda1 = hdp.hdp_to_lda()
print(lda1)

In [0]:
dictionary = corpora.Dictionary(tweets)
corpus1 = [dictionary.doc2bow(text) for text in tweets]

In [0]:
total_topics = 7
lda = models.LdaModel(corpus1, id2word=dictionary, num_topics=total_topics)
lda.show_topics(total_topics,10)

LSA - Visualize topics through an interactive graph - pyLDAvis


In [0]:

pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim.prepare(hdp, corpus, dictionary, mds='TSNE')
panel

### Visualize cosine metrics of topics as a heatmap 


In [0]:
from collections import OrderedDict
#set a range of topics found - in this case we set 5
data_hdp = {i: OrderedDict(hdp.show_topic(i,20)) for i in range(6)}

In [0]:
df_hdp = pd.DataFrame(data_hdp)
df_hdp = df_hdp.fillna(0).T
print(df_hdp.shape)

In [0]:
df_hdp

In [0]:
g=sns.clustermap(df_hdp.corr(), center=0, cmap="RdBu", metric='cosine', linewidths=1, figsize=(10, 12))
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()

### HDP and LDA via Gensim Models

We can utilize Latent Dirichlet Allocation (LDA), since the implementation in Gensim is straightforward. As before, we need to create a Dictionary and a Corpus, set the number of topics we want to infer and then finally associated a number of keywords for each topic.

HDP is an implementation of LDA, but the latter lets you infer the distributions while during HDP this inference is integrated in the model without any a priori knowledge of topics

In the example below, we set =5 topics and =10 keywords





In [0]:
total_topics = 5

lda = models.LdaModel(corpus, id2word=dictionary, num_topics=total_topics)
lda.show_topics(total_topics,10)



---

Heatmap of Cos Metrics for LDA



In [0]:
data_lda = {i: OrderedDict(lda.show_topic(i,25)) for i in range(total_topics)}
#data_lda
df_lda = pd.DataFrame(data_lda)
df_lda = df_lda.fillna(0).T
print(df_lda.shape)

In [0]:
g=sns.clustermap(df_lda.corr(), center=0, cmap="RdBu", metric='cosine', linewidths=1, figsize=(10, 12))
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()



---
pyLDAvis for LDA



In [0]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim.prepare(lda, corpus, dictionary, mds='TSNE')
panel