# Exploratory Spatial Data Analysis of Disaster-Tweets with BerTopic


In [None]:
# Install packages
%%capture
!pip install bertopic


## Load dataset and libraries

In [None]:
import pandas as pd

import folium
from folium.plugins import HeatMap

from bertopic import BERTopic
import statistics
import re

from IPython.core.display import display, HTML

In [None]:
url = 'https://raw.githubusercontent.com/DorianZGIS/geo_ai_data_public/main/Data/napa_tweets2.csv'
df = pd.read_csv(url, sep='\t')
#df = df.drop(['Unnamed: 0'], axis=1)
df.head(5)

Unnamed: 0,time,tweet_text,latitude,longitude
0,24.08.2014 23:07,SMITE vs LOL ??,37.6007,-122.01482
1,24.08.2014 23:12,Random people complimenting you is so nice,38.417359,-122.709612
2,24.08.2014 23:37,Calum is being an annoying,37.353794,-121.863898
3,24.08.2014 23:38,Plan B,37.957866,-122.032392
4,24.08.2014 23:42,Laaawwwwllllll,38.3384,-122.685784


# Topic Modelling



### 1. Assignment: (9 Points)

Perform topic modelling with a Bert based approach (e.g. bertopic). 

* Read up about bertopic on https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html. Explain in your own words how this algorithm works! (Or use your favorite nlp algorithm to summarize the text for you ;) ) (2 Points)

* Preprocess the tweets data accordingly. (4 Points)

* Build and train a basic Bertopic model. You will find this (https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html) Quickstart guide helpful! (3 Points)



## <font color=#FB6060> BERTopic summary </font>

<FONT COLOR=#FB6060> BERTTopic starts by embedding, i.e., transforming the tweet documents into numerical representation, which then under go dimensionality reduction. BERTTopic uses the c-TF-IDF algorithm which is an adaptation of the TF-IDF (Term Frequency, Inverse Document Frequency) algorithm used for finding relevant wods. The adjusted version of this used by Bertopic allows comparing documents across clusters. </font>

<font color=#FB6060> (the quickstart is super unhelpful for explaining how the algorithm actually works, I went off of [this](https://medium.com/@angelamarieteng/topic-modeling-with-bert-2e3218723373) instead)  </font>

### Preprocessing

Think about the steps we discussed in the lecture and why they might be or not be needed in this example?

In [None]:
# Your preprocessing steps! (3 lines of code)
# In order to ensure a smooth topic modeling and topic depiction over time check out how Maarten preprocesses his text for depicting topics over time: https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-topics-over-time

# Filter out hastags, usernames and links
# Your code (about 3 lines of code)

df.tweet_text = df.apply(lambda row: re.sub(r"http\S+", "", row.tweet_text).lower(), 1)
df.tweet_text = df.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.tweet_text.split())), 1)
df.tweet_text = df.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.tweet_text).split()), 1)

timestamps = df.time.to_list()
filtered_tweets = df.tweet_text.to_list()

timestamps

**Build and train the model**

This can take a few minutes. If it takes more than 30min on your local machine check out if you can use a GPU for exelerated training, use a smaller sample of tweets e.g. 10k, consider a topic reduction or switch to the online google colab implementation of https://github.com/MaartenGr/BERTopic.

Note if you are choosing a random sample of your data, it is easier to directly extract a random sample from your dataframe df. Since in a later example you will need to extract dates for each tweet.

Futher, when creating the topic_model instance set verbose = True. This will allow you to track the progress of your topic model. It should usually take between 15 to 30 minutes depending on your machine and CPU and GPU use. 

In [None]:
# Define the topic model for 10 topics and verbose = True
topic_model = BERTopic(verbose=True, nr_topics=10)

# fit the topic model with your filtered tweets
topics, probs = topic_model.fit_transform(filtered_tweets)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/2841 [00:00<?, ?it/s]

2023-05-31 09:17:18,228 - BERTopic - Transformed documents to Embeddings
2023-05-31 09:19:26,214 - BERTopic - Reduced dimensionality
2023-05-31 09:19:37,148 - BERTopic - Clustered reduced embeddings
2023-05-31 09:19:42,188 - BERTopic - Reduced number of topics from 1287 to 10


In [None]:
# This is how you can save your model
topic_model.save("napa_earthquake_bertopic_model")

  self._set_arrayXarray(i, j, x)


In [None]:
# This is how you can reload your model
topic_model = BERTopic.load("napa_earthquake_bertopic_model")

## Investigate the results

### 2. Assignment: (6 Points)

* Choose at least one way to illustrate your different topics. (2 Points)
* Visualize Topics over Time. (2 Points) (This can be tricky!)
* Interpret your results in your own words. Are the topics the same as for the LDA algorithm? (2 Points)


Help: https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html

In [None]:
# If you encounter issues with nbformat when visualizing the topics, just upgrade nbformat
# !pip install --upgrade nbformat

### Visualize Topics

In [None]:
# Topic visualisation method 
topic_model.visualize_topics()

In [None]:
#topic_model.'Your code!'

#??? why do we need this

### Visualize Topics over Time

Check out: https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-topics-over-time

In [None]:
topics_over_time = topic_model.topics_over_time(filtered_tweets, timestamps)

1440it [02:09, 11.13it/s]


In [None]:
topic_model.visualize_topics_over_time(topics_over_time, topics=[0, 1, 2, 3, 4, 5, 6, 7])

#### Add your topics to the dataframe

In [None]:
# add topics and probabilities to the df
df['topics'] = topics
df


Unnamed: 0,time,tweet_text,latitude,longitude,topics
0,24.08.2014 23:07,smite vs lol,37.600700,-122.014820,0
1,24.08.2014 23:12,random people complimenting you is so nice,38.417359,-122.709612,0
2,24.08.2014 23:37,calum is being an annoying,37.353794,-121.863898,0
3,24.08.2014 23:38,plan b,37.957866,-122.032392,0
4,24.08.2014 23:42,laaawwwwllllll,38.338400,-122.685784,0
...,...,...,...,...,...
90902,25.08.2014 06:59,nobody in that booth is a real baseball fan,37.576299,-122.312694,0
90903,25.08.2014 06:59,it used to be standard that academics and tech...,37.850229,-122.283600,-1
90904,25.08.2014 06:59,cecii sometimes i need someone to tlk to somet...,37.333867,-121.878019,0
90905,25.08.2014 06:59,feliz cumplea os y que dios te bendiga mucho t...,37.744386,-122.475447,2


## <font color=#FB6060> Interperetation </font>

<font color=#FB6060>  In this model the topics aren't so evenly distributed as in the LDA model, with topic 0 and topic 1 being significantly larger than the other topics. Topic 0 seems to represent a lot of stop words so is maybe a more general topic. we see in the time graph that topic 0 spiked along with the earthquake, but it also followed other growth trends in tweeting which means that it might be capturing representing any increase in tweets regardless the reason. Topic 1 seemed much more closely aligned with the earthquake event, espeically based on the temporal graph.</font>

# Specialist Assignments!
# Try to improve the Topic generation

### 3. Assignment (5 Points)

* Try to use the CountVectorizer from sklearn. Read up about how it works and why it might improve your results. (2 Points)

* Try to use different clustering algorithms for bertopic. (2 Points)

* Maybe try out one or two additinoal preprocessing steps which we discussed in the lecture? Do they improve the results or not? (1 Points)



## <font color=#FB6060 >  CountVectorizer </font>

In [None]:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df = 1, max_df=10) #adding min_df here to avoid weird words like 'aaaaaannndd' instead of 'and'
vectorized_topic_model = BERTopic(vectorizer_model=vectorizer, nr_topics=10)
topics3, probs3 = vectorized_topic_model.fit_transform(filtered_tweets)

2023-05-31 10:02:40,256 - BERTopic - Transformed documents to Embeddings
2023-05-31 10:04:58,622 - BERTopic - Reduced dimensionality
2023-05-31 10:05:06,600 - BERTopic - Clustered reduced embeddings
2023-05-31 10:05:12,253 - BERTopic - Reduced number of topics from 1337 to 10


In [None]:
vectorized_topic_model.visualize_topics()

## <font color=#FB6060>  Clustering </font>


In [None]:


from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
clustered_topic_model = BERTopic(hdbscan_model=hdbscan_model, nr_topics=10)

topics2, probs2 = clustered_topic_model.fit_transform(filtered_tweets)

2023-05-31 09:49:16,888 - BERTopic - Transformed documents to Embeddings
2023-05-31 09:51:17,814 - BERTopic - Reduced dimensionality
2023-05-31 09:51:29,577 - BERTopic - Clustered reduced embeddings
2023-05-31 09:51:33,275 - BERTopic - Reduced number of topics from 929 to 10


In [None]:
clustered_topic_model.visualize_topics()

## <font color=FB6060>  More pre-processing </font>

In [None]:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

cleaner_tweets=[nltk.word_tokenize(tweet) for tweet in filtered_tweets]

reg_expression = r'[^A-Za-z]+'
cleaner_tweets = [[re.sub(reg_expression,'', string) for string in sub_list] for sub_list in cleaner_tweets]

# Remove stop words
stoplist = set(stopwords.words('english'))
cleaner_tweets = [[word for word in document if word not in stoplist] for document in cleaner_tweets]

#Remove empty strings
cleaner_tweets = [[word for word in document if word] for document in cleaner_tweets]

display(cleaner_tweets[:10])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[['smite', 'vs', 'lol'],
 ['random', 'people', 'complimenting', 'nice'],
 ['calum', 'annoying'],
 ['plan', 'b'],
 ['laaawwwwllllll'],
 ['b', 'lol'],
 ['earthquake'],
 ['stits', 'go'],
 ['time'],
 ['hey']]

In [None]:
#from nltk.tokenize.treebank import TreebankWordDetokenizer as Detok
detokenizer = Detok()
cleaner_tweets_detok = [detokenizer.detokenize(tweet) for tweet in cleaner_tweets]

cleaner_topic_model = BERTopic(verbose=True, nr_topics=10)
topics4, probs4 = cleaner_topic_model.fit_transform(cleaner_tweets_detok)




Batches:   0%|          | 0/2841 [00:00<?, ?it/s]

2023-05-31 10:25:33,636 - BERTopic - Transformed documents to Embeddings
2023-05-31 10:27:40,653 - BERTopic - Reduced dimensionality
2023-05-31 10:27:51,722 - BERTopic - Clustered reduced embeddings
2023-05-31 10:27:56,475 - BERTopic - Reduced number of topics from 1483 to 10


In [None]:
cleaner_topic_model.visualize_topics()

In [None]:
cleaner_topics_over_time = cleaner_topic_model.topics_over_time(cleaner_tweets_detok, timestamps)
cleaner_topic_model.visualize_topics_over_time(cleaner_topics_over_time, topics=[0, 1, 2, 3, 4, 5, 6, 7])

1440it [02:56,  8.14it/s]


<font color=FB6060> Pre-processing the tweets creaed even bigger disparity between the first two topics (topic 0 and 1) and the rest of the topics. However topic 0 still seems to exist as a 'general' topic and contains lots of generic words that aren't covered in traditional stopword dictionaries like the one we used. </font>

# Expore the geospatial distribution of different topics


### 4. Assignment: (3 Points)

* Plot the 2 topics you find most suitable and 1 which you do not find suitable at all to describe the earthquake Napa, individually on a basemap. (2 Points)
* Interpret and compare these results.(E.g.: How do these maps differ in comparison to the LDA topic maps we created in the Lecture?) (1 Points)

In [None]:
def generateBaseMap(default_location=[40.693943, -73.985880], default_zoom_start=12):
    base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
    return base_map

In [None]:
df['count'] = 1
y = statistics.mean(df['latitude']) 
x = statistics.mean(df['longitude']) 

In [None]:

# Create Topic maps
topic_numbers = [0, 1]

base_maps = []
for topic_number in topic_numbers:
    df_topic = df.loc[df['topics'] == topic_number]
    base_map = generateBaseMap([y,x],8)
    HeatMap(data=df_topic[['latitude', 'longitude', 'count']].groupby(['latitude', 'longitude']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
    base_maps.append(base_map)

htmlmap = HTML('<iframe srcdoc="{}" style="float:left; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
           '<iframe srcdoc="{}" style="float:right; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
           .format(base_maps[0].get_root().render().replace('"', '&quot;'),400,400,
                   base_maps[1].get_root().render().replace('"', '&quot;'),400,400))
display(htmlmap)

## <font color =FB6060> Interperetation </font>

<font color = #FB6060> The maps honestly aren't too different from the ones in class. The map of topic 1 (on the right) seems to have higher activity not only in the Napa city center, but also in some surrounding areas of Napa perhaps indicative of earthquake specific activity. 

# Evaluation


### 5. Exercise (2 Point)

* Think about different ways to assess the quality of these topics!
* Describe your results briefly in 3-5 sentences. (2 Points)

<font color =#FB6060> In general are results do an okay job at the task of identifying earthquake specific activity in tweets. On the one hand, the earthquake tweets were successfully sorted into their own topic, and looking at the Topics over time graph we do see that spike corresponding to the earthquake event. On the other, personally we feel would expected to see more pronounced results in the mapping. Perhaps adjusting the map for population density as mentioned in class would be a fair solution to this issue. In general, we think these results are a good testament to both the power and limitations of NLP and the messiness of data such as tweet dictionaries. 

# Summary: 

Congratulations! You are now able to use state-of-the-art transformers technology to classify natural text into topics! 
Further one could play around with different bert instances as a base for the the bertopic algorithm or how hyperparameter e.g. the number of topics influences the results. However, this is not part of the assignment. 