## Spatial Data Science (GIS6307/GEO4930)


<br>
Instructor: Yi Qiang (qiangy@usf.edu)<br>
Teaching Assistant: Jinwen Xu (jinwenxu@usf.edu)

---

# Workshop on Spatial Analysis of Twitter (Day 2)

This workshop will help you to get started with the acquisition, processing, and analysis of Twitter data using data science techniques. Specifically, you will learn:

- Streaming real-time tweets using Twitter Developer APIs.
- Processing the raw tweets into an analyzable form.
- Mapping, spatial analysis and natural language processing of Twitter data.

### Prerequisites
- Install Anaconda in your computer.
- Activation of Twitter Developer Account and approved **Elevated Access** before the workshop.
- Basic programming skills are recommended, but not required.


## 1. Install Python Libraries

Please open Anaconda Prompt.

1. Please run the following command in Anaconda Prompt to activate the conda environment `geo` that you created on Day 1. 
    
    - `conda activate geo`
    
2. Launch Jupyter Notebook using the following command:

    `jupyter notebook`

3. Open the downloaded .ipynb file

## 2. Read and Explore Data

Import libraries that are needed for this lab.

In [None]:
import nltk
import string
import tweepy
import pandas as pd
import emoji

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

#from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

# Print full cells in dataframes
pd.set_option('display.max_colwidth', None)

First, let's read tweets that are streamed in the previous class. If you don't have the data, you can also download a sample dataset from [**here**](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/Twitter/tweets_putin.csv), and save it in the default directory of Jupyter Notebook (C:\Users\UserName for Windows and /Users/UserName for MacOS). 

In [None]:
tweets = pd.read_csv('tweets_putin.csv')

tweets.head()

Print the total number of tweets.

In [None]:
len(tweets)

## 3. Text Cleaning

Raw tweets may include many meaningless phrases, symbols and characters that are hard to understand for machines. Text cleaning is the process of removing the meaningless items and prepare raw text for Natural Language Processing (NLP). Text cleaning is an important step to get meaningful analysis results from text mining. Text cleaning includes the following basic steps.

- Remove punctuations, URLs, mentions and hashtags
- Tokenization - Converting a sentence into list of words
- Remove stopwords
- Lammetization/stemming - Tranforming any form of a word to its root word

### 3.1 Remove punctuations, URLs, mentions and hastags

Punctuations URLs, mentions and hastabs may cause trouble for machines to recognize meaningful words. The first step of text cleaning is to remove these noises.

The `string.punctuation` attribute contains a list of common puncutations, which will be removed. We also add some punctuations that are not included in the `string.puncuation` list.

In [None]:
# Combining punctuations in string.punctuation and other punctuations
punctuation = list(string.punctuation) + ['’','…','\n']

# Print the combined list of punctuations
punctuation

Emojis are very popular in text messages and tweets. In Twitter, emojis are formed by punctuations, which cannot be recoginzed by machines. It will be helpful to convert emojis to words (i.e. demojize).

Next, we select the first tweet in the dataset and convert the emojis to words.

In [None]:
text = tweets['tweet'].iloc[0]
text

Use `emoji.demojize` to remove emojis

In [None]:
text = emoji.demojize(text, delimiters=(' ', ' '))
text

Replace the delimiter with space

In [None]:
# Demojize
text = text.replace("_"," ").replace('mark',"")
# Split words by spaces
text

Next, we combine the steps of removing emojis, mentions, hashtags, URLs and punctuations in a function.

In [None]:
# import the re library for regular expression operations.
import re

# Define a function to remove punctuation in a text string
def remove_punct(text):
    
    # Convert emojis to words
    text = emoji.demojize(text, delimiters=(' ', ' ')).replace("_"," ").replace('mark',"")

    # Remove mentions
    text = re.sub("@[A-Za-z0-9_]+"," ", text)
    
    # Remove hashtags
    text = re.sub("#[A-Za-z0-9_]+"," ", text)
    
    # Remove URLs
    text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)
    
    # Remove punctuations
    for p in punctuation: text = text.replace(p, " ")
        
    return text

Then we use `apply` to map the `remove_punct` function to each tweet in the `tweet` column to remove mentions, hastags, URL and punctuations. The `apply` function is a simplified syntax of a `for` loop. 

In [None]:
# Apply remove_punct to each row in the dataframe
tweets['tweet_punct'] = tweets['tweet'].apply(lambda x: remove_punct(x))

# Preview the original tweets and processed tweets
tweets[['tweet','tweet_punct']].head(10)

### 3.2 Tokenization

Word tokenization, also known as word segmentation, divides a string of written language into its component words. White space is a good approximation of a word divider in English and many other languages with the help of some form of Latin alphabet.

First, we create a function to tokenize text by non alphanumeric symbols, such as white spaces and symbols not included in puctuations.

In [None]:
def tokenization(text):
    text = re.split('\W+', text)
    return text

Tokenize the tweets without punctuations, and store the tokenized tweets in a new column. We also change all characters to lower case.

In [None]:
# Tokenize the tweets
tweets['tweet_tokenized'] = tweets['tweet_punct'].apply(lambda x: tokenization(x.lower()))

# Preview the tokenized tweets
tweets[['tweet_punct','tweet_tokenized']].head()

### 3.3 Remove stop words

Stop words (e.g. a, the, for...) are frequently used in English text, but carry little information. In text mining, stop words may delute the words that carry actual meanings. Removing stop words can yield to more meaningful results from text mining.

The `nltk` library contains comprehensive lists of stop words in different languages. The following code get a list of stop words from `nltk.corpus.stopwords.words`, which will be removed later.

In [None]:
# Get a list of stopwords from nltk, plus rt and via. 
stopword = nltk.corpus.stopwords.words('english') + ['rt', 'via','amp','get',
                                                     'would','go','like','say',
                                                     "don\'t",'dont','need','want','think',
                                                     'show','know','let','putin']

stopword

Create a function to remove stop words

In [None]:
def remove_stopwords(text):
    text = [word for word in text if word not in stopword]
    return text

Apply the `remove_stopwords` function to the tokenized tweets.

In [None]:
tweets['tweet_nonstop'] = tweets['tweet_tokenized'].apply(lambda x: remove_stopwords(x))


tweets[['tweet_tokenized','tweet_nonstop']].head()

### 3.4 Word stemming

Word stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma (also known as Lammitization). Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

First, we create a function for word stemming.

In [None]:
ps = nltk.PorterStemmer()

def stemming(text):
    text = [ps.stem(word) for word in text]
    return text

Apply the function `stemming` to the `tweet_nonstop` column.

In [None]:
tweets['tweet_stemmed'] = tweets['tweet_nonstop'].apply(lambda x: stemming(x))

tweets[['tweet_nonstop','tweet_stemmed']].head()

Print the original tweets and processed tweets in the four steps. Please compare their differences to learn what has been done at each step.

In [None]:
tweets[['tweet','tweet_punct','tweet_tokenized','tweet_nonstop','tweet_stemmed']].head(10)

# 4. Frequency Analysis

## 4.1 Word Cloud
A word cloud (also known as a tag cloud) is a visual representation of words. Word cloud can highlight popular words and phrases based on frequency. Word cloud provides you with quick and simple visual insights that can lead to more in-depth analyses.

Before creating word cloud, we need to break the lists in the `tweet_tokenized`, `tweet_nonstop`, and `tweet_stemmed` to strings.

In [None]:
import numpy as np

tweets_raw = tweets.tweet.sum().replace("", "")

tweets_punct = tweets.tweet_punct.sum()

tweets_tokenized = tweets.tweet_tokenized.sum()
tweets_tokenized = ' '.join([str(tweet) for tweet in tweets_tokenized])

tweets_nonstop = tweets.tweet_nonstop.sum()
tweets_nonstop = ' '.join([str(tweet) for tweet in tweets_nonstop])

tweets_stemmed = tweets.tweet_stemmed.sum()
tweets_stemmed = ' '.join([str(tweet) for tweet in tweets_stemmed])

Create word clouds for the tweets at different processing steps. Comparing the word cloud, you can see how the 5 processing steps affect the word cloud.

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20, 20)
from wordcloud import WordCloud

# Create a plot with subplots arranged in 3 columns * 2 rows
fig, ax = plt.subplots(3, 2)

# Create word clouds for tweets at the 5 processing steps.
wordcloud_raw = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweets_raw)
wordcloud_punct = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweets_punct)
wordcloud_tokenized = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweets_tokenized)
wordcloud_nonstop = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweets_nonstop)
wordcloud_stemmed = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweets_stemmed)

# Display the word cloud to the subplots one by one
ax[0,0].imshow(wordcloud_raw, interpolation='bilinear')
ax[0,0].set_title('Raw tweets', fontsize=16)
ax[0,0].axis('off')

ax[0,1].imshow(wordcloud_punct, interpolation='bilinear')
ax[0,1].set_title('Tweets after remove punctuation',fontsize=16)
ax[0,1].axis('off')

ax[1,0].imshow(wordcloud_tokenized, interpolation='bilinear')
ax[1,0].set_title('Tokenized tweets',fontsize=16)
ax[1,0].axis('off')

ax[1,1].imshow(wordcloud_nonstop, interpolation='bilinear')
ax[1,1].set_title('Tweets without stop words',fontsize=16)
ax[1,1].axis('off')

ax[2,0].imshow(wordcloud_stemmed, interpolation='bilinear')
ax[2,0].set_title('Tweets after stemming',fontsize=16)
ax[2,0].axis('off')

## 4.2 Word Frequency in Bar Chart

In addition to word cloud, bar chart is useful graph to show word frequency. To create a bar chart, we need to combine all cleaned tweets and count the number of identical words. We can do this using `collections.Counter`. Finally, we will get the 15 most frequently appeared words.

In [None]:
import collections

# Combine all tweets in the tweet_stemmed column
all_tweets = tweets.tweet_stemmed.sum()

# Count the numbers of identical words in the tweets.
counts = collections.Counter(all_tweets)

# Get the 15 most frequently appeared words
most_common_words = counts.most_common(15)

Convert `most_common_words` to a dataframe, which is easier for analysis.

In [None]:
# Convert to dataframe
frequent_tweets = pd.DataFrame(most_common_words, columns=['words', 'count'])

# Print the dataframe
frequent_tweets

Create bar chart to display the most frequently appeared words.

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

# Plot horizontal bar graph
frequent_tweets[1:16].sort_values(by='count').plot.barh(x='words',
                      y='count',
                      ax=ax)

ax.set_title("Most Frequent Words in Tweets")

plt.show()

# 5. Sentiment Analysis

As a common text mining technique, Sentiment analysis can be defined as a process that automates mining of attitudes, opinions, views and emotions from text, speech, tweets and database sources through Natural Language Processing (NLP). Sentiment analysis involves classifying opinions in text into categories like "negative" (score: -1), "neutral" (score 0)" or "positive" (score: 1). Sentiment analysis is also referred to as subjectivity analysis, opinion mining, and appraisal extraction.

Before sentiment analysis, we first break the lists in `tweet_stemmed` to bare strings of words separated by spaces, and store the converted tweets in a new column `tweet_stemmed2`.

In [None]:
# Convert the list of words to strings
tweets['tweet_stemmed2'] = tweets['tweet_stemmed'].apply(lambda x: ' '.join([str(tweet) for tweet in x]))

# Preview the converted strings
tweets[['tweet_stemmed','tweet_stemmed2']]

Next, we use the `textblob` pacakge to calculate sentiment scores of the strings.

In [None]:
from textblob import TextBlob

# Calculate sentiment scores of the tweets
sentiment_objects = [TextBlob(tweet) for tweet in tweets.tweet_stemmed2]

# Print the sentiment scores of the first 20 tweets
[object.polarity for object in sentiment_objects][0:20]

Store the sentiment scores in a new column in the dataframe.

In [None]:
tweets['sentiment'] = [object.polarity for object in sentiment_objects]

tweets.head()

Next, we take a look at some tweets with most positive sentiment (-1). 

To do so, we sort the dataframe based on sentiment in an descending order, and then preview the first 5 tweets.

In [None]:
tweets.sort_values(by='sentiment', ascending=True).head()

Create a histogram to show distribution of  sentiment.

In [None]:
# Create a canvas in a specific size
fig, ax = plt.subplots(figsize=(10, 10))

# Plot histogram 
plt.hist(tweets['sentiment'], bins=20,edgecolor='k', alpha=0.65)
plt.axvline(tweets['sentiment'].mean(), color='red', linewidth=3)

In the sentiment analysis, many words don't have a positive/negative sentiment, and are assigned neutral (0) sentiment. So a large number of tweets have a 0 sentiment, creating a high bar in the middle. 

Let's check how many tweets have 0 sentiment:

In [None]:
print("{} out of {} tweets have neutral (zero) sentiment.".format(len(tweets[tweets['sentiment']==0]), len(tweets)))

Most tweets have a neutral sentiment. So it makes sense to remove neutral tweets and only keeps tweets with a positive and negative sentiment. The following code select tweets with non-zero sentiment and store them in `tweets2`.

In [None]:
tweets2 = tweets[tweets['sentiment']!=0]

Create the histogram with non-zero sentiment tweets.

In [None]:
# Create a canvas in a specific size
fig, ax = plt.subplots(figsize=(10, 10))

# Plot histogram 
plt.hist(tweets2['sentiment'], bins=20,edgecolor='k', alpha=0.65)
plt.axvline(tweets2['sentiment'].mean(), color='red', linewidth=3)


# 6. Create heat map for sentiment

Heat map (also known as kernel density map) is a common approach to visualise clusters of points. Heat map use color gradients to display density variation of points. Other than treating the points equally, heat map can also include a "population" attribute to weigh the points.

In this task, we will create a heat map to display clusters of tweets using sentiment scores as the population field.


## 6.1 Calculating centroids of geotags

Heat map applies to point data. So we use centroids of geotags (bounding box) to represent the tweet locations.

The `geotag` column contains bounding boxes in the Well-Known Text (WKT) format as strings. Next, we will use the `json` package to convert the WKT strings to lists, and store the lists in a new column `geotag2`. The lists are easier to access coordinates of the bounding boxes.

In [None]:
import json

# Convert the string to lists
tweets2['geotag2'] = tweets2['geotag'].apply(lambda st: json.loads(st))

tweets2.head()

Next, we can access coordiantes in `geotag2` to calculate coordinates of centroids of the bounding boxes.

In [None]:
# Calculate coordinates of bounding box centroid
tweets2['point']  = tweets2['geotag2'].apply(lambda s: [(s[0][1]+s[2][1])/2,(s[0][0]+s[2][0])/2])

# Get the latitude of the centroid 
tweets2['lat']  = tweets2['geotag2'].apply(lambda s: (s[0][1]+s[2][1])/2)

# Get the longitude of the centroid 
tweets2['lon']  = tweets2['geotag2'].apply(lambda s: (s[0][0]+s[2][0])/2)

# Preview the geotweets
tweets2.head()

## 6.2 Create Heat Maps for positive and negative sentiment tweets

Split the dataframe to two dataframes with positive and negative sentiment. Ideally, we should create a kernel density map using sentiment as the population parameter (as the figure below). However, `folium` does not support negative population. So we will work around it by mapping positive and negative sentiment in separate layers and overlay them in a map.

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/kernel.jpg)

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/kde.png)

We first separate positive and negative tweets, and then convert negative scores to positive.

In [None]:
# Select tweets with positive (>1) and negative (<1) sentiment 
positive = tweets2.loc[tweets2['sentiment']>0,['lat','lon','sentiment']]
negative = tweets2.loc[tweets2['sentiment']<0,['lat','lon','sentiment']]

# Convert the positve tweets to an numpy array
positive = np.array(positive)

# Convert the negative tweets to an numpy array
negative['sentiment'] = negative['sentiment'].abs()
negative = np.array(negative)

Create heat maps for positive and negative sentiment tweets

In [None]:
import folium
from folium.plugins import HeatMap
import branca.colormap as cm
from collections import defaultdict


lon, lat = tweets2['lon'].mean(), tweets2['lon'].mean()
zoom_level = 1

steps = 20


m = folium.Map([lon, lat], tiles='stamentoner', zoom_start=zoom_level)


#colormap_pos=cm.linear.Blues_09.scale(0,1).to_step(steps)
colormap_pos = cm.LinearColormap(colors=['white','blue'], index=[0,1],vmin=0,vmax=1)

gradient_map_pos=defaultdict(dict)
for i in range(steps):
    gradient_map_pos[1/steps*i] = colormap_pos.rgb_hex_str(1/steps*i)
#colormap_pos.add_to(m) #add color bar at the top of the map

colormap = cm.LinearColormap(colors=['red','white','blue'], index=[-1,0,1],vmin=-1,vmax=1, caption='Total Standard deviation at the point[mm]')

data_pos = (positive).tolist()
HeatMap(data_pos,gradient = gradient_map_pos,min_opacity=0.5).add_to(folium.FeatureGroup(name='Positive').add_to(m))
colormap.add_to(m)


#colormap_neg=cm.linear.Reds_09.scale(0,1).to_step(steps)
colormap_neg = cm.LinearColormap(colors=['white','red'], index=[0,1],vmin=0,vmax=1)

gradient_map_neg=defaultdict(dict)
for i in range(steps):
    gradient_map_neg[1/steps*i] = colormap_neg.rgb_hex_str(1/steps*i)
#colormap_neg.add_to(m) #add color bar at the top of the map

data_neg = (negative).tolist()
HeatMap(data_neg,gradient = gradient_map_neg,min_opacity=0.5).add_to(folium.FeatureGroup(name='Negative').add_to(m))
folium.LayerControl().add_to(m)

m