## Spatial Data Science (GIS6307/GEO4930)


<br>
Instructor: Yi Qiang (qiangy@usf.edu)<br>
Teaching Assistant: Jinwen Xu (jinwenxu@usf.edu)

---

# Workshop on Spatial Analysis of Twitter (Day 1)

This workshop will help you to get started with the acquisition, processing, and analysis of Twitter data using data science techniques. Specifically, you will learn:

- Streaming real-time tweets using Twitter Developer APIs.
- Processing the raw tweets into an analyzable form.
- Basic mapping, spatial analysis and natural language processing for Twitter data.

### Prerequisites
- Install Anaconda in your computer.
- Activation of Twitter Developer Account and approved **Elevated Access** before the workshop.
- Basic programming skills are recommended, but not required.




## 1. Install Python Libraries

First, we need to install a few libraries that will be used this lab. Please do the following steps to install the libarries.

1. Open or create a new conda environment `geo`. 
- For students in courses GIS6307 and GEO4930, please open Anaconda Prompt, and use the command `conda activate geo` to activate the "geo" environment that you created in the previous lab. 

- For workshop participatns, please run the following code to create a new conda environment `geo` and then activate it.

    - `conda create --name geo`
    
    - `conda activate geo`

2. Install tweepy, folium and Jupyter Lab using the following command (GIS6307/GEO4930 students only need to install tweepy and folium):

    `conda install -c conda-forge tweepy folium jupyterlab matplotlib emoji wordcloud textblob`
    
    
3. Install pandas and nltk using the following command (GIS6307/GEO4930 students only need to install nltk):

    `conda install -c anaconda pandas nltk`
    

4. Launch Jupyter Notebook using the following command:

    `jupyter notebook`
    

5. Open the Twitter_Workshop_D1.ipynb that you just downloaded. Run the following code to import the installed libraries. If the code runs through, the libraries are installed successfully.

In [None]:
import tweepy
import folium
import pandas as pd
import matplotlib.pyplot as plt

## 2. Getting to Know Jupyter Notebook

Keyboard shortcuts will save you lots of time. Jupyter stores a list of keybord shortcuts under the menu at the top: Help > Keyboard Shortcuts, or by pressing `H` in command mode (more on that later). It’s worth checking this each time you update Jupyter, as more shortcuts are added all the time.

Another way to access keyboard shortcuts, and a handy way to learn them is to use the command palette: `Cmd + Shift + P` (or `Ctrl + Shift + P` on Linux and Windows). This dialog box helps you run any command by name – useful if you don’t know the keyboard shortcut for an action or if what you want to do does not have a keyboard shortcut. The functionality is similar to Spotlight search on a Mac, and once you start using it you’ll wonder how you lived without it!

Some of my favorite shortcuts (in command mode):

- `ESC`: exit cell editing and enter to command mode
- `A`: add a cell above the current cell
- `B`: add a cell below the current cell
- `C`: copy the current cell
- `X`: cut the current cell
- `V`: paste the copied cell below the current cell
- `DD`: delete the current cell
- `M`: change cell to MarkDown code (must exit editing first)
- `Y`: change cell to Python code (must exit editing first)
- `Ctrl + Enter`: Run code in the current cell.
- `Shift + Enter`: Run code in the current cell and move to the next cell.

## 3. Set-Up Connection to Twitter

Go to Twitter Developer Portal (https://developer.twitter.com/en/apps). Click the App you created in account activation.

If you haven't created an App when you created the account, you can create one in the project.

Click `Set up` to set up User authentification settings.

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/signup3.jpg)

Turn on `OAuth 1.0a` and keep `OAuth 2.0` off. 

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/OAuth.jpg)

Select "Read and write and Direct message". You can use "http://127.0.0.1:8080" as Callback URL. Add any website as the website URL (e.g. your personal website or https://twitter.com).

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/setting.jpg)

Copy the API keys you have saved when you activated your Developer account, and paste them to replace "......" below. If you can't find them, you can **regenerate** the keys and tokens in your Developer Portal. 

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/keys2.jpg)

Copy and paste your API key, API key secret, Access Token and Access Token Secret to replace "......" below.

In [None]:
API_key = '......'
API_key_secret = '......'
access_token = '......'
access_token_secret = '......'

Set up for Twitter authentication.

In [None]:
auth = tweepy.OAuthHandler(API_key, API_key_secret)
auth.set_access_token(access_token, access_token_secret)

Set up tweepy API and set rate limit to be true.

In [None]:
api = tweepy.API(auth, wait_on_rate_limit = True)

---

## 4. Simple Operation with Twitter APIs

Now, your working environment is set up for Twitter analysis. Let's first try a few simple operations to acquire Twitter data in a programmatic way.

The full functionalities of Twitter API and Tweepy can be found in:

- [Twitter APIs](https://developer.twitter.com/en/docs.html)
- [Tweepy documentation](http://docs.tweepy.org/en/v4.8.0/)

### 4.1 Posting/Deleting a Tweet

First, let's post a message in your Twitter account.

**Note**: if you don't want to disturb your followers with a meanless tweet, don't run the following block of code.

In [None]:
# Post a tweet from Python
test_tweet = api.update_status("DRILL: I'm creating a robot to tweet!")

Check your Twitter account, and you'll see the above message is posted.

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/tweet.jpg)

Tweets are encoded in a JSON (JavaScript Object Notation) format. You can run the following code to check the content of the tweet you just posted.

In [None]:
test_tweet._json

`_json` returns a dictionary object. So you could access specific attributes using keys in the dictionary. The code below gets the posting time of the tweets.

In [None]:
test_tweet._json['created_at']

Alternative, you can also use the built-in attribute of the tweepy status object to access the attribute. All attributes of a tweet can be found [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet).

In [None]:
test_tweet.created_at

You can run the following code to delete the drill tweet you just posted.

In [None]:
api.destroy_status(test_tweet.id_str)

### 4.2 Getting Trending Tweets

In Twitter, you can find trends in different places in the Explore tab.

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/trends.jpg)

Next, we are going to retrieve trends in Python. The following code get a list of places where trends are available. These places include countries and cities.

In [None]:
place_ls = api.available_trends()

Convert the list (in JSON format) into a dataframe (i.e. a table). Print the number of places where trends are available.

In [None]:
df_place = pd.DataFrame(place_ls)

print (str(len(df_place)) + " places have trends.")

Preview 10 places where trends are available

In [None]:
df_place.head(10)

Get the record of Tampa. The `woeid` is a unique ID for each place.

In [None]:
df_place[df_place['name']=='Tampa']

Store the `woeid` of Tampa in tampa_id.

In [None]:
tampa_id = df_place[df_place['name']=='Tampa']['woeid']

Return the trends in Tampa.

Note: you need to convert the city_id from a pandas series object into an integer.

In [None]:
# make Tampa as an example
trends_tampa = api.get_place_trends(int(tampa_id))

Print the trends in JSON format

In [None]:
# print the top 20 trends in Tampa
trends_tampa[0:20]

Organize the Tampa trends in a dataframe.

In [None]:
# Get trend name, URL, and tweet volume from the JSON data, and store them in a list
trends_tampa_ls = [[trend['name'], trend['url'], trend['tweet_volume']] for trend in trends_tampa[0]['trends']]

# Convert the list to a dataframe
df_trends = pd.DataFrame(trends_tampa_ls,columns=['name','url','tweet_volume'])

Sort the trends by tweet volumn in a descending order and print the top 10 trends with the most tweeting volumne.

In [None]:
# Sort the trends by tweet volumn in a descending order
df_trends.sort_values("tweet_volume", inplace = True, ascending = False)

# Print the top 10 trends ranked by tweet volumne
df_trends.head(10)

The table shows the popular topics people are tweeting about in Tampa.

---

## 5. Acquiring Tweets using the Search API

### 5.1 Search Tweets using Keywords

In this step, you will use Python program to search tweets that contain a specific keyword. 

You can replace the key words to something you are interesting. Please try to choose a popular one so that you can collect sufficient tweets.

In [None]:
tweets = api.search_tweets("Football", count=100)

print("Total retweet retrieved: "+ str(len(tweets)))

Store the user name, user location, posting time, and tweet text in a Pandas dataframe.

In [None]:
tweets_pd = pd.DataFrame([[tweet.user.name, tweet.user.location,tweet.created_at, tweet.text] for tweet in tweets], 
                         columns = ['user_name','user_loc','creation_time','text'])

tweets_pd

The `search_tweets` funciton can retrieve max 100 tweets at a time. If you want to get more tweets, you need to use a `cursor`.

The following code use a cursor to retrieve more tweets containing a keyword. Here we still retrieve 100 tweets to save your search quota for the following steps. You could increase the `num` variable if you want to get more tweets.

> Note: Twitter only allows you to retrieve a limited number of tweets per 15 minutes. If the retrieved tweets exceed the limit, the program will pause for a certain amount time. If you can't wait, you can doublepress `i` on your keyboard to interrupt the process.

In [None]:
# Number of tweets to be retrieved.
num = 100

# define the search keyword
keyword = "Football"

# use cursor to send your request with parameters
tweets = tweepy.Cursor(api.search_tweets,
                   q = keyword,
                   count = num,
                   lang="en").items(num)

# restore the results as a list
tweet_ls = [[tweet.user.name, tweet.user.location,tweet.created_at, tweet.text] for tweet in tweets]

# Store the retrieved tweets in a dataframe
tweets_pd_full = pd.DataFrame(tweet_ls, 
                         columns = ['user_name','user_loc','creation_time','text'])

# Print the dataframe
tweets_pd_full

### 5.2 Filter out Retweets
The retrieved tweets include both original tweets and retweets. The content of retweets are almost identical and carry little information. You can set up a filter to eliminate the retweets and keep only the original tweets.

To filter out retweets, you can simply add `-filter:retweets` in the search keyword.

In [None]:
new_keyword = "Football" + " -filter:retweets"
new_keyword

The following code retrieve only original tweets that contain the keyword "Football".

In [None]:
# Number of tweets to be retrieved.
num = 100

# use cursor to send your request with parameters
tweets = tweepy.Cursor(api.search_tweets,
                   q = new_keyword,
                   count = num,
                   lang="en").items(num)

# restore the results as a list
tweet_ls = [[tweet.user.name, tweet.user.location,tweet.created_at, tweet.text] for tweet in tweets]

# Store the retrieved tweets in a dataframe
tweets_pd_full = pd.DataFrame(tweet_ls, 
                         columns = ['user_name','user_loc','creation_time','text'])

# Print the dataframe
tweets_pd_full

### 5.3 Search Tweets using locations

You can also search for tweets around a certain location. The spatial query is based on geotags and/or user locations of the tweets.

The following code will retrieve 200 tweets containing "Football" within a 20 mile radius from USF (28.0619,-82.4146).

In [None]:
# Number of tweets to be retrieved.
num = 200

# define the search keyword
keyword = "Football"

# use cursor to send your request with parameters
tweets = tweepy.Cursor(api.search_tweets,
                   q=keyword,
                   geocode = "28.0619,-82.4146,20mi",
                   count = num,
                   lang="en").items(num)

# restore the results as a list
tweet_ls = [[tweet.user.name, tweet.user.location, tweet.place, tweet.created_at, tweet.text] for tweet in tweets]

# Store the retrieved tweets in a dataframe
tweets_pd_geo = pd.DataFrame(tweet_ls, 
                         columns = ['user_name','user_loc','geotag','creation_time','text'])

# Print the dataframe
tweets_pd_geo

Next, we are going to map the tweets using their geotags. However, only a very small proportion (1-2%) of tweets have a geotag. We first check how many of the 200 tweets have a  geotag.

In [None]:
all = len(tweets_pd_geo[tweets_pd_geo['geotag'].notna()]) # all retrieved tweets
geo = len(tweets_pd_geo) # tweets that actually have geotags.

print("%s out of the %s retrieved tweets actually have a geotag" % (all, geo))

#### Copy tweets with geotags to a new dataframe called "geotags"

In [None]:
geo_tweets = tweets_pd_geo.loc[tweets_pd_geo['geotag'].notna()].copy()

Check detailed data in a geotag. The location of the geotag is in a bounding box defined by the four coordinate pairs.

In [None]:
geo_tweets.iloc[0].geotag

Get coordinates of the bounding box

In [None]:
geo_tweets.iloc[0].geotag.bounding_box.coordinates

Create a column in the dataframe to store coordinates of the bounding boxes

In [None]:
# store bounding box coordinates to a new column
geo_tweets['bounding_box'] = geo_tweets.geotag.apply(lambda s:s.bounding_box.coordinates[0])

# print the geotag
geo_tweets.head()

For mapping purpose, we will simplify the bounding boxes to their centroids (points). In such a way, each tweet will be pinned to the centroid of the bounding box in the geotag.

The following code calculate the lat, lon of centroids and store them in two new columns.

In [None]:
geo_tweets['point']  = geo_tweets['bounding_box'].apply(lambda s: [(s[0][1]+s[2][1])/2,(s[0][0]+s[2][0])/2])

geo_tweets['lat']  = geo_tweets['bounding_box'].apply(lambda s: (s[0][1]+s[2][1])/2)

geo_tweets['lon']  = geo_tweets['bounding_box'].apply(lambda s: (s[0][0]+s[2][0])/2)

Print to see the dataframe again. You'll see the centroids, latitude, and longitude are added as columns in the dataframe.

Note: the point column is an redundancy of the lat and lon columns. We create all these columns just in case you want to convert the dataframe to a geodataframe (geometry).

In [None]:
geo_tweets.head()

---

## 6. Mapping Geotagged Tweets

Finally, we will use folium to create an interactive map for the geotagged tweets.

In [None]:
# Create a base map
maptweet = folium.Map()

# Add the tweets into the basemap
for i, row in geo_tweets.iterrows():
    folium.Marker(row.point,popup = row.text).add_to(maptweet)
    
# Zoom to the bounding box including the tweets
maptweet.fit_bounds([[min(geo_tweets.lat),min(geo_tweets.lon)],[max(geo_tweets.lat),max(geo_tweets.lon)]])

# Show the map
display(maptweet)

## 7. Streaming Tweets

Unlike the Search API, the Streaming API utilizes Twitter's HTTP protocol to deliver data through an open, streaming API connection. A single streaming connection is opened between your app and the API, with new results being sent through that connection whenever new matches occur. This results in a low-latency delivery mechanism that can support very high throughput. For further information, see https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data.

Depending on the filter (the speed of retrieval), the streaming API is also subject to a limit per 15 minutes. If the retrieved tweets exceed the limit, your program will pause for the next 15 minutes cycle.

Run the following code to start streaming. The text messages are printed. You can double press `I` to interrupt the streaming. 

In [None]:
import tweepy
import csv

# paste your API keys and tokens to replace ......
API_key = '......'
API_key_secret = '......'
access_token = '......'
access_token_secret = '......'

auth = tweepy.OAuthHandler(API_key, API_key_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit = True)

You can run the follow code to stream real-time tweets and save them in a CSV file in your computer.

In [None]:
# Define path and name of output file
output_file = "tweets.csv"

# Open a file to write in streamed tweets
with open(output_file, "w", encoding="utf-8",newline='') as f:
    writer = csv.writer(f)
    
    # Write the headers in the first row
    writer.writerow(["username","created_at", "geotag", "user_location", "lang","tweet"])
    
    # Define a sub-object of tweepy.Stream
    class stream_csv(tweepy.Stream):
        def on_status(self, status):
            # Skip retweets and collect only geotagged tweets
            if (not status.retweeted) and ('RT @' not in status.text) and (status.place is not None):
                # Organize attributes of a tweet in a list
                line = [status.user.name, status.user.created_at, status.place.bounding_box.coordinates[0], status.user.location, status.lang, status.text]
                print(line) # Print the line
                writer.writerow(line) # write the line to csv file

    streamer = stream_csv(API_key, API_key_secret, access_token, access_token_secret)
    streamer.filter(languages=["en"], track=["putin"])
    streamer.sample()

You can double press `I` to stop the streaming (there is no better way to do it...). Before starting a new streaming, you need to disconnect the streaming.

In [None]:
streamer.disconnect()