## Spatial Data Science (GIS6307/GEO4930)


<br>
Instructor: Yi Qiang (qiangy@usf.edu)<br>
Teaching Assistant: Jinwen Xu (jinwenxu@usf.edu)

---

# Workshop on Spatial Analysis of Twitter (Day 1)

This workshop will help you to get started with the acquisition, processing, and analysis of Twitter data using data science techniques. Specifically, you will learn:

- Streaming real-time tweets using Twitter Developer APIs.
- Processing the raw tweets into an analyzable form.
- Basic mapping, spatial analysis and natural language processing for Twitter data.

### Prerequisites
- Install Anaconda in your computer.
- Activation of Twitter Developer Account and approved **Elevated Access** before the workshop.
- Basic programming skills are recommended, but not required.




## 1. Install Python Libraries

First, we need to install a few libraries that will be used this lab. Please do the following steps to install the libarries.

1. Open or create a new conda environment `geo`. 
- For students in courses GIS6307 and GEO4930, please open Anaconda Prompt, and use the command `conda activate geo` to activate the "geo" environment that you created in the previous lab. 

- For workshop participatns, please run the following code to create a new conda environment `geo` and then activate it.

    - `conda create --name geo`
    
    - `conda activate geo`

2. Install tweepy, folium and Jupyter Lab using the following command (GIS6307/GEO4930 students only need to install tweepy and folium):

    `conda install -c conda-forge tweepy folium jupyterlab matplotlib`
    
    
3. Install pandas and nltk using the following command (GIS6307/GEO4930 students only need to install nltk):

    `conda install -c anaconda pandas nltk`
    

4. Launch Jupyter Notebook using the following command:

    `jupyter notebook`
    

5. Open the Twitter_Workshop_D1.ipynb that you just downloaded. Run the following code to import the installed libraries. If the code runs through, the libraries are installed successfully.

In [4]:
import tweepy
import folium
import pandas as pd
import matplotlib.pyplot as plt

## 2. Getting to Know Jupyter Notebook

Keyboard shortcuts will save you lots of time. Jupyter stores a list of keybord shortcuts under the menu at the top: Help > Keyboard Shortcuts, or by pressing `H` in command mode (more on that later). It’s worth checking this each time you update Jupyter, as more shortcuts are added all the time.

Another way to access keyboard shortcuts, and a handy way to learn them is to use the command palette: `Cmd + Shift + P` (or `Ctrl + Shift + P` on Linux and Windows). This dialog box helps you run any command by name – useful if you don’t know the keyboard shortcut for an action or if what you want to do does not have a keyboard shortcut. The functionality is similar to Spotlight search on a Mac, and once you start using it you’ll wonder how you lived without it!

Some of my favorite shortcuts (in command mode):

- `ESC`: exit cell editing and enter to command mode
- `A`: add a cell above the current cell
- `B`: add a cell below the current cell
- `C`: copy the current cell
- `X`: cut the current cell
- `V`: paste the copied cell below the current cell
- `DD`: delete the current cell
- `M`: change cell to MarkDown code (must exit editing first)
- `Y`: change cell to Python code (must exit editing first)
- `Ctrl + Enter`: Run code in the current cell.
- `Shift + Enter`: Run code in the current cell and move to the next cell.

## 3. Set-Up Connection to Twitter

Go to Twitter Developer Portal (https://developer.twitter.com/en/apps). Click the App you created in account activation.

If you haven't created an App when you created the account, you can create one in the project.

Click `Set up` to set up User authentification settings.

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/signup3.jpg)

Turn on `OAuth 1.0a` and keep `OAuth 2.0` off. 

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/OAuth.jpg)

Select "Read and write and Direct message". You can use "http://127.0.0.1:8080" as Callback URL. Add any website as the website URL (e.g. your personal website or https://twitter.com).

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/setting.jpg)

Copy the API keys you have saved when you activated your Developer account, and paste them to replace "......" below. If you can't find them, you can **regenerate** the keys and tokens in your Developer Portal. 

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/keys2.jpg)

Copy and paste your API key, API key secret, Access Token and Access Token Secret to replace "......" below.

In [5]:
API_key = 'As0PGxFkLLgXKyqnGCre08UNz'
API_key_secret = '6rzlzIBqgURxZd2PKjo4QTj4iYNiwINHDWDCMWW3zlxUU9xr23'
access_token = '1509606550134038533-TpQM7b6U3nlvewate1hJqqbMuHSl5p'
access_token_secret = 'jTMIxiyKndnxMAI1vJScVC8diddUgHO2EJvNbb4nozU8L'

Set up for Twitter authentication.

In [6]:
auth = tweepy.OAuthHandler(API_key, API_key_secret)
auth.set_access_token(access_token, access_token_secret)

Set up tweepy API and set rate limit to be true.

In [7]:
api = tweepy.API(auth, wait_on_rate_limit = True)

---

## 4. Simple Operation with Twitter APIs

Now, your working environment is set up for Twitter analysis. Let's first try a few simple operations to acquire Twitter data in a programmatic way.

The full functionalities of Twitter API and Tweepy can be found in:

- [Twitter APIs](https://developer.twitter.com/en/docs.html)
- [Tweepy documentation](http://docs.tweepy.org/en/v4.8.0/)

### 4.1 Posting/Deleting a Tweet

First, let's post a message in your Twitter account.

**Note**: if you don't want to disturb your followers with a meanless tweet, don't run the following block of code.

In [14]:
# Post a tweet from Python
test_tweet = api.update_status("DRILL: I'm creating a robot to tweet!")

Check your Twitter account, and you'll see the above message is posted.

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/tweet.jpg)

Tweets are encoded in a JSON (JavaScript Object Notation) format. You can run the following code to check the content of the tweet you just posted.

In [15]:
test_tweet._json

{'created_at': 'Tue Apr 19 20:19:00 +0000 2022',
 'id': 1516511703516823567,
 'id_str': '1516511703516823567',
 'text': "DRILL: I'm creating a robot to tweet!",
 'truncated': False,
 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []},
 'source': '<a href="https://twitter.com" rel="nofollow">twitter_workshop2</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 1509606550134038533,
  'id_str': '1509606550134038533',
  'name': 'Alex Zhang',
  'screen_name': 'AlexZha37291293',
  'location': '',
  'description': 'Data scientist',
  'url': None,
  'entities': {'description': {'urls': []}},
  'protected': False,
  'followers_count': 0,
  'friends_count': 1,
  'listed_count': 0,
  'created_at': 'Thu Mar 31 19:00:34 +0000 2022',
  'favourites_count': 0,
  'utc_offset': None,
  'time_zone': None,
  'geo_enabled': False,
  'verified': Fa

`_json` returns a dictionary object. So you could access specific attributes using keys in the dictionary. The code below gets the posting time of the tweets.

In [19]:
test_tweet._json['created_at']

'Alex Zhang'

Alternative, you can also use the built-in attribute of the tweepy status object to access the attribute. All attributes of a tweet can be found [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet).

In [20]:
test_tweet.created_at

'Alex Zhang'

You can run the following code to delete the drill tweet you just posted.

In [12]:
api.destroy_status(test_tweet.id_str)

Status(_api=<tweepy.api.API object at 0x00000239A65E07F0>, _json={'created_at': 'Tue Apr 19 20:14:45 +0000 2022', 'id': 1516510636284551186, 'id_str': '1516510636284551186', 'text': "DRILL: I'm creating a robot to tweet!", 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="https://twitter.com" rel="nofollow">twitter_workshop2</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1509606550134038533, 'id_str': '1509606550134038533', 'name': 'Alex Zhang', 'screen_name': 'AlexZha37291293', 'location': '', 'description': 'Data scientist', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 0, 'friends_count': 1, 'listed_count': 0, 'created_at': 'Thu Mar 31 19:00:34 +0000 2022', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': Fa

### 4.2 Getting Trending Tweets

In Twitter, you can find trends in different places in the Explore tab.

![](https://raw.githubusercontent.com/qiang-yi/spatial_data_science/main/image/twitter/trends.jpg)

Next, we are going to retrieve trends in Python. The following code get a list of places where trends are available. These places include countries and cities.

In [21]:
place_ls = api.available_trends()

Convert the list (in JSON format) into a dataframe (i.e. a table). Print the number of places where trends are available.

In [23]:
df_place = pd.DataFrame(place_ls)

print (str(len(df_place)) + " places have trends.")

467 places have trends.


Preview 10 places where trends are available

In [24]:
df_place.head(10)

Unnamed: 0,name,placeType,url,parentid,country,woeid,countryCode
0,Worldwide,"{'code': 19, 'name': 'Supername'}",http://where.yahooapis.com/v1/place/1,0,,1,
1,Winnipeg,"{'code': 7, 'name': 'Town'}",http://where.yahooapis.com/v1/place/2972,23424775,Canada,2972,CA
2,Ottawa,"{'code': 7, 'name': 'Town'}",http://where.yahooapis.com/v1/place/3369,23424775,Canada,3369,CA
3,Quebec,"{'code': 7, 'name': 'Town'}",http://where.yahooapis.com/v1/place/3444,23424775,Canada,3444,CA
4,Montreal,"{'code': 7, 'name': 'Town'}",http://where.yahooapis.com/v1/place/3534,23424775,Canada,3534,CA
5,Toronto,"{'code': 7, 'name': 'Town'}",http://where.yahooapis.com/v1/place/4118,23424775,Canada,4118,CA
6,Edmonton,"{'code': 7, 'name': 'Town'}",http://where.yahooapis.com/v1/place/8676,23424775,Canada,8676,CA
7,Calgary,"{'code': 7, 'name': 'Town'}",http://where.yahooapis.com/v1/place/8775,23424775,Canada,8775,CA
8,Vancouver,"{'code': 7, 'name': 'Town'}",http://where.yahooapis.com/v1/place/9807,23424775,Canada,9807,CA
9,Birmingham,"{'code': 7, 'name': 'Town'}",http://where.yahooapis.com/v1/place/12723,23424975,United Kingdom,12723,GB


Get the record of Tampa. The `woeid` is a unique ID for each place.

In [26]:
df_place[df_place['name']=='Tampa']

394    2503863
Name: woeid, dtype: int64

Store the `woeid` of Tampa in tampa_id.

In [27]:
tampa_id = df_place[df_place['name']=='Tampa']['woeid']

Return the trends in Tampa.

Note: you need to convert the city_id from a pandas series object into an integer.

In [29]:
# make Tampa as an example
trends_tampa = api.get_place_trends(int(tampa_id))

[{'trends': [{'name': 'Disney',
    'url': 'http://twitter.com/search?q=Disney',
    'promoted_content': None,
    'query': 'Disney',
    'tweet_volume': 203846},
   {'name': 'Floridians',
    'url': 'http://twitter.com/search?q=Floridians',
    'promoted_content': None,
    'query': 'Floridians',
    'tweet_volume': 12880},
   {'name': 'Orange County',
    'url': 'http://twitter.com/search?q=%22Orange+County%22',
    'promoted_content': None,
    'query': '%22Orange+County%22',
    'tweet_volume': None},
   {'name': 'Maxey',
    'url': 'http://twitter.com/search?q=Maxey',
    'promoted_content': None,
    'query': 'Maxey',
    'tweet_volume': 25614},
   {'name': '#Vtubers',
    'url': 'http://twitter.com/search?q=%23Vtubers',
    'promoted_content': None,
    'query': '%23Vtubers',
    'tweet_volume': 14420},
   {'name': 'Liverpool',
    'url': 'http://twitter.com/search?q=Liverpool',
    'promoted_content': None,
    'query': 'Liverpool',
    'tweet_volume': 404813},
   {'name': '#LI

Print the trends in JSON format

In [30]:
# print the top 20 trends in Tampa
trends_tampa[0:20]

[{'trends': [{'name': 'Disney',
    'url': 'http://twitter.com/search?q=Disney',
    'promoted_content': None,
    'query': 'Disney',
    'tweet_volume': 203485},
   {'name': 'Floridians',
    'url': 'http://twitter.com/search?q=Floridians',
    'promoted_content': None,
    'query': 'Floridians',
    'tweet_volume': 12880},
   {'name': 'Orange County',
    'url': 'http://twitter.com/search?q=%22Orange+County%22',
    'promoted_content': None,
    'query': '%22Orange+County%22',
    'tweet_volume': None},
   {'name': 'Maxey',
    'url': 'http://twitter.com/search?q=Maxey',
    'promoted_content': None,
    'query': 'Maxey',
    'tweet_volume': 25614},
   {'name': '#Vtubers',
    'url': 'http://twitter.com/search?q=%23Vtubers',
    'promoted_content': None,
    'query': '%23Vtubers',
    'tweet_volume': 14420},
   {'name': 'Liverpool',
    'url': 'http://twitter.com/search?q=Liverpool',
    'promoted_content': None,
    'query': 'Liverpool',
    'tweet_volume': 404813},
   {'name': '#LI

Organize the Tampa trends in a dataframe.

In [31]:
# Get trend name, URL, and tweet volume from the JSON data, and store them in a list
trends_tampa_ls = [[trend['name'], trend['url'], trend['tweet_volume']] for trend in trends_tampa[0]['trends']]

# Convert the list to a dataframe
df_trends = pd.DataFrame(trends_tampa_ls,columns=['name','url','tweet_volume'])

Sort the trends by tweet volumn in a descending order and print the top 10 trends with the most tweeting volumne.

In [34]:
# Sort the trends by tweet volumn in a descending order
df_trends.sort_values("tweet_volume", inplace = True, ascending = False)

# Print the top 10 trends ranked by tweet volumne
df_trends.head(20)

Unnamed: 0,name,url,tweet_volume
5,Liverpool,http://twitter.com/search?q=Liverpool,404813.0
0,Disney,http://twitter.com/search?q=Disney,203485.0
13,Manchester United,http://twitter.com/search?q=%22Manchester+Unit...,193340.0
6,#LIVMUN,http://twitter.com/search?q=%23LIVMUN,150919.0
20,Taylor Lorenz,http://twitter.com/search?q=%22Taylor+Lorenz%22,132042.0
12,Maguire,http://twitter.com/search?q=Maguire,113685.0
18,#MUFC,http://twitter.com/search?q=%23MUFC,113447.0
25,Anfield,http://twitter.com/search?q=Anfield,103953.0
44,nayeon,http://twitter.com/search?q=nayeon,76717.0
9,Johnny Depp,http://twitter.com/search?q=%22Johnny+Depp%22,76268.0


The table shows the popular topics people are tweeting about in Tampa.

---

## 5. Acquiring Tweets using the Search API

### 5.1 Search Tweets using Keywords

In this step, you will use Python program to search tweets that contain a specific keyword. 

You can replace the key words to something you are interesting. Please try to choose a popular one so that you can collect sufficient tweets.

In [35]:
tweets = api.search_tweets("Football", count=100)

print("Total retweet retrieved: "+ str(len(tweets)))

Total retweet retrieved: 100


Store the user name, user location, posting time, and tweet text in a Pandas dataframe.

In [37]:
tweets_pd = pd.DataFrame([[tweet.user.name, tweet.user.location,tweet.created_at, tweet.text] for tweet in tweets], 
                         columns = ['user_name','user_loc','creation_time','text'])

tweets_pd

Unnamed: 0,user_name,user_loc,creation_time,text
0,KH SAKIB,,2022-04-19 20:31:29+00:00,Why the hell Harry Maguire and Marcus overrate...
1,ESSE marley✨,In my blunt,2022-04-19 20:31:29+00:00,RT @artasaster90: When football was a game and...
2,S,TaylorSwift,2022-04-19 20:31:29+00:00,RT @Ludo4PF: Thiago isn’t playing football btw...
3,Very Proud Robot,,2022-04-19 20:31:28+00:00,RT @samsonlawal_: Days 76 of #100DaysOfCode\n\...
4,theffrobot,Earth,2022-04-19 20:31:28+00:00,RT @NBCSEdgeFB: Steelers take chance on Miles ...
...,...,...,...,...
95,Bishoy_FCB💙❤️,,2022-04-19 20:31:16+00:00,@FCBarcelona_es @SergiRoberto10 Out @SergiRobe...
96,Belfast Live Sport,"Belfast, Northern Ireland",2022-04-19 20:31:16+00:00,GOALS!!\n\nColeraine pull one back; Glenavon e...
97,Leo🩺,Le monde,2022-04-19 20:31:16+00:00,"RT @SSS_Promotion: MARCUS RASHFORD, QUIT FOOTB..."
98,GK,,2022-04-19 20:31:15+00:00,RT @MrKartShyam: Just when we started to look ...


The `search_tweets` funciton can retrieve max 100 tweets at a time. If you want to get more tweets, you need to use a `cursor`.

The following code use a cursor to retrieve more tweets containing a keyword. Here we still retrieve 100 tweets to save your search quota for the following steps. You could increase the `num` variable if you want to get more tweets.

> Note: Twitter only allows you to retrieve a limited number of tweets per 15 minutes. If the retrieved tweets exceed the limit, the program will pause for a certain amount time. If you can't wait, you can doublepress `i` on your keyboard to interrupt the process.

In [38]:
# Number of tweets to be retrieved.
num = 100

# define the search keyword
keyword = "Football"

# use cursor to send your request with parameters
tweets = tweepy.Cursor(api.search_tweets,
                   q = keyword,
                   count = num,
                   lang="en").items(num)

# restore the results as a list
tweet_ls = [[tweet.user.name, tweet.user.location,tweet.created_at, tweet.text] for tweet in tweets]

# Store the retrieved tweets in a dataframe
tweets_pd_full = pd.DataFrame(tweet_ls, 
                         columns = ['user_name','user_loc','creation_time','text'])

# Print the dataframe
tweets_pd_full

Unnamed: 0,user_name,user_loc,creation_time,text
0,Christian,,2022-04-19 20:34:25+00:00,RT @247Sports: Tennessee @Vol_Football is up t...
1,Ayumi,Malaysia,2022-04-19 20:34:24+00:00,RT @Mich8Lee: Manchester United don’t deserve ...
2,LOwke¥🅿️usha,Trenches deep dyoli,2022-04-19 20:34:23+00:00,Mane and Salah playing some beaurrriful football
3,Gavin,,2022-04-19 20:34:23+00:00,RT @CFC_Raf: Marcus Rashford plays football li...
4,Seyi omoOje,United States of Lagos,2022-04-19 20:34:22+00:00,@mrmacaronii football is about winning and manU
...,...,...,...,...
95,🗿,"Islington, London",2022-04-19 20:34:04+00:00,I’ve seen man at powerleague play better footb...
96,1990s Football,"England, United Kingdom",2022-04-19 20:34:04+00:00,Here's Steve Stone with a shocking miss agains...
97,Quinn,"Oklahoma, USA",2022-04-19 20:34:04+00:00,RT @JoshuaBates64: BIG TIME OL’s coming to the...
98,Liam Stewart,,2022-04-19 20:34:04+00:00,RT @AnfieldWatch: THIS. IS. FOOTBALL 😍 \n\nhtt...


### 5.2 Filter out Retweets
The retrieved tweets include both original tweets and retweets. The content of retweets are almost identical and carry little information. You can set up a filter to eliminate the retweets and keep only the original tweets.

To filter out retweets, you can simply add `-filter:retweets` in the search keyword.

In [39]:
new_keyword = "Football" + " -filter:retweets"
new_keyword

'Football -filter:retweets'

The following code retrieve only original tweets that contain the keyword "Football".

In [40]:
# Number of tweets to be retrieved.
num = 100

# use cursor to send your request with parameters
tweets = tweepy.Cursor(api.search_tweets,
                   q = new_keyword,
                   count = num,
                   lang="en").items(num)

# restore the results as a list
tweet_ls = [[tweet.user.name, tweet.user.location,tweet.created_at, tweet.text] for tweet in tweets]

# Store the retrieved tweets in a dataframe
tweets_pd_full = pd.DataFrame(tweet_ls, 
                         columns = ['user_name','user_loc','creation_time','text'])

# Print the dataframe
tweets_pd_full

Unnamed: 0,user_name,user_loc,creation_time,text
0,Mr Lanks,,2022-04-19 20:35:59+00:00,@adedunke_ Lol\nYou people think man like that...
1,ًEllis.,Join my groupchat 👉,2022-04-19 20:35:58+00:00,So sad seeing what’s happened to Rashford he s...
2,Tarila,"Lagos, Nigeria",2022-04-19 20:35:58+00:00,@cosmicdirtbag81 @MistateeT @HarryPapa9 @COYS_...
3,Lonely,Citadel At the End Of Time,2022-04-19 20:35:57+00:00,@prAfricanChild1 Nze am watching football
4,s,Saint Lucia,2022-04-19 20:35:57+00:00,Lol I love when this man talks about football....
...,...,...,...,...
95,waxof,,2022-04-19 20:35:21+00:00,@umarmenya @Uchiha__Yusuf @utdreport Did you o...
96,Anmol rai,India,2022-04-19 20:35:20+00:00,@sportbible This team rn is not playing footba...
97,💆🏿‍♂️Ezek,@whosezek,2022-04-19 20:35:19+00:00,So many problems at this football club
98,Tuovoo,Switzerland,2022-04-19 20:35:19+00:00,@PingPongChong8 Shittest defender I’ve ever se...


### 5.3 Search Tweets using locations

You can also search for tweets around a certain location. The spatial query is based on geotags and/or user locations of the tweets.

The following code will retrieve 200 tweets containing "Football" within a 20 mile radius from USF (28.0619,-82.4146).

In [41]:
# Number of tweets to be retrieved.
num = 200

# define the search keyword
keyword = "Football"

# use cursor to send your request with parameters
tweets = tweepy.Cursor(api.search_tweets,
                   q=keyword,
                   geocode = "28.0619,-82.4146,20mi",
                   count = num,
                   lang="en").items(num)

# restore the results as a list
tweet_ls = [[tweet.user.name, tweet.user.location, tweet.place, tweet.created_at, tweet.text] for tweet in tweets]

# Store the retrieved tweets in a dataframe
tweets_pd_geo = pd.DataFrame(tweet_ls, 
                         columns = ['user_name','user_loc','geotag','creation_time','text'])

# Print the dataframe
tweets_pd_geo

Unnamed: 0,user_name,user_loc,geotag,creation_time,text
0,Florida Man - Worlds Worst Superhero,"Tampa, FL",,2022-04-19 20:36:57+00:00,@Brentdoesstuff @Dmaron21 @DocEpcot @rankingth...
1,David Tilton,"Lakeland, Florida",,2022-04-19 20:27:06+00:00,@USFLMemes The crowd matters. It’s football. T...
2,MVP VO 😏,"Tampa, FL",,2022-04-19 20:15:45+00:00,My only regret in football wa snot being in po...
3,Odogwu 🦍,"Gibsonton, FL",,2022-04-19 20:10:10+00:00,Help Me raise money for East Bay Football 2022...
4,Jaylon Key Johnson,"Riverview, FL",,2022-04-19 20:05:15+00:00,Help Jaylon Key raise money for East Bay Footb...
...,...,...,...,...,...
195,Beanie #TeamSprigatito 🇺🇦,"Lakeland, FL",,2022-04-18 00:42:28+00:00,"@EffGeeYT Nope, as he walks by you clearly see..."
196,Phil Harrison,"Tampa, FL",,2022-04-18 00:38:55+00:00,Best photos from the annual Ohio State spring ...
197,Select Training Solutions,"Riverview, Florida",,2022-04-18 00:24:34+00:00,@oswinspk @RaesTake He's a football player tho...
198,Laura Demers,"Tampa, FL",,2022-04-18 00:16:16+00:00,"@MikeManganello Yeah, I was bummed as my earli..."


Next, we are going to map the tweets using their geotags. However, only a very small proportion (1-2%) of tweets have a geotag. We first check how many of the 200 tweets have a  geotag.

In [42]:
all = len(tweets_pd_geo[tweets_pd_geo['geotag'].notna()]) # all retrieved tweets
geo = len(tweets_pd_geo) # tweets that actually have geotags.

print("%s out of the %s retrieved tweets actually have a geotag" % (all, geo))

12 out of the 200 retrieved tweets actually have a geotag


#### Copy tweets with geotags to a new dataframe called "geotags"

In [43]:
geo_tweets = tweets_pd_geo.loc[tweets_pd_geo['geotag'].notna()].copy()

Check detailed data in a geotag. The location of the geotag is in a bounding box defined by the four coordinate pairs.

In [45]:
geo_tweets.iloc[0].geotag

Place(_api=<tweepy.api.API object at 0x00000239A65E07F0>, id='dc62519fda13b4ec', url='https://api.twitter.com/1.1/geo/id/dc62519fda13b4ec.json', place_type='city', name='Tampa', full_name='Tampa, FL', country_code='US', country='United States', contained_within=[], bounding_box=BoundingBox(_api=<tweepy.api.API object at 0x00000239A65E07F0>, type='Polygon', coordinates=[[[-82.620093, 27.821353], [-82.2652945, 27.821353], [-82.2652945, 28.171836], [-82.620093, 28.171836]]]), attributes={})

Get coordinates of the bounding box

In [46]:
geo_tweets.iloc[0].geotag.bounding_box.coordinates

[[[-82.620093, 27.821353],
  [-82.2652945, 27.821353],
  [-82.2652945, 28.171836],
  [-82.620093, 28.171836]]]

Create a column in the dataframe to store coordinates of the bounding boxes

In [47]:
# store bounding box coordinates to a new column
geo_tweets['bounding_box'] = geo_tweets.geotag.apply(lambda s:s.bounding_box.coordinates[0])

# print the geotag
geo_tweets.head()

Unnamed: 0,user_name,user_loc,geotag,creation_time,text,bounding_box
26,Chris,"Tampa, FL",Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 16:18:34+00:00,Solid episode fellas. Brandon had a ton of gre...,"[[-82.620093, 27.821353], [-82.2652945, 27.821..."
31,QB1 Tyler Harrison,Tampa,Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 15:56:00+00:00,@CCSFootball813 \n\nI am interested in your mi...,"[[-82.620093, 27.821353], [-82.2652945, 27.821..."
44,Can you still feel the butterflies?,my ex-wife,Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 14:34:01+00:00,@CBSSports Michigan football,"[[-82.620093, 27.821353], [-82.2652945, 27.821..."
61,Maggi Cook,"St Petersburg, FL",Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 08:15:27+00:00,Woot! @GoBearcatsFB \nhttps://t.co/6jKaDxxqEB,"[[-82.369079, 27.755502], [-82.2443655, 27.755..."
68,getloosejay,"Lakeland, FL",Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 02:25:48+00:00,I am blessed to say I have received my 2nd ⭕️f...,"[[-82.056434, 28.095121], [-82.006638, 28.0951..."


For mapping purpose, we will simplify the bounding boxes to their centroids (points). In such a way, each tweet will be pinned to the centroid of the bounding box in the geotag.

The following code calculate the lat, lon of centroids and store them in two new columns.

In [48]:
geo_tweets['point']  = geo_tweets['bounding_box'].apply(lambda s: [(s[0][1]+s[2][1])/2,(s[0][0]+s[2][0])/2])

geo_tweets['lat']  = geo_tweets['bounding_box'].apply(lambda s: (s[0][1]+s[2][1])/2)

geo_tweets['lon']  = geo_tweets['bounding_box'].apply(lambda s: (s[0][0]+s[2][0])/2)

Print to see the dataframe again. You'll see the centroids, latitude, and longitude are added as columns in the dataframe.

Note: the point column is an redundancy of the lat and lon columns. We create all these columns just in case you want to convert the dataframe to a geodataframe (geometry).

In [49]:
geo_tweets.head()

Unnamed: 0,user_name,user_loc,geotag,creation_time,text,bounding_box,point,lat,lon
26,Chris,"Tampa, FL",Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 16:18:34+00:00,Solid episode fellas. Brandon had a ton of gre...,"[[-82.620093, 27.821353], [-82.2652945, 27.821...","[27.9965945, -82.44269374999999]",27.996595,-82.442694
31,QB1 Tyler Harrison,Tampa,Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 15:56:00+00:00,@CCSFootball813 \n\nI am interested in your mi...,"[[-82.620093, 27.821353], [-82.2652945, 27.821...","[27.9965945, -82.44269374999999]",27.996595,-82.442694
44,Can you still feel the butterflies?,my ex-wife,Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 14:34:01+00:00,@CBSSports Michigan football,"[[-82.620093, 27.821353], [-82.2652945, 27.821...","[27.9965945, -82.44269374999999]",27.996595,-82.442694
61,Maggi Cook,"St Petersburg, FL",Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 08:15:27+00:00,Woot! @GoBearcatsFB \nhttps://t.co/6jKaDxxqEB,"[[-82.369079, 27.755502], [-82.2443655, 27.755...","[27.827367000000002, -82.30672225]",27.827367,-82.306722
68,getloosejay,"Lakeland, FL",Place(_api=<tweepy.api.API object at 0x0000023...,2022-04-19 02:25:48+00:00,I am blessed to say I have received my 2nd ⭕️f...,"[[-82.056434, 28.095121], [-82.006638, 28.0951...","[28.1276105, -82.03153599999999]",28.12761,-82.031536


---

## 6. Mapping Geotagged Tweets

Finally, we will use folium to create an interactive map for the geotagged tweets.

In [50]:
# Create a base map
maptweet = folium.Map()

# Add the tweets into the basemap
for i, row in geo_tweets.iterrows():
    folium.Marker(row.point,popup = row.text).add_to(maptweet)
    
# Zoom to the bounding box including the tweets
maptweet.fit_bounds([[min(geo_tweets.lat),min(geo_tweets.lon)],[max(geo_tweets.lat),max(geo_tweets.lon)]])

# Show the map
display(maptweet)

## 7. Streaming Tweets

Unlike the Search API, the Streaming API utilizes Twitter's HTTP protocol to deliver data through an open, streaming API connection. A single streaming connection is opened between your app and the API, with new results being sent through that connection whenever new matches occur. This results in a low-latency delivery mechanism that can support very high throughput. For further information, see https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data.

Depending on the filter (the speed of retrieval), the streaming API is also subject to a limit per 15 minutes. If the retrieved tweets exceed the limit, your program will pause for the next 15 minutes cycle.

In [1]:
import tweepy
import csv

# paste your API keys and tokens to replace ......
API_key = 'bxbI3mGmSXSv9kDHx5NjLrovD'
API_key_secret = 'OIXX2LzloE2ZkcsMKaC67fdwCLAdSuXYYKpfKAYNQtNy2yFeTq'
access_token = '1506332614348587009-qAQSmlKOoA6vBPxRt4CZ2JxltqftJt'
access_token_secret = '3B9GmKhTWuGsY9bXKx7j5z00mni1Pf3F2YLnIN3lF4iPF'

auth = tweepy.OAuthHandler(API_key, API_key_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit = True)

Run the following code to start streaming. The text messages are printed and saved in a csv file. If you want to stop, you need to double press `I` to interrupt the streaming. 

In [None]:
# Define path and name of output file
output_file = "tweets.csv"

# Open a file to write in streamed tweets
with open(output_file, "w", encoding="utf-8",newline='') as f:
    writer = csv.writer(f)
    
    # Write the headers in the first row
    writer.writerow(["username","created_at", "geotag", "user_location", "lang","tweet"])
    
    # Define a sub-object of tweepy.Stream
    class stream_csv(tweepy.Stream):
        def on_status(self, status):
            # Skip retweets and collect only geotagged tweets
            if (not status.retweeted) and ('RT @' not in status.text) and (status.place is not None):
                # Organize attributes of a tweet in a list
                line = [status.user.name, status.user.created_at, status.place.bounding_box.coordinates[0], status.user.location, status.lang, status.text]
                print(line) # Print the line
                writer.writerow(line) # write the line to csv file

    streamer = stream_csv(API_key, API_key_secret, access_token, access_token_secret)
    streamer.filter(languages=["en"], track=["zelenskyy"])
    streamer.sample()

Before starting a new streaming, you need to disconnect the streaming.

In [None]:
streamer.disconnect()