### Make sure you have the TweePy library installed.

### Make sure you have the TweePy library installed.

In [2]:
# !pip install tweepy

Collecting tweepy
  Downloading https://files.pythonhosted.org/packages/36/1b/2bd38043d22ade352fc3d3902cf30ce0e2f4bf285be3b304a2782a767aec/tweepy-3.8.0-py2.py3-none-any.whl
Collecting requests-oauthlib>=0.7.0 (from tweepy)
  Downloading https://files.pythonhosted.org/packages/c2/e2/9fd03d55ffb70fe51f587f20bcf407a6927eb121de86928b34d162f0b1ac/requests_oauthlib-1.2.0-py2.py3-none-any.whl
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->tweepy)
[?25l  Downloading https://files.pythonhosted.org/packages/05/57/ce2e7a8fa7c0afb54a0581b14a65b56e62b5759dbc98e80627142b8a3704/oauthlib-3.1.0-py2.py3-none-any.whl (147kB)
[K     |████████████████████████████████| 153kB 4.2MB/s eta 0:00:01
[?25hInstalling collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.1.0 requests-oauthlib-1.2.0 tweepy-3.8.0


### Import the necessary libraries.

In [3]:
import tweepy
import pandas as pd

### Make sure you have credentials for the Twitter API and set them here! [(See documentation here for help.)](http://docs.tweepy.org/en/v3.8.0/auth_tutorial.html)

In [4]:
auth = tweepy.OAuthHandler("putyour", "credentialshere")
auth.set_access_token("putyour", "credentialshere")

api = tweepy.API(auth)

### Main function to process Tweets and pull out relevant information

In [5]:
def process_tweet(status):
    """
    Takes in a Status object and returns a dictionary object of the post's most important information.
    For this use case, that is:
    - the set of hashtags present in the tweet, if any
    - the full text of the tweet
    - the link to the first piece of media embedded in the tweet, if any
    - the name of the place (from the Place object of the Status), if any
    - the timestamp of the tweet's creation
    """
    data = {}
    tags = []
    for ht in status.entities['hashtags']:
        tags.append(ht['text'])
    data['hashtags'] = tags
    data['full_text'] = status.full_text
    media_link = "n/a"
    if 'media' in status.entities.keys():
        ent = status.entities['media'][0]
        if ent['type'] == 'photo':
            media_link = ent['expanded_url']
    data['media_link'] = media_link
    try:
        data['place'] = status.place.full_name
    except:
        data['place'] = "n/a"
    data['created_at'] = status.created_at
    return data

### Set the number of tweets you want and keywords you want to search for.
#### The keywords do not need to be hashtags; in our case, most content about the wildfires was shared with each fire's hashtag so we used those hashtags to search.

In [6]:
num_wanted_tweets = 5000

keywords = ["#SaddleridgeFire", "#GettyFire", "#KincadeFire", "#MariaFire", "#EasyFire", "#TickFire", "#HillsideFire"]

#### The search method can handle multiple keywords at the same time, so this cell creates the full query.

In [7]:
query = " OR ".join(keywords)
query

'#SaddleridgeFire OR #GettyFire OR #KincadeFire OR #MariaFire OR #EasyFire OR #TickFire OR #HillsideFire'

### The code below takes in the number of desired tweets and the search query created above. It then loops through full-length English-language tweets and collects all non-retweet tweets in a list. Once it has collected all desired tweets, the loop breaks.
#### There are a number of parameters that can be set (see documentation here for all information). In this case, we chose to filter based on language and on date, because we were aiming to gather information about specific fires (particularly Getty and Kincade, to illustrate certain components of our dashboard demo). Twitter's free API purportedly does not let you search further back than seven days, which would make reproducing our exact data pull difficult after the fact. However, someone using paid tiers of the Twitter API would have more options and ways to handle this.

In [8]:
tweets = []

for tweet in tweepy.Cursor(api.search,q=query,
                           count=100, # the number of tweets to return from each page
                           lang="en", # the language of the tweet
                           since="2019-10-03", # optional field to filter search based on date (see note above)
                           tweet_mode='extended').items():
    if not tweet.full_text.startswith("RT @"):
        tweets.append(process_tweet(tweet))
    if len(tweets) % 100 == 0:
        print(f"pulled {len(tweets)} of {num_wanted_tweets} tweets")
    if len(tweets) > num_wanted_tweets:
        break

pulled 0 of 5000 tweets
pulled 100 of 5000 tweets
pulled 200 of 5000 tweets
pulled 300 of 5000 tweets
pulled 400 of 5000 tweets
pulled 500 of 5000 tweets
pulled 600 of 5000 tweets
pulled 700 of 5000 tweets
pulled 800 of 5000 tweets
pulled 900 of 5000 tweets
pulled 1000 of 5000 tweets
pulled 1100 of 5000 tweets
pulled 1200 of 5000 tweets
pulled 1300 of 5000 tweets
pulled 1400 of 5000 tweets
pulled 1500 of 5000 tweets
pulled 1600 of 5000 tweets
pulled 1700 of 5000 tweets
pulled 1800 of 5000 tweets
pulled 1900 of 5000 tweets
pulled 2000 of 5000 tweets
pulled 2100 of 5000 tweets
pulled 2200 of 5000 tweets
pulled 2300 of 5000 tweets
pulled 2400 of 5000 tweets
pulled 2500 of 5000 tweets
pulled 2600 of 5000 tweets
pulled 2700 of 5000 tweets
pulled 2800 of 5000 tweets
pulled 2900 of 5000 tweets
pulled 3000 of 5000 tweets
pulled 3100 of 5000 tweets
pulled 3200 of 5000 tweets
pulled 3300 of 5000 tweets
pulled 3400 of 5000 tweets
pulled 3500 of 5000 tweets
pulled 3600 of 5000 tweets
pulled 3700 o

### Our dashboard demo (built in ARCGISOnline) is configured such that this dataset of tweets needs a specific `Disaster` column to designate which disaster each tweet is associated with. 
### In our case, this means each tweet needs to be associated with the correct `fire`. The following code turns the collected tweets into a dataframe, cleans the hashtags up, and creates the necessary column (`fire` in our case).

In [9]:
# providing the information necessary to filter and clean the tweets
disaster_type = 'Fire'
needed_disasters = [disaster.replace(disaster_type, "").replace("#", "") for disaster in keywords]
needed_disaster

['Saddleridge', 'Getty', 'Kincade', 'Maria', 'Easy', 'Tick', 'Hillside']

In [10]:
df = pd.DataFrame(tweets)

# cleaning the hashtags up and creating the necessary disaster column
df = df.explode('hashtags')
df = df.loc[(df['hashtags'].str.contains(disaster_type)) | (df['hashtags'].str.contains(disaster_type.lower()))].copy()
df[disaster_type.lower()] = df['hashtags'].str.replace(disaster_type, "")
df[disaster_type.lower()] = df[disaster_type.lower()].str.replace(disaster_type.lower(), "")

# in order to check for potential new disasters:
# this is relevant in our use case because many new fires were starting even while we were working on this
# df[disaster_type.lower()].value_counts()

df = df.loc[df[disaster_type.lower()].isin(needed_disasters)]
df = df[[disaster_type.lower(), 'full_text', 'media_link', 'place', 'created_at']]

In [11]:
df.columns

Index(['fire', 'full_text', 'media_link', 'place', 'created_at'], dtype='object')

In [12]:
df.head()

Unnamed: 0,fire,full_text,media_link,place,created_at
0,Saddleridge,"From the #SaddleridgeFire to the #KincadeFire,...",,,2019-10-27 19:59:04
1,Saddleridge,Our hearts go out to Californians affected by ...,https://twitter.com/EPCFIRM/status/11885645126...,,2019-10-27 21:13:56
2,Saddleridge,"@macayla_nield Hey, #SoCal, too, although, adm...",,,2019-10-28 00:54:08
3,Saddleridge,Smoke and flames poured from underground sewer...,,,2019-10-28 02:29:27
4,Saddleridge,"The #SaddleridgeFire, which has burned about 4...",,,2019-10-28 04:29:21


In [13]:
df.tail()

Unnamed: 0,fire,full_text,media_link,place,created_at
5345,Kincade,Good Morning !.....after the #KincadeFire ......,https://twitter.com/jonigeographer/status/1192...,,2019-11-06 16:40:30
5346,Kincade,California's Kincade Fire Burn Scar Seen From ...,https://twitter.com/SpaceRef/status/1192121602...,,2019-11-06 16:48:33
5347,Kincade,U.S. District Judge William Alsup ordered PG&a...,,,2019-11-06 17:04:50
5348,Kincade,Kincade fire victims swarm county aid center i...,,,2019-11-06 17:09:21
5349,Kincade,@ETSshow Thanks for sharing @ETSshow . Been fe...,,,2019-11-06 17:11:32


### A quick check of the percentage of tweets that actually contain media or location information.

In [14]:
1 - df[['media_link', 'place']].isnull().mean()

media_link    0.341308
place         0.063551
dtype: float64

### At this point, the data is ready to be saved and passed into the ARCGISOnline dashboard.

In [15]:
df.to_csv('final_5k_tweets.csv', index=False)