# Twitter Data Collection

We have collected tweets from most recent California wildfires based on below locations and time frames: 

**Pre-fire tweets**     
San Fernando Valley / 08/20/2019 - 09/06/2019  
**Post-fire tweets**  
1- La Tuna Canyon / 09/01/2017 - 09/01/2017  
2- Kincade / 10/24/2019 - 10/28/2019  
3- Santa Clarita (Tick Fire) / 10/26/2019 - 10/28/2019     
4- Saddleridge / 10/19/2019 - 10/24/2019  
5- Getty Center / 10/28/2019 - 10/29/2019  
6- Santa Paula (Maria Fire) - 10/31/2019 - 11/01/2019  

Then, we have labeled all pre-fire tweets as negative class (Target = 0) and manually labeled all post-fire tweets one by one. If a tweets is both relevant and informative, then the target value is (Target = 1), otherwise (Target = 0)
.

## 1 - Using Tweepy Library

Using Tweepy library to create a daily scraper for newest tweets based on location and date filters since Twitter API allows to scrape the tweets posted within last 7 days.

### Accessing the API

It requires to create a Twitter developer account and create an app to obtain consumer keys and access tokens.  
For further information: https://developer.twitter.com/en.html

In [1]:
# Imports
import pandas as pd
import numpy as np

import json
import tweepy
import csv
from datetime import date

In [2]:
# Create a dictionary to store your twitter credentials

twitter_cred = dict()

twitter_cred['CONSUMER_KEY'] = '***********************'
twitter_cred['CONSUMER_SECRET'] = '***********************'
twitter_cred['ACCESS_KEY'] = '***********************'
twitter_cred['ACCESS_SECRET'] = '***********************'

In [3]:
# Save the information to a json so that it can be reused in code

with open('twitter_cred.json', 'w') as secret_info:
    json.dump(twitter_cred, secret_info, sort_keys=True)

Find twitter_cred.json file in current directory and type consumer key, consumer secret, access key, access secret. After this, do not re-run above cell since it would overwrite the file with blank data.

In [7]:
# load Twitter API credentials

with open('twitter_cred.json') as cred_data:
    info = json.load(cred_data)
    consumer_key = info['CONSUMER_KEY']
    consumer_secret = info['CONSUMER_SECRET']
    access_token = info['ACCESS_KEY']
    access_secret = info['ACCESS_SECRET']

In [8]:
# Authorization and initialization

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

In [9]:
# https://tweepy.readthedocs.io/en/latest/code_snippet.html
    
# specify a keyword i.e fire
# if you assign blank string, it will pull all tweets
keyword = ''

# get today's date
today = str(date.today())

# open/create a file to save data
csv_file = open('/data/maria_fire_20191101.csv', 'a')

# define csv writer
csv_writer = csv.writer(csv_file)
    
for tweet in tweepy.Cursor(api.search,
                           q=' maria fire',
                           count=100,
                           geocode='34.3542, -119.0593, 30mi', # longitude, latitude, radius (need to update for different locations)
                           lang="en",
                           since=today).items(): # .items() is for paging tweets 
    
    # print(tweet.created_at, tweet.text)
    # condition to exclude retweets
    # if not (tweet.retweeted) and ('RT @' not in tweet.text) and ('https://' not in tweet.text): 
    
    # append the dataset file
    csv_writer.writerow([tweet.created_at, tweet.text.encode('utf-8'), tweet.retweet_count, tweet.favorite_count]) 


In [14]:
# create dataframe from csv file 
tweet_df = pd.read_csv('../data/maria_fire_20191101.csv')

In [15]:
tweet_df.shape

(2297, 5)

In [16]:
# assign appropriate column names
tweet_df.columns = ['time', 'text', 'retweet_count', 'favorite_count', 'target']

In [17]:
# convert datetime string to datetime data type
tweet_df['time'] = pd.to_datetime(tweet_df['time'])

In [18]:
tweet_df.drop_duplicates(subset='text', inplace=True)

In [19]:
tweet_df['target'] = 1
# target value to be updated on csv file based on relevance

In [20]:
tweet_df.head()

Unnamed: 0,time,text,retweet_count,favorite_count,target
0,2019-11-01 21:38:55,RT @Christian_lxpez: MARIA FIRE UPDATE ! https...,239,0,1
1,2019-11-01 21:38:09,RT @KTLA: The #MariaFire exploded across South...,514,0,1
2,2019-11-01 21:37:45,RT @WCKitchen: UPDATE from the #MariaFire in V...,338,0,1
3,2019-11-01 21:37:32,RT @SpecNews1SoCal: UPDATE: More new #MariaFir...,26,0,1
4,2019-11-01 21:37:27,RT @yamphoto: A study in hose action: firefigh...,192,0,1


In [21]:
tweet_df.to_csv('../data/maria_fire_20191101.csv', index=False)

## 2 - Using GetOldTweets3 Library

We used GeltOldTweets3 library to scrape tweets older than 7 days.

In [22]:
# Imports
import GetOldTweets3 as got
import codecs
import pandas as pd

In [23]:
# https://pypi.org/project/GetOldTweets3/
    
# define a functions that takes parameters and passes them to getTweets() method
# then creates a list of tweet dictionaries
def scrape_tweets(scrape_information):
    tweets = got.manager.TweetManager.getTweets(scrape_information)
    tweets_list = []
    for tweet in tweets:
        scraped_tweets = {}
        scraped_tweets['tweet'] = tweet.text
        tweets_list.append(scraped_tweets)
    return tweets_list

In [26]:
# keyword = 'fire'
parameters = got.manager.TweetCriteria()\
                .setMaxTweets(10000)\
                .setSince('2019-08-20')\
                .setUntil('2019-09-06')\
                .setNear('34.1826,-118.4397')\
                .setWithin('30mi');
prefire_tweets = scrape_tweets(parameters)

In [27]:
# create a dataframe
prefire_tweets = pd.DataFrame(prefire_tweets)
prefire_tweets.head()

Unnamed: 0,tweet
0,6626 Norwich Avenuepic.twitter.com/X3ubOVlf8O
1,This was an awesome find by @MitchTheFort. So ...
2,"#MrKrabsMeme @North Hollywood, California http..."
3,"Just posted a photo @North Hollywood, Californ..."
4,Check it out! Carnival Row Orlando Bloom Cara ...


In [28]:
# assign appropriate column names
prefire_tweets.columns = ['text']

In [29]:
# set datetime valeu same as filter parameter
prefire_tweets['time'] = '2019-08-20 10:00:00'

In [30]:
# convert datetime string to datetime data type
prefire_tweets['time'] = pd.to_datetime(prefire_tweets['time'])

In [31]:
# drop duplicate tweets 
prefire_tweets.drop_duplicates(subset='text', inplace=True)

In [32]:
# set target value as 0
prefire_tweets['target'] = 0

In [33]:
prefire_tweets.to_csv('../data/prefire_tweets.csv', index=False)