# Notebook 01:  scraping Twitter for incident response

This notebook details the methods used to collect tweets related to traffic incidents in the Los Angeles metropolitan area.  The current configuration is set up to gather data before and after an incident on the interstate 10 freeway in El Monte, CA.  To run a custom query you should update the incident time and query term list.

In [1]:
import GetOldTweets3 as got
import pandas as pd
import numpy as np
from langdetect import detect

#### Tweet fetching function
- for each query, collect tweets between two timestamps

In [2]:
def getweets(query,t0,t1): 
    
    # send get query useing getoldtweets3
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch(query)\
                                               .setSince(t0)\
                                               .setUntil(t1)
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    
    # select desired features
    features = {
    'tweet': 'text',
    'username': 'username',
    'date': 'formatted_date',
    }
    
    # instantiate data dictionary for new data
    data = {}

    # loop through 
    for title in features.keys():
        data[title] = []

    for i in tweets:
        for col, attr in features.items():
            var = eval('i.' + attr)
            data[col].append(var)

    return(pd.DataFrame.from_dict(data))  

#### Define incident time, desired time window, and list of query terms
- update this cell for each new incident with the following:
1. incident time as pandas datetime object
2. starting time t0 to collect tweets from
3. ending time t1 to collect tweets until
4. list of query terms to search twitter for
- to do combinations, assign a list of stems and leaves to combine in multiple permutations

In [19]:
# Set incident time
incident_time = pd.to_datetime("2019-02-23 13:00:00-07:00")

# Set collection starting time (recommended: 6 hours vefore incident)
t0 = "2019-02-22T00:00:00-07:00"

# Set Tweet collection ending time (recommended: 6 hours after incident)
t1 = "2019-02-25T00:00:00-07:00"

# SET QUERY TERMS:  use a combination of main artery and intersection/exits

# main artery keywords
stems = [
    'I-10',
    'I-10 Eastbound',
    'I-10 East',
    'I-10 Westbound',
    'I-10 West',
    'i10',
    'i10 Eastbound',
    'i10 East',
    'i10 Westbound',
    'i10 West',
    'Ten eastbound',
    'Ten east',
    'Ten westbound',
    'Ten west',
    'Santa Monica Fwy'
]

# intersection/exit keywords
leaves = [
    'El Monte',
    'Durfee Ave',
    'Durfee'
    'I-605',
    'i605',
    'San Gabriel Valley Fwy',
    'exit 30'
]

# combine stem and leaf terms as individual values
query_terms = stems + leaves

# concatenate combinations of stems and leaves
for s in stems:
    for l in leaves:
        nu_term = s + ' ' + l
        query_terms.append(nu_term)
        
print(f"{len(query_terms)} query items")

111 query items


#### Collect Data

In [4]:
%%time

# initialize query with 3-gram
df_list = []

# for each query term, add a new data frame to the list
for q in query_terms:
    nu_df = getweets(q,t0,t1)
    df_list.append(nu_df)
   
# concatenate list of dataframes, set index, and convert dates to datetime objects
data = pd.concat(df_list)  
data['index'] = np.array(range(0,data.shape[0]))
data = data.set_index('index')
data['date'] = pd.to_datetime(data['date'])

# display data attributes
print(data.shape)
data.head()

(8574, 3)
CPU times: user 47.5 s, sys: 2.22 s, total: 49.7 s
Wall time: 7min 1s


Unnamed: 0_level_0,tweet,username,date
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Pique jogador,gabriel_agle,2019-02-24 23:59:11+00:00
1,,girodrigs7,2019-02-24 23:58:45+00:00
2,alguém me entende,lauraincg,2019-02-24 23:48:45+00:00
3,歌います。 きいてください…『つばさをください』,i10_oo,2019-02-24 23:47:40+00:00
4,I just got really high and ate a whole bag of ...,_mvndiii,2019-02-24 23:47:23+00:00


#### Detect language and filter out non-English tweets

In [5]:
langs = []

# loop through tweets and detect language, add to list
for tweet in list(data['tweet']):
    try: 
        langs.append(detect(tweet))
    except:
        langs.append('unk')

# add new column with languages
data['langs'] = langs

# filter for just English language Tweets and set corpus
corpus = data[data['langs'] == 'en']

#### Set corpus target by incident time
- Tweets from before the incident are labeled with a 0
- Tweets from after the incident are labeled with a 1

In [21]:
# set labels
corpus['after_incident'] = [0 if time < incident_time else 1 for time in corpus['date']]

# print normalized value counts for each class
corpus['after_incident'].value_counts(normalize=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0    0.703302
1    0.296698
Name: after_incident, dtype: float64

#### Drop unused columns and export csv

In [23]:
# define file output path
outfile = './data/sample_output.csv'

# drop date and language columns and export to csv
corpus.drop(columns=['date','langs']).to_csv(outfile,index=False)