# Gathering Data

## Table of Contents

1. [Collecting Tweets](01-Gathering-Data.ipynb)
1. [Feature Engineering with TF-IDF](02-Feature-Engineering.ipynb)
1. [Benchmark Model](03-Benchmark-Model.ipynb)
1. [Feature Engineering & Model Tuning with Doc2Vec](04-Model-Tuning.ipynb)
1. [Making Predictions on Test Data](05-Making-Predictions.ipynb)
1. [Visualizing a Disaster Event](06-Time-Series-Analysis.ipynb)

### Install Packages
This only needs to be done once.

In [49]:
# !pip install pyquery
# !pip install -r './lib/got3/requirements.txt'

### Import Libraries

In [51]:
import pandas as pd
import time
import lib.got3 as got


## Collecting Tweets from Twitter using GetOldTweets (GOT)

### Convert GOT tweets to Pandas Dataframe

In [52]:
def tweets_to_df(tweets):
    '''
    Converts tweets in acquired using GOT into a Pandas dataframe.
    Index: date
    Columns: text
    '''
    tweets_list = []
    for t in tweets:
        tweet_dict = {}
        tweet_dict['date'] = t.date
        tweet_dict['text'] = t.text
        tweets_list.append(tweet_dict)
        
    tweets_df = pd.DataFrame(tweets_list)
    
    # convert to time series
    tweets_df.set_index(tweets_df['date'], inplace = True)
    tweets_df.sort_index(ascending = True)
    
    return tweets_df[['text']]
    

### Set up search query

In [37]:
query = 'wildfire OR forest+fire'
since = '2018-07-10'
until = '2018-07-11'
count = 20000

tweetCriteria = got.manager.TweetCriteria().setQuerySearch(query)\
                                           .setSince(since)\
                                           .setUntil(until)\
                                           .setMaxTweets(count)

### Search using multiple date ranges, save each search result to separate CSVs

Since there are so many tweets in a day, it is advised to search on one-day date ranges.

In [44]:
query = 'wildfire OR forest+fire'
dates = ['2018-07-12', '2018-07-13']

In [45]:
for i in range(len(dates)-1):
    
    # Set up search query
    since = dates[i]
    until = dates[i+1]
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch(query)\
                                           .setSince(since)\
                                           .setUntil(until)\
                                           .setMaxTweets(count)

    # Run search
    t0 = time.time()
    tweets =  got.manager.TweetManager.getTweets(tweetCriteria)
    t1 = time.time() - t0
    
    # Print progress
    print(f'Got {len(tweets)} tweets from {since} to {until} in {round(t1, 2)} seconds')
    
    # Convert tweets to dataframe
    tweets_df = tweets_to_df(tweets)
    
    # Save Tweets with date range and query
    tweets_df.to_csv(f'../data/{since}_{until}_{query}.csv', index = True, index_label = 'date')

Got 1837 tweets from 2018-07-12 to 2018-07-13 in 45.47 seconds
