# Scraping Twitter for 50K Tweets
_<b>Author:</b> Raffy Santayana_

---

## Problem Statement:
The aftermath of Hurricane Sandy led over 8 million US people without power<sub>[1](https://www.huffpost.com/entry/hurricane-sandy-power-outage-map-infographic_n_2044411)</sub>. According to [energy.gov](https://www.energy.gov/articles/hurricane-sandy-noreaster-situation-reports), everyone who was able to recieve electricity after the storm has had their electricity restored by December 3, leaving 26,000 people in New York and New Jersey without power. One of the methods of detecting power outages at this time was by utilizing [smart meters and Advanced Meter Infrastructures (AMI)](https://openei.org/wiki/Definition:Outage_Detection/Reporting). The problem is that these systems will not be fully implemented until 2030 due to high cost of production<sub>[2](http://people.stern.nyu.edu/kbauman/research/papers/2015_KBauman_WITS.pdf)</sub>. As an alternative method of detecting these outages, we will be using natural language processsing techniques such as web embedding to analyze various posts from a social media platform, Twitter. When analyzing these posts, we hope to classify the location that the tweet was sent from as an area either with or without power.

```python
import pandas as pd
import numpy as np
import GetOldTweets3 as got
import time as t
import datetime
```

```python
max_tweets = 50_000
tweetCriteria = got.manager.TweetCriteria().setQuerySearch("'blackout' OR 'blackouts' OR 'outage' OR 'outages' OR 'power outage' OR 'pwr out' OR 'no pwr' OR 'no power' -filter:retweets")\
                                           .setSince("2014-01-01")\
                                           .setUntil("2019-09-02")\
                                           .setMaxTweets(max_tweets)```

```python
tweets = list()
start_time = datetime.datetime.now()
for i in max_tweets:
    print(f'sleeping...')
    t.sleep(1)
    pull_stime = datetime.datetime.now()
    tweets.append(got.manager.TweetManager.getTweets(tweetCriteria_2014)[i])
    print(f'{datetime.datetime.now() - pull_stime} to complete pull')
    print(f'Got tweet {i}')
print(f'{datetime.datetime.now() - start_time} to complete')```

```python
tweet_id, username, text, date =list()
for tweet in tweets_collected:
    tweet_id.append(tweet.id)
    username.append(tweet.username)
    text.append(tweet.text)
    date.append(tweet.date)```

```python
df_tweets = pd.DataFrame(data = {'id': tweet_id,
                     'username': username,
                     'text': text,
                     'timestamp': date})```

```python
df_tweets.to_csv('./data/tweets.csv')
```