#### The intention of this notebook is to provide the high level guidlines how to create an arbitrary dataset from twitter.
#### As an example, a small dataset will be created, similar to the one used in [this challenge](https://www.kaggle.com/c/nlp-getting-started).

Note, you will not be run this notebook as is, unless you use your access tokens (`Access Token`, `Access Token Secret`, `Consumer Key`, `Consumer Secret`).

To access to twitter api, you will need to go through the registration proccess (free, but limited access) at https://developer.twitter.com/en.

After that you will obtain the private keys&tokens.

In [None]:
!pip install twitter

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 100
import twitter

We will start from the end: [here](https://www.kaggle.com/vstepanenko/disaster-tweets) is the dataset that has been created, following the steps described in this notebook.
You can find more info about the content and time of creation following that link.

In [None]:
disaster_tweets_df = pd.read_csv('../input/disaster-tweets/tweets.csv',
                                 usecols=['keyword', 'location', 'text', 'target'])
disaster_tweets_df

It has the same structure as in this competition and contains 11370 tweets.

In [None]:
disaster_tweets_df['target'].value_counts()

And now we start building a new smaller dataset from twitter from scratch

In [None]:
# Establishing the access to twitter api
# Here I used my twitter credentials (there are now invalidated)
# You will need to plug in yours.

# VERY IMPORTANT!
# Regenerate/revoke your keys, if you decide to publish your version of the notebook

tw={
    'Consumer Key': 'your_consumer_key',
    'Consumer Secret': 'your_consumer_secret',
    'Access Token': 'your_access_token',
    'Access Token Secret': 'your_access_secret',
   }


auth = twitter.oauth.OAuth(tw['Access Token'],
                           tw['Access Token Secret'],
                           tw['Consumer Key'],
                           tw['Consumer Secret'])

twitter_api = twitter.Twitter(auth=auth)

Set the variable `q` to any disaster related keyword, or anything else of your interest.

Note: The below cell will not run normally (`TwitterHTTPError` will be raised), unless you use your private keys.


In [None]:
q = 'covid19'
number = 10 # number of tweets to query
search_results = twitter_api.search.tweets(q=q, count=number)

The search_results contains two keys: `statuses` (lots of details about the tweet) and `search_metadata` (info about search parameters).

From all available data in `statuses`, we will extract only `keyword`, `location`, `text`.

In [None]:
print(search_results.keys())
statuses = search_results['statuses']

Here is how to extract `keyword`, `location`, `text` to replicate the structure of the dataset used in this competition.
Also we add column `target` and set it to `None`

In [None]:
example_df = pd.DataFrame(
    data=[[q, s['user']['location'], s['text'], None] for s in statuses],
    columns = ['keyword', 'location', 'text', 'target'],
            )

example_df

Let's try some more keywords, and using `for-loop` to iterate over.

All keywords that could be found in `train` and `test` datasets (provided in the competition) have been reused to create [Disaster Tweets](https://www.kaggle.com/vstepanenko/disaster-tweets).

In [None]:
# Just as example, here I use four topics.
# Feel free to complement/ammend the list with yours.
keywords=['pandemic', 'lockdown', 'fire', 'crush']

In [None]:
def collect_tweets(keywords, count=10):
    df = pd.DataFrame(columns=['keyword', 'location', 'text', 'target'])
    for q in keywords:
        search_results = twitter_api.search.tweets(q=q, count=count)
        tmp_df = pd.DataFrame(
            data=[[q, s['user']['location'], s['text'], None] for s in search_results['statuses']],
             columns = ['keyword', 'location', 'text', 'target'],
            )
        df = df.append(tmp_df, ignore_index=True)
    
    return df

In [None]:
# We collect 20 tweets in total. 5 tweets over 4 topics.
tweet_collection_df = collect_tweets(keywords, count=5)
tweet_collection_df

The tweets have been collected into the dataframe. It is all done!

You may want to save your work to continue to tune the collected tweets off-line.

In [None]:
tweet_collection_df.to_csv('tweet_collection_df.csv', index=False)

In [None]:
!ls tweet_collection_df.csv -l