## Scrape Twitter with Twint

This notebook uses the Python [twint](https://github.com/twintproject/twint/wiki/Module) package to scrape Twitter. Set the configuration before running the cell (see the Python comments for guidance).

<h3 style="color:red;">Important!</h3>

Before you begin, make sure that you have installed the development version of twint and nest_asyncio with the following commands:

```python
pip install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint

pip install nest_asyncio
```

## Configuration

In [None]:
# Set the value to None for any filters you do not wish to use
handle             = None # e.g. 'sekleinman'
queryterm          = 'humanities' # e.g. 'humanities
language           = 'en'
limit              = None # e.g. 20
location           = None # e.g. 'London'
near               = None # e.g. 'London'
year               = None # e.g. '2017'
since              = '2018-01-1' # e.g. '2017-12-27'
until              = '2018-01-31' # e.g. '2017-12-27'
output_format      = 'json' # 'json' or 'csv'
# The path to a folder where the tweets will be saved.
# A file called "tweets.json" will be saved in this folder.
# If the file already exists, tweets will be appended to existing data.
output_path        ='C:/Users/Scott/OneDrive/Mellon/twint/testscrape'

# True/False Options
verified           = False # Include only tweets by only verified users
hide_output        = True # Prevents large outputs from displaying in the console
count              = True # Show the total number of tweets fetched
stats              = True # Show the tweet stats in the terminal output

# twint scrapes a large number of metadata properties.
# Specify all properties you wish to include in your output.
# A full list of possibile properties is below.
# Note that for some reason `location` causes an error, so
# it is commented out.
output_properties  = [
    'id',
    'date',
    'username',
    'place',
    'tweet',
    'mentions',
    'urls',
#     'location',
    'hashtags',
    'link',
    'retweets_count',
    'likes_count'
]

# Full list: 
# 'id', 'conversation_id', 'created_at', 'date', 'time',
# 'timezone', 'user_id', 'username', 'name', 'place', 'tweet',
# 'mentions', 'urls', 'photos', 'replies_count', 'retweets_count', 
# 'likes_count', 'location', 'hashtags', 'link', 'retweet',
# 'quote_url', 'video'

## Load Helper Functions

In [None]:
# Python imports
import datetime
import os
import twint
import nest_asyncio
nest_asyncio.apply()

# Helper Function
def scrape(options):
    """Add the docstring."""   
    config = twint.Config()
    config.Format = '{date}: {tweet}'
    config.Verified = options['verified']
    config.Count = options['count']
    config.Stats = options['stats']
    config.Hide_output = options['hide_output']
    if 'handle' in options and options['handle'] is not None:
        config.Username = options['handle']
    if 'queryterm' in options and options['queryterm'] is not None:
        config.Search = options['queryterm']
    if 'limit' in options and options['limit'] is not None:
        config.Limit = options['limit']
    if 'language' in options and options['language'] is not None:
        config.Language = options['language']
    if 'location' in options and options['location'] is not None:
        config.Location = options['location']
    if 'near' in options and options['near'] is not None:
        config.Near = options['near']
    if 'year' in options and options['year'] is not None:
        config.Year = options['year']
    if 'since' in options and options['since'] is not None:
        config.Since = options['since']
    if 'until' in options and options['until'] is not None:
        config.Until = options['until']
    if 'output_path' in options or 'output_format' in options:
        try:
            if options['output_path'] == None:
                assert options['output_format'] == None
            if options['output_format'] == None:
                assert options['output_path'] == None
            config.Output = options['output_path']
            if 'output_format' in options and output_format == 'csv':
                config.Store_csv = True
            else:
                config.Store_json = True
        except:
            print('\n\nYou must set both the output_path and output_format options.')
    if 'output_properties' in options and len(options['output_properties']) != 23:
        config.Custom["tweet"] = options['output_properties']
        
    twint.run.Search(config)

## Begin Scraping

In [None]:
try:
    configs = [
        'handle', 'queryterm', 'limit', 'location', 'near',
        'language', 'year', 'since', 'until', 'verified',
        'count', 'stats', 'output_format', 'output_path',
        'output_properties', 'hide_output'
    ]
    options = {}
    for item in configs:
        options[item] = eval(item)


    # User feedback when the process starts
    start_time = datetime.datetime.now() 
    print('\n\nProcess started at ' + start_time.strftime("%Y-%m-%d %H:%M:%S"))
    if hide_output == True:
        msg = """\n\n
    To check the progress, use your file explorer/finder to navigate
    to your output file and watch the file size update. You will
    receive a notification here when the scraping is finished. If 
    you need to stop the process, stop the Jupyter notebook kernel.
    If you resume the process, it should pick up where you left off."""
    #     print('\n\nYou may receive a warning "CRITICAL:root:twint.get:User:\'NoneType\' object is not subscriptable". This is a bug and you should be able to ignore it.\n\n')

    # Perform the scraping
        scrape(options)

    # User feedback when the process ends
    time_elapsed = datetime.datetime.now() - start_time 
    print('\n\nFinished!')
    print('\n\nTime elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))
except:
    print('\n\nError:')
    print('Could not perform the scraping process. Please check your configuration.')

## View the Data

The cell below loads the downloaded data into a pandas data frame, where it can be manipulated. Note that very large files may take a while to load. To make things easier, only the first ten rows are displayed (but you can modify this).

In [None]:
import ujson as json
import pandas as pd

records = map(json.loads, open(output_path, encoding='utf-8'))
df = pd.DataFrame.from_records(records)
sf = df[0:10]

## Modify the Data

There are lots of things you can do once you get the data into a pandas dataframe. This is just one example, which moves the "tweet" column to the left.

In [None]:
cols = sf.columns.values.tolist()
cols = ['date', 'tweet',
 'hashtags',
 'id',
 'likes_count',
 'link',
 'mentions',
 'place',
 'retweets_count',
 'urls',
 'username']
sf = sf[cols]
sf