# Twitter scraper for a user account
This notebook uses the [Tweepy](https://www.tweepy.org/) library to scrape as many tweets as possible (up to 3,200) for a specified user, and export the text (and optionally, metadata) to a CSV file.

## Prerequisites
- Before you can use this notebook, you have to have a Twitter account, and use it to [apply for a developer account](https://developer.twitter.com/). The approval process usually takes a few days. 
- Once you've been approved for a developer account, you can [visit the app dashboard](https://developer.twitter.com/en/apps)
- On the app dashboard, create a new app to use with this scraper. 
- After you've created your app, on the editing screen for the app there's a tab at the top for "Keys and tokens". Click on that tab. 
- On the page that appears, create an app token and app token secret. You'll need those values, along with the Consumer API Keys, in the second code block.

## Step 0. Install Tweepy
If you haven't installed Tweepy already, run the code block below to install it. You should only have to run this the very first time you use this notebook. If you run it again later, you'll just get a message that Tweepy is already installed ("Requirement already satisfied").

In [None]:
import sys
!{sys.executable} -m pip install tweepy

## Step 1. Import libraries
Every time you use this notebook, start by running the code block below to import the necessary libraries: *tweepy* for scraping tweets, *pandas* for storing the tweets in a form where we can easily choose what to export, *re* for cleaning the data, and *datetime* for grabbing the current date for the output filename.

In [None]:
import tweepy
import pandas as pd
import re
import datetime
import time

## Step 2. Twitter API authentication
Put the consumer key, consumer secret, access token, and access secret from the app that you created for use with this notebook between the quotation marks below, and run the code cell below.

In [None]:
consumer_key = "put-your-consumer-key-here"
consumer_secret = "put-your-consumer-secret-here"
access_token = "put-your-access-token-here"
access_secret = "put-your-access-secret-here"
 
#Performs authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

#Defines API using the authentication info
api = tweepy.API(auth)

## Step 3. Specify the username
Put the Twitter username of the user whose tweets you want to scrape between the quotes below, replacing `ADHOrg`. Don't use the @ sign. Run the code cell below.

In [None]:
TwitterHandle = 'ADHOrg'

## Step 4. Scrape tweets
Run the code cell below to start scraping. At some point, the Twitter API won't give you more results, and the code will wait 15 minutes before trying again.

This scraper excludes replies from what it captures. If you want to include replies, change `exclude_replies=True` to `exclude_replies=False` in the line that begins `for tweet in tweepy.Cursor`.

If you don't want the scraped tweets to print to the screen as they're gathered, you can put a `#` in front of the line `print(tempdata)`.

In [None]:
now = datetime.datetime.now()
endtime = now + datetime.timedelta(minutes = 45)

In [None]:
#List for collecting lists containing the tweet information
tweetholder = []
#This will iterate indefinitely to try to get more tweets

#while now < endtime:
while True:
    try:
        #Looks in the Twitter timeline for the user with the user name you specified above
        for tweet in tweepy.Cursor(api.user_timeline, screen_name=TwitterHandle, exclude_replies=True, tweet_mode="extended", wait_on_rate_limit=True).items():
            #
            tweet_text = tweet.full_text  
            time = tweet.created_at  
            tweeter = tweet.user.screen_name
            tweetid = tweet.id
            tempdata = [tweet.id, tweet.user.screen_name, str(tweet.created_at), tweet.full_text]
            tweetholder.append(tempdata)
            print(tempdata)
    except tweepy.RateLimitError:
        time.sleep(15 * 60)

## Step 5. Put the tweets in a dataframe
Run the code block below to turn the *tweetholder* list into a dataframe that's easier to export fom and modify, and display the dataframe.

In [None]:
df3 = pd.DataFrame(tweetholder, columns=['tweetid','username','timestamp','tweet_text'])
df3

## Step 6. Clean data
If we want to use these tweets for text analysis, it may help to remove URLs, which will probably only occur once. Run the cell below to remove URLs from the *tweet_text* column of the dataframe.

In [None]:
df3['tweet_text'] = df3['tweet_text'].str.replace('http\S+|www.\S+', '', case=False)

## Step 7. Export data
Run the code cells below to export the data in two formats: one with just the text of the tweets, and another with all the metadata.

In [None]:
#Sets up a variable for the current time, to append to the end of the filename
#That makes it easier to avoid a lot of fles with duplicate names if you run the code multiple times
time = datetime.datetime.now().strftime("%Y-%m-%d-%H%M%S")

### 7a. Export only tweet text

In [None]:
#Specifies an output filename based on the twitter username captured, plus current time
outtweets = TwitterHandle + "-tweets-" + time + ".csv"
#Exports ONLY TWEET TEXT to a CSV file
df3.to_csv(outtweets, mode='w', columns=['tweet_text'], index=False)

### 7b. Export tweets + metadata

In [None]:
#Specifies an output filename based on the twitter username captured, plus current time
outtweetsmetadata = TwitterHandle + "-tweets-metadata-" + time + ".csv"
#Exports all columns from the dataframe to a CSV file
df3.to_csv(outtweetsmetadata, mode='w', index=False)

## Suggested citation
Dombrowski, Quinn. *Twitter scraper for a user account*. Jupyter Notebook. https://github.com/quinnanya/twitter-user-scraper-notebook. December 19, 2019.