## Downloading Data from Twitter

Download tweets using Twitter's [REST API](https://dev.twitter.com/rest/public). 

Tweeets can also be collected using [Stream Listener](https://dev.twitter.com/streaming/overview) in real time. In general, REST API allows you to collect enough data faster since it searches for tweets already posted, while streaming API allows you to collect data in real time to update your data base and/or take actions immediately.

In either case, we can use a python package called tweepy to collect tweets from Twitter.

In [6]:
import tweepy # module for downloading tweets
from tweepy import OAuthHandler,AppAuthHandler, Cursor
import pandas as pd
import sqlite3
import os
import json
import sys
import numpy as np

# Enter the values for your app here.
# These will be required to gain Authorization from Twitter.
consumer_key = "0UEi4rTWHXm6OXbNwxnV6A8wG"
consumer_secret = "XKggqncGF1zoI87ifDLDUjc42zjMZdudZ2uBQQgLXVag1lsaEh"

access_token = "609508066-juzBiy519wQRViP75bY0D6pr9RaPszVIZvujnHDI"
access_secret = "0CIlVgJ49QmfKqAgcxqgwHqBShYOASniCMcGLntZ4CKoN"

In order to collect tweets using REST API, we pass a paramater *('max_id')* to the function. This is the 'id' of the latest tweet we want to collect, i.e., we will collect all the tweets with their 'id' less than or equal to 'max_id'.

The easiest way to do so is to post a tweet on your own timeline (which will have the current timestamp), and then use *OAuthHandler* to look at your timeline and get the 'id' of your latest tweet. Then use that 'id' with *AppAuthHandler*(REST API).

In [7]:
# Getting the max_id first
#Getting Authorization first.
auth1 = OAuthHandler(consumer_key,consumer_secret)
auth1.set_access_token(access_token,access_secret)
api1 = tweepy.API(auth1)

#Getting most recent tweet from my timeline using my screen_name
user_recent_tweet = api1.user_timeline(scree_name='ZhenWang211', count=1)
print (user_recent_tweet.max_id)

#Getting Authorization for REST API
auth = AppAuthHandler(consumer_key,consumer_secret)
#Making wait_on_rate_limit TRUE, so that the code will wait and continue when limit is reached.
api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)

#Making sure Authorization was obtained.
if (not api):
    print ("Can't Authorize.")
    sys.exit(-1)    

852169544424706047


Collect the tweets into a SQL database using sqlite3. For this project, I stored six vairbales for each tweet.
1. Text
2. Screen name
3. Id
4. Retweeted
5. Retweet count
6. Number of followers

If a tweet was retweet it was not saved. 

The following script collects tweets related to "healthcare". 

In [8]:
# Set up the paramters for the collection of the tweets
searchQuery = ["healthcare"] # sequence of interest to search for
maxTweets = 200 # maximum number of tweets to collect. Set to large number for actual data collection.
tweetsPerQuery = 100 # number of tweets to fetch per query
max_id = user_recent_tweet.max_id # id of my most recent tweet
sinceId = None # We want to collect all possible. So set to None.

print ("Downloading max {0} tweets.".format(maxTweets))

# Opening a connection to the sqlite3 database
if not os.path.exists('test.db'):
    conn = sqlite3.connect('test.db')
    c = conn.cursor()
    # creating a table if it does not exist.
    c.execute('''CREATE TABLE tweets (tweet TEXT, id INTEGER, screenName TEXT, isRT INTEGER, rtCount INTEGER, fCount INTEGER)''')
else:
    conn = sqlite3.connect('test.db')
    c = conn.cursor()

# Collecting tweets
tweetCount = 0
while tweetCount < maxTweets:
    try:
        if max_id <= 0:
            if (not sinceId):
                new_tweets = api.search(q = searchQuery,count = tweetsPerQuery)
            else:
                new_tweets = api.search(q = searchQuery, count = tweetsPerQuery, since_Id = sinceId)
        else:
            if (not sinceId):
                new_tweets = api.search(q = searchQuery,count=tweetsPerQuery,max_id=str(max_id-1))
            else:
                new_tweets = api.search(q = searchQuery,count=tweetsPerQuery,max_id=str(max_id-1),since_Id=sinceId)
            
        if not new_tweets:
            print ("No more tweets found.")
            break
        for tweet in new_tweets: #Saving tweets to SQL
            tweet._json #First convert tweets from string to json object.
            # The variables of interesting
            tweet1 = tweet.text 
            id1 = tweet.id
            screenName1 = tweet.user.screen_name
            rtCount1 = tweet.retweet_count
            isRT1 = tweet.retweeted
            foll_count1 = tweet.user.followers_count
            #Saving only if not a retweet
            if not isRT1:
                c.execute("INSERT INTO tweets VALUES (?,?,?,?,?,?)",(tweet1,id1,screenName1,isRT1,rtCount1,foll_count1))
                conn.commit()  
            
            
        tweetCount += len(new_tweets)
        print ("Downloaded {0} tweets".format(tweetCount))
        max_id = new_tweets[-1].id
    except tweepy.TweepError as e:
        print ("ERROR :" + str(e))
        break
conn.close() #Close connection.
                


Downloading max 200 tweets.
Downloaded 100 tweets
Downloaded 200 tweets


Check out the tweets we've collected. Connect to the sqlite3 database and load all the saved tweets into pandas frame.

In [9]:
conn = sqlite3.connect('test.db')
c = conn.cursor()
test_sql = pd.read_sql_query("Select * from tweets",conn)
conn.close()
test_sql.head(10)

Unnamed: 0,tweet,id,screenName,isRT,rtCount,fCount
0,RT @mediclocator: Stay-SlimSecrets From Dietit...,852169539181842432,Frankizuzu,0,1,107
1,Best workplaces in healthcare includes @Americ...,852169521716764672,KrisleighH,0,0,3065
2,"Want to work in #Augusta, GA? View our latest ...",852169521540485121,tmj_ga_nursing,0,0,451
3,RT @mjobarber: @XXXTENTACIMAN @AnaShadows @Nia...,852169521100193793,AnaShadows,0,1,1306
4,"RT @mediclocator: How To Eat More, But Smaller...",852169516691881984,Frankizuzu,0,1,107
5,"See our latest #Louisville, KY #job and click ...",852169514317824002,tmj_ky_health,0,0,387
6,RT @LimaCharlieNews: OpEd: Democrats need to a...,852169514108280833,erintouponso,0,37,194
7,Interested in a #job in #GoodfellowAirForceBas...,852169511104987136,tmj_tx_health,0,0,490
8,The #Startups using #apps &amp; #Phones to sol...,852169483867357184,DigitalHealthEZ,0,0,1370
9,"OK, not the most ""fun"" topic, but unfortunatel...",852169475642187777,Biff_Bruise,0,0,2579
