# Scraping Tweets with Tweepy Python

**This is a step by step guide to scrape Twitter tweets using a Python library called Tweepy.**

### **Prerequisites: Setting up a Twitter Developer Account**

Before you start using Tweepy, you would need a Twitter Developer Account in order to call Twitter’s APIs. Just follow the instructions and after some time (only a few hours for me), they would grant you your access.

To get the data required, we'll stream the tweets and replies from the official handles of the top banks, telcos, utilities services in Kenya using Tweepy and Twitter's API.

The resulting dataset contains tweets and responses from Kenyan banks. The following data wrangling process will accomplish:
- Get customer inquiry, and corresponding response from the company in every row.
- Convert datetime columns to datetime data type.
- Calculate response time to minutes.
- Any customer inquiry that takes longer than 60 minutes will be filtered out. We are working on requests that get response in 60 minutes.
- Create time attributes and response word count.

**Importing the required libraries**

Importing the required libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import tweepy

import sys
import re
import json
from pandas.io.json import json_normalize

%matplotlib inline
plt.style.use('bmh')
colors = ['#D55E00', '#009E73', '#0072B2', '#348ABD', '#A60628', 
          '#7A68A6', '#467821', '#CC79A7', '#56B4E9', '#F0E442']

### Authenticating Twitter API

In [None]:
### Authenticating Twitter APIpd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)

### Authenticating Twitter API

Tweepy really does make OAuth mostly painless. We'll need to get our credentials from Twitter Developer Account.

In [None]:
# Variables that contains the credentials to access Twitter API
ACCESS_TOKEN = ''
ACCESS_SECRET = ''
CONSUMER_KEY = ''
CONSUMER_SECRET = ''

In [None]:
# Setup access to API
def connect_to_twitter_OAuth():
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)  # add proxy='172.30.1.251:6969' if needed
    return api


# Create API object
api = connect_to_twitter_OAuth()

And now just to make sure it worked, lets print the most recent tweets from a user.

In [None]:
# tweets from a specific user
my_tweets = api.user_timeline(id='realDonaldTrump', count=10)

for tweet in my_tweets:
    print(tweet.created_at, tweet.text)

**We can use a function to extract the attributes we’re interested in and create a dataframe. Using a loop, we are able to extract tweets from multiple handles (banks in this case)**

In [None]:
for page in tweepy.Cursor(api.user_timeline, screen_name='KCBGroup', count=3).pages(2):
    for item in page:
        #print('https://twitter.com/' + item.in_reply_to_screen_name + '/status/' + item.in_reply_to_status_id_str)
        print(item.user.name)
        print(item.text)

### Batch Scraping

Due to the limited number of API calls one can make using a basic and free developer account, (~900 calls every 15 minutes before your access is denied) we pass the ***wait_on_rate_limit=True, wait_on_rate_limit_notify=True*** paramentes during authentication which will automatically handle the limitation for us.

In the example below, we scrape 20 tweets and loop over 50 pages per handle. Note: there are pagination limits.

In [None]:
handles = ['KCBGroup','AbsaKenya','coopbankenya','NCBABankKenya','KeEquityBank','StanChartKE','Safaricom_Care','AIRTEL_KE',
           'Zuku_WeCare','KenyaPower_Care','DStv_Kenya','HudumaKenya']

import io


merged=pd.DataFrame()

for handle in handles:
    pages = tweepy.Cursor(api.user_timeline, screen_name=handle, count=20).pages(50)

    for page in pages:
        for tweet in page:
            print(tweet.user.name, tweet.created_at, tweet.id, tweet.in_reply_to_status_id_str, 
                  tweet.in_reply_to_screen_name, tweet.text, tweet.retweet_count, tweet.favorite_count, 
                  sep='\t', end='\n', file=open("data/banks_tweeter_data_11262020.txt", "a",  encoding='utf-8'))


In [None]:
col_names=['User', 'Created_at', 'ID', 'Reply_to_status', 'Reply_to_user', 'Tweet', 'RTs','Likes'] 
df_t1 = pd.read_csv(r'data/banks_tweeter_data_11262020.txt', 
                    sep='\t', names=col_names, header=None, quoting=3, error_bad_lines=False, encoding='utf-8')

In [None]:
df_t1.head()

**Since we are interested in getting the response time to a tweet, we'll write another funtion to get the attributes of the original tweet**

In [None]:
tweet_ids = df_t1.Reply_to_status.tolist()

#Collect all tweets & replies from the account (Original tweeet to companyh)
def extract_tweet2_attributes(pages):
    df_t2 = pd.DataFrame(columns=['O_ID', 'O_Created_at', 'O_User', 'O_Tweet'])

    for _id in tweet_ids:
        try:
            tweet2 = api.get_status(_id)
        except:
            pass
        #print(tweet2.id_str, tweet2.created_at, tweet2.user.screen_name, tweet2.text, reply_to_tweet2)

        O_id = tweet2.id_str
        O_time = tweet2.created_at
        O_user = tweet2.user.screen_name
        O_text = tweet2.text

        new_row = {'O_ID': O_id, 'O_Created_at': O_time, 'O_User': O_user,'O_Tweet': O_text}

        df_t2= df_t2.append(new_row, ignore_index=True, sort = False)

    return(df_t2)

df_t2 = extract_tweet2_attributes(pages)        
#print(df_t2)

#Export collected data to csv
#df_t2.to_csv('safaricom_tweets.csv',index=False)

In [None]:
df_t2

**We merge the two dataframes. The unified dataframe contains the time a user made a tweet to a bank and the time the bank responded to the user. This will help us extract the response time for each bank**

In [None]:
df = pd.merge(df_t1, df_t2, how='outer', left_on=['Reply_to_status'], right_on = ['O_ID'])

In [None]:
df.head()

**Calculating time between outbound response and inbound message**

In [None]:
df[["Created_at", "O_Created_at"]] = df[["Created_at", "O_Created_at"]].apply(pd.to_datetime, errors = 'coerce', infer_datetime_format = True)

In [None]:
df.dtypes

In [None]:
df['Response_time'] = df['Created_at'] - df['O_Created_at']

In [None]:
#Convert to minutes, we only need tweets responded to in max 60 minutes
df['Response_time'] = df['Response_time'].astype('timedelta64[s]') / 60
df = df.loc[df['Response_time'] <= 60]

In [None]:
# Time attributes
df['Created_at_dayofweek'] = df['Created_at'].apply(lambda x: x.dayofweek)
df['Created_at_day_of_week'] = df['Created_at'].dt.day_name()
df['Created_at_day'] = df['Created_at'].dt.day
df['Created_at_is_weekend'] = df['Created_at_dayofweek'].isin([5,6]).apply(lambda x: 1 if x == True else 0)
df['Tweet_word_count'] = df.O_Tweet.apply(lambda x: len(str(x).split()))

In [None]:
#df[df.duplicated(keep=False)]

In [None]:
#cleanup any duplicates from our loop
df.drop_duplicates(keep='first', inplace=True)

In [None]:
df.head()

**Export our data in as csv file**

In [None]:
df.to_csv('data/banks_tweets_replies_11262020.csv',index=False)

Jupyter notebook can be located on the Github. Link on my blog thesiliconsavannah.com.