# Twitter extraction


## Setup

In [11]:
PRICE_FOLDER    = "data/price/"
TWITTER_FOLDER  = "data/twitter/"
tweets_raw_file = './data/twitter/bitcoin_tweets_raw.csv'
tweets_clean_file = './data/twitter/bitcoin_tweets_clean.csv'

## 1. Retrieve the tweets from Twitter API

### 1.1 Import Tweepy


In [12]:
import sys
import subprocess
import pkg_resources
import tweepy

required = {'numpy','pandas','tweepy'} 
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    # implement pip as a subprocess:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install',*missing])

### 1.2 OAuth2 Authentication (*app* authentication)


In [14]:
consumer_key= # Removed
consumer_secret= # Removed
access_token= # Removed
access_token_secret= # Removed

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token,access_token_secret)
api = tweepy.API(auth)

### 1.3 Query the twitter API
Here we query the twitter API to get the latest tweets about bitcoin. Then we transform it to store only the useful data inside a Pandas Dataframe.

The following fields are retrieved from the response:

- **id** (int) : unique identifier of the tweet
- **text** (string) : UTF-8 textual content of the tweet, max 140 chars
- user
  - **name** (string) : twitter's pseudo of the user
  - **followers_count** (int) : Number of followers the user has
- **retweet_count** (int) : Number of times the tweet has been retweeted
- **favorite_count** (int) : Number of likes
- **created_at** (datetime) : creation date and time of the tweet

Also, we wanted to retrieve the following fields but it is not possible with the standard free API, Enteprise or premium is needed (https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html):

- reply_count (int) : Number of times the Tweet has been replied to

The pandas package must be installed using *pip install pandas* from the command line.

We used the search opertators that are explained here (https://lifehacker.com/search-twitter-more-efficiently-with-these-search-opera-1598165519) to not only search by hashtag but also the tweets that contain the currency name or that have the hashtag with the currency's abreviation.

In [15]:
from time import sleep
import json
import pandas as pd
import io
from tqdm import tqdm

In [16]:
tweets_raw_file

'./data/twitter/bitcoin_tweets_raw.csv'

In [17]:
from datetime import datetime, timedelta

now = datetime.today().now()
prev_hr = now-timedelta(hours=1)
now = now.strftime("%Y-%m-%d, %H:%M:%S")
prev_hr = prev_hr.strftime("%Y-%m-%d, %H:%M:%S")
now
prev_hr

prev_test = datetime(2021, 9, 9, 0, 0, 0)

In [29]:
number_of_tweets = 100
data = {"statuses": []}
next_id = "" #"1147236962945961984"
since_id= ''
text_query = "Bitcoin -filter:retweets"
language = "en"
tweets = []
likes = []
retweets = []
followers = []
time = []
ID = []

##
file_exists = os.path.isfile(tweets_raw_file)

with open(tweets_raw_file,"a+", encoding='utf-8') as file_:
    
    #file_.write("Tweets,Likes,CreatedAt\n") Creation date should go here?

    while(True):
        
        for i in tweepy.Cursor(api.search, q=text_query, result_type="recent", lang=language, tweet_mode="extended").items(number_of_tweets):
            ID.append(i.id)
            tweets.append(i.full_text)
            likes.append(i.favorite_count)
            retweets.append(i.retweet_count)
            followers.append(i.user.followers_count)
            time.append(i.created_at)
        
        last_size = 0

        print('Retrieved tweets, waiting for 1 hour until next query'.format(len(data["statuses"])))
        df = pd.DataFrame({'ID':ID, 'Tweets':tweets, 'Likes':likes, 'Retweets':retweets, 'Followers':followers, 'CreatedAt':time})
        df.to_csv(file_, mode='a', encoding='utf-8',index=True,header=True)
        if last_size + 1 == len(data["statuses"]):
            print('No more new tweets, stopping...')
            break
        data["statuses"] = []

        break

Retrieved tweets, waiting for 1 hour until next query


In [10]:
!jupyter nbconvert --to script --no-prompt 01_TwitterExtraction.ipynb

[NbConvertApp] Converting notebook 01_TwitterExtraction.ipynb to script
[NbConvertApp] Writing 5015 bytes to 01_TwitterExtraction.py
