# Twitter extraction

This notebook aims to retrieve tweets, clean them and compute a sentiment in order to observe a correlation between crypto currencies and tweets' sentiments. The following steps are executed in this notebook :

- Retrieve tweets with Twython API (Twitter API wrapper for python)
- Extract the wanted data (tweet's text, #followers, #likes, etc.)
- Clean the textual data (remove unnecessary elements like media, websites link, pseudos, ...)
- Compute for each tweet a sentiment score with Vader (named compound) and a score linked to the popularity of the tweet and its compound

This notebook is written using Python 3.6.

## Setup

In [25]:
# Define the currency
#CURRENCY = "zilliqa"
#CURRENCY_SYMBOL = "ZIL"
#CURRENCY = "nexo"
#CURRENCY_SYMBOL = "NEXO"
CURRENCY = "bitcoin"
CURRENCY_SYMBOL = "BTC"

## personal config
TWEETS_FOLDER    = f"data/crypto/{CURRENCY_SYMBOL}" # Relative path to historical data
SEP_CHAR         = '~' # character seperating dates from and to in filename
ENVS             = ['CRYPTO', 'LINE_COUNT', 'MOST_RECENT_FILE', 'MOST_RECENT_ID'] # Stored in var.csv
MAX_ROW_PER_FILE = 20000 # Each file storing data has a maximum amount of rows

#tweets_raw_file = f'data/twitter/{CURRENCY_SYMBOL}/{CURRENCY}_tweets_raw.csv'
#tweets_clean_file = f'data/twitter/{CURRENCY_SYMBOL}/{CURRENCY}_tweets_clean.csv'
query = f'#{CURRENCY} OR #{CURRENCY_SYMBOL}' ####TODO PUT BACK  OR {CURRENCY} OR ${CURRENCY} OR ${CURRENCY_SYMBOL}

## 1. Retrieve the tweets from Twitter API

### 1.1 Import Twython
We use the *twython* package as my Python interface with the Twitter API: https://twython.readthedocs.io/en/latest/usage/starting_out.html

The twython package must be installed using *pip install twython* from the command line.

In [8]:
from twython import Twython

### 1.2 OAuth2 Authentication (*app* authentication)
Here we use the method *OAuth2* along with the Twithon library to authenticate on the twitter API.

OAuth1 will give you *user* access to the API, whereas OAuth2 will give the *app* access. For academic use the rate limits are generally better for *OAuth2* (app) authentication, with a few exceptions. For a chart showing the API limits for user and app authentication for the various parts of the Twitter API, see this chart: https://dev.twitter.com/rest/public/rate-limits

Running the code block below shows that we now have a rate limit of 450 API calls. This means we can make 450 different calls to the API within the current 15-minute window. With the search API we can access 100 tweets per call. This means that, if we were downloading tweets with a specific hashtag, such as *#arnova16*, we could download 450 $\times$ 100 or 45,000 tweets per window. This is much better than the 18,000 tweets we can access using the OAuth1 or user authentication.

In [9]:
APP_KEY = 'mPQKoRwd2Pb9qpQyQmyG5s8KR'
APP_SECRET = 'HLvIhusvfzDLKaRXY8CnZGP143kp3E3f2KqQBIEMfVL5mOxZjq'
twitter = Twython(APP_KEY, APP_SECRET, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()
twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)
twitter.get_application_rate_limit_status()['resources']['search']

{'/search/tweets': {'limit': 450, 'remaining': 450, 'reset': 1527859898}}

### 1.3 Query the twitter API
Here we query the twitter API to get the latest tweets about bitcoin. Then we transform it to store only the useful data inside a Pandas Dataframe.

The following fields are retrieved from the response:

- **id** (int) : unique identifier of the tweet
- **text** (string) : UTF-8 textual content of the tweet, max 140 chars
- user
  - **name** (string) : twitter's pseudo of the user
  - **followers_count** (int) : Number of followers the user has
- **retweet_count** (int) : Number of times the tweet has been retweeted
- **favorite_count** (int) : Number of likes
- **created_at** (datetime) : creation date and time of the tweet

Also, we wanted to retrieve the following fields but it is not possible with the standard free API, Enteprise or premium is needed (https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html):

- reply_count (int) : Number of times the Tweet has been replied to

The pandas package must be installed using *pip install pandas* from the command line.

We used the search opertators that are explained here (https://lifehacker.com/search-twitter-more-efficiently-with-these-search-opera-1598165519) to not only search by hashtag but also the tweets that contain the currency name or that have the hashtag with the currency's abreviation.

In [43]:
from time import sleep
import json
import pandas as pd
import io
from tqdm import tqdm

In [42]:
NUMBER_OF_QUERIES = 450
data = {"statuses": []}
next_id = "998511795781361665"
with open(tweets_raw_file,"a+", encoding='utf-8') as f:
    if not next_id:
        f.write("ID,Text,UserName,UserFollowerCount,RetweetCount,Likes,CreatedAt\n")
    while(True):
        twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)
        last_size = 0
        for i in tqdm(range(NUMBER_OF_QUERIES)):
            if not next_id:
                data = twitter.search(q=query, lang='en', result_type='recent', count="100") # Use since_id for tweets after id
            else:
                data["statuses"].extend(twitter.search(q=query, lang='en', result_type='mixed', count="100", max_id=next_id)["statuses"])
            if len(data["statuses"]) > 1:
                next_id = data["statuses"][len(data["statuses"]) - 1]['id']
            if last_size + 1 == len(data["statuses"]):
                break
            else:
                last_size = len(data["statuses"])

        print('Retrieved {0}, waiting for 15 minutes until next queries'.format(len(data["statuses"])))
        d = pd.DataFrame([[s["id"], s["text"].replace('\n','').replace('\r',''), s["user"]["name"], s["user"]["followers_count"], s["retweet_count"], s["favorite_count"], s["created_at"]] for s in data["statuses"]], columns=('ID', 'Text', 'UserName', "UserFollowerCount", 'RetweetCount', 'Likes', "CreatedAt"))
        d.to_csv(f, mode='a', encoding='utf-8',index=False,header=False)
        if last_size + 1 == len(data["statuses"]):
            print('No more new tweets, stopping...')
            break
        data["statuses"] = []
        
        sleep(910)

data/twitter/BTC/2018-05-29 12-20-53~2018-05-29 12-26-11.csv


## Preprocessing

Now we will cleanup the data.

We already filtered tweets in english in the call to the Twitter API.
We will now filter links, @Pseudo, images, videos, unhashtag #happy -> happy.

We won't transform to lower case because Vader take capital letters into consideration to emphasize sentiments.

You must install `pip install tqdm`

In [5]:
import re # regular expressions
from tqdm import tnrange, tqdm_notebook, tqdm

d = pd.read_csv(tweets_raw_file)
for i,s in enumerate(tqdm(d['Text'])):
    text = d.loc[i, 'Text']
    text = text.replace("#", "")
    text = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', text, flags=re.MULTILINE)
    text = re.sub('@\\w+ *', '', text, flags=re.MULTILINE)
    d.loc[i, 'Text'] = text
f = open(tweets_clean_file, 'a+', encoding='utf-8')
d.to_csv(f, header=True, encoding='utf-8',index=False)

100%|██████████| 1120556/1120556 [17:30:30<00:00, 17.78it/s]  


In [6]:
df_clean = pd.read_csv(tweets_clean_file)
df_clean.head(5)

Unnamed: 0,ID,Text,UserName,UserFollowerCount,RetweetCount,Likes,CreatedAt
0,1001439557504692224,RT : $btc bullish continuation,Tamekia Gay,0,548,0,Tue May 29 12:26:11 +0000 2018
1,1001439550860873728,RT : The Woman Powering Bitcoin - Interview Wi...,PiranhadoAmor,53,375,0,Tue May 29 12:26:09 +0000 2018
2,1001439550298828801,RT : Company Creates Decentralized VRWorld Whe...,🇻🇪 José Subero,2635,4,0,Tue May 29 12:26:09 +0000 2018
3,1001439549178961920,RT : Zynga poker chips are a fun currency in T...,Bingaman Sarah,1,890,0,Tue May 29 12:26:09 +0000 2018
4,1001439548877017088,RT Bitcoin: Bitcoin Cash is BitcoinVideo by Cr...,Crypto Br4in,9562,0,0,Tue May 29 12:26:08 +0000 2018


In [None]:
df_clean.min(axis=0)

In [7]:
f.close()