# Twitter extraction

This notebook aims to retrieve tweets, clean them and compute a sentiment in order to observe a correlation between crypto currencies and tweets' sentiments. The following steps are executed in this notebook :

- Retrieve tweets with Twython API (Twitter API wrapper for python)
- Extract the wanted data (tweet's text, #followers, #likes, etc.)
- Clean the textual data (remove unnecessary elements like media, websites link, pseudos, ...)
- Compute for each tweet a sentiment score with Vader (named compound) and a score linked to the popularity of the tweet and its compound

This notebook is written using Python 3.6.

## Setup

In [4]:
# Define the currency
CURRENCY = "" # Enter the name of the name of the curruncy e.g (cardano,bitcoin,dogecoin)
CURRENCY_SYMBOL = "" # Name of the curruncy symbol e.g (ADA, BTC, D)

## personal config
TWEETS_FOLDER    = "data/crypto/%s"%(CURRENCY) # Relative path to historical data
SEP_CHAR         = '~' # character seperating dates from and to in filename
ENVS             = ['CRYPTO', 'LINE_COUNT', 'MOST_RECENT_FILE', 'MOST_RECENT_ID'] # Stored in var.csv
MAX_ROW_PER_FILE = 20000 # Each file storing data has a maximum amount of rows

tweets_raw_file = 'data/twitter/%s/%s_tweets_raw.csv'%(CURRENCY_SYMBOL,CURRENCY)
tweets_clean_file = 'data/twitter/%s/%s_tweets_clean.csv'%(CURRENCY_SYMBOL,CURRENCY)
query = '#%s OR #%s'%(CURRENCY,CURRENCY_SYMBOL) ####TODO PUT BACK  OR {CURRENCY} OR ${CURRENCY} OR ${CURRENCY_SYMBOL}

## 1. Retrieve the tweets from Twitter API

### 1.1 Import Twython


In [5]:
from twython import Twython

### 1.2 OAuth2 Authentication (*app* authentication)


In [3]:
APP_KEY =''  # Enter your API key
APP_SECRET =  '' # Enter API Secret key
twitter = Twython(APP_KEY, APP_SECRET, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()
twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)
twitter.get_application_rate_limit_status()['resources']['search']

{'/search/tweets': {'limit': 450, 'remaining': 450, 'reset': 1628404812}}

### 1.3 Query the twitter API
Here we query the twitter API to get the latest tweets about Cardano. Then we transform it to store only the useful data inside a Pandas Dataframe.

The following fields are retrieved from the response:

- **id** (int) : unique identifier of the tweet
- **text** (string) : UTF-8 textual content of the tweet, max 140 chars
- user
  - **name** (string) : twitter's pseudo of the user
  - **followers_count** (int) : Number of followers the user has
- **retweet_count** (int) : Number of times the tweet has been retweeted
- **favorite_count** (int) : Number of likes
- **created_at** (datetime) : creation date and time of the tweet


The pandas package must be installed using *pip install pandas* from the command line.

In [6]:
# Import Libraries
from time import sleep
import json
import pandas as pd
import io
from tqdm import tqdm

In [7]:
tweets_raw_file

'data/twitter/ADA/cardano_tweets_raw.csv'

In [6]:
NUMBER_OF_QUERIES = 450
data = {"statuses": []}
next_id = "" #"1147236962945961984"
since_id= ''
with open(tweets_raw_file,"a+", encoding='utf-8') as f:
    if not next_id and not since_id:
        f.write("ID,Text,UserName,UserFollowerCount,RetweetCount,Likes,CreatedAt\n")
    while(True):
        twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)
        last_size = 0
        for i in tqdm(range(NUMBER_OF_QUERIES)):
            if not next_id:
                data = twitter.search(q=query, lang='en', result_type='recent', count="100",tweet_mode='extended',since_id=since_id) # Use since_id for tweets after id
#                 print(data)
            elif since_id:
                data["statuses"].extend(twitter.search(q=query, lang='en', result_type='mixed', count="100",max_id=next_id,tweet_mode='extended')["statuses"])
            else:
                data["statuses"].extend(twitter.search(q=query, lang='en', result_type='mixed', count="100", max_id=next_id,tweet_mode='extended')["statuses"])
            if len(data["statuses"]) > 1:
                next_id = data["statuses"][len(data["statuses"]) - 1]['id']
            if last_size + 1 == len(data["statuses"]):
                break
            else:
                last_size = len(data["statuses"])

        print('Retrieved {0}, waiting for 15 minutes until next queries'.format(len(data["statuses"])))
        d = pd.DataFrame([[s["id"], s["full_text"].replace('\n','').replace('\r',''), s["user"]["name"], s["user"]["followers_count"], s["retweet_count"], s["favorite_count"], s["created_at"]] for s in data["statuses"]], columns=('ID', 'Text', 'UserName', "UserFollowerCount", 'RetweetCount', 'Likes', "CreatedAt"))
        d.to_csv(f, mode='a', encoding='utf-8',index=False,header=False)
        if last_size + 1 == len(data["statuses"]):
            print('No more new tweets, stopping...')
            break
        data["statuses"] = []
        
        sleep(910)   

100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [06:21<00:00,  1.18it/s]


Retrieved 44706, waiting for 15 minutes until next queries


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [06:31<00:00,  1.15it/s]


Retrieved 43668, waiting for 15 minutes until next queries


 49%|███████████████████████████████████████▎                                        | 221/450 [03:01<02:48,  1.36it/s]

Retrieved 21843, waiting for 15 minutes until next queries
No more new tweets, stopping...


## Preprocessing

Now we will cleanup the data.

We already filtered tweets in english in the call to the Twitter API.
We will now filter links, @Pseudo, images, videos, unhashtag #happy -> happy.

We won't transform to lower case because Vader take capital letters into consideration to emphasize sentiments.

You must install `pip install tqdm`

In [8]:
import re # regular expressions
from tqdm import tnrange, tqdm_notebook, tqdm

d = pd.read_csv(tweets_raw_file)
for i,s in enumerate(tqdm(d['Text'])):
    text = d.loc[i, 'Text']
    text = text.replace("#", "")
    text = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', text, flags=re.MULTILINE)
    text = re.sub('@\\w+ *', '', text, flags=re.MULTILINE)
    d.loc[i, 'Text'] = text
f = open(tweets_clean_file, 'a+', encoding='utf-8')
d.to_csv(f, header=True, encoding='utf-8',index=False)

100%|██████████████████████████████████████████████████████████████████████████| 110217/110217 [31:26<00:00, 58.43it/s]


In [11]:
f.close()

### End of firts notebok extracting the data and cleaning. 