<br>
<h1 style="font-family:sans-serif; text-align:center"> 
<!--     <span style='color: pink'> Twitter analysis of </span> -->
    <span style='color: white; font-size:100%; text-shadow: 0px 0px 15px black'> Twitter analysis of </span>
<!--     <span style='color:#00acee'> Twitter analysis of </span> -->
<!--     <span style="-webkit-text-stroke"> Twitter analysis of</span> -->
<!--     <span class="hr3" style='color:#e40843; letter-spacing: 4px; font-size:105%'> Canada</span> -->
    <span class="hr3" style='color:#e40843; font-size:120%; text-shadow: 0px 0px 30px pink'>Canada </span> <span class="hr3" style='color:gray; font-size:100%; text-shadow: 0px 0px 30px pink'>response to Covid-19</span><br>
</h1>

#### — _Using snscrape_ —

### ✅ This jupyter notebook works well!

The aim of this notebook is to retrieve the tweets from March 1st until April 30th, to analyze the difference in sentiment analysis of tweets from people before and after Trudeau's [announcement of government policies facing impact of Covid-19](https://www.youtube.com/watch?v=1o-tV0A87l8&feature=youtu.be) to support small businesses and their employees.


The **snscrape** allowed us to find old tweets (as opposed to the free version of the API from twitter, and the GetOldTweets3 library that is non-currently working). The way to download tweets with this library is well explained in Martin Beck's article _[How to Scrape Tweets With snscrape](https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af)_ at Medium.

_Authors: Leo Cuspinera ([cuspime](https://github.com/cuspime)) and Victor Cuspinera ([vcuspinera](https://github.com/vcuspinera))_

## Imports

In [2]:
# # Install development version of snscrape
# !pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git 

# General libraries
import pandas as pd
import numpy as np
import os
import time
from datetime import datetime, timedelta, date
from pytz import timezone
import json

# Preprocess libraries
import re
import spacy
import string
import en_core_web_sm
nlp = en_core_web_sm.load()

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

## Parameters

In [3]:
# dates
today = datetime.now()
init = date.fromisoformat('2020-03-01')

my_dates = list()
for d in range(0, 61, 1):
# for d in range(0, 1, 1):
    aux = init + timedelta(days=d)
    my_dates.append(aux)

# twitter accounts
accounts = ("@JustinTrudeau", "@CanadianPM","@Canada", "@GovCanHealth")

# max number of results
max_results = 100_000

#folder to save information
my_folder = "../tweets/"

## Get and save tweets as `json` files
⚠️ **Caution:** Just run this code chunk once, it takes so much time (more than couple hours) to download all the tweets. Additionally, this step downloads 244 JSON files, that in total weight 10.27 GB

In [4]:
%%time

# Retrieving tweets with `snscrape`, by using OS library to call CLI commands in Python.
for ac in accounts:
    for dt in my_dates:
        next_day = dt + timedelta(days=1)
        os.system("snscrape --jsonl --max-results " + str(max_results) + " --since " + 
                  dt.strftime("%Y-%m-%d") + " twitter-search '" + ac + " until:" + 
                  next_day.strftime("%Y-%m-%d") + "' > " + my_folder + ac + 
                  "_" + dt.strftime("%Y-%m-%d") + ".json")


CPU times: user 154 ms, sys: 175 ms, total: 329 ms
Wall time: 1h 44min 10s


## Bring and merge tweets by account

In [5]:
%%time

# Columns
my_columns = ['account', 'date', 'content', 'user', 'replyCount', 'retweetCount',
              'likeCount', 'quoteCount', 'lang', 'sourceLabel']

dict_tot = {}

# # Call and concatenate the data frames
for ac in accounts:
    # Create an empty pandas dataframe
    df = pd.DataFrame(columns = my_columns)
    t0 = time.clock()
    print("Start with " + ac + " files:")

    # Call the JSON files of tweets
    for d in my_dates:
        t00 = time.clock()
        aux = pd.read_json(my_folder + ac + '_' + str(d) + '.json', lines=True)
        aux['account'] = ac
        df = pd.concat([df, aux[my_columns]])
        print("   > date: " + str(d) + ' time ' + str(round(time.clock() - t00, 4)))

    # Save a JSON file for each account
    df.reset_index(drop=True, inplace=True)
    dict_tot[ac] = df
    print(ac + " DB ready, time " + str(round(time.clock() - t0, 4)))
    # df.to_csv(my_folder + 'tweets_db_' + ac + '.csv')
    print(ac + " tweets saved as .JSON file, time " + str(round(time.clock() - t0, 4)))
    print("- - - - - o - - - - -\n")


Start with @JustinTrudeau files:
   > date: 2020-03-01 time 0.4633
   > date: 2020-03-02 time 0.1893
   > date: 2020-03-03 time 0.3233
   > date: 2020-03-04 time 0.303
   > date: 2020-03-05 time 0.3967
   > date: 2020-03-06 time 0.3345
   > date: 2020-03-07 time 0.3268
   > date: 2020-03-08 time 0.2887
   > date: 2020-03-09 time 0.4855
   > date: 2020-03-10 time 0.5087
   > date: 2020-03-11 time 0.4458
   > date: 2020-03-12 time 1.1032
   > date: 2020-03-13 time 1.2411
   > date: 2020-03-14 time 0.9987
   > date: 2020-03-15 time 1.0789
   > date: 2020-03-16 time 2.0332
   > date: 2020-03-17 time 0.9336
   > date: 2020-03-18 time 1.2837
   > date: 2020-03-19 time 0.9919
   > date: 2020-03-20 time 1.0339
   > date: 2020-03-21 time 1.4402
   > date: 2020-03-22 time 1.5794
   > date: 2020-03-23 time 1.1027
   > date: 2020-03-24 time 1.9376
   > date: 2020-03-25 time 1.4543
   > date: 2020-03-26 time 0.8571
   > date: 2020-03-27 time 1.2774
   > date: 2020-03-28 time 0.9282
   > date: 2020-

## Merge all tweets in one `json` file

⚠️ Just run this section once: the first time you run the notebook. If you previously run this cell and you are reopening this notebook, go to the next section _**Open Json file with all tweets**_.

In [17]:
%%time
df_tot = pd.DataFrame()
# Concatenate all tweets
df_tot = pd.concat([dict_tot['@Canada'], 
                    dict_tot['@CanadianPM'], 
                    dict_tot['@GovCanHealth'], 
                    dict_tot['@JustinTrudeau']])
df_tot.reset_index(drop=True, inplace=True)

CPU times: user 347 ms, sys: 13.6 ms, total: 361 ms
Wall time: 359 ms


In [18]:
def unpack(df, column, fillna=None):
    ret = None
    if fillna is None:
        ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems()))], axis=1)
        del ret[column]
    else:
        ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(fillna)], axis=1)
        del ret[column]
    return ret

In [21]:
%%time

# unpack information from user
df_tot = unpack(df_tot, 'user')

# select only the columns we wanted to save
wanted_columns = ['account', 'date', 'content', 'replyCount', 'retweetCount', 
                  'likeCount', 'quoteCount', 'lang', 'sourceLabel', 'username', 
                  'followersCount', 'friendsCount', 'location']
df_tot = df_tot[wanted_columns]

CPU times: user 8.82 s, sys: 296 ms, total: 9.11 s
Wall time: 9.11 s


In [22]:
%%time

# Save tweets as JSON file
df_tot.to_json(my_folder + 'tweets_db.json')

CPU times: user 6.97 s, sys: 704 ms, total: 7.67 s
Wall time: 8.76 s


## Open `json` file with all tweets
#### 👉 Use this in case you already run the previous chunks, and you are reopening the notebook.

In [23]:
%%time

# Open the file
df_tot = pd.read_json(my_folder + 'tweets_db.json')

CPU times: user 7.87 s, sys: 2.16 s, total: 10 s
Wall time: 10.1 s


## Preprocess tweets
#### ⚠️ This process is very slow because it changes more than 3.5 million tweets, one by one. 
Instead of running this section, we recommend to use the python script __[preprocess.py](https://github.com/vcuspinera/Canada_response_covid/blob/master/src/preprocess.py)__ by running from the terminal the next code:  
`$ python src/preprocess.py --input_dir=tweets/ --output_dir=tweets/`

In [24]:
def preprocess(text, irrelevant_pos = ['SPACE'],
              avoid_entities = ['ORG']):
    """
    Function that identify sensible information and delete some of 
    these data as emails and urls.
    Parameters
    -------------
    text : (list)
        the list of text to be preprocessed
    irrelevant_pos : (list)
        a list of irrelevant 'pos' tags
    avoid_entities : (list)
        a list of entity labels to be avoided

    Returns
    -------------
    (list) list of preprocessed text

    Example
    -------------
    example = ["Contact me at george23@gmail.com",
           "@vcuspinera my webpage is https://vcuspinera.github.io"]
    preprocess(example)
    (output:) ['contact me at',
               'my webpage is']
    """
    result = []

    # function
    for sent in text:
        sent = str(sent).lower()

        result_sent = []
        doc = nlp(sent)
        entities = [str(ent) for ent in doc.ents if ent.label_ in avoid_entities]
        # This helps to detect names organization

        for token in doc:            
            if (token.like_email or
                token.like_url or
                token.pos_ in irrelevant_pos or
                str(token) in entities
               ):
                continue
            else:
                if str(token) in string.punctuation:
                    try:
                        result_sent[-1] = str(result_sent[-1]) + str(token)
                    except:
                        result_sent.append(str(token))
                else:
                    result_sent.append(str(token))
        result.append(" ".join(result_sent))
    return result

In [None]:
# Preprocess tweets
df_tot['tweet'] = preprocess(df_tot['content'])

In [None]:
# Save file
df_tot.drop('content', axis=1).to_json(output_dir + 'tweets_db_clean.json')

## **Basic analysis** and **EDA**

#### 👉 [Click here](https://github.com/vcuspinera/Canada_response_covid/blob/master/src/eda.ipynb) to see the initial Data Analysis in the EDA jupyter notebook of this project.