<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:10px 5px'> 
Master Thesis Yannik Haller - Sentiment Analysis TEXTBLOB
</h1>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
1. Load required packages and the data
</h2>
</div>

In [1]:
# Import required baseline packages
import re
import os
import glob
import time
import sys
import pandas as pd
import numpy as np
from pprint import pprint

# Change pandas' setting to print out long strings
pd.options.display.max_colwidth = 200

# Plotting tools
import matplotlib.pyplot as plt
%matplotlib inline

# TextBlob (for Sentiment Analysis)
from textblob import Blobber
from textblob_fr import PatternTagger, PatternAnalyzer

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)
warnings.filterwarnings("ignore", category = FutureWarning)

  def _figure_formats_changed(self, name, old, new):


In [2]:
# Set the appropriate working directory
os.chdir('D:\\Dropbox\\MA_data')

In [3]:
# Define a function to read in the fully preprocessed data (note: we are using the preprocessed data in which negations are preserved --> PPII)
def read_preprocessed(language, tokenize = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)
    
    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')

    # Define the name of the file to load
    filename = "Preprocessed/Sentiment_Analysis/"+language+"_preprocessed_senti.csv"

    # Read in the dataframe containing the text data
    tx_pp = pd.read_csv(filename, index_col = 0, dtype = {'tx': object})

    # Get the articles' index together with an enumeration to identify their position in the list of precleaned articles
    idx = tx_pp.index
    idx = pd.DataFrame(idx, columns = [language+'_idx'])

    # Reduce the dataframe to a list containing the text data
    tx_pp = tx_pp.tx.to_list()

    # Tokenize the data again if tokenize = True (RAM-saving)
    if tokenize:
        tx_pp = retokenize(tx_pp)

    # Return the preprocessed data
    return tx_pp, idx

# Define a function to retokenize the preprocessed text data (RAM-saving)
def retokenize(article_list):
    for i in range(len(article_list)):
        temp_tx = str(article_list[i]).split()
        article_list[i] = temp_tx
    return article_list

In [4]:
# Read in the preprocessed data (not tokenized)
fr_tx, fr_idx = read_preprocessed('fr', tokenize = False)

# Take a look at the size of the precleaned data
sys.getsizeof(fr_tx)

3849360

In [5]:
# Take a look at the preprocessed data
fr_tx[0]

'bourse york terminer hausse mercredi espoir prochain accord nouveau plan aide économique américain mener dow jones brièvement dessus séance dow jones industrial average avancer point nasdaq gagner point s&p progresser point bourse york clôturer anxieusement léger baisse mardi débat présidentiel dow jones industrial average céder nasdaq mercredi rencontre cheffe démocrate chambre secrétaire américain trésor discuter nouveau aide économique panne mois susciter espoir compromis raisonnable mot steven mnuchin optimisme donner coup fouet action brusquement tempérer chef républicain sénat mitch mcconnell sortir dire position encore très très éloigné expliquer karl haeling lbbw bourse york voir aussi introduction fanfare cotation direct titre discret groupe surveillance donnée palantir prix valoriser plus milliard dollar symbole pltr titre clôturer dollar soit bien dessus prix indicatif dollar donner mardi soir new york stock lire page titre fabricant camion électrique hydrogène nikola repre

In [6]:
# Take a look at the dataframe containing the according index
fr_idx.tail(3)

Unnamed: 0,fr_idx
481159,2436480
481160,2436481
481161,2436482


In [7]:
# Retrieve the location of the article in the preprocessed data using the according article id
article_ids = [2436481, 2436482]
location = fr_idx[fr_idx.fr_idx.isin(article_ids)].index.tolist() #481160

# Access the preprocessed text from the articles with the article ids in [2436481, 2436482]
#list(fr_tx[i] for i in location)

# Look at the according location of the articles with the article ids in [2436481, 2436482]
location

[481160, 481161]

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2. Sentiment Assessment of the Articles
</h2>
</div>

In [8]:
# Define a function that evaluates the polarity of the articles and stores the result to a correctly indexed csv file
def eval_blob_polarity(tx, idx, language = 'de', batchsize = 100000):
    # Notes: 
    ## tx has to be a list containing the precleaned and NOT tokenized articles
    ## idx has to be a list containing the correctly ordered index

    # Initialize a Blobber class, which uses the language specific PatternAnalyzer we imported above to assess text polarity (sentiment from -1 to 1) and subjectivity
    tb = Blobber(pos_tagger = PatternTagger(), analyzer = PatternAnalyzer())

    # Set up a loop to go through all articles and evaluate their polarity with TextBlob
    i = 0
    i_last_batch = 0
    n_articles = len(tx)
    pol = []
    t = time.time()
    for article in tx:
        i = i + 1
        pol.append(tb(article).sentiment[0])
        if i % batchsize == 0:
            print("Processing time to evaluate polarity scores of the articles at positions", i_last_batch, "to", i-1, ":", str(round((time.time() - t)/60,2)), "minutes")
            i_last_batch = i
            t = time.time()
        if i == n_articles:
            print("Processing time to evaluate polarity scores of the articles at positions", i_last_batch, "to", i-1, ":", str(round((time.time() - t)/60,2)), "minutes")
            print("DONE! ;)")

    # Create a correctly indexed dataframe
    Blob_tx_polarity = pd.DataFrame(pol, index = idx, columns = ['Blob_polarity'])
    # Save the results to a csv file
    Blob_tx_polarity.to_csv("Sentiment/TextBlob/"+language+"_blob_polarity.csv", index = True)
    # Return the results
    return Blob_tx_polarity

In [9]:
# Apply the previously defined function
Blob_tx_polarity = eval_blob_polarity(fr_tx, fr_idx.fr_idx.values.tolist(), 'fr', 100000)

Processing time to evaluate polarity scores of the articles at positions 0 to 99999 : 1.85 minutes
Processing time to evaluate polarity scores of the articles at positions 100000 to 199999 : 2.02 minutes
Processing time to evaluate polarity scores of the articles at positions 200000 to 299999 : 1.88 minutes
Processing time to evaluate polarity scores of the articles at positions 300000 to 399999 : 2.16 minutes
Processing time to evaluate polarity scores of the articles at positions 400000 to 481161 : 1.49 minutes
DONE! ;)


In [10]:
# Take a look at the results
Blob_tx_polarity

Unnamed: 0,Blob_polarity
0,0.124348
1,0.100000
2,0.163700
3,-0.012364
4,0.207273
...,...
2436478,0.191667
2436479,0.107407
2436480,0.027500
2436481,0.090000


In [11]:
# Read the results back in
Blob_tx_polarity = pd.read_csv("Sentiment/TextBlob/fr_blob_polarity.csv", index_col = 0, dtype = {'Blob_polarity': float})
# Take a look at the results
Blob_tx_polarity

Unnamed: 0,Blob_polarity
0,0.124348
1,0.100000
2,0.163700
3,-0.012364
4,0.207273
...,...
2436478,0.191667
2436479,0.107407
2436480,0.027500
2436481,0.090000


In [12]:
# Take a look at some summary statistics
share_pos = np.round(np.sum(Blob_tx_polarity['Blob_polarity'] > 0) / len(Blob_tx_polarity),2)
share_neg = np.round(np.sum(Blob_tx_polarity['Blob_polarity'] < 0) / len(Blob_tx_polarity),2)
print('The share of articles with a positive sentiment is', 100*share_pos,'%')
print('The share of articles with a negative sentiment is', 100*share_neg,'%')
np.round(Blob_tx_polarity.describe(), 3)

The share of articles with a positive sentiment is 88.0 %
The share of articles with a negative sentiment is 11.0 %


Unnamed: 0,Blob_polarity
count,481162.0
mean,0.086
std,0.084
min,-1.0
25%,0.042
50%,0.086
75%,0.131
max,1.0
