<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:10px 5px'> 
Master Thesis Yannik Haller - Data Preprocessing LDA
</h1>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
1. Load required packages and the data
</h2>
</div>

In [1]:
# Import required baseline packages
import re
import os
import glob
import time
import sys
import pandas as pd
import numpy as np
from pprint import pprint

# Change pandas' setting to print out long strings
pd.options.display.max_colwidth = 200

# Spacy (for lemmatization)
import spacy

# Plotting tools
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)

In [2]:
# Set the appropriate working directory
os.chdir('D:\\Dropbox\\MA_data')

# Read in the aggregated data
fr_tx = pd.read_csv("agg_csv_sparse_fr.csv", index_col = 0, dtype = {'so': object, 'la': object, 'tx': object})

In [3]:
# Take a look at the shape of the data
fr_tx.shape

(481162, 3)

In [4]:
# Store the article IDs (i.e. index)
fr_idx = fr_tx.index  # French

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2. Preprocess the text data batchwise
</h2>
</div>

# Prerequisite
Run these commands in terminal (cmd) after the appropriate environment has been activated (by running the command "activate Master_Thesis_env") in order to install the required spaCy implementations.

## German
python -m spacy download de_core_news_sm (check)

python -m spacy download de_core_news_md (check)

python -m spacy download de_core_news_lg (check)

## English
python -m spacy download en_core_web_sm (check)

python -m spacy download en_core_web_md (check)

python -m spacy download en_core_web_lg (check)

## French
python -m spacy download fr_core_news_sm (check)

python -m spacy download fr_core_news_md (check)

python -m spacy download fr_core_news_lg (check)

## Italian
python -m spacy download it_core_news_sm (check)

python -m spacy download it_core_news_md (check)

python -m spacy download it_core_news_lg (check)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.1 Define all required functions to preprocess the data (pre-cleaning, tokenizing, removing stop words and lemmatization)
</h2>
</div>

In [2]:
## Define all required functions for the batchwise data preprocessing

# Define a function to prepare/pre-clean the text data
def pre_clean(articles):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a list.")

    # Keep track of the processing time
    t = time.time()

    # Remove any links starting with http:// or https://
    articles = [re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    # Remove any links starting with www.
    articles = [re.compile('www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    
    # Replace punctuations which are not followed by a blank with punctuations followed by a blank
    articles = [re.sub(r'[\.]', '. ', x) for x in articles]
    # Separate words in which a lowercase letter is followed by a capital letter, since they usually do not belong together
    articles = [re.sub('(^[a-z]*)+([A-Z])', r'\1 \2', x) for x in articles]
    # Correct manually for those cases where a name like 'McDonalds' was separated to Mc Donalds
    articles = [re.sub('Mc ', 'Mc', x) for x in articles]
    # Replace quotation marks with a blank
    articles = [re.sub('«', ' ', x) for x in articles]
    articles = [re.sub('»', ' ', x) for x in articles]
    # Remove percentage signs
    articles = [re.sub('%', ' ', x) for x in articles]
    # Remove distracting hyphens
    articles = [re.sub("-", " ", x) for x in articles]
    articles = [re.sub("–", " ", x) for x in articles]
    # Replace control characters (e.g. \n or \t) and multiple blanks with a single blank
    articles = [re.sub('\s+', ' ', x) for x in articles]
    # Print out the processing time
    print("Processing time for pre-cleaning: ", str((time.time() - t)/60), "minutes")

    # Return the pre-cleaned text data
    return articles


# Define a function to perform the following tasks at once:
## 1. Tokenize: transform text into a list of words, digits and punctuations
## 2. Remove stopwords
## 3. Filter the tokens, such that only nouns, proper nouns, verbs, adjectives and adverbs are kept, while digits and punctuations are removed
## 4. Lemmatize: transform each word back to its word stem
## 5. Lowercase the entire data
def tokenize_filter_and_lemmatize(articles, nlp):
    # Keep track of the processing time
    t = time.time()
    # Create a list to store the output
    articles_out = []
    # Define the list of allowed postags (pos = part of speech)
    allowed_postags = ['PROPN', 'NOUN', 'ADJ', 'VERB', 'ADV']
    # Create a loop to go through all articles in the input list of articles
    for article in articles:
        # Define the current article as the focal document
        doc = nlp(article)
        # Tokenize, filter and lemmatize the document, while filtering out stop words, punctuations and unused word types (i.e. words with a postag that is not contained in the 'allowed_posttags' variable)
        articles_out.append([token.lemma_.lower() for token in doc if (token.pos_ in allowed_postags and not token.is_stop)])
    # Print out the processing time
    print("Processing time for tokenizing, filtering and lemmatization: ", str((time.time() - t)/60), "minutes")
    return articles_out

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.2 Define a function to preprocess and export the data batchwise
</h2>
</div>

In [7]:
## Define the function to apply batchwise processing of the text data

# Note: "articles" has to be a dataframe with a column tx containing the text files
def process_batchwise(articles, language, batch_size = 100000, first_batch_number = 1):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, pd.DataFrame)):
        raise ValueError("Invalid input type. Expected a pandas DataFrame.")
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Get the number of batches
    nbatches = int((len(articles)-1)/batch_size) + 1
    # Store the index of the articles
    idx = articles.index
    # Convert the column of the dataframe that contains the articles to a list of articles, while overwriting the variable 'articles' to save RAM
    articles = articles.tx.values.tolist()

    # Initialize the appropriate spacy model depending on the language of the text data, while keeping only the tagger component (for efficiency)
    if language == 'de':
        nlp = spacy.load('de_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'en':
        nlp = spacy.load('en_core_web_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'fr':
        nlp = spacy.load('fr_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'it':
        nlp = spacy.load('it_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    
    # Set up a loop to process the data batchwise
    for i in range(nbatches):
        print('Processing batch #', i+first_batch_number, '...')
        # Select the data related to the current batch
        batch_min = batch_size * i
        if i == (nbatches - 1):
            batch_max = len(articles)
        else:
            batch_max = batch_size * (i+1)
        batch_tx = articles[batch_min:batch_max]

        # Pre-clean the data
        batch_tx = pre_clean(batch_tx)
        # Tokenize, filter and lemmatize the data
        batch_tx = tokenize_filter_and_lemmatize(batch_tx, nlp)

        ## Save the processed text data to a csv file
        # Generate a list containing the preprocessed data in form of strings in which all lemmatized phrases are contained and separated by a blank (such that it's easy to read in later)
        batch_tx_out = []
        for article in batch_tx:
            batch_tx_out.append(" ".join(article))
        # Create a correctly indexed dataframe containing the preprocessed data in a column and export it as a csv file
        pd.DataFrame(batch_tx_out, index = idx[batch_min:batch_max], columns = ['tx']).to_csv(
            "Preprocessed/Lemmatized/"+language+"_lemmatized_batch_"+str(i+first_batch_number)+".csv", index = True, encoding = 'utf-8-sig'
        )

        # Delete large unused variables to save memory
        del batch_tx_out
    print("DONE! ;)")

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.3 Apply batchwise preprocessing and store the preprocessed data externally as csv files
</h2>
</div>

In [8]:
# Apply batchwise preprocessing by means of the previously defined function
process_batchwise(fr_tx, language = 'fr', batch_size = 100000, first_batch_number = 1)

Processing batch # 1 ...
Processing time for pre-cleaning:  0.4011604468027751 minutes
Processing time for tokenizing, filtering and lemmatization:  76.61191082000732 minutes
Processing batch # 2 ...
Processing time for pre-cleaning:  0.4837564984957377 minutes
Processing time for tokenizing, filtering and lemmatization:  79.22954199314117 minutes
Processing batch # 3 ...
Processing time for pre-cleaning:  0.45687848726908364 minutes
Processing time for tokenizing, filtering and lemmatization:  73.37979460557303 minutes
Processing batch # 4 ...
Processing time for pre-cleaning:  0.5389755050341288 minutes
Processing time for tokenizing, filtering and lemmatization:  105.8040094335874 minutes
Processing batch # 5 ...
Processing time for pre-cleaning:  0.36118507385253906 minutes
Processing time for tokenizing, filtering and lemmatization:  65.33602706988653 minutes
DONE! ;)


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
3. Read in and inspect the filtered and lemmatized data
</h2>
</div>

In [7]:
# Define a function to read in and concatenate the filtered and lemmatized data
def read_lemmatized(language, tokenize = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data\\Preprocessed\\Lemmatized')

    # Get a list of all files to read and concatenate
    extension = 'csv'
    all_filenames = [i for i in glob.glob(language+"_lemmatized_batch_*.{}".format(extension))]
    # Concatenate all files in the list to one dataframe
    batches_aggregated = pd.concat([pd.read_csv(f, index_col = 0, dtype = {'tx': object}) for f in all_filenames])
    # Get the articles' indices together with an enumeration to identify them in the list of filtered and lemmatized articles
    idx = batches_aggregated.index
    idx = pd.DataFrame(idx, columns = [language+'_idx'])
    # Tokenize the data again if tokenize = True
    if tokenize:
        batches_aggregated = retokenize(batches_aggregated.tx.values.tolist())
    else:
        batches_aggregated = batches_aggregated.tx.values.tolist()
    
    # Reset the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')  

    # Return the precleaned data
    return batches_aggregated, idx

# Define a function to retokenize the filtered and lemmatized text data
def retokenize(articles):
    articles_out = []
    for article in articles:
        articles_out.append(article.split())
    return articles_out

In [8]:
# Read in the filtered and lemmatized data
fr_tx_lemm, fr_idx = read_lemmatized('fr', tokenize = True)

In [9]:
# Take a look at the dataframe containing the according index
fr_idx

Unnamed: 0,fr_idx
0,0
1,1
2,2
3,3
4,4
...,...
481157,2436478
481158,2436479
481159,2436480
481160,2436481


In [10]:
# Take a look at the size of the filtered and lemmatized data
sys.getsizeof(fr_tx_lemm)

4290016

In [11]:
# Take a look at the first few tokens of the first element of filtered and lemmatized data
fr_tx_lemm[0][:6]

['bourse', 'york', 'terminer', 'hausse', 'mercredi', 'espoir']

In [12]:
# Compare it to the initial text
fr_tx.tx.iloc[0]

"La Bourse de New York a terminé en hausse mercredi, sur les espoirs d'un prochain accord sur un nouveau plan d'aide économique américain qui a mené le Dow Jones brièvement au-dessus de 2% en séance.Le Dow Jones Industrial Average a avancé de 1,20% à 27.781,70 points. Le Nasdaq a gagné 0,74% à 11.167,50 points et le S&P 500, a progressé de 1,05% à 3370,53 points.La Bourse de New York avait clôturé anxieusement en légère baisse mardi avant le débat présidentiel. Le Dow Jones Industrial Average, avait cédé 0,48% et le Nasdaq -0,29%.Mercredi, la rencontre entre la cheffe des démocrates à la Chambre et le secrétaire américain au Trésor pour discuter d'une nouvelle aide économique, en panne depuis des mois, a suscité l'espoir d'un «compromis raisonnable», selon les mots de Steven Mnuchin.Cet optimisme a donné un coup de fouet aux actions, qui s'est brusquement tempéré «lorsque le chef des républicains au Sénat Mitch McConnell est sorti et a dit que les positions étaient encore très, très él

In [13]:
# Remove unnecessary variables to save RAM
del fr_tx_lemm

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
4. Supplemental (manual) cleaning of the filtered and lemmatized data
</h2>
</div>

In [14]:
# Read in the filtered and lemmatized data (untokenized)
fr_tx, fr_idx = read_lemmatized('fr', tokenize = False) # Overwrite the variable containing the uncleaned data to save RAM

In [15]:
# Take a look at the dataframe containing the according index
fr_idx

Unnamed: 0,fr_idx
0,0
1,1
2,2
3,3
4,4
...,...
481157,2436478
481158,2436479
481159,2436480
481160,2436481


In [16]:
# Take a look at the first element of the (untokenized) filtered and lemmatized data
fr_tx[0]

'bourse york terminer hausse mercredi espoir prochain accord plan aide économique américain mener dow jones brièvement séance dow jones industrial average avancer 781,70 point nasdaq gagner point s&p progresser 1,05 point bourse york clôturer anxieusement léger baisse mardi débat présidentiel dow jones industrial average céder nasdaq mercredi rencontre cheffe démocrate chambre secrétaire américain trésor discuter nouveau aide économique panne mois susciter espoir compromis raisonnable mot steven mnuchin optimisme donner coup fouet action brusquement tempérer chef républicain sénat mitch mcconnell sortir position éloigné expliquer karl haeling lbbw bourse york introduction fanfare cotation titre discret groupe surveillance donnée palantir prix valoriser milliard dollar symbole pltr titre clôturer dollar prix indicatif dollar donner mardi soir new york stock lire page titre fabricant camion électrique hydrogène nikola reprendre vigueur +14,54 dollar descente enfer marquer perte tiers val

In [17]:
# Define a function to apply the supplemental/manual cleaning to the filtered and lemmatized data
def supp_clean(articles):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a list.")

    # Keep track of the processing time
    t = time.time()
    # Remove any instances where 1 to 3 initiating letters are followed by a dot (either once at the end or after each letter), since such cases usually represent abbreviations with low semantic meaning
    articles = [re.compile(' [a-z][\.]?[a-z]?[\.]?[a-z]?\.+').sub(' ', x) for x in articles]
    # Remove any remaining digit
    articles = [re.sub(r'\b\d+\b', '', x) for x in articles]
    # Remove anything except words, spaces and the & sign, since this might appear in certain names
    articles = [re.sub(r'[^\w\s\&]','', x) for x in articles]
    # Remove a list of specific words which appear quite often but do not seem to add any semantic value
    words_to_remove = ['awp','afp']
    for word in words_to_remove:
        articles = [re.sub(' '+word, '', x) for x in articles]   
    # Replace control characters (e.g. \n or \t) and multiple blanks with a single blank
    articles = [re.sub('\s+', ' ', x) for x in articles]
    # Print out the processing time
    print("Processing time for supplemental manual cleaning: ", str((time.time() - t)/60), "minutes")

    # Return the manually cleaned text data
    return articles

In [18]:
# Apply the supplemental/manual cleaning by means of the previously defined function
fr_tx = supp_clean(fr_tx)

Processing time for supplemental manual cleaning:  1.3890565633773804 minutes


In [19]:
# Take a look at the first element of the fully preprocessed data
fr_tx[0]

'bourse york terminer hausse mercredi espoir prochain accord plan aide économique américain mener dow jones brièvement séance dow jones industrial average avancer point nasdaq gagner point s&p progresser point bourse york clôturer anxieusement léger baisse mardi débat présidentiel dow jones industrial average céder nasdaq mercredi rencontre cheffe démocrate chambre secrétaire américain trésor discuter nouveau aide économique panne mois susciter espoir compromis raisonnable mot steven mnuchin optimisme donner coup fouet action brusquement tempérer chef républicain sénat mitch mcconnell sortir position éloigné expliquer karl haeling lbbw bourse york introduction fanfare cotation titre discret groupe surveillance donnée palantir prix valoriser milliard dollar symbole pltr titre clôturer dollar prix indicatif dollar donner mardi soir new york stock lire page titre fabricant camion électrique hydrogène nikola reprendre vigueur dollar descente enfer marquer perte tiers valeur introduction bo

In [20]:
# Compare the fully preprocessed data to the initial text (copy paste from above)

"La Bourse de New York a terminé en hausse mercredi, sur les espoirs d'un prochain accord sur un nouveau plan d'aide économique américain qui a mené le Dow Jones brièvement au-dessus de 2% en séance.Le Dow Jones Industrial Average a avancé de 1,20% à 27.781,70 points. Le Nasdaq a gagné 0,74% à 11.167,50 points et le S&P 500, a progressé de 1,05% à 3370,53 points.La Bourse de New York avait clôturé anxieusement en légère baisse mardi avant le débat présidentiel. Le Dow Jones Industrial Average, avait cédé 0,48% et le Nasdaq -0,29%.Mercredi, la rencontre entre la cheffe des démocrates à la Chambre et le secrétaire américain au Trésor pour discuter d'une nouvelle aide économique, en panne depuis des mois, a suscité l'espoir d'un «compromis raisonnable», selon les mots de Steven Mnuchin.Cet optimisme a donné un coup de fouet aux actions, qui s'est brusquement tempéré «lorsque le chef des républicains au Sénat Mitch McConnell est sorti et a dit que les positions étaient encore très, très éloignées», a expliqué Karl Haeling de LBBW.La Bourse de New York a vu aussi l'introduction en fanfare, via une cotation directe, des titres du discret groupe de surveillance de données Palantir, à un prix le valorisant à plus de 21 milliards de dollars. Sous le symbole PLTR, le titre a clôturé à 9,73 dollars, soit bien au dessus du prix indicatif de 7,25 dollars donné mardi soir par le New York Stock Exchange (lire page 11)Le titre du fabricant de camions électriques et à hydrogène Nikola a repris de la vigueur (+14,54% à 20,48 dollars) après sa descente aux enfers marquée par la perte de deux tiers de sa valeur depuis son introduction en bourse.Nikola a ajourné mercredi un événement au cours duquel il devait présenter en grande pompe son nouveau pick-up Badger.Quasiment tous les secteurs du S&P ont terminé dans le vert, celui de la santé en tête. les laboratoires Pfizer ont pris 1,41%. Des grands noms de la tech ont progressé comme Microsoft et Apple (+1,50%).Sur le marché obligataire, le taux à 10 ans sur la dette américaine augmentait à 0,6840% contre 0,6495% mardi soir. – (afp)"

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
5. Export the fully preprocessed data as one csv file
</h2>
</div>

In [21]:
# Define a function to export the fully preprocessed data
def export_preprocessed(language, articles, idx, data_tokenized = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')

    # Untokenize the data if it is still tokenized
    if data_tokenized:
        # Generate a list containing the fully preprocessed data in form of strings in which all precleaned unigrams are contained and separated by a blank (such that it's easy to read in later)
        articles_out = []
        for article in articles:
            articles_out.append(" ".join(article))
        # Overwrite the variable which stores the tokenized articles
        articles = articles_out
        # Delete the variable articles_out to save RAM
        del articles_out
    
    # Create a correctly indexed dataframe containing the fully preprocessed data in a column and export it as a csv file
    pd.DataFrame(articles, index = idx, columns = ['tx']).to_csv("Preprocessed/"+language+"_preprocessed.csv", index = True, encoding = 'utf-8-sig')

In [22]:
# Export the fully preprocessed data
export_preprocessed('fr', fr_tx, fr_idx.fr_idx.to_list(), False)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
6. Read in the fully preprocessed data
</h2>
</div>

In [3]:
# Define a function to read in the fully preprocessed data
def read_preprocessed(language, tokenize = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)
    
    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')

    # Define the name of the file to load
    filename = "Preprocessed/"+language+"_preprocessed.csv"

    # Read in the dataframe containing the text data
    tx_pp = pd.read_csv(filename, index_col = 0, dtype = {'tx': object})

    # Get the articles' index together with an enumeration to identify their position in the list of precleaned articles
    idx = tx_pp.index
    idx = pd.DataFrame(idx, columns = [language+'_idx'])

    # Reduce the dataframe to a list containing the text data
    tx_pp = tx_pp.tx.to_list()

    # Tokenize the data again if tokenize = True (RAM-saving)
    if tokenize:
        tx_pp = retokenize(tx_pp)

    # Return the preprocessed data
    return tx_pp, idx

# Define a function to retokenize the preprocessed text data (RAM-saving)
def retokenize(article_list):
    for i in range(len(article_list)):
        temp_tx = str(article_list[i]).split()
        article_list[i] = temp_tx
    return article_list

In [4]:
# Read in the fully preprocessed data
fr_tx, fr_idx = read_preprocessed('fr', tokenize = True) # Overwrite the variables used above to save RAM

In [5]:
# Take a look at the dataframe containing the according index
fr_idx

Unnamed: 0,fr_idx
0,0
1,1
2,2
3,3
4,4
...,...
481157,2436478
481158,2436479
481159,2436480
481160,2436481


In [6]:
# Take a look at the first element of the fully preprocessed and tokenized data
fr_tx[0]

['bourse',
 'york',
 'terminer',
 'hausse',
 'mercredi',
 'espoir',
 'prochain',
 'accord',
 'plan',
 'aide',
 'économique',
 'américain',
 'mener',
 'dow',
 'jones',
 'brièvement',
 'séance',
 'dow',
 'jones',
 'industrial',
 'average',
 'avancer',
 'point',
 'nasdaq',
 'gagner',
 'point',
 's&p',
 'progresser',
 'point',
 'bourse',
 'york',
 'clôturer',
 'anxieusement',
 'léger',
 'baisse',
 'mardi',
 'débat',
 'présidentiel',
 'dow',
 'jones',
 'industrial',
 'average',
 'céder',
 'nasdaq',
 'mercredi',
 'rencontre',
 'cheffe',
 'démocrate',
 'chambre',
 'secrétaire',
 'américain',
 'trésor',
 'discuter',
 'nouveau',
 'aide',
 'économique',
 'panne',
 'mois',
 'susciter',
 'espoir',
 'compromis',
 'raisonnable',
 'mot',
 'steven',
 'mnuchin',
 'optimisme',
 'donner',
 'coup',
 'fouet',
 'action',
 'brusquement',
 'tempérer',
 'chef',
 'républicain',
 'sénat',
 'mitch',
 'mcconnell',
 'sortir',
 'position',
 'éloigné',
 'expliquer',
 'karl',
 'haeling',
 'lbbw',
 'bourse',
 'york',
 

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
7. Quantitative summary of the data cleaning
</h2>
</div>

In [7]:
# Count the total number of words contained after the data cleaning
nwords_after = 0
for article in fr_tx:
    nwords_after = nwords_after + len(article)
print('Total number of words contained after data cleaning:', nwords_after)

Total number of words contained after data cleaning: 98303836


In [8]:
# Get the average number of words per article after the data cleaning
avg_nwords_after = nwords_after/len(fr_tx)
avg_nwords_after
print('Average number of words per article after data cleaning:', avg_nwords_after)

Average number of words per article after data cleaning: 204.30506981016788


In [9]:
# Remove unnecessary variables to save RAM
del fr_tx, fr_idx

In [10]:
# Read in the uncleaned data
os.chdir('D:\\Dropbox\\MA_data')
fr_tx_uncleaned = pd.read_csv("agg_csv_sparse_fr.csv", index_col = 0, dtype = {'so': object, 'la': object, 'tx': object})

In [11]:
## Count the total number of words contained before the data cleaning
# Note: to get an appropriate count of the distinct words we must at least apply the sparse preprocessing first, to ensure that all words are separated properly and distracting signs are removed
fr_tx_uncleaned = pre_clean(fr_tx_uncleaned.tx.tolist())
# Count the total number of words
nwords_before = 0
for article in fr_tx_uncleaned:
    nwords_before = nwords_before + len(article.split())
print('Total number of words contained before data cleaning:', nwords_before)

Processing time for pre-cleaning:  2.4118225812911986 minutes
Total number of words contained before data cleaning: 209660898


In [12]:
# Get the average number of words per article before the data cleaning
avg_nwords_before = nwords_before/len(fr_tx_uncleaned)
print('Average number of words per article before data cleaning:', avg_nwords_before)

Average number of words per article before data cleaning: 435.7386867624625


In [13]:
# Get the number of removed words
nwords_rm = nwords_before - nwords_after
print('Number of words removed by the data cleaning:', nwords_rm)

Number of words removed by the data cleaning: 111357062


In [14]:
# Get the ratio of the words that have been removed
ratio_removed = nwords_rm / nwords_before
print(np.round(ratio_removed*100,4), 'percent of the words have been removed by the data cleaning.')

53.1129 percent of the words have been removed by the data cleaning.
