<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:10px 5px'> 
Master Thesis Yannik Haller - Data Preprocessing for the Naïve Sentiment Classifier
</h1>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
1. Load required packages and the data
</h2>
</div>

In [1]:
# Import required baseline packages
import re
import os
import glob
import time
import sys
import pandas as pd
import numpy as np
from pprint import pprint

# Change pandas' setting to print out long strings
pd.options.display.max_colwidth = 200

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Spacy (for lemmatization)
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim (optional)
import logging
logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.ERROR)

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)

  def _figure_formats_changed(self, name, old, new):


In [2]:
# Set the appropriate working directory
os.chdir('D:\\Dropbox\\MA_data')

# Read in the aggregated data
it_tx = pd.read_csv("agg_csv_sparse_it.csv", index_col = 0, dtype = {'so': object, 'la': object, 'tx': object})

In [3]:
# Take a look at the shape of the data
it_tx.shape

(23621, 3)

In [4]:
# Store the article IDs (i.e. index) of the language specific subsets
it_idx = it_tx.index  # Italian

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2. Preprocess the text data batchwise
</h2>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.1 Define all required functions to preprocess the data (pre-cleaning, tokenizing, removing stop words and lemmatization)
</h2>
</div>

In [5]:
## Prerequisite

# Run in terminal (cmd) after the appropriate environment has been activated (activate Master_Thesis_env)
# German
#python -m spacy download de_core_news_sm (check)
#python -m spacy download de_core_news_md (check)
#python -m spacy download de_core_news_lg (check)

# English
#python -m spacy download en_core_web_sm (check)
#python -m spacy download en_core_web_md (check)
#python -m spacy download en_core_web_lg (check) --> dir :  C:\Users\Hallk\AppData\Local\Temp\pip-ephem-wheel-cache-tl3c3f5d\wheels\41\75\77\c4a98e18b2c317a2a13931cbbea7e3ca7f3a21efc36adc1d71

# French
#python -m spacy download fr_core_news_sm (check)
#python -m spacy download fr_core_news_md (check)
#python -m spacy download fr_core_news_lg (check)

# Italian
#python -m spacy download it_core_news_sm (check)
#python -m spacy download it_core_news_md (check)
#python -m spacy download it_core_news_lg (check)

In [2]:
## Define all required functions for the batchwise data preprocessing

# Define a function to prepare/pre-clean the text data
def pre_clean(articles):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a list.")

    # Keep track of the processing time
    t = time.time()
    # Replace punctuations which are not followed by a blank with punctuations followed by a blank
    articles = [re.sub(r'[\.]', '. ', x) for x in articles]
    # Separate words in which a lowercase letter is followed by a capital letter, since they usually do not belong together
    articles = [re.sub('(^[a-z]*)+([A-Z])', r'\1 \2', x) for x in articles]
    # Correct manually for those cases where a name like 'McDonalds' was separated to Mc Donalds
    articles = [re.sub('Mc ', 'Mc', x) for x in articles]
    # Replace quotation marks with a blank
    articles = [re.sub('«', ' ', x) for x in articles]
    articles = [re.sub('»', ' ', x) for x in articles]
    # Remove percentage signs
    articles = [re.sub('%', ' ', x) for x in articles]
    # Remove distracting hyphens
    articles = [re.sub("-", " ", x) for x in articles]
    articles = [re.sub("–", " ", x) for x in articles]
    # Replace new line characters (i.e. \n) and multiple blanks with a single blank
    articles = [re.sub('\s+', ' ', x) for x in articles]
    # Print out the processing time
    print("Processing time for pre-cleaning: ", str((time.time() - t)/60), "minutes")

    # Return the pre-cleaned text data
    return articles


# Define a function to perform the following tasks at once:
## 1. Tokenize: transform text into a list of words, digits and punctuations
## 2. Filter the tokens, such that only nouns, proper nouns, verbs, adjectives, adverbs and negations are kept, while digits and punctuations are removed
## 3. Lemmatize: transform each word back to its word stem
## 4. Lowercase the entire data
def tokenize_filter_and_lemmatize(articles, nlp):
    # Keep track of the processing time
    t = time.time()
    # Create a list to store the output
    articles_out = []
    # Define the list of allowed postags (pos = part of speech)
    allowed_postags = ['PROPN', 'NOUN', 'ADJ', 'VERB', 'ADV']
    # Define the list of allowed negations (for Italian)
    allowed_negations = ['no', 'non', 'niente', 'nessuno']
    # Create a loop to go through all articles in the input list of articles
    for article in articles:
        # Define the current article as the focal document
        doc = nlp(article)
        # Tokenize, filter and lemmatize the document, while filtering punctuations and unused word types (i.e. words which are not contained in the 'allowed_posttags' variable)
        articles_out.append([token.lemma_.lower() for token in doc if (token.pos_ in allowed_postags) or (token.lemma_.lower() in allowed_negations)])
    # Print out the processing time
    print("Processing time for tokenizing, filtering and lemmatization: ", str((time.time() - t)/60), "minutes")
    return articles_out

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.2 Define a function to preprocess and export the data batchwise
</h2>
</div>

In [7]:
## Define the function to apply batchwise processing of the text data

# Note: "articles" has to be a dataframe with a column tx containing the text files
def process_batchwise(articles, language, batch_size = 100000, first_batch_number = 1):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, pd.DataFrame)):
        raise ValueError("Invalid input type. Expected a pandas DataFrame.")
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Get the number of batches
    nbatches = int((len(articles)-1)/batch_size) + 1
    # Store the index of the articles
    idx = articles.index
    # Convert the column of the dataframe that contains the articles to a list of articles, while overwriting the variable 'articles' to save RAM
    articles = articles.tx.values.tolist()

    # Initialize the appropriate spacy model depending on the language of the text data, while keeping only the tagger component (for efficiency)
    if language == 'de':
        nlp = spacy.load('de_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'en':
        nlp = spacy.load('en_core_web_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'fr':
        nlp = spacy.load('fr_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'it':
        nlp = spacy.load('it_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    
    # Set up a loop to process the data batchwise
    for i in range(nbatches):
        print('Processing batch #', i+first_batch_number, '...')
        # Select the data related to the current batch
        batch_min = batch_size * i
        if i == (batch_size - 1):
            batch_max = len(articles)
        else:
            batch_max = batch_size * (i+1)
        batch_tx = articles[batch_min:batch_max]

        # Pre-clean the data
        batch_tx = pre_clean(batch_tx)
        # Tokenize, filter and lemmatize the data
        batch_tx = tokenize_filter_and_lemmatize(batch_tx, nlp)

        ## Save the processed text data to a csv file
        # Generate a list containing the preprocessed data in form of strings in which all lemmatized phrases are contained and separated by a blank (such that it's easy to read in later)
        batch_tx_out = []
        for article in batch_tx:
            batch_tx_out.append(" ".join(article))
        # Create a correctly indexed dataframe containing the preprocessed data in a column and export it as a csv file
        pd.DataFrame(batch_tx_out, index = idx[batch_min:batch_max], columns = ['tx']).to_csv(
            "Preprocessed/Sentiment_Analysis/Lemmatized/"+language+"_lemmatized_senti_batch_"+str(i+first_batch_number)+".csv", index = True, encoding = 'utf-8-sig'
        )

        # Delete large unused variables to save memory
        del batch_tx_out
    print("DONE! ;)")

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.3 Apply batchwise preprocessing and store the preprocessed data externally as csv files
</h2>
</div>

In [8]:
# Apply batchwise preprocessing by means of the previously defined function
process_batchwise(it_tx, language = 'it', batch_size = 100000, first_batch_number = 1)

Processing batch # 1 ...
Processing time for pre-cleaning:  0.06947460969289145 minutes
Processing time for tokenizing, filtering and lemmatization:  11.725524806976319 minutes
DONE! ;)


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
3. Read in and inspect the filtered and lemmatized data
</h2>
</div>

In [7]:
# Define a function to read in and concatenate the filtered and lemmatized data
def read_lemmatized(language, tokenize = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data\\Preprocessed\\Sentiment_Analysis\\Lemmatized')

    # Get a list of all files to read and concatenate
    extension = 'csv'
    all_filenames = [i for i in glob.glob(language+"_lemmatized_senti_batch_*.{}".format(extension))]
    # Concatenate all files in the list to one dataframe
    batches_aggregated = pd.concat([pd.read_csv(f, index_col = 0, dtype = {'tx': object}) for f in all_filenames])
    # Get the articles' indices together with an enumeration to identify them in the list of filtered and lemmatized articles
    idx = batches_aggregated.index
    idx = pd.DataFrame(idx, columns = [language+'_idx'])
    # Tokenize the data again if tokenize = True
    if tokenize:
        batches_aggregated = retokenize(batches_aggregated.tx.values.tolist())
    else:
        batches_aggregated = batches_aggregated.tx.values.tolist()
    
    # Reset the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')  

    # Return the precleaned data
    return batches_aggregated, idx

# Define a function to retokenize the filtered and lemmatized text data
def retokenize(articles):
    articles_out = []
    for article in articles:
        articles_out.append(article.split())
    return articles_out

In [8]:
# Read in the filtered and lemmatized data
it_tx_lemm, it_idx = read_lemmatized('it', tokenize = True)

In [9]:
# Take a look at the dataframe containing the according index
it_idx

Unnamed: 0,it_idx
0,313578
1,460527
2,460528
3,460529
4,460530
...,...
23616,2425111
23617,2425112
23618,2425113
23619,2425114


In [10]:
# Take a look at the size of the filtered and lemmatized data
sys.getsizeof(it_tx_lemm)

200320

In [11]:
# Take a look at the first few tokens of the first element of filtered and lemmatized data
it_tx_lemm[0][:6]

['fermati', 'obbligare', 'oppresso', 'mondare', 'ginocchio', 'forse']

In [12]:
# Compare it to the initial text
it_tx.tx.iloc[0]

'Fermati obbligato a meditare. Hai oppresso il mondo in ginocchio, forse per pregare …Applaudivamo allo stadio. Ora acclamiamo dal balcone, dai ricchi del pallone a splendidi medici.Ci hai tolto la stretta di mano, la cosa più amicale, il bacio. Senza contatti un solo sfiorarsi, come un soffio di vento. Soli come un solco senza seme.A chiederci i mille perché.Convinti che la vita è breve, non va perso neanche un istante, potrebbe nascere la luna piena, in tutta la sua bellezza e verità.Ci torna felice, compiuto il canto: non camminerete mai da soli.Dubitiamo negli esperti del nulla. Lasciamo volare il pipistrello, maestro della biodiversità, senza colpe.Usciamo dall’inferno che toglie il respiro. Reagiamo al torrente della vita. Dissetiamoci alle fonti naturali, sempre convinti che l’amore ci  salverà, unica risposta e tutto vince.Felicità, serenità e si tornerà a  sorridere.Sicuri che Dio sconfiggerà anche il coronavirus.Rodolfo Fasani, Mesocco'

In [13]:
# Remove unnecessary variables to save RAM
del it_tx_lemm

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
4. Supplemental (manual) cleaning of the filtered and lemmatized data
</h2>
</div>

In [14]:
# Read in the filtered and lemmatized data (untokenized)
it_tx, it_idx = read_lemmatized('it', tokenize = False) # Overwrite the variable containing the uncleaned data to save RAM

In [15]:
# Take a look at the dataframe containing the according index
it_idx

Unnamed: 0,it_idx
0,313578
1,460527
2,460528
3,460529
4,460530
...,...
23616,2425111
23617,2425112
23618,2425113
23619,2425114


In [16]:
# Take a look at the first element of the (untokenized) filtered and lemmatized data
it_tx[0]

'fermati obbligare oppresso mondare ginocchio forse applaudivamo stadio ora acclamare balcone ricco pallone splendido medico togliere stringere mano cosa molto baciare contatto solere sfiorarsi soffiare vento soli solcare seme chiederci perché convinti vita essere non perdere neanche istante lunare pieno bellezza verità tornire compiere cantare non camminare mai solo dubitiamo esperto nullo lasciamo pipistrello maestro biodiversità colpa usciamo inferno togliere respirare reagiamo torrente vita dissetiamoci fonte naturale sempre convinto amore salvare unico rispondere vincere felicità serenità tornare sicuri dio sconfiggere anche coronavirus rodolfo fasani mesocco'

In [17]:
# Define a function to apply the supplemental/manual cleaning to the filtered and lemmatized data
def supp_clean(articles):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a list.")

    # Keep track of the processing time
    t = time.time()
    # Remove any links starting with http:// or https://
    articles = [re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    # Remove any links starting with www.
    articles = [re.compile('www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    # Remove any instances where 1 to 3 initiating letters are followed by a dot (either once at the end or after each letter), since such cases usually represent abbreviations with low semantic meaning
    articles = [re.compile(' [a-z][\.]?[a-z]?[\.]?[a-z]?\.+').sub(' ', x) for x in articles]
    # Remove any remaining digit
    articles = [re.sub(r'\b\d+\b', '', x) for x in articles]
    # Remove anything except words, spaces and the & sign, since this might appear in certain names
    articles = [re.sub(r'[^\w\s\&]','', x) for x in articles]
    # Remove a list of specific words which appear quite often but do not seem to add any semantic value (and the auxiliary verb essere, as this is not identified as such by the Italian Spacy module)
    words_to_remove = ['awp','afp','essere']
    for word in words_to_remove:
        articles = [re.sub(' '+word, '', x) for x in articles] 
    # Replace new line characters (i.e. \n) and multiple blanks with a single blank
    articles = [re.sub('\s+', ' ', x) for x in articles]
    # Print out the processing time
    print("Processing time for supplemental manual cleaning: ", str((time.time() - t)/60), "minutes")

    # Return the manually cleaned text data
    return articles

In [18]:
# Apply the supplemental/manual cleaning by means of the previously defined function
it_tx = supp_clean(it_tx)

Processing time for supplemental manual cleaning:  0.04911261002222697 minutes


In [19]:
# Take a look at the first element of the fully preprocessed data
it_tx[0]

'fermati obbligare oppresso mondare ginocchio forse applaudivamo stadio ora acclamare balcone ricco pallone splendido medico togliere stringere mano cosa molto baciare contatto solere sfiorarsi soffiare vento soli solcare seme chiederci perché convinti vita non perdere neanche istante lunare pieno bellezza verità tornire compiere cantare non camminare mai solo dubitiamo esperto nullo lasciamo pipistrello maestro biodiversità colpa usciamo inferno togliere respirare reagiamo torrente vita dissetiamoci fonte naturale sempre convinto amore salvare unico rispondere vincere felicità serenità tornare sicuri dio sconfiggere anche coronavirus rodolfo fasani mesocco'

In [20]:
# Compart the fully preprocessed data to the initial text (copy paste from above)

'Fermati obbligato a meditare. Hai oppresso il mondo in ginocchio, forse per pregare …Applaudivamo allo stadio. Ora acclamiamo dal balcone, dai ricchi del pallone a splendidi medici.Ci hai tolto la stretta di mano, la cosa più amicale, il bacio. Senza contatti un solo sfiorarsi, come un soffio di vento. Soli come un solco senza seme.A chiederci i mille perché.Convinti che la vita è breve, non va perso neanche un istante, potrebbe nascere la luna piena, in tutta la sua bellezza e verità.Ci torna felice, compiuto il canto: non camminerete mai da soli.Dubitiamo negli esperti del nulla. Lasciamo volare il pipistrello, maestro della biodiversità, senza colpe.Usciamo dall’inferno che toglie il respiro. Reagiamo al torrente della vita. Dissetiamoci alle fonti naturali, sempre convinti che l’amore ci  salverà, unica risposta e tutto vince.Felicità, serenità e si tornerà a  sorridere.Sicuri che Dio sconfiggerà anche il coronavirus.Rodolfo Fasani, Mesocco'

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
5. Export the fully preprocessed data as one csv file
</h2>
</div>

In [21]:
# Define a function to export the fully preprocessed data
def export_preprocessed(language, articles, idx, data_tokenized = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')

    # Untokenize the data if it is still tokenized
    if data_tokenized:
        # Generate a list containing the fully preprocessed data in form of strings in which all precleaned unigrams are contained and separated by a blank (such that it's easy to read in later)
        articles_out = []
        for article in articles:
            articles_out.append(" ".join(article))
        # Overwrite the variable which stores the tokenized articles
        articles = articles_out
        # Delete the variable articles_out to save RAM
        del articles_out
    
    # Create a correctly indexed dataframe containing the fully preprocessed data in a column and export it as a csv file
    pd.DataFrame(articles, index = idx, columns = ['tx']).to_csv("Preprocessed/Sentiment_Analysis/"+language+"_preprocessed_senti.csv", index = True, encoding = 'utf-8-sig')

In [22]:
# Export the fully preprocessed data
export_preprocessed('it', it_tx, it_idx.it_idx.to_list(), False)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
6. Read in the fully preprocessed data
</h2>
</div>

In [3]:
# Define a function to read in the fully preprocessed data
def read_preprocessed(language, tokenize = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)
    
    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')

    # Define the name of the file to load
    filename = "Preprocessed/Sentiment_Analysis/"+language+"_preprocessed_senti.csv"

    # Read in the dataframe containing the text data
    tx_pp = pd.read_csv(filename, index_col = 0, dtype = {'tx': object})

    # Get the articles' index together with an enumeration to identify their position in the list of precleaned articles
    idx = tx_pp.index
    idx = pd.DataFrame(idx, columns = [language+'_idx'])

    # Reduce the dataframe to a list containing the text data
    tx_pp = tx_pp.tx.to_list()

    # Tokenize the data again if tokenize = True (RAM-saving)
    if tokenize:
        tx_pp = retokenize(tx_pp)

    # Return the preprocessed data
    return tx_pp, idx

# Define a function to retokenize the preprocessed text data (RAM-saving)
def retokenize(article_list):
    for i in range(len(article_list)):
        temp_tx = str(article_list[i]).split()
        article_list[i] = temp_tx
    return article_list

In [4]:
# Read in the fully preprocessed data
it_tx, it_idx = read_preprocessed('it', tokenize = True) # Overwrite the variables used above to save RAM

In [5]:
# Take a look at the dataframe containing the according index
it_idx

Unnamed: 0,it_idx
0,313578
1,460527
2,460528
3,460529
4,460530
...,...
23616,2425111
23617,2425112
23618,2425113
23619,2425114


In [6]:
# Take a look at the first element of the fully preprocessed and tokenized data
it_tx[0]

['fermati',
 'obbligare',
 'oppresso',
 'mondare',
 'ginocchio',
 'forse',
 'applaudivamo',
 'stadio',
 'ora',
 'acclamare',
 'balcone',
 'ricco',
 'pallone',
 'splendido',
 'medico',
 'togliere',
 'stringere',
 'mano',
 'cosa',
 'molto',
 'baciare',
 'contatto',
 'solere',
 'sfiorarsi',
 'soffiare',
 'vento',
 'soli',
 'solcare',
 'seme',
 'chiederci',
 'perché',
 'convinti',
 'vita',
 'non',
 'perdere',
 'neanche',
 'istante',
 'lunare',
 'pieno',
 'bellezza',
 'verità',
 'tornire',
 'compiere',
 'cantare',
 'non',
 'camminare',
 'mai',
 'solo',
 'dubitiamo',
 'esperto',
 'nullo',
 'lasciamo',
 'pipistrello',
 'maestro',
 'biodiversità',
 'colpa',
 'usciamo',
 'inferno',
 'togliere',
 'respirare',
 'reagiamo',
 'torrente',
 'vita',
 'dissetiamoci',
 'fonte',
 'naturale',
 'sempre',
 'convinto',
 'amore',
 'salvare',
 'unico',
 'rispondere',
 'vincere',
 'felicità',
 'serenità',
 'tornare',
 'sicuri',
 'dio',
 'sconfiggere',
 'anche',
 'coronavirus',
 'rodolfo',
 'fasani',
 'mesocco']

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
7. Quantitative summary of the data cleaning
</h2>
</div>

In [7]:
# Count the total number of words contained after the data cleaning
nwords_after = 0
for article in it_tx:
    nwords_after = nwords_after + len(article)
print('Total number of words contained after data cleaning:', nwords_after)

Total number of words contained after data cleaning: 2999296


In [8]:
# Get the average number of words per article after the data cleaning
avg_nwords_after = nwords_after/len(it_tx)
avg_nwords_after
print('Average number of words per article after data cleaning:', avg_nwords_after)

Average number of words per article after data cleaning: 126.97582659497904


In [9]:
# Remove unnecessary variables to save RAM
del it_tx, it_idx

In [10]:
# Read in the uncleaned data
os.chdir('D:\\Dropbox\\MA_data')
it_tx_uncleaned = pd.read_csv("agg_csv_sparse_it.csv", index_col = 0, dtype = {'so': object, 'la': object, 'tx': object})

In [11]:
## Count the total number of words contained before the data cleaning
# Note: to get an appropriate count of the distinct words we must at least apply the very first low-level precleaning, to ensure that all words are separated properly and distracting signs are removed
it_tx_uncleaned = pre_clean(it_tx_uncleaned.tx.tolist())
# Count the total number of words
nwords_before = 0
for article in it_tx_uncleaned:
    nwords_before = nwords_before + len(article.split())
print('Total number of words contained before data cleaning:', nwords_before)

Processing time for pre-cleaning:  0.06998445590337117 minutes
Total number of words contained before data cleaning: 5984264


In [12]:
# Get the average number of words per article before the data cleaning
avg_nwords_before = nwords_before/len(it_tx_uncleaned)
print('Average number of words per article before data cleaning:', avg_nwords_before)

Average number of words per article before data cleaning: 253.3450742982939


In [13]:
# Get the number of removed words
nwords_rm = nwords_before - nwords_after
print('Number of words removed by the data cleaning:', nwords_rm)

Number of words removed by the data cleaning: 2984968


In [14]:
# Get the ratio of the words that have been removed
ratio_removed = nwords_rm / nwords_before
print(np.round(ratio_removed*100,4),'percent of the words have been removed by the data cleaning')

49.8803 percent of the words have been removed by the data cleaning
