### Optimizing the preprocessing script
We will try 3 different things:
- The initial approach, of defining a single preprocessing function depending on the parameters given to the class (original)
- Try using multiple map statements, with lambda functions
- Try using multiple map statements, with existing static functions

In [23]:
import unicodedata
from collections import defaultdict

from bs4 import BeautifulSoup
from nltk import word_tokenize
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted


class CorpusPreprocess(BaseEstimator, TransformerMixin):
    def __init__(self, language='english', stop_words=None, lowercase=True, strip_accents=False,
                 strip_numbers=False, strip_punctuation=None, stemmer=None, max_df=1.0, min_df=1):
        """Scikit-learn like Transformer for Corpus preprocessing.
        Preprocesses text by applying multiple tasks (e.g. lowecasing, stemming, etc).
        Fits the data for obtaining vocabulary_ (mapping of terms to document frequencies)
         and stop_words_ (terms that were ignored because of either 'max_df', 'min_df' or 'stop_words').

        Args:
            language (str, optional): language of input text. Passed to word tokenizer. Defaults to 'english'.
            stop_words (list, optional): list of stop words to be removed. Defaults to None.
            lowercase (bool, optional): lowercases text if True. Defaults to True.
            strip_accents (bool, optional): strips accents from text if True. Defaults to False.
            strip_numbers (bool, optional): strips numbers from text if True. Defaults to False.
            strip_punctuation (iterable, optional): strips provided punctuation from text if not None.
             Defaults to None.
            stemmer (Stemmer instance, optional): applies the provided Stemmer's stem method to text.
             Defaults to None.
            max_df (float in range [0.0, 1.0] or int, optional): ignore terms with a document frequency higher 
             than the given threshold. If float, the parameter represents a proportion of documents, integer 
             absolute counts. Defaults to 1.0.
            min_df (float in range [0.0, 1.0] or int, optional): ignore terms with a document frequency lower 
             than the given threshold. If float, the parameter represents a proportion of documents, integer 
             absolute counts. Defaults to 1.

        Raises:
            ValueError: max_df and min_df are bounded to range [0.0, 1.0]
        """
        self.language = language
        self.stop_words = stop_words
        self.lowercase = lowercase
        self.strip_accents = strip_accents
        self.strip_numbers = strip_numbers
        self.strip_punctuation = strip_punctuation
        self.stemmer = stemmer
        self.max_df = max_df
        self.min_df = min_df
        if max_df < 0 or min_df < 0:
            raise ValueError("negative value for max_df or min_df")

    def fit(self, X, y=None):
        # Building vocabulary_ and stop_words_
        self.fit_transform(X)

        return self

    def fit_transform(self, X, y=None, tokenize=True):
        # Preprocess and tokenize corpus
        corpus = self._word_tokenizer(X)

        # Build vocabulary document frequencies
        vocab_df = defaultdict(int)
        for doc in corpus:
            for unique in set(doc):
                vocab_df[unique] += 1

        # Find stop_words_ based on max_df and min_df
        if self.stop_words is None:
            self.stop_words_ = set()
        else:
            self.stop_words_ = set(self.stop_words)

        if self.max_df is not None:
            if isinstance(self.max_df, float):
                vocab_rel_df = {k: v / len(X) for k, v in vocab_df.items()}
                self.stop_words_.update(
                    {k for k, v in vocab_rel_df.items() if v > self.max_df})
            else:
                self.stop_words_.update(
                    {k for k, v in vocab_df.items() if v > self.max_df})

        if self.min_df is not None:
            if isinstance(self.min_df, float):
                vocab_rel_df = {k: v / len(X) for k, v in vocab_df.items()}
                self.stop_words_.update(
                    {k for k, v in vocab_rel_df.items() if v < self.min_df})
            else:
                self.stop_words_.update(
                    {k for k, v in vocab_df.items() if v < self.min_df})

        # Remove stop_words_ from vocabulary
        for k in self.stop_words_:
            vocab_df.pop(k, None)

        # Set vocabulary_
        self.vocabulary_ = vocab_df

        # Remove stop_words from corpus
        if self.stop_words is not None:
            corpus = [[token for token in doc if token not in self.stop_words]
                      for doc in corpus]

        # Split vs merged
        if not tokenize:
            corpus = [" ".join(doc) for doc in corpus]

        return corpus

    def transform(self, X, y=None, tokenize=True):
        # Check if fit has been called
        check_is_fitted(self)

        # Preprocess and tokenize corpus
        corpus = self._word_tokenizer(X)

        # Remove stop_words from corpus
        corpus = [[token for token in doc if token not in self.stop_words_]
                  for doc in corpus]

        # Split vs merged
        if not tokenize:
            corpus = [" ".join(doc) for doc in corpus]

        return corpus

    def _word_tokenizer(self, X):
        """Preprocesses and tokenizes each document by applying a 
         preprocessing function.

        Args:
            X (iterable): documents to preprocess

        Returns:
            list: preprocessed and tokenized documents
        """
        # Define function conditionally so we only need to evaluate the condition once instead at every document
        if self.strip_accents and self.lowercase and self.strip_numbers and self.strip_punctuation is not None:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Lowercase
                doc = doc.lower()
                # Remove accentuation
                doc = unicodedata.normalize('NFKD', doc).encode(
                    'ASCII', 'ignore').decode('ASCII')
                # Remove numbers
                doc = doc.translate(str.maketrans('', '', "0123456789"))
                # Remove punctuation
                doc = doc.translate(str.maketrans(
                    '', '', self.strip_punctuation))
                return doc
        elif self.strip_accents and self.lowercase and self.strip_punctuation is not None:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Lowercase
                doc = doc.lower()
                # Remove accentuation
                doc = unicodedata.normalize('NFKD', doc).encode(
                    'ASCII', 'ignore').decode('ASCII')
                # Remove punctuation
                doc = doc.translate(str.maketrans(
                    '', '', self.strip_punctuation))
                return doc
        if self.strip_accents and self.lowercase and self.strip_numbers:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Lowercase
                doc = doc.lower()
                # Remove accentuation
                doc = unicodedata.normalize('NFKD', doc).encode(
                    'ASCII', 'ignore').decode('ASCII')
                # Remove numbers
                doc = doc.translate(str.maketrans('', '', "0123456789"))
                return doc
        if self.strip_accents and self.strip_numbers and self.strip_punctuation is not None:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Remove accentuation
                doc = unicodedata.normalize('NFKD', doc).encode(
                    'ASCII', 'ignore').decode('ASCII')
                # Remove numbers
                doc = doc.translate(str.maketrans('', '', "0123456789"))
                # Remove punctuation
                doc = doc.translate(str.maketrans(
                    '', '', self.strip_punctuation))
                return doc
        if self.lowercase and self.strip_numbers and self.strip_punctuation is not None:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Lowercase
                doc = doc.lower()
                # Remove numbers
                doc = doc.translate(str.maketrans('', '', "0123456789"))
                # Remove punctuation
                doc = doc.translate(str.maketrans(
                    '', '', self.strip_punctuation))
                return doc
        elif self.strip_accents and self.lowercase:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Lowercase
                doc = doc.lower()
                # Remove accentuation
                doc = unicodedata.normalize('NFKD', doc).encode(
                    'ASCII', 'ignore').decode('ASCII')
                return doc
        elif self.strip_accents and self.strip_numbers:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Remove accentuation
                doc = unicodedata.normalize('NFKD', doc).encode(
                    'ASCII', 'ignore').decode('ASCII')
                # Remove numbers
                doc = doc.translate(str.maketrans('', '', "0123456789"))
                return doc
        elif self.strip_accents and self.strip_punctuation is not None:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Remove accentuation
                doc = unicodedata.normalize('NFKD', doc).encode(
                    'ASCII', 'ignore').decode('ASCII')
                # Remove punctuation
                doc = doc.translate(str.maketrans(
                    '', '', self.strip_punctuation))
                return doc
        elif self.lowercase and self.strip_numbers:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Lowercase
                doc = doc.lower()
                # Remove numbers
                doc = doc.translate(str.maketrans('', '', "0123456789"))
                return doc
        elif self.lowercase and self.strip_punctuation is not None:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Lowercase
                doc = doc.lower()
                # Remove punctuation
                doc = doc.translate(str.maketrans(
                    '', '', self.strip_punctuation))
                return doc
        elif self.strip_numbers and self.strip_punctuation is not None:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Remove numbers
                doc = doc.translate(str.maketrans('', '', "0123456789"))
                # Remove punctuation
                doc = doc.translate(str.maketrans(
                    '', '', self.strip_punctuation))
                return doc
        elif self.strip_accents:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Remove accentuation
                doc = unicodedata.normalize('NFKD', doc).encode(
                    'ASCII', 'ignore').decode('ASCII')
                return doc
        elif self.lowercase:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Lowercase
                doc = doc.lower()
                return doc
        elif self.strip_numbers:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Remove numbers
                doc = doc.translate(str.maketrans('', '', "0123456789"))
                return doc
        elif self.strip_punctuation is not None:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                # Remove punctuation
                doc = doc.translate(str.maketrans(
                    '', '', self.strip_punctuation))
                return doc
        else:
            def doc_preprocessing(doc):
                # Removes HTML tags
                doc = BeautifulSoup(doc, features="lxml").get_text()
                return doc

        # Apply cleaning function over X
        corpus = map(doc_preprocessing, X)

        # Word tokenizer
        corpus = [word_tokenize(doc, language=self.language) for doc in corpus]

        if self.stemmer is not None:
            corpus = [[self.stemmer.stem(token)
                       for token in doc] for doc in corpus]
            return corpus

        return corpus
    
    def _word_tokenizer1(self, X):
        """Preprocesses and tokenizes each document by applying a 
         preprocessing function.

        Args:
            X (iterable): documents to preprocess

        Returns:
            list: preprocessed and tokenized documents
        """
        docs = map(lambda x: BeautifulSoup(x, features="lxml").get_text(), X)
        
        if self.lowercase:
            docs = map(str.lower, docs)
        
        if self.strip_accents:
            docs = map(lambda x: unicodedata.normalize('NFKD', x).encode(
                    'ASCII', 'ignore').decode('ASCII'), docs)
        if self.strip_numbers:
            docs = map(lambda x: x.translate(str.maketrans('', '', "0123456789")), docs)
        
        if self.strip_punctuation is not None:
            docs = map(lambda x: x.translate(str.maketrans('', '', self.strip_punctuation)), docs)
        
        # Word tokenizer
        corpus = [word_tokenize(doc, language=self.language) for doc in docs]

        if self.stemmer is not None:
            corpus = [[self.stemmer.stem(token)
                       for token in doc] for doc in corpus]
            return corpus

        return corpus
    
    def _word_tokenizer2(self, X):
        """Preprocesses and tokenizes each document by applying a 
         preprocessing function.

        Args:
            X (iterable): documents to preprocess

        Returns:
            list: preprocessed and tokenized documents
        """

        docs = map(remove_html_tags, X)
        
        if self.lowercase:
            docs = map(str.lower, docs)
        
        if self.strip_accents:
            docs = map(remove_accents, docs)
            
        if self.strip_numbers:
            docs = map(remove_numbers, docs)
        
        if self.strip_punctuation is not None:
            docs = [remove_punctuation(doc, self.strip_punctuation) for doc in docs]
        
        # Word tokenizer
        corpus = [word_tokenize(doc, language=self.language) for doc in docs]
        
        if self.stemmer is not None:
            corpus = [[self.stemmer.stem(token)
                       for token in doc] for doc in corpus]

        return corpus

    
def remove_html_tags(x):
    return BeautifulSoup(x, features="lxml").get_text()

def remove_accents(x):
    return unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore').decode('ASCII')

def remove_numbers(x):
    return x.translate(str.maketrans('', '', "0123456789"))

def remove_punctuation(x, punctuation_list):
    return x.translate(str.maketrans('', '', punctuation_list))

In [24]:
from collections import defaultdict
import json
import os
from string import punctuation

from gensim.summarization import keywords, summarize
import matplotlib.pyplot as plt
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from rake_nltk import Rake
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from wordcloud import WordCloud
from src import PROJECT_ROOT

In [25]:
# Reading data
INPUT_PATH = os.path.join(PROJECT_ROOT, "tasks", "extract_text", "output")
with open(os.path.join(INPUT_PATH, "pdf_files.json")) as json_file:
    data = json.load(json_file)
    
df = pd.DataFrame(
    {
        "filename": data.keys(),
        "country": [i["Country"] for i in data.values()],
        "text": [i["Text"] for i in data.values()]
    }
)

In [26]:
# Creating word count field
df['word_count'] = df['text'].apply(lambda x: len(str(x).split(" ")))
# Removing document without text
df = df.drop(df.index[df['word_count'] == 1].tolist()).reset_index(drop=True)
# Removing badly read documents
bad_docs = ["CreditoGanadero_Mexico", "Ley Especial Cafe_ElSalvador", "Sembrando Vida Report"]
df = df.drop(df.index[df['filename'].isin(bad_docs)].tolist()).reset_index(drop=True)

In [27]:
df.head()

Unnamed: 0,filename,country,text,word_count
0,2019CVE 1713470_Chile,Chile,CVE 1713470|Director: Juan Jorge Lazo Rodrígue...,10424
1,Decreto 51_Chile,Chile,Biblioteca del Congreso Nacional de Chile - ww...,22478
2,Decreto 95_Chile,Chile,www.bcn.cl - Biblioteca del Congreso Nacional ...,6068
3,Decreto8_Chile,Chile,Biblioteca del Congreso Nacional de Chile - ww...,1209
4,Ley 20412_Chile,Chile,Biblioteca del Congreso Nacional de Chile - ww...,4349


In [28]:
df.count()

filename      58
country       58
text          58
word_count    58
dtype: int64

In [29]:
spa_stopwords = set(stopwords.words('spanish'))
extra_stopwords = {"ley", "artículo", "ser", "así", "según", "nº"}
spa_stopwords = spa_stopwords.union(extra_stopwords)

In [31]:
prep = CorpusPreprocess(
    language='spanish', 
    stop_words=spa_stopwords,
    lowercase=True,
    strip_accents=True,
    strip_numbers=True,
    strip_punctuation=punctuation,
    stemmer=SnowballStemmer('spanish'), 
    max_df=0.9, 
    min_df=2
)

### Performance
We will compare performance after calling the word tokenizer function 10 times for each alternative:
- Original approach: `_word_tokenizer()`
- Map with lambda: `_word_tokenizer1()`
- Map with existing functions: `_word_tokenizer2()`

and print the average duration of a run

In [32]:
import time
from tqdm import tqdm

In [33]:
def evaluate_func(func, args):
    durations = list()
    
    for i in tqdm(range(10)):
        start = time.time()
        output = func(args)
        stop = time.time()
        durations.append(stop-start)
        
    return sum(durations)/len(durations)

In [34]:
avg_duration_og = evaluate_func(prep._word_tokenizer, df['text'])

100%|██████████| 10/10 [05:30<00:00, 33.10s/it]


In [35]:
print("Avg duration for og option:", avg_duration_og)

Avg duration for og option: 33.09642074108124


In [36]:
avg_duration_first = evaluate_func(prep._word_tokenizer1, df['text'])

100%|██████████| 10/10 [05:22<00:00, 32.29s/it]


In [37]:
print("Avg duration for option 1:", avg_duration_first)

Avg duration for option 1: 32.29024872779846


In [38]:
avg_duration_second = evaluate_func(prep._word_tokenizer2, df['text'])

100%|██████████| 10/10 [05:26<00:00, 32.65s/it]


In [39]:
print("Avg duration for option 2:", avg_duration_second)

Avg duration for option 2: 32.64609534740448


### Results
After comparing the 3 alternatives, a combination of `map()` + existing functions ends up being mildly more efficient and more readable. Whenever we want to add a new type of transformation to the text we just add another `if` statement and another separate function, then we call `map()`