# Text processing

Whenever we have textual data, we need to apply several pre-processing steps to the data to transform words into numerical features that work with machine learning algorithms. The pre-processing steps for a problem depend mainly on the domain and the problem itself, hence, we don’t need to apply all steps to every problem. 

Here, we are going to learn how to apply common text preprocessing in Python. Apart from some very basic text processing, we will be using the [Natural Language Toolkit (NLTK)](https://www.nltk.org/).

The case study is a dataset of news in Spanish, including the title and the body of the news.

In [1]:
import json, re
import pandas as pd 

with open('./data/noticias.txt') as json_file:
    data = json.load(json_file)
tuples = list(zip([noticia.get("titular") for noticia in data],
                  [noticia.get("texto") for noticia in data]))
df = pd.DataFrame(tuples, columns =['Titular', 'Noticia'])
print(df.shape)
df.head()


(5665, 2)


Unnamed: 0,Titular,Noticia
0,Un estudio impulsado por la Universidad de San...,El virus SARS-Cov-2 entró en España por la ciu...
1,Las claves: qué es Montai y quién está detrás,¿Qué es Montai? ¿Qué relación guarda con las o...
2,Robots entregan domicilios en Medellín durante...,Unos 15 robots recorren las calles de Medellín...
3,Grazón insiste en que un nuevo estado de alarm...,"En una entrevista en Radio Euskadi, recogida p..."
4,Vox se sube a la ola de la extrema derecha eur...,"""España ha dejado de ser católica"", decía Manu..."


## Special and undesired characters
Let's start by performing a very basic clean up. 
In particular, we should account for special characters and other specificities that the texts may include 
and could generate noise when processing the texts.

In [2]:
def cleaning_text(text):
    """
    Common text cleaning steps.
    """
    # Filter out special characters
    text = re.sub(r'\W', ' ', str(text))
    # Filter out very short words (here 1 char but often 2 char are also eliminated)
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
    # Replace consecutive spaces
    text = re.sub(r'\s+', ' ', text, flags=re.I)
    # Convert to lower case
    text = text.lower()
    return text

df["Tokens"] = df.Noticia.apply(cleaning_text)

df.head()

Unnamed: 0,Titular,Noticia,Tokens
0,Un estudio impulsado por la Universidad de San...,El virus SARS-Cov-2 entró en España por la ciu...,el virus sars cov 2 entró en españa por la ciu...
1,Las claves: qué es Montai y quién está detrás,¿Qué es Montai? ¿Qué relación guarda con las o...,qué es montai qué relación guarda con las otr...
2,Robots entregan domicilios en Medellín durante...,Unos 15 robots recorren las calles de Medellín...,unos 15 robots recorren las calles de medellín...
3,Grazón insiste en que un nuevo estado de alarm...,"En una entrevista en Radio Euskadi, recogida p...",en una entrevista en radio euskadi recogida po...
4,Vox se sube a la ola de la extrema derecha eur...,"""España ha dejado de ser católica"", decía Manu...",españa ha dejado de ser católica decía manuel...


## Basic text processing pipeline

In general, texts are submitted to the following basic processing steps:
<ul>
    <li>Tokenization, i.e. split the text into sentences and the sentences into words. </li>
    <li>Lowercase word conversion.</li>
    <li>Stopword removal.</li>
    <li>Too short word removal.</li>
    <li>Stemming, i.e. words are reduced to their root form.</li>
</ul>

In [3]:
import nltk
from nltk.tokenize import ToktokTokenizer
''' There are multiple, good tokenizers available, this is just one of the most recent ones.
The tokenizer splits the text into sentences and the sentences into tokens, most of the times individual words (language dependent). 
So, it is relatively easy for you to see if the selected tokenizer is producing the desired output or not.
'''

tokenizer = ToktokTokenizer() 
df["Tokens"] = df.Tokens.apply(tokenizer.tokenize)

#df["Tokens"] =word_tokenize(df.Tokens)

df.head()

ModuleNotFoundError: No module named 'nltk'

In [9]:

nltk.download("stopwords")
from nltk.corpus import stopwords

# Every language has stopwords, i.e. words that are used quite frequently for purposes 
# of syntactic constructiuon but that do not bear useful content. 
STOPWORDS = set(stopwords.words("spanish"))

def eliminate_stopwords_digits(tokens):
    """
    Eliminates stopwords and digits from the list of tokens.
    """
    return [token for token in tokens if token not in STOPWORDS 
            and not token.isdigit()]

df["Tokens"] = df.Tokens.apply(eliminate_stopwords_digits)
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Analia\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Titular,Noticia,Tokens
0,Un estudio impulsado por la Universidad de San...,El virus SARS-Cov-2 entró en España por la ciu...,"[virus, sars, cov, entró, españa, ciudad, vito..."
1,Las claves: qué es Montai y quién está detrás,¿Qué es Montai? ¿Qué relación guarda con las o...,"[montai, relación, guarda, empresas, quién, de..."
2,Robots entregan domicilios en Medellín durante...,Unos 15 robots recorren las calles de Medellín...,"[robots, recorren, calles, medellín, realizar,..."
3,Grazón insiste en que un nuevo estado de alarm...,"En una entrevista en Radio Euskadi, recogida p...","[entrevista, radio, euskadi, recogida, europa,..."
4,Vox se sube a la ola de la extrema derecha eur...,"""España ha dejado de ser católica"", decía Manu...","[españa, dejado, ser, católica, decía, manuel,..."


In [10]:
from nltk.stem import SnowballStemmer

'''
Stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form.
Again, there are multiple stemmers available. Snowball is one of the most commonly used. 
This process may take a while to complete...
'''

stemmer = SnowballStemmer("spanish")

def stem_palabras(tokens):
    """
    Reduce cada palabra de una lista dada a su raíz.
    """
    return [stemmer.stem(token) for token in tokens]

df["Tokens"] = df.Tokens.apply(stem_palabras)

df.head()


Unnamed: 0,Titular,Noticia,Tokens
0,Un estudio impulsado por la Universidad de San...,El virus SARS-Cov-2 entró en España por la ciu...,"[virus, sars, cov, entro, españ, ciud, vitori,..."
1,Las claves: qué es Montai y quién está detrás,¿Qué es Montai? ¿Qué relación guarda con las o...,"[montai, relacion, guard, empres, quien, detra..."
2,Robots entregan domicilios en Medellín durante...,Unos 15 robots recorren las calles de Medellín...,"[robots, recorr, call, medellin, realiz, entre..."
3,Grazón insiste en que un nuevo estado de alarm...,"En una entrevista en Radio Euskadi, recogida p...","[entrev, radi, euskadi, recog, europ, press, g..."
4,Vox se sube a la ola de la extrema derecha eur...,"""España ha dejado de ser católica"", decía Manu...","[españ, dej, ser, catol, dec, manuel, azañ, co..."


In [11]:
#Select a document to preview after preprocessing
print(df.Tokens[0][0:10])


['virus', 'sars', 'cov', 'entro', 'españ', 'ciud', 'vitori', 'torn', 'febrer', 'conclusion']


## Feature extraction: Bag of Words

The cleaned text is not enough to be passed directly to the classification model. 
The features need to be numeric, not strings. 

There are many state-of-art approaches to extract features from the text data.
The most simple and known method is the Bag-Of-Words (BOW) representation. 
It’s an algorithm that transforms the text into fixed-length vectors. 
This is possible by counting the number of times the word is present in a document. 
The word occurrences allow to compare different documents and evaluate their similarities for applications, 
such as search, document classification, and topic modeling.

The reason for its name, “Bag-Of-Words”, is due to the fact that it represents the sentence as a bag of terms. 
It doesn’t take into account the order and the structure of the words, 
but it only checks if the words appear in the document.


In [12]:
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import random
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline

doc_term_matrix = Dictionary(df.Tokens)
print(f'Número de tokens: {len(doc_term_matrix)}')

Número de tokens: 47369


Filter out tokens that appear in less than 2 documents (absolute number) or 
more than 0.8 documents (fraction of total corpus size, not absolute number). 

In [13]:
'''
no_below : absolute value
no_above: percentual
keep_n: number of tokens to be kept (i.e. the most frequent tokens)
'''
doc_term_matrix.filter_extremes(no_below=2, no_above = 0.8)
print(f'Número de tokens: {len(doc_term_matrix)}')

Número de tokens: 25516


For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘corpus’, then check one document.

In [14]:
# Corpus construction
corpus = [doc_term_matrix.doc2bow(noticia) for noticia in df.Tokens]

'''
If we want to use tf_idf scores we should use the corresponding model...

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
'''

# The BOW of one document 
# In the below example, the first five elements are [(3, 1), (25, 1), (26, 6), (29, 1), (40, 1) …].
# Specifically, the tuple (3, 1) tells us that the word with id=3 shows one time. 
# You can know which token it is by printing the corresponding entry in the dictionary, i.e. doc_term_matrix[3].
print(corpus[6])
print(doc_term_matrix[3])

[(3, 1), (25, 1), (26, 6), (29, 1), (40, 1), (41, 3), (44, 1), (48, 7), (52, 2), (67, 1), (68, 1), (77, 1), (86, 1), (94, 1), (96, 1), (108, 4), (116, 1), (118, 1), (121, 1), (131, 2), (146, 2), (149, 2), (164, 1), (172, 1), (176, 2), (178, 1), (193, 1), (204, 1), (210, 1), (222, 1), (235, 3), (236, 4), (238, 1), (245, 1), (268, 1), (276, 1), (283, 1), (295, 1), (299, 1), (311, 3), (312, 2), (339, 1), (349, 1), (367, 11), (372, 1), (394, 12), (407, 1), (413, 1), (431, 1), (436, 2), (439, 1), (440, 1), (450, 1), (454, 2), (462, 2), (475, 3), (478, 2), (492, 1), (498, 1), (502, 2), (513, 2), (525, 1), (531, 3), (549, 1), (561, 1), (574, 1), (587, 2), (615, 1), (631, 1), (640, 1), (650, 3), (653, 2), (656, 1), (660, 1), (677, 1), (680, 1), (683, 1), (684, 1), (686, 2), (694, 1), (732, 1), (784, 2), (793, 2), (794, 1), (804, 2), (817, 3), (830, 1), (839, 1), (840, 1), (842, 1), (852, 1), (866, 2), (913, 1), (915, 1), (917, 2), (921, 2), (933, 1), (958, 1), (1055, 1), (1057, 1), (1060, 1), 

In the next classes, we will explore alternative feature extraction algorithms, namely the 
calculation of term frequency-inverse document frequency (TFIDF) scores. 
No algorithm outperforms the rest, the choice depends on the actual data as well as the objectives 
of the analysis. Often, it is interesting to compare model performance over different feature sets.