# DSFM text-as-data workshop

## 1. Text preprocessing and Exploratory Data Analysis

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)

Source: [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)

License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository.

## Workshop

Text data are ubiquitous and most of us have to deal with it on a daily job. In this workshop, we will learn how to work with text data under different aspects.

Natural Language Processing, the interactions between computers and human languages, is widely used in many different fields including speech recognition, natural language understanding, and natural language generation.

Imagine you want to consolidate the brand identity of your firm and you want to understand what your clients are saying about your brand and your products. You are given a long list of reviews of both your company as well as other firms, what do you do? In this workshop we will see how we can extract insights from such data as well as understand which reviews are positive and which are negative.

This workshop is composed of four different Jupyter Notebooks. All four make use of the same dataset, [Yelp Dataset](https://www.kaggle.com/yelp-dataset/yelp-dataset). This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the dataset, you'll find information about businesses across 11 metropolitan areas in four countries.

## Overview

In this notebook you will learn how to:
    1. Extract text data from PDF
    2. Understand how to analyze both structured and unstructured data such as text data
    3. Preprocess and clean text data

## Part 1: PDF Data Extraction

Document with text data might have different extensions, PDF is one of the most common. Here, we will extract text data from PDF files and store them into a Pandas DataFrame. If you work with other file types such as power point presentation the procedure is about the same.

**Learning objective**: being able to extract text data from pdf and store it into Pandas.

**Useful resources**:
 - [glob](https://docs.python.org/3.7/library/glob.html) a python module from the standard library to find all the pathnames that match a specified pattern
 - [pdfminer.six](https://github.com/pdfminer/pdfminer.six) a third party python library to work with pdf files

Q1: Under `data/pdf_review` you can find a list of reviews in pdf format. Employing the `glob` module, save the name of such files into a list called `pdf_review`. Display on-screen the number of pdf files as well as the first three filenames.

> ☝️ The glob module is part of the python standard library. It often helps to find and extract filenames and file paths of multiple sources.

> ☝️ Try to open a pdf review with your favorite PDF viewer to get a glance of the raw data itself  

In [None]:
import glob
pdf_reviews = glob.glob(# YOUR CODE HERE #)

print(f"There are {len(pdf_reviews)} pdf files.")
pdf_reviews[:3]

Q2: With the aid of [pdfminer.six](https://github.com/pdfminer/pdfminer.six), iterate over all pdf reviews found under `data/pdf_reviews/`, extract the text and store it into a `text_review` list. Look at the first two reviews. What do you notice?

In [None]:
from pdfminer.high_level import extract_text

text_review = # YOUR CODE HERE #
text_review[:1]

**Answer** The extracted text data contains some special symbols that will need to be removed and preprocessed. 

Q3: Save the obtained `text_review` into a `pdf_review_df` Pandas DataFrame and look at the first 5 rows.

> ☝️When working with Pandas Dataframe is a good practice to prepend the variable name with `_df`

In [None]:
import pandas as pd
pdf_review_df = # YOUR CODE HERE #
pdf_review_df.head()

## Part 2: Exploratory Data Analysis

Before executing any fancy machine learning model, it's useful to have a good understanding of the data we are about to deal with.

The previous created DataFrame is composed of only 100 reviews.
For this part we will start with a larger dataset composed of 500'000 reviews.

**Learning objective**: be able to visualize and analyze both structured (categorical, numerical, etc.) and text data.

### Structured data

Q1: Load all reviews into a Pandas DataFrame named `df` and display the first five rows.

> ☝️ Executing operations on 500k of data might take a while. As a suggestion, you might want to sample about 10k of data, write the code and look at the results and only at the end execute again the code on the whole dataset (Kernel > Restart and Run All). Once the data are loaded, you can use the following snippet of code to sample 10 thousands reviews: `df = df.sample(10000).reset_index(drop=True)`

In [None]:
# Settings for high-quality graphs
import matplotlib.pyplot as plt

import numpy as np
# Fix random seed for reproducibility
np.random.seed(42)

%matplotlib inline
%config InlineBackend.figure_format = "retina"

In [None]:
df = # YOUR CODE HERE #
df = # YOUR CODE HERE # # sample 10k
df.head()

Q2: Even if we are principally dealing with text data, we are still interested in any structured data as this might help with our task. Information such as the kind of business and the city of the business might be useful for instance. You are given a second dataset (`yelp_business.csv`) containing some extra information regarding the business.

Load into a Pandas DataFrame named `df_business` the `yelp_business.csv` dataset and look at the first rows. Then, merge it with the initial `df` DataFrame. Which column you need to use to _merge_ the two datasets?

In [None]:
df_business = # YOUR CODE HERE #
df_business.head()

In [None]:
df = # YOUR CODE HERE #
df.head(2)

Q3: The `categories` column of the new Dataframe represents the type of business of the review. Identify and count the occurrences for each category and plot it using a bar chart. How many categories are there and what are the most common categories?

In [None]:
category_counts = # YOUR CODE HERE #

category_counts[:20]

In [None]:
print(f"There are {len(category_counts)} unique categories.")

In [None]:
title = "Most common categories"
category_counts[:40].sort_values().plot.barh(figsize=(8,8), title=title);

Q4: With a bar plot, display the the review's stars. What are average and median values?

> ☝️Keep in mind these results, they'll be useful later.

In [None]:
title = "Review's ratings"

# YOUR CODE HERE #


print(f"Average rating: {# YOUR CODE HERE #}")
print(f"Median rating: {# YOUR CODE HERE #}")

### Text data

Q5: Add a new column `words_len` corresponding to the number of words of the `text` column. You can use the `str.split()` function to split a string. Then, display the histogram of the new column. What is the average number of words per review?

In [None]:
df['words_len'] = # YOUR CODE HERE #

title = "Number of words for each review"
df['words_len'].plot.hist(title=title, bins=100);

In [None]:
print(f"Average number of tokens is {df['words_len'].mean()}")

Q5: Print the most common 20 words. What do you notice?

In [None]:
top_words = # YOUR CODE HERE #

NUM_TOP_WORDS = 20
top_words[:NUM_TOP_WORDS]

**Answer:** most of the common words are stopwords, i.e words without a special meaning.

## Part 3: Data Cleaning

When dealing with text data, data cleaning is an essential step. In general, data cleaning is both domain and task-dependent. In this section, we will see a universal pipeline valid in most scenarios. Data cleaning is also an iterative process.

**Learning objective**: learn the main steps for text preprocessing, as well as experiment with regular expressions

### Regular expressions

**Useful resources**:
 - https://www.regular-expressions.info - The Premier website about Regular Expressions

Q1: Store the the first review of the the `pdf_review_df` DataFrame in a variable `r`, display it and by using regular expressions extract the `date`, the `categories` as well as the rating `stars`.

> ☝️You can use this snippet of code to visualize the first review: `pdf_review_df.iloc[0]['text']`

In [None]:
r = # YOUR CODE HERE #
r

In [None]:
import re

date = re.findall(# YOUR CODE HERE #, r)
print("date: ", date)

categories = re.findall(# YOUR CODE HERE #, r)
print("categories: ", categories)

stars = re.findall(# YOUR CODE HERE #, r)
print("stars: ", stars)

> ☝️ Regular expressions are a powerful tool. In most cases, there are many ways of solving the same problem.

### Tokenization

Tokenization refers to the act of splitting a sentence into a list of tokens (words). This is an essential step as it will allows us to later map every word to a number that the computer and the machine learning algorithms (regression, classification, deep learning, whatever, ...) can understand. 

Q1: Before tokenizing the whole dataset, let's compare different algorithms on the `review_example`. Start by splitting the first review `r` by the empty space (` `) and look at the result. What do you notice? What's the main issue with this approach?

In [None]:
print(# YOUR CODE HERE #)

**Answers** Some of the tokens such as `first.` should be splitted further.

Q2: Execute again the tokenization, this time using [spaCy](https://spacy.io/) ("Industrial-Strength Natural Language Processing"), a common python package for NLP tasks. How does it look like compared to the previous solution? What are the advantages and disadvantages of the two approaches?

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
review_example_spacy = nlp(# YOUR CODE HERE #)

tokens_spacy = []
for token in review_example_spacy:
    token = token.text.strip() # remove starting and ending space
    tokens_spacy.append(# YOUR CODE HERE #)

print(tokens_spacy)

**Answers** tokenization with spaCy is slower but it does a better job (in most cases) in separating some tokens (for instance `first.`). Note also that with the second approach the date `2019-07-01` has been splitter further, this might or might not be the best solution, depending on the task one is trying to achieve.

Q3: Tokenize the whole DataFrame `df` using spaCy by adding a new colum `tokenized` to the dataset. Display the first five rows. This operation might take a while (3-5 minutes), you can monitor the progress using [tqdm](https://pypi.org/project/tqdm/).

In [None]:
%%time
from tqdm import tqdm
tqdm.pandas()

import spacy
nlp = spacy.load("en_core_web_sm")

def tokenize(text):
    
    # for a faster operation, we can limit the number of character to 500
    # doing that we might split a word that is sub-optimal.
    #text = text[:500]
    
    nlp_text = nlp(# YOUR CODE HERE #)
    tokens_spacy = []
    for token in nlp_text:
        token = token.text.strip() # remove starting and ending space
        tokens_spacy.append(# YOUR CODE HERE #)
    return tokens_spacy
    
df['tokenized'] = df['text'].progress_apply(tokenize)
df.head(2)

### Removal of stopwords

As seen previously, some of the most common words in the dataset does not brings any valuable meaning to the reviews. This words are known as _stopwords_.

There are two main approaches to remove stopwords:
 1. Remove all top words. In this case it's important to carefully choose the threshold.
 2. Remove all words that appears in a pre-defined stopwords list.
 
#### Remove all _top_words_

Q1: First, create a list of stopwords you want to get rid of. Look at the `top_words` and create a python `set` of _top_words_. Be careful in selecting the right threshold.

In [None]:
top_words[:30]

In [None]:
STOPWORDS_THRESHOLD = # YOUR CODE HERE #
stopwords = top_words[:STOPWORDS_THRESHOLD]

Q2: Define a function `remove_stopwords(tokenized_text, list_stopwords)` that given a tokenized text, removes all `list_stopwords`. Apply this function to the `tokenized` Series and store the results into another column, `without_stopwords`, of the DataFrame.

> ☝️In some cases, the tokens in the stopwords list are lowercased. Make sure to take this fact into consideration when developing your own solution.

In [None]:
def remove_stopwords(tokenized_text, list_stopwords):
    return # YOUR CODE HERE #

remove_stopwords(tokenized_text=["is", "Beautiful", "!"], list_stopwords=["is"])

#### Use a pre-defined stopwords list

A more common approach is to remove all stopwords already pre-defined in a stopwords list. In most cases, this is a more stable solution as this permits to remove all stopwords without the risk of removing important but common words.

> ☝️ Historically, one of the most common stopwords list is the one provided by [NLTK](). A more modern and valid alternative is to use the list of stopwords provided by spaCy

Q3: Load the spaCy stopwords list and `apply` again the function on the new set as you did before.

In [None]:
from spacy.lang.en import stop_words as spacy_en_stopword
spacy_stopwords = spacy_en_stopword.STOP_WORDS

In [None]:
df['without_stopwords'] = df['tokenized'].apply(lambda row: # YOUR CODE HERE #)
df['without_stopwords']

Q4: Join back the splitted tokens to generate a single string for each cell of the Pandas Series. Look at the head of the DataFrame and finnaly store it as a csv file: `review_clean.csv`.

In [None]:
df['text_clean'] = # YOUR CODE HERE #

(
    df[["text_clean", "text", "stars", "categories", "name", "address"]]
    .to_csv("./data/review_clean.csv", index=False)
)

df.head(3)