# **Natural Language Processing** (NLP)

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this exercise, we are revisiting the EPFL course book data. We would like to understand how similar courses are based on their textual description. Think about this case: maybe you liked a course very much and would now like to take the most similar one to that.

### Business Objective:

* To discover similarity relationship between EPFL courses based on their textual description 

### Learning Objectives:

* Getting familiar with text preprocessing facilities in the `nltk` library
* Understanding intuition behind different vector space models to work with text data, e.g. TFIDF
* Learning how to transform a raw corpus into the vector space model of choice
* Learning how to query similar documents to a focal document in a given space
* Learning how to visualize text data from high-dimensional space into low dimansions for visualization 

-------

# Part 0: Setup

In [None]:
# Standard imports 
import pandas as pd

# Natural Language Toolkit (NLTK) and spaCy
import nltk
nltk.download('wordnet')
import spacy

# Sklearn TFIDF function and PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

# Plotting packages
import matplotlib.pyplot as plt

# Python math package
import math

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Define constant(s)

SEED = 42

# Part 1: Load .csv data

In this part, simply load the EPFL course file from `data/epfl_description.csv`.

**Q 1**: Load EPFL course data. Look at the shape and the first 5 rows. What shape does the data have?

**Q 2**: Concatenate the course title and the course description column. Why is this useful?

# Part 2: Clean data

**Q 1**: Draw a random sample of 5 course descriptions and look at the entire description. Are there any issues with the text data? If so, what are they?

Hint: Look at the element in row 8. 

Some of the issues that require cleaning include:

- `\r` character is the carriage return
- punctuation marks like `.`, `,`, `?`, `!`, etc. 
- quotation marks and other symbols like `$`, `(`, `)`, etc.
- etc.

**Q 2**: Remove the parts of the text identified above. Also remove multiple white spaces. How did the element in row 8 change?

Hint: use a "regular expressions" (regex), which defines a search pattern for strings - a very handy tool for pre-processing text. You can visit https://regex101.com/ to test your regex expressions.

**Q 3**: Clean data by removing rows with missing data in any column. How many clean rows are left?

# Part 3: Tokenize and lemmatize course descriptions 

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For example:

- am, are, is $\Rightarrow$ be
- car, cars, car's, cars' $\Rightarrow$ car

For details about lemmatization and stemming visit: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

**Q 1**: Define a simple function that takes a course description as input and outputs the tokenized and lemmatized text as a list.

**Q 2**: Apply the function to the course description in your Pandas dataframe. 

Hint: Use the Pandas `apply` function. Look at the second row - what changed? Did lemmatization work?

# Part 4: Create a term frequency inverse document frequency (TFIDF) matrix

We now have to ensure that the text description is stored as a string in our dataframe, not as a list. In the code below, replace the variable names with the ones you are using.

In [None]:
# Transform list data to text data
df_clean['description_lemmatized_text'] = df_clean['description_lemmatized'].str.join(' ')
df_clean['description_lemmatized_text'].head()

In [None]:
# Extract all the text data
data = df_clean['description_lemmatized_text']
len(data)

**Q 1**: Fit and transform your text data using TFIDF. 

Hint: use the `TfidfVectorizer()` function in sklearn with the parameter `max_features = 400`.

**Q 2**: What shape does the TFIDF matrix have? What's the meaning of the number of columns? Use the `toarray()` function to show some of the TFIDF entries.

# Part 5: Apply a principal component analysis (PCA)

We now project the high-dimensional TFIDF matrix into its 2 principal components.

**Q 1**: Run a PCA on the TFIDF matrix. Hint: use the `PCA.fit_transform()` function in sklearn.

**Q 2**: What's the shape of the PCA output? Why?

**Q 3**: This is done for you: add the PCA values to the cleaned dataframe. We want a dataframe with the course name and the PCA values.

# Part 6: Visualize how similar EPFL courses to each other

Now we returning to the initial business objective: to discover similarity relationship between EPFL courses based on their textual description.

**Q 1**: Visualize the PCA values using a simple scatter plot. Each dot represents a course.

**Q 2**: What's the most similar course to COM-421? Compute the Euclidean distance between COM-421 and every other course. To do so, you can use this function below:

In [None]:
def euclideanDistance(p1, p2):
    """
    Compute euclidean distance
    
    Parameter: 
        p1 (list): input point defined as a [x,y] list
        p2 (list): input point defined as a [x,y] list
    
    Returns: 
        float: euclidean distance between p1 and p2
        
    """   
    
    return math.sqrt( ((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2) )