# **Natural Language Processing** (NLP) (SOLUTIONS)

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this exercise, we are revisiting the EPFL course book data. We would like to understand how similar courses are based on their textual description. Think about this case: maybe you liked a course very much and would now like to take the most similar one to that.

### Business Objective:

* To discover similarity relationship between EPFL courses based on their textual description 

### Learning Objectives:

* Getting familiar with text preprocessing facilities in the `nltk` library
* Understanding intuition behind different vector space models to work with text data, e.g. TFIDF
* Learning how to transform a raw corpus into the vector space model of choice
* Learning how to query similar documents to a focal document in a given space
* Learning how to visualize text data from high-dimensional space into low dimansions for visualization 

-------

# Part 0: Setup

In [None]:
# Standard imports 
import pandas as pd

# Natural Language Toolkit (NLTK) and spaCy
import nltk
nltk.download('wordnet')
import spacy

# Sklearn TFIDF function and PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

# Plotting packages
import matplotlib.pyplot as plt

# Python math package
import math

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Define constant(s)

SEED = 42

# Part 1: Load .csv data

In this part, simply load the EPFL course file from `data/epfl_description.csv`.

**Q 1**: Load EPFL course data. Look at the shape and the first 5 rows. What shape does the data have?

In [None]:
# Loading the csv file
df = pd.read_csv('data/epfl_description.csv')
df.head()

In [None]:
df.shape

**Q 2**: Concatenate the course title and the course description column. Why is this useful?

In [None]:
# Adding course titles to descriptions
df['description'] = df['course'] + ' ' + df['description']
df.head()

**Q 3**: Lowercase all the words. Otherwise, the computer thinks that "Finance" and "finance" are two different words.

In [None]:
df['description'] = df['description'].str.lower()

# Part 2: Clean data

**Q 1**: Draw a random sample of 5 course descriptions and look at the entire description. Are there any issues with the text data? If so, what are they?

Hint: Look at the element in row 8. 

In [None]:
# Draw a random sample of descriptions (execute repeatedly to check many examples)
df['description'].sample(5).values

In [None]:
# Inspect a single description
df.iloc[8, 2]

Some of the issues that require cleaning include:

- `\r` character is the carriage return
- punctuation marks like `.`, `,`, `?`, `!`, etc. 
- quotation marks and other symbols like `$`, `(`, `)`, etc.
- etc.

**Q 2**: Remove the parts of the text identified above. Also remove multiple white spaces. How did the element in row 8 change?

Hint: use a "regular expressions" (regex), which defines a search pattern for strings - a very handy tool for pre-processing text. You can visit https://regex101.com/ to test your regex expressions.

In [None]:
# Copy the current dataframe, this is the raw data, now we are cleaning it
df_dropna = df.copy()

In [None]:
# replace characters
df_dropna['description'] = df_dropna['description'].str.replace('\r',' ')
df_dropna['description'] = df_dropna['description'].str.replace('.',' ')
df_dropna['description'] = df_dropna['description'].str.replace(',',' ')
df_dropna['description'] = df_dropna['description'].str.replace(';',' ')
df_dropna['description'] = df_dropna['description'].str.replace('$',' ')
df_dropna['description'] = df_dropna['description'].str.replace('(',' ')
df_dropna['description'] = df_dropna['description'].str.replace(')',' ')
df_dropna['description'] = df_dropna['description'].str.replace('?',' ')
df_dropna['description'] = df_dropna['description'].str.replace('!',' ')


In [None]:
# Remove multiple white spaces
df_dropna['description'] = df_dropna['description'].str.replace('\s+', ' ', regex = True)

In [None]:
# Inspect the same description as above
df_dropna.iloc[8, 2]

**Q 3**: Clean data by removing rows with missing data in any column. How many clean rows are left?

In [None]:
# Remove rows with missing values and reset index
df_clean = df_dropna.dropna(axis=0, how='any')
df_clean.reset_index(drop=True, inplace=True)
df_clean.shape

In [None]:
df_clean['description'][0]

# Part 3: Tokenize and lemmatize course descriptions 

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For example:

- am, are, is $\Rightarrow$ be
- car, cars, car's, cars' $\Rightarrow$ car

For details about lemmatization and stemming visit: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

**Q 1**: Define a simple function that takes a course description as input and outputs the tokenized and lemmatized text as a list.

In [None]:
# Load spaCy 
# (if spacy cannot load data run this in terminal: python -m spacy download en_core_web_sm)
nlp = spacy.load('en_core_web_sm')

# Tokenization and lemmatization function
def lemmatize_text(text):
    
    """
    Tokenize and lemmatize text
    
    Parameter: 
        text (str): input text
    
    Returns: 
        list: list of tokenized and lemmatized text 
        
    """
        
    # Set up text-processing pipeline
    text = nlp(text)
    text_tokenizedLemmatized = []

    # Tokenize and Lemmatize each word
    for word in text:
        text_tokenizedLemmatized.append(word.lemma_)
        
    return text_tokenizedLemmatized


In [None]:
# Example of lemmatizing a verb with and without POS
lemmatizer = nltk.stem.WordNetLemmatizer()

print('Without POS information:'.ljust(30) + str(lemmatizer.lemmatize('are')))
print('With POS information:'.ljust(30) + str(lemmatizer.lemmatize('are', pos='v')))

**Q 2**: Apply the function to the course description in your Pandas dataframe. 

Hint: Use the Pandas `apply` function. Look at the second row - what changed? Did lemmatization work?

In [None]:
df_clean['description'].iloc[1]

In [None]:
# Apply the lemmatization function
df_clean['description_lemmatized'] = df_clean['description'].apply(lemmatize_text)


In [None]:
' '.join(df_clean['description_lemmatized'].iloc[1])

# Part 4: Create a term frequency inverse document frequency (TFIDF) matrix

We now have to ensure that the text description is stored as a string in our dataframe, not as a list. In the code below, replace the variable names with the ones you are using.

In [None]:
# Transform list data to text data
df_clean['description_lemmatized_text'] = df_clean['description_lemmatized'].str.join(' ')
df_clean['description_lemmatized_text'].head()

In [None]:
# Extract all the text data
data = df_clean['description_lemmatized_text']
len(data)

**Q 1**: Fit and transform your text data using TFIDF. 

Hint: use the `TfidfVectorizer()` function in sklearn with the parameter `max_features = 400`.

In [None]:
# Load the vectorizer
vectorizer = TfidfVectorizer(min_df = 1, max_features = 400, stop_words='english')

In [None]:
# Fit and transform the course description data
X = vectorizer.fit_transform(data)

**Q 2**: What shape does the TFIDF matrix have? What's the meaning of the number of columns? Use the `toarray()` function to show some of the TFIDF entries.

In [None]:
# Convert from sparse matrix to regular matrix
X = X.toarray()

In [None]:
# Each column is a unique term/word
X.shape

In [None]:
# Look at some TFIDF values
X

# Part 5: Apply a principal component analysis (PCA)

We now project the high-dimensional TFIDF matrix into its 2 principal components.

**Q 1**: Run a PCA on the TFIDF matrix. Hint: use the `PCA.fit_transform()` function in sklearn.

In [None]:
# Apply PCA
pca = PCA(n_components = 2, random_state = SEED)
pca_out = pca.fit_transform(X)

**Q 2**: What's the shape of the PCA output? Why?

In [None]:
pca_out.shape

**Q 3**: This is done for you: add the PCA values to the cleaned dataframe. We want a dataframe with the course name and the PCA values.

In [None]:
# Convert PCA output to dataframe
pca_df = pd.DataFrame(data = pca_out, columns = ['PCA1', 'PCA2'])
pca_df.shape

In [None]:
# Concatenate dataframes
df_out = pd.concat([df_clean, pca_df], axis = 1)
df_out.head()

# Part 6: Visualize how similar EPFL courses to each other

Now we returning to the initial business objective: to discover similarity relationship between EPFL courses based on their textual description.

**Q 1**: Visualize the PCA values using a simple scatter plot. Each dot represents a course.

In [None]:
# Plot the PCA values
plt.figure(figsize=(16,12))
plt.scatter(df_out['PCA1'], df_out['PCA2'], s = 10)
plt.xlabel('PC 1')
plt.ylabel('PC 2')

**Q 2**: What's the most similar course to COM-421? Compute the Euclidean distance between COM-421 and every other course. To do so, you can use this function below:

```
def euclideanDistance(p1, p2):
    """
    Compute euclidean distance
    
    Parameter: 
        p1 (list): input point defined as a [x,y] list
        p2 (list): input point defined as a [x,y] list
    
    Returns: 
        float: euclidean distance between p1 and p2
        
    """   
    
    return math.sqrt( ((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2) )
```

In [None]:
def euclideanDistance(p1, p2):
    """
    Compute euclidean distance
    
    Parameter: 
        p1 (list): input point defined as a [x,y] list
        p2 (list): input point defined as a [x,y] list
    
    Returns: 
        float: euclidean distance between p1 and p2
        
    """   
    
    return math.sqrt( ((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2) )

In [None]:
# Enter course code & the # of most similar courses to display
CODE = 'COM-421'
TOP  = 30

WIDTH = 80

print('-'*WIDTH)
print('FOCAL COURSE')
print('-'*WIDTH)
print(df_out[df_out['code'] == CODE]['course'].values[0], '\n')

# Find the most similar courses
p1 = df_out[df_out['code'] == CODE][['PCA1', 'PCA2']].values[0]

# Collect all distance metrics 
all_distances = {}
for i, row in df_out.iterrows():
    p2 = row[['PCA1', 'PCA2']].values
    distance = euclideanDistance(p1, p2)
    all_distances[row['course']] = distance

# Sort distances in increasing order of distance 
all_distances = {k: v for k, v in sorted(all_distances.items(), key=lambda item: item[1])}

# Print most similar courses 
i = 0
print('-'*WIDTH)
print('TOP {} COURSES'.format(TOP).ljust(70) + 'DISTANCE')
print('-'*WIDTH)
for k, v in all_distances.items():
    print('{}'.format(k).ljust(70) + '{}'.format(round(v, 4)))
    i+= 1
    if i == TOP: break

In [None]:
# focal course description (lemmatized)
df_out[df_out['code'] == CODE]['description_lemmatized_text'].values[0]

In [None]:
# similar course description (lemmatized)
course_name = 'Neurosciences III : behavioral and cognitive neuroscience'
df_out[df_out['course'] == course_name]['description_lemmatized_text'].values[0]