# Introduction to NLP

**Goal of the lab**: 
Given a set of Shakespeare play, can we:
- Find similarity across the different plays ?
- Find most frequent words per play ?
- Characterize the plays and interprete the results ?

## Pre-requisite
To run this lab, you need to have installed on your system:
- `pandas`
- `matplotlib`
- `seaborn`
- `spacy` (and the english extension by running `python -m spacy download en_core_web_sm`)
- `sklearn`

In [1]:
import pandas as pd
import seaborn as sns
import spacy
from matplotlib import pyplot as plt

plt.rcParams['figure.figsize'] = [15, 5]



# Loading the dataset

> Reminder from yesterday's session.

In [2]:
df = pd.read_csv("data/dataset.csv")

**Exercice**:
1. Load the dataset located in "data/dataset.csv"
1. Give the number of individuals and list the columns in the dataset.
2. Give the number of unique authors in the dataset.
3. Give the number of unique school of thoughts in the dataset.
4. Give the number of author per school of thoughts in the dataset.
5. Plot the number of book by author.
6. Plot the number of book by school.

## Cleaning up the data
Textual should be:

- Lemmed
- Cleaned from stop words and punctuation.
  
We will do it using spacy built-in features.

In [6]:
nlp = spacy.load("en_core_web_sm")
# Take the first sentence from the dataset
test_sentence = df["sentence_str"].iloc[0]
# Run it through spacy nlp function
doc = nlp(test_sentence)

# You can now iteratively access the different parsed version of the words
for token in doc[:1]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

There there PRON EX expl Xxxxx True True


**Exercice**:
1. Compute for each book the uncapitalized_text (using the `.lower()` function) and lemmed_text from spacy.
1. Add 2 new columns to the dataframe: `uncapitalized_text`and `lemmed_text`.
3. Add a new column `cleaned_text` which contains the final text: start with the column `lemmed_text` and remove the stop words contained in `spacy.lang.en.stop_words.STOP_WORDS`. You should remove punctuation as well, using the value contained in `is_stop`.

## Analyzing vocabulary use

Using the `sklearn` library (see slides in lectures), it is easy to compute word counts using the `CountVectorizer` class.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [22]:
# Initialize 
count_vect = CountVectorizer(stop_words=list(spacy.lang.en.STOP_WORDS))

# Fit on the clean column
count_vect.fit(df.sentence_str)

# Get count matrix
count_matrix = count_vect.transform(df.sentence_str).todense()

# Create as dataframe
count_df_ = pd.DataFrame(count_matrix, columns=count_vect.get_feature_names_out(), index=df.title)

# Merge with df to get information
# Careful, new column names will have _y appended
count_df = count_df_.merge(df, left_index=True, right_on="title")



`count_df` is now a dataframe where each column is a word and the rows correspond to the number of occurences of the words.

**Exercice**:
1. Give the 5 most frequent words.
2. Find for each book the number of times the word `man` and `woman` is used.
3. Find the book that uses most often the word `essence`.
4. Find the school of thought that uses most of then word `god`.
5. Give for each school of thought the 25 most often used words and print them.
6. Perform the same exercise (you can copy paste the code) using tf-idf (`TfidfVectorizer`)
7. Create a new list called `columns_to_filter` that contains for each author the 25 words with the highest tf-idf.
8. Create a new dataframe called `filtered_df` which contains the previously filtered column.

## Projecting into a lower space

We are going to use `PCA` (see slides in lecture) to project the different books in a reduced 2 dimensional space and to visually analyze their similarity.

In [23]:
from sklearn.decomposition import PCA

# Reduce into a 2 dimension matrix
pca = PCA(n_components=20)

# Train and retrieve output
reduced_pca = pca.fit_transform(count_df_)

**Exercice:**
1. Plot the scatter plot into the 2 dimensional space.
2. Color the graph per author.
3. Add the title of each book using plt.annotate.
4. Can you infer anything regarding the distribution in the reduced space ?
5. Perform the same exercice on `filtered_df`.

**Bonus**: 
- Find explained variance ratio using the attribute `pca.explained_variance_ratio_`.
- Analyze axis meaning by plotting the different components and their associated word.

## Performing k-means clustering

We are going to use k-means clustering (see lecture slides) to group works that are the most similar in terms of vocabulary use, using the class `KMeans`. 

In [25]:
from sklearn.cluster import KMeans

kmeans= KMeans(n_clusters=11, n_init="auto")
kmeans.fit(count_df_)
labels = kmeans.labels_

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



The `labels` list contains for each title its associated cluster.

**Exercice**:
1. Assign the `labels` column to the dataframe.
4. Provide the list of titles / author / school within each cluster.
5. Give the 10 words with the highest term frequency per cluster.
6. Conclude regarding overlap in topics.
   
**Bonus**:
1. Select the optimum number of clusters by analyzing the inertia for each value of k.
2. Test using the `DBScan` algorithm.