# Introduction to NLP

**Goal of the lab**: 
Given a set of Shakespeare play, can we:
- Find similarity across the different plays ?
- Find most frequent words per play ?
- Characterize the plays and interprete the results ?

## Pre-requisite
To run this lab, you need to have installed on your system:
- `pandas`
- `matplotlib`
- `seaborn`
- `spacy` (and the english extension by running `python -m spacy download en_core_web_sm`)
- `sklearn`

In [2]:
import pandas as pd
import seaborn as sns
import spacy
from matplotlib import pyplot as plt

plt.rcParams['figure.figsize'] = [15, 5]

# Loading the dataset

The file `dataset.csv` contains the aggregated data of a set of Shakespeare's play.


> This is a reminder from yesterday's session !

In [3]:
df = pd.read_csv("data/dataset.csv")

**Exercice**:
1. Load the dataset located in "data/dataset.csv"
1. Give the number of individuals and list the columns in the dataset.
2. Give the number of unique plays in the dataset and output a list of them.

## Cleaning up the data
Textual should be:

- Lemmed
- Cleaned from stop words and punctuation.
  
We will do it using spacy built-in features.

In [4]:
nlp = spacy.load("en_core_web_sm")
# Take the first sentence from the dataset
test_sentence = df["PlayerLine"].iloc[0]
# Run it through spacy nlp function
doc = nlp(test_sentence)

# You can now iteratively access the different parsed version of the words
for token in doc[0:3]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

A a DET DT det X True True
hall hall NOUN NN ROOT xxxx True False
in in ADP IN prep xx True True


**Exercice**:
1. Compute for each play the uncapitalized_text (using the `.lower()` function) and lemmed_text from spacy.
1. Add 2 new columns to the dataframe: `uncapitalized_text`and `lemmed_text`.

**Bonus**: Add a new column `cleaned_text` which contains the final text: start with the column `lemmed_text` and remove the stop words contained in `spacy.lang.en.stop_words.STOP_WORDS`.

## Analyzing vocabulary use

Using the `sklearn` library (see slides in lectures), it is easy to compute word counts using the `CountVectorizer` class.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
# Initialize 
count_vect = CountVectorizer(stop_words=list(spacy.lang.en.STOP_WORDS))

# Fit on the clean column
count_vect.fit(df.PlayerLine)

# Get count matrix
count_matrix = count_vect.transform(df.PlayerLine).todense()

# Create as dataframe
count_df_ = pd.DataFrame(count_matrix, columns=count_vect.get_feature_names_out(), index=df.Play)

# Merge with df to get information
# Careful, new column names will have _y appended
count_df = count_df_.merge(df, left_index=True, right_on="Play")



`count_df` is now a dataframe where each column is a word and the rows correspond to the number of occurences of the words.

**Exercice**:
1. Give the 5 most frequent words across all plays, using count_df_.
2. Find for each play the number of times the word `love` and `death` is used.
3. Find the play that uses most often the word `love`.
5. Give for each play the most frequent word.


## Projecting into a lower space

We are going to use `PCA` (see slides in lecture) to project the different books in a reduced 2 dimensional space and to visually analyze their similarity.

In [16]:
from sklearn.decomposition import PCA

# Reduce into a 2 dimension matrix
pca = PCA(n_components=2)

# Train and retrieve output
reduced_pca = pd.DataFrame(pca.fit_transform(count_df_), columns=["axis_0","axis_1"])

**Exercice:**
1. Plot the scatter plot into the 2 dimensional space.
2. Add the title of each play using plt.annotate.
3. Can you infer anything regarding the distribution in the reduced space ?

**Bonus**: 
- Find explained variance ratio using the attribute `pca.explained_variance_ratio_` and conclude regarding relevance of axis choice.

## Performing k-means clustering

We are going to use k-means clustering (see lecture slides) to group works that are the most similar in terms of vocabulary use, using the class `KMeans`. 

In [18]:
from sklearn.cluster import KMeans

kmeans= KMeans(n_clusters=5, n_init="auto")
kmeans.fit(count_df_)
labels = kmeans.labels_

The `labels` list contains for each title its associated cluster.

**Exercice**:
1. Assign the `labels` column to the dataframe `count_df`.
2. Provide the list of plays within each cluster.
3. Give the 10 words with the highest term frequency per cluster.
4. Conclude regarding vocabulary overlap in Shakespeare's play.
   
**Bonus**:
1. Select the optimum number of clusters by analyzing the inertia for each value of k.
2. Test using the `DBScan` algorithm and compare results.

**Bonus**: document embedding using Doc2Vec.

Another possible approach to embedding is the use of *neural networks* (we'll see tomorrow the practical behind these terms).
Many models exist for this embedding, and one of the most popular one is Doc2Vec.
This model is available in the library `gensim` (you can access the documentation here: https://radimrehurek.com/gensim/models/doc2vec.html).

> Do not forget to install `gensim` if you want this code to run !
> There is a slight hiccup as `gensim` requires a lower version of scipy, so you need additionally to run: `pip install scipy==1.12`.

In [15]:
# Load required libraries
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(df.PlayerLine.values.tolist())]

# vector_size controls the dimension of the embeddings
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
embeddings = pd.DataFrame([model.infer_vector(doc.split()) for doc in df.PlayerLine.values.tolist()], index=count_df.index)

**Exercice**:
1. Perform the clustering using k-means on the neural networks embedding and compare it to using count/tf-idf embedder.
2. Set the `vector_size` argument to 2 in order to get 2 dimension embeddings.
3. Plot the 2D embeddings, color and annotate according to play name and draw conclusions regarding plays similarity.