# DSFM text-as-data workshop

## 2. Basic of Natural Language Processing 

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)

Source: [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)

License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository.


### Overview

When dealing with text data, we can look at it under different perspectives. We can, for instance, look at a single sentence to study and capture the linguistic level. This include for instance finding named entities (named entity recognition) or finding part of speeech tags (verb, adverb, ...). 

Another approach is to consider the whole document as a single entity and look for similar documents, one simple solution being count the words that documents share in commons. This action is equivalent to represent a document as a vector and compute distances between vectors.

## Part 1: NLP with spaCy

We already used spaCy ([spacy.io](https://spacy.io/)) in the previous notebook to tokenize the text and find the stopwords. spaCy's catchphrase is "Industrial strength NLP in Python". spaCy is known to be fast and simple to use.

Spacy can help to answer these questions:
 - What this text is talking about?
 - What do the words mean in this context?
 - What companies and products are mentioned?
    
The main features of spaCy are:
 - Tokenization
 - Part-of-speech (POS) Tagging. The action to assign word types to tokens, like verb or noun.
 - Dependency Parsing. A tool to describe relations between individual tokens (see next).
 - Named Entity Recognition. Find entities such as person name or firm name in a text.

Resources: 
 - [spaCy 101](https://spacy.io/usage/spacy-101), the official getting-started tutorial. 

Q1: Load the `review_clean.csv` CSV file into a Pandas DataFrame `df` and display the first 5 reviews.

In [None]:
import numpy as np
# Fix random seed for reproducibility
np.random.seed(42)

import pandas as pd
df = # YOUR CODE HERE #
df.head(# YOUR CODE HERE #)

Q2: Store in a `first_review` variable the first review and display it on screen.

In [None]:
first_review = # YOUR CODE HERE #
first_review

Part of speech (POS) tagging is the process of assigning grammatical properties (noun, verb, adjective, adverb,  etc.) to words. spaCy models use both the definition of the words and its context to determine the right tag.  

Q3: Using spaCy, apply POS tagging to the first and look at the results. What is the part of speech tag for the `-` token? What about `phone` and `Jana`?

> ☝️At line 11 we are overwriting the default printing function with another one from the [rich](https://github.com/willmcgugan/rich) library. This allows to pretty print the data.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(# YOUR CODE HERE #)

pos = []

for token in doc:
    pos.append(# YOUR CODE HERE #)

from rich import print
print(pos[-10:])

Q4: Using [spaCy visualizer](https://spacy.io/usage/visualizers), display the dependency parse tree of the first sentence of the first review.

In [None]:
first_sentence = # YOUR CODE HERE #

print(f"First sentence is: {first_sentence}")

doc = nlp(# YOUR CODE HERE #)

from spacy import displacy
displacy.render(# YOUR CODE HERE #, style="dep")

Q5: Using the spaCy recognizer, look at the named entities of the first and second reviews. Which information can you get it out of it? Can you use a similar function in your daily job?

In [None]:
doc = nlp(# YOUR CODE HERE #)
displacy.render(# YOUR CODE HERE #, style="ent")

In [None]:
second_review = # YOUR CODE HERE #
doc = nlp(# YOUR CODE HERE #)
displacy.render(# YOUR CODE HERE #, style="ent")

## Part 2: Vector Space

As machines cannot understand human languages as we do, we are required to somehow transform text data into a numeric format. The idea is to _map_ every review to a numeric vector.

Q1 (**theory**): You just received 10 thousands new contracts. You need to categorize them in 10 different sub-categories in an efficient way. What do you do? You don't have access to any information a priori, neither your data have some metadata nor labels. Describe in layman terms how you would proceed.

**Answers**

1. Count the word occurrence in each document and create a document-term matrix count.
1. Apply a clustering algorithm such as k-means (with k = 10 in this case) and find the different clusters.

Q2: [Texthero](https://texthero.org/) is a simple toolkit to preprocess and analyze text-based dataset. Texthero is still in beta and therefore some parts might change in future releases.

With the aid of Texthero, represent each reviews by counting words. Select only the first 500 most common words.

If you need help, you can have a look at the [getting-started](https://texthero.org/docs/getting-started) tutorial.

In [None]:
import texthero as hero

df['count'] = # YOUR CODE HERE #
df['count']

Q3: By applying principal component analysis, reduce the dimension of the vector space to two.

In [None]:
df['pca'] = # YOUR CODE HERE #
df['pca']

Q4: Visualize the obtained vector space, can you identify any pattern?

In [None]:
from matplotlib.pyplot import rcParams
rcParams['figure.figsize'] = 10, 8


import seaborn as sns; sns.set()
import seaborn
seaborn.# YOUR CODE HERE #

Q5: Find the most similar reviews to the second review. For this, you will need to compute the distance between every review and pick the closet one. You can use the [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function from `scikit-learn`.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

first_review_vector = # YOUR CODE HERE #
first_review_vector

cosine_similarity(
            np.asarray(list(df['count'])), np.array(# YOUR CODE HERE #).reshape(1, -1)
        ).reshape(1, -1)[0].argsort()[::-1]

In [None]:
df.iloc[# YOUR CODE HERE #]['text']

In [None]:
df.iloc[# YOUR CODE HERE #]['text']

In [None]:
df.iloc[# YOUR CODE HERE #]['text']

## Part 3: Topic modelling


Topic modeling is a unsupervised learning method. The goal is to find group of different document of the same "topic". Topic Models are useful for uncovering hidden structure in a collection of texts. There are two common algorithms: Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).

There are different python libraries that can be used to compute topic modeling, Gensim and Scikit-learn are very common. Gensim documentation is not always crystal clear and can be complex to use in some scenario. For this part, we will use scikit-learn, in particular [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [LatentDirichletAllocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).


Q1: Store into a variable `reviews` all reviews and compute the "review-term" matrix (`review_term_matrix`) using CountVectorizer. Then, display the shape of the obtained matrix. Does it look like what you expected?

> ☝️ For a faster computation, you can limit the number of terms to 500 (`max_features=500`).

> ☝️ Make sure you use the "text_clean" column with stopwords removed (otherwise stopwords will pollute the topics)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

reviews = list(df['text_clean'])

vectorizer = # YOUR CODE HERE #
review_term_matrix = # YOUR CODE HERE #

review_term_matrix.shape

Q2: Apply the LDA algorithm to the obtained `review_term_matrix`. You will need to specify the number of topics you want to compute as well as the number of iterations for the LDA algorithm. 

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(# YOUR CODE HERE #)
lda.fit(review_term_matrix);

Q3: The below function `print_top_words` display on screen the top words in each "cluster". Display the most common 15 words for each cluster. What do you notice?  

In [None]:
def print_top_words(scikit_learn_model, feature_names, num_top_words):
    for topic_num, topic in enumerate(scikit_learn_model.components_):
        print(f"Topic #{topic_num}")
        print(" ".join([feature_names[i]
                             for i in topic.argsort()[:-num_top_words - 1:-1]]))
    print()
    
print_top_words(# YOUR CODE HERE #)

Q4: If you wish, you can play around with the obtained topic modelling by executing this lines of code:

> ☝️[PyLDAvis](https://github.com/bmabey/pyLDAvis) is a beautiful and simple library to visualize topic models

In [None]:
import pyLDAvis
import pyLDAvis.sklearn

# topic_vis_data = pyLDAvis.sklearn.prepare(lda, review_term_matrix, vectorizer)
# pyLDAvis.display(topic_vis_data)