# Hands-On NLP — Class 2

<span style="color:magenta">Group members:</span>

* Name 1
* Name 2
* Name 3

## Outline

- Embeddings from scratch

- Classifications with embeddings

In [1]:
from pathlib import Path

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from nltk.tokenize import word_tokenize
from sklearn import (
    decomposition,
    ensemble,
    linear_model,
    metrics,
    model_selection,
    multiclass,
    naive_bayes,
    neighbors,
    svm,
    tree,
)
# from sklearn.feature_extraction.text import CountVectorizer
from tqdm.notebook import tqdm

In [3]:
tqdm.pandas()

nltk.download("punkt")

print("sklearn", sklearn.__version__)   # 1.3.2

sns.set_style("darkgrid")
sns.set_context("notebook")

pd.set_option("display.precision", 2)

sklearn 1.6.0


[nltk_data] Downloading package punkt to C:\Users\Global
[nltk_data]     Computers\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
TEXT_P = Path("E:/Courses/M1 Data Science/T3/Hands-on NLP/TP2/texts")
print(TEXT_P.exists())

CORPORA = [
    "mythology",
    "woodworking",
    "robotics",
    "hsm",
    "health",
    "portuguese",
]

EPS = np.finfo(float).eps

True


## Getting the data

In [18]:
corpora = {}
stats = []

for corpus in tqdm(CORPORA):
    print(corpus)
    texts = []
    for fp in (TEXT_P / corpus).glob("*.txt"):
        print(fp)
        with fp.open(encoding="utf-8") as f:
            texts.append(f.read())

    corpora[corpus] = "".join(texts)

    stats.append(
        {
            "corpus": corpus,
            "files_n": len(texts),
            "chars_n": len(corpora[corpus]),
        }
    )

df = pd.DataFrame.from_records(stats, index=["corpus"])
df["text"] = [corpora[corpus] for corpus in corpora]
df

  0%|          | 0/6 [00:00<?, ?it/s]

mythology
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology.stackexchange_0000000001.txt
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology.stackexchange_0000000002.txt
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology.stackexchange_0000000003.txt
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology.stackexchange_0000000005.txt
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology.stackexchange_0000000007.txt
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology.stackexchange_0000000010.txt
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology.stackexchange_0000000011.txt
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology.stackexchange_0000000012.txt
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology.stackexchange_0000000013.txt
E:\Courses\M1 Data Science\T3\Hands-on NLP\TP2\texts\mythology\mythology

KeyboardInterrupt: 

### Tokenizing

In [8]:
# If your machine is slow, pickeling allows to go faster next time.

tokens_fp = "tokens.pkl"
try:
    tokens = pd.read_pickle(tokens_fp)
except FileNotFoundError:
    tokens = df.text.progress_map(word_tokenize)
    tokens.to_pickle(tokens_fp)

In [9]:
df["tokens"] = tokens
df["tokens_n"] = df.tokens.map(len)
df["types_n"] = df.tokens.map(set).map(len)
df

Unnamed: 0_level_0,files_n,chars_n,text,tokens,tokens_n,types_n
corpus,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
mythology,0,0,,"[Q, :, What, is, the, meaning, behind, Amatera...",998926,53030
woodworking,0,0,,"[Q, :, Choice, of, marking, knives, I, have, a...",1620394,35469
robotics,0,0,,"[Q, :, hector_mapping, +, imu, issue, Hi, ,, I...",20494962,502398
hsm,0,0,,"[Q, :, What, is, the, origin, of, ``, root, ''...",1865024,74635
health,0,0,,"[Q, :, Are, yawns, and, hiccups, pscyhosomatic...",1915518,71817
portuguese,0,0,,"[Q, :, De, onde, surgiu, a, expressão, ``, vic...",992726,63863


## Vectorization

### 🚧 TODO: How to vectorize text?

- Try counting words in the stackoverflow corpus based on a given vocabulary

- Apply reduction techniques to reduce the dimensionality to 2 dimensions (e.g., PCA)

- Plot the 2D vectors

In [None]:
words = (
    "myth,wood,robot,history,science,mathematics,health,portuguese,o".split(",")
)
wc_df = pd.DataFrame(index=df.index)
# for w in words:
#     ...

In [None]:
# wc_df

#### Bag of words

### 🚧 TODO: Implement another bag of words vectorizer model on the corpus

*   This time using [sklearn's `CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

*   Try first the provided example in the `CountVectorizer` documentation

    Try with and without the n-gram parameter

*   Then try to vectorize the stackoverflow corpus using `vocabulary=words`

#### First with a toy example

In [None]:
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

With the whole vocabulary

In [None]:
# vectorizer = feature_extraction...
# xs = vectorizer...

print(vectorizer.get_feature_names_out())
print(xs.toarray())

# vectorizer_2g = ...
# x2gs = ...

print(vectorizer_2g.get_feature_names_out())

In [None]:
cv_df = pd.DataFrame(xs.toarray(), columns=vectorizer.get_feature_names_out())
cv_df.insert(0, "Document", corpus)
cv_df

With a subset of the vocabulary

In [None]:
vocabulary = ["and", "document", "first"]
# vectorizer = ...

#### Reprocess the stackoverflow corpora with `CountVectorizer`

In [None]:
# vectorizer = ...

### 🚧 TODO: Why is this different?

- Try to explain

- Give a simple example with the toy corpus below (with the same vocabulary)

In [None]:
test_text = (
    "myth wood robot history science mathematics health portuguese o "
    "myth wood robot history science mathematics health portuguese o"
)

# freqs = ...

# test_wc_df = ...

In [None]:
# test_xs = vectorizer...

# test_cv_df = ...

### 🚧 TODO: Is this difference important?

• Visualize the PCAs of both models

-----------

### The corpus as individual documents

In [None]:
data = []

for i, corpus in enumerate(tqdm(CORPORA)):
    print(corpus)
    for fp in (TEXT_P / corpus).glob("*.txt"):
        with fp.open() as f:
            text = f.read()
        data.append(
            {
                "id": fp.stem,
                "text": text,
                "category": corpus,
                "cat_id": i,
            }
        )

In [None]:
doc_df = pd.DataFrame.from_records(data).set_index("id")
doc_df

#### 🚧 TODO: Plot (bar) the number of documents per category

In [None]:
# doc_df...

#### 🚧 TODO: Boxplot the number of tokens per document

* With and without outliers

* Shortly explain the different values presented in a boxplot ([Wikipedia](https://en.wikipedia.org/wiki/Box_plot))

* Are the texts of signifcantly different length? Argue shortly.

In [None]:
# Careful: slow!

# doc_df["tokens_n"] ...

#### 🚧 TODO: How to find the crazy long robotics text?

*   Find the index of the longest text

*   Show the content

*   Explain why this text is so long (what does it contain?)

In [None]:
# longests_df = doc_df[...

### Vectorizing again

#### 🚧 TODO: See how many features we get if we don't restrict their number

* Use again the `CountVectorizer` to vectorize the stackoverflow corpus

  * But use the whole vocabulary of the documents this time (**without** `vocabulary=words`)

  * Tell how many features are obtained

* Then limit the vocabulary to the 5000 most frequent words

* Apply and plot dimensionality reduction to 2 dimensions as prevously
  (only on the limited vocabulary)

We want something like this:

```python
xs ~ doc_df.text
ys ~ doc_df.cat_id
```

In [None]:
ys = doc_df.cat_id.values

In [None]:
# unconstrained_cv = CountVectorizer()
# xs = ...

In [None]:
# cv = CountVectorizer(max_features=5000)
# xs = ...

In [None]:
# pca = ...

#### 🚧 TODO: Find that outlier!

* Use pandas to find the document corresponding to the outlier

* Print the correspoding text
  
* Tell what it contains (if you could figure it out)

* Remove the corresping raw from the dataframe and redo the dimensional reduction (and plot)

#### 🚧 TODO: The reason for this outlier is...

*   Give a short explanation

*   Remove the outlier from the dataframe

*   Redo the dimensional reduction

*   Plot the 2D vectors and color them by category

## Train models to predict text subjects

### Split the data in training and test sets

In [None]:
train_xs, test_xs, train_ys, test_ys = model_selection.train_test_split(
    xs, ys, test_size=0.3, random_state=0, shuffle=True
)
print(train_xs.shape)
print(test_xs.shape)

### 🚧 TODO: Apply different algorithms to try predicting the category

* E.g., Logistic Regression, Multinomial Naive Bayes, Decision Tree Classifier, Random Forest, Support Vector Classifer.

* You could investigate [`SGDClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html), which implements linear classifiers (e.g, SVM, logistic regression) with SGD training (faster).

* Present a table with the results of the different algorithms (e.g., accuracy, precision, recall, f1-score) and their execution time

* (Optional) Analyse 1 algorythm in detail (e.g., Logistic Regression)

  *   Try different parameters (possibly with a grid search)

  *   Present the [classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

  *   Present the confusion matrix of the best model

### 🚧 TODO: Explain what model seems to work best

____