# Lab 2 - Text analysis

## Environment

We need `spacy` for text analysis, `scikit-learn` for calculations and `matplotlip` for charts and plots. We also need to download the `en_core_web_sm` spacy's language model we will work on.

The `datasets` is a module to easily load datasets. They come from the [HuggingFace](https://huggingface.co/docs/datasets/v1.8.0/loading_datasets.html).

In [None]:
!pip install spacy scikit-learn matplotlib datasets
!python -m spacy download en_core_web_sm

## Tokenization

In the first lab you found out how to tokenize your text into tensors that can be further used to word predictions. If the tokenization result is not meant to be used as the neural network input directly, we can use much more friendly tokenizaion from the `spacy` package. For example, we may want to split text into word tokens.

First, you need to import and initialize the `spacy` module.

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

Then, you can tokenize the text into sentences.

In [None]:
text = """The second lab will be exciting, too! 
There are many knowledge for you to gain like part of speech and
named entities recognition or stemming. It will be fun!"""

tokens = nlp(text)
[token.text for token in tokens]


Or sentences:

In [None]:
[sentence.text for sentence in tokens.sents]

### ⭐ Task for you ⭐

It may seem that tokenization is just spliting the text by spaces or dots. But its smarter than that! Try to tokenize the following text into words.

```
We have been to U.K. before we got to the very special country, i.e. Poland.
```

In [None]:
# your code

## Part of speech detection

You can use the `spacy` module to fetch information about part of speech (POS) of every token. We may use the `tokens` list initialized in the previous step.

In [None]:
[(token.text, token.pos_) for token in tokens]

### ⭐ Task for you ⭐

**Now** go ahead and count how many different POS tags are there in the given text! We want to know how many verbs, adjectives, pronouns, etc. are there in the text. Extra bonus for a chart 📊 😀

In [None]:
# your code

## Lemmatization

If you want to count how many certain word has been mentioned in the text, it is very useful to take all of the words to their base forms. This process is called as a *lemmatization*. The text processed with spacy already contains lemmas for every token. We will use this technique further in the lab.

In [None]:
[(token.text, token.lemma_) for token in tokens]

### ⭐ Task for you ⭐

Find lemmas for the following words:

* entities
* was
* mice
* cacti
* octopi

Are they lemmatized correctly with `spacy`?

In [None]:
# your code

## Named entity recognition

Processing the text with `spacy` also results in recognizing named entities, i.e. **balblabla**.

### Basics

In [None]:
ner_result = nlp("Questions are swirling around $30M nomination of Andrea Riseborough to Oscar at 30th January 2023 in U.S.")
[(e.text, e.label_, e.start_char, e.end_char) for e in ner_result.ents]

If you wonder what the certain entity label means, you can ask `spacy` for an explanation.

In [None]:
spacy.explain('GPE')

#### ⭐ Task for you ⭐

Try to come up with a text that will contain an entity of `WORK_OF_ART` type.

In [None]:
# your code

### Visualization

You can use the `displacy` module of `spacy` to visualize the NER result. It will be much easier to analyze the text.

In [None]:
spacy.displacy.render(ner_result, style="ent", jupyter=True)

You can also display only specific entity types for better text understanding. Read the docs for the `displacy.render` function to find out more options you can configure here.

In [None]:
spacy.displacy.render(ner_result, style="ent", jupyter=True, options={"ents": ["MONEY", "DATE"]})

#### ⭐ Task for you ⭐

Try to analyze some longer text with `spacy` and visualize the NER result with `displacy`. Use some artice found on the web.

Then, count how many times each entity type has been detected in the text and display some stats. Extra bonus for a chart 📊 😀

In [None]:
# your code

## Detecting text similarity

### Bag of words

Let's say we have three texts.

> The quick brown fox jumps over the lazy dog.

> The dog kept barking over the night.

> A lazy fisherman with his dog met a fox last night.

How much they are similar to each other? Can we say they talking about similar topics? 

A very idiomatic way of finding this out is a technique called *bag of words*. Its based on the calculation of the frequency of words apearing in the all texts, selecting the most popular ones and then representing the text as a list of integers containing the number of appearances of these words.

Example better than a lecture!

We will use the `sklearn` module to calculate the text metrics. The `CountVectorizer` class does all of the calculations for us. The `max_features=5` parameter tells the vectorizer we want to select at most 5 the most popular tokens from all of the texts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The dog kept barking over the night.",
    "A lazy fisherman with his dog met a fox last night.",
]

count_vector = CountVectorizer(max_features=5)
data_count = count_vector.fit_transform(texts)
data_count.toarray()

Wooow! What does it even mean? Let's see the tokens that were chosen to describe the texts.

In [None]:
count_vector.get_feature_names()

Ok, so the chosen tokens are

```
['dog', 'fox', 'lazy', 'night', 'the']
```

and the texts representation after creating the bag of words is:

```
array([[1, 1, 1, 0, 2],
       [1, 0, 0, 1, 2],
       [1, 1, 1, 1, 0]])
```

It means that:
* the word `dog` appered in all of the texts once
* the word `fox` and `lazy` appeared once in the first and the third text
* the word `night` appeared once in the second and the third text
* the word `the` appeared in the first and the second text, twice in both of them

Now you should understand the *bag of words* text representation. We can say that the more similar the vectors are, the more similar the texts are, too. We can obviously calculate the distance between them and even visualize them on a chart, but we need a few more exercies and obviously - more data!

#### ⭐ Task for you ⭐

Try to experiment with the `max_features` option. What number of `max_features` results in best vectors according to you?

In [None]:
# Your code

### Stopwords

As you saw, the word `the` also has been counted although it does not carry any information in the text. This can greatly influence the results of our analysis, so it's very common to remove such words from the text before calculating any metrics. These words are called *stopwords* and the `sklearn` module has built in mechanisms to remove them. Let's see some of them first.

In [None]:
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS

list(ENGLISH_STOP_WORDS)[:10]

You don't need to import the stopwords to use them, because they are managed internally within the package (noticed the `_` in the package name?). However, you may find it interesting to see what's inside!

Now, all you need to do is to define the builtin list of stopwords you want to use before calculating the vectors.

In [None]:
count_vector = CountVectorizer(max_features=5, stop_words='english')
data_count = count_vector.fit_transform(texts)
count_vector.get_feature_names()



['barking', 'dog', 'fox', 'lazy', 'night']

### Visualization of the text vectors in the chart

Detecting similar texts if you have a lot of data can be challenging. It's always helpful to visualize the data on the screen, so we could plot the vectors and see if we can detect some groups on the screen. It will be hard for three texts we are currently operating on, but you will get the idea.

However, the screens are 2D only in 2023. We can now postpone this lab and wait to 2048 when he 5D screens will be available, or use the popular `t-SNE` algorithm to *flatten* the data and then visualize them. We will take the second solution!

Without taking too deep into how this algorithm works, it is able to reduce the XD vectors to YD vectors, with X>Y, maintaining distances between them. For our text, we want to reduce 5D vectors (5 features of the text) to 2D vectors (so to the format that can be plotted on the screen).

In [None]:
from sklearn.manifold import TSNE

tsne_model = TSNE(n_components=2)
tsne_data = tsne_model.fit_transform(data_count.toarray())

tsne_data

As you can see, the algorithm transformed all of the vectors into 2D. We can plot them!

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.scatter(tsne_data[:, 0], tsne_data[:, 1])

for i, label in enumerate(["quick fox", "barking dog", "lazy fisherman"]):
    ax.annotate(label, (tsne_data[i, 0], tsne_data[i, 1]))

plt.show()

There are only three datapoints so it's hard to tell if the texts can be considered similar to each other or not. However, if we had many more texts, we might suspect that the data points would create some distinguishable groups, meaning the text are talking about similar topics.

## Datasets

We need more data for the final task. Luckily, there are many options for us to start with while learning. One option is to use the [HuggingFace](https://huggingface.co/docs/datasets/v1.8.0/loading_datasets.html) `datasets` module to download some texts we can work on.

Let's see what's inside.

In [None]:
import datasets

datasets.list_datasets()

As you can see, there are many datasets we can work on. How to load them?

In [None]:
dataset = datasets.load_dataset('ag_news', split='train')
dataset

As we saw in the previous examples, the list of texts will be the easier structure to work on for now. Having the above dataset with `text` and `label` fields, we can create a list of texts with a simple comprehension.

In [None]:
large_texts = [item['text'] for item in dataset]
large_texts[:10]

## ⭐ A big 🗻 task for you ⭐

You have all the tools!

Collect large dataset of texts from *XXX* and:

1.   Prepare them for analysis, e.g.
  1. Tokenize them.
  1. Transform the tokens into lemmas (so the `dog` and the `dogs` are treated as the same feature).
2. Represent the texts as bag of words, remembering about stopwords. Experiment with the features count. If you find that there are features that influence the representation, go back to the step 1. and take it into consideration when preparing the data (maybe you want to get rid of numbers?).
3. Visualize the data on a plot (without labels for better performance). Can you distiguish some groups of texts? What these texts are about?
4. Detect named entites in groups representatives. Do named etities also suggest the topic of the text?


In [None]:
# your code

In [None]:
# your code