<a href="https://colab.research.google.com/github/worldbank/dec-python-course/blob/main/2-advanced-topics/text-analysis/intro-text-analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Text Analysis

Text analysis is the process of extracting meaningful information from text data, uncovering insights that would otherwise remain buried under text corpora.

This session is an **introduction** to text analysis. We'll be covering the following topics:

1. Regex and character patterns in text data
1. Text data pre-processing
1. Counting words
1. Sentiment analysis
1. Text classification

The session assumes previous knowledge of Python and Pandas, and some knowledge of data visualization using seaborn.

We'll use the following libraries in this notebook:

- **pandas** for dataframe operations
- **re** for regular expressions
- **spacy** for text data processing
- **seaborn**, **matplotlib**, and **wordcloud** for data visualization
- **nltk** for sentiment analysis
- **sklearn** for data classification

## (some) Data exploration

We'll start by getting familiarized with our dataset. We'll use a structured tabular dataset of working papers obtained from the WB Documents API.

Run the following line to load the dataset:

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/worldbank/dec-python-course/main/2-advanced-topics/text-analysis/data/papers.csv')

In [None]:
len(df)

In [None]:
df.head()

The data is a corpus of working papers from the WB Policy Research Working Paper series. For each paper, we have:

- A paper identifier
- The Title
- Two URLs
- The topics of the paper, separated by commas
- An abstract
- A text

Let's take a closer look at the columns `url`, `url_text`, and `text`:

In [None]:
df['url'][0]

In [None]:
df['url_text'][0]

In [None]:
df['text'][0]

`url` contains the paper URL, `url_text` is the URL to actual text content, and `text` is the text of the paper.

Now that we know what the data is about, we can start planning what to do with it. In general, all the tasks we'll do are about data preparation, and basic descriptive and classification tasks. This is a summary of what we'll do:

1. Generate new features (columns) based on the text
1. Count the words and most used words
1. Obtain the emotional tone (sentiment) of sentences
1. Build a topic classifier from our corpus

For the first task, we'll expand the columns of the dataset using existing patterns in the text.

## Patterns

Let's take another look at the text. This time we'll use the function `print()`, so that space characters are properly rendered and the text is easier to read.

In [None]:
print(df['text'][0])

Note that there are a number of information elements that seem to follow some patterns in the text:

- The WP number is in the last sequence of non-space characters in the first line
- The authors' names is a series of contiguous lines after the paper title
- Abstract: lines after the word "Abstract" in the beginnning of the text. All of them seem to have a big space in the middle of the sentence
- Keywords: separated by a semi-colon in a line that starts with "Keywords"
- JEL Codes: an uppercase caracter followed by two numbers, separated by commas
- Authors emails: non-space sequence of characters with "at" sign ("@") and ending in ".org", ".com"
- Bibliography elements: last lines of the text

We're going to take advantage of the patterns of JEL codes to extract them in a new column and add them to our original dataframe. We'll use regular expressions for this.

**Important:** We're only checking one observation (the first) when inferring these patterns. If you want to cretae a column for the dataframe and not for a single observation, you'd have to make sure the same pattern exists in the rest of the texts of your corpus. We'll take it for granted in this session for the sake of time, but you should note that manually exploring different observations of your corpus is needed to infer possible patterns in your texts.

### Regular expressions

In programming, regular expressions are sequences of characters that match a pattern in text. A simple example:

In [None]:
import re

In [None]:
text = 'The ID number of participant 1 is 30551. They were born on July 01, 1996. Participant 2 has ID 71098.'

# Pattern for capturing IDs in this text: sequences of five number characters:
pattern = '\d{5}'

# Capturing IDs
ids = re.findall(pattern, text)
print(ids)

Some notes about this code:
- `\d` is a wildcard that represents one number (0-9). This is also the same as `[0-9]`
- `{5}` means that the previous character in the pattern is repeated five times
- A variation of this pattern could be `\d{4}`, which could be used to capture years. This would have returned a list with `1996` in the example above

In regex, there is a wildcard for almost everything. Some examples:

- Character wildcards:
    + `\d` --> digits (0-9)
    + `\W` --> any word character (uppercase and lowercase a-z, digits, and underscore ("_") )
    + `\n` --> newline characters
    + `\s` --> whitespace characters, including newline
    + `.` --> any character except newline
- Character repetition:
    + `{a}` --> the previous character, repeated "a" times
    + `{a,b}` --> the previous character, repeated between "a" and "b" times
    + `*` --> the previous character, repeated zero or more times
    + `+` --> the previous character, repeated one or more times
    
Regex can match any pattern we can possibly imagine. However, working with regex can be complex for starters. For the purpose of this session, we've introduced regex so you know it exists and can be used to create columns in datasets containing corpus of documents. Don't worry for now if you still didn't grasp well how the patterns work, but if you're interested in learning more about rege, we recommend the following resources:

- A nice regex tutorial is [here](https://regexone.com/)
- A great regex visualizer tool is [here](https://jex.im/regulex/#!flags=&re=www%5C.%5Ba-zA-Z0-9-%5D%2B%5C.(%3F%3Acom%7Cnet%7Corg))

### Extracting information using patterns

Remember we said that JEL codes in the text looked like a pattern of one uppercase letters followed by two digits? We'll use this to extract the JEL codes of each paper in a new column in the dataframe.

In [None]:
pattern = '[A-Z]\d{2}'

This pattern captures one uppercase alphabetic character (`[A-Z]`), followed by one digit repeated two times (`\d{2}`).

Now we'll define a helper function that looks for this pattern in a text and returns all captures in a list:

In [None]:
def capture_jel(text):
    
    pattern = '[A-Z]\d{2}'
    result = re.findall(pattern, text)
    
    return result

Lastly, we'll map this function using Pandas' `apply()` method to create a new column in the dataframe:

In [None]:
df['jel'] = df['text'].apply(capture_jel)

In [None]:
df.head()

Now we have a new column in our dataset. Great!

For the next part of the session, we'll start properly analyzing and getting insights from the text contents. The final result of the next part will be a count of the most used words in each text and we'll also count the total number of words in each text during the process.

## Text data pre-processing

Before we start, we need to think of the following:

- Our texts are in a very raw state. Shouldn't we "clean" them a bit before counting words?
- Using regex to capture words so we can count them sounds possible, but perhaps there is an easier way?
- Texts in English usually repeat a lot words that are not very insightful about the content, such as prepositions or pronouns. Can we get rid of some of them before the word count?
- Lastly, shouldn't we count in the same category words that are not exactly the same but have a very similar meaning? for example:
    + different conjugations of the same verb
    + singular and plural forms of the same noun
    
The answer to all of these questions is Yes. We'll do this in the data pre-processing. Data pre-processing in text analysis is extremely important. Omitting pre-processing will give you different results in text analysis tasks.

Data pre-processing can consist of multiple tasks. We'll apply the following for our corpus:

- Transform to lowercase
- Tokenization: transform texts into lists of words
- Remove stop words (words that are not very insightful, such as prepositions)
- Lemmatization: transform different forms of words into a common word that conveys a similar meaning. This is useful to "normalize" conjugations of verbs or plural forms of words

Fortunately, there is a very useful Python library we can use for this: [spaCy](https://spacy.io/). SpaCy makes available pre-existing NLP models that tokenize, lemmatize, and detect stop words and non-word characters (such as digits or punctuation), so we can easily transform a text into a list of "meaningful" lemmatized words that we can use for word counts.

### Working with spaCy

First we need to install spaCy. Uncomment the line below, run it, and then comment it again with `#`.

In [None]:
#!pip install spacy

In [None]:
import spacy

Now we need to **download** spaCy's NLP model. Uncomment the line below, run it only once, and then comment it out again to make sure you won't run it again accidentally.

In [None]:
#!python -m spacy download en_core_web_sm

Now we **load** the model so it's available in this Python notebook:

In [None]:
nlp = spacy.load('en_core_web_sm')

Then, we'll build a function that:

1. Reads a text
1. Transforms it to lowercase
1. Loads it into the model
1. For each word, obtains the lemmatized versions of words that are not:
    - Stop words
    - Punctuation
    - Numbers
    - Spaces
1. Finally, the function returns a list of the lemmatized words

In [None]:
def word_tokenization_normalization(text):
    
    text = text.lower() # lowercase
    doc = nlp(text)     # loading text into model

    words_normalized = []
    for word in doc:
        if word.text != '\n' \
        and not word.is_stop \
        and not word.is_punct \
        and not word.like_num \
        and len(word.text.strip()) > 2:
            word_lemmatized = str(word.lemma_)
            words_normalized.append(word_lemmatized)
    
    return words_normalized

To get a better idea of what the function does, let's take a look at the result for one paper:

In [None]:
text = df['text'][10]
doc_tokenized = word_tokenization_normalization(text)

In [None]:
doc_tokenized

The result is a list of normalized words for the text.

You might have also noticed that this takes some time to run. To avoid having to wait, we'll apply the function to tokenize and normalize only the **abstracts**. We'll again use the Pandas method `apply()`.

In [None]:
df['abstract_tokenized'] = df['abstract'].apply(word_tokenization_normalization)

In [None]:
df.tail()

The downside of having applied the tokenization and normalization on the abstracts and not the texts is that for word counts we might not have abstracts long enough to make word repetition very insightful. In a real project, we should have used the full texts, leave the code running while we do other things or go for coffee, and come back and work with the results once the code finishes.

## Counting words

Now that the texts are normalized, we can count words! We'll do two things:

1. Generate a column with the number of words
1. Generate a column with a dictionary where each word is a key and the number of times are the key's values. This will look like `{'word1': n1, 'word2': n2, ...}`

For the first task, we can directly create a new column with the result in the dataframe:

In [None]:
df['n_words_abstract'] = df['abstract_tokenized'].apply(len)

In [None]:
df.head()

Just out of curiosity, let's pause for a minute to see the distribution in the number of words.

In [None]:
# Uncomment and run this line if you don't have seaborn:
#!pip install seaborn

In [None]:
import seaborn as sns

In [None]:
sns.histplot(data=df, x='n_words_abstract');

For the second task, we need to generate a helper function that generates the dictionary from each tokenized abstract.

In [None]:
def word_counts(tokenized_text):
    
    count = {}
    
    for word in tokenized_text:
        if word in count:
            count[word] += 1
        else:
            count[word] = 1
    
    return count

We'll first apply the function to only one text to make sure the result looks correct.

In [None]:
abstract_tokenized = df['abstract_tokenized'][42]
count = word_counts(abstract_tokenized)
count

This looks interesting, but it's not very meaningful unless we spend some time looking at the result. We'll transform this into a barplot for easier interpretation but only keeping the words with more than 2 counts.

In [None]:
count_trimmed = {}
for word, value in count.items():
    if value > 2:
        count_trimmed[word] = value

In [None]:
sorting = sorted(count_trimmed, key=lambda x:count_trimmed[x]) # we add this to see the result sorted (ascending)
sns.barplot(count_trimmed, orient='h', order=sorting[::-1]);   # [::-1] reverses the order of a list, for descending order

Now we'll apply the function `word_counts()` to all the abstracts.

In [None]:
df['abstract_word_count'] = df['abstract_tokenized'].apply(word_counts)

In [None]:
df.tail()

The word count we just generated in `abstract_word_count` would be useful if we wanted to analyze the counts of an individual paper. In a corpus like this, however, it might be more useful to obtain a word count of all the papers we have.

In [None]:
# Appending all lists in abstract_tokenized
all_words = df['abstract_tokenized'].sum()

In [None]:
all_words

Now we'll apply our function to count words and save the result into a dictionary on `all_words`:

In [None]:
count_complete = word_counts(all_words)

In [None]:
count_complete

In [None]:
len(count_complete)

With this, we can plot the count of words for our entire corpus of papers. We'll do it below for the `n` words most used.

In [None]:
n = 15
values_sorted = sorted(count_complete.values())[::-1] # This returns the values in count_complete, sorted in descending order.
nth_value = values_sorted[n]

In [None]:
nth_value

This means that after sorting our word counts descending, 209 is the value in the 16th position --remember that positions in Python are always zero-indexed. We'll traverse through our dictionary and will keep only the counts higher than this value, saving the result in a new dictionary called `count_complete_trimmed`.

In [None]:
count_complete_trimmed = {}
for word, value in count_complete.items():
    if value > nth_value:
        count_complete_trimmed[word] = value

In [None]:
count_complete_trimmed

In [None]:
len(count_complete_trimmed)

Now we can produce our plot!

In [None]:
sorting = sorted(count_complete_trimmed, key=lambda x:count_complete_trimmed[x]) # we add this to see the result sorted
sns.barplot(count_complete_trimmed, orient='h', order=sorting[::-1]);

### Word clouds

What kind of text analysis training would this be without a word cloud example? we'll use our dictionary of word counts for the corpus of papers and the library `wordcloud` for this.

In [None]:
# Activate and run the line below only once to install wordcloud
#!pip install wordcloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
wc = WordCloud(background_color='white', colormap = 'binary').generate_from_frequencies(count_complete)
plt.axis("off")
plt.imshow(wc);

## Sentiment analysis

Sentiment analysis consists of determining the emotional tone of a text. It classifies a text into one of three types: positive, neutral, or negative sentiment.

Sentiment analysis works best when it's applied on sentences. However, the unit of observation of our dataframe and corpus is a paper. Before moving forward, we'll transform our dataframe from a paper level to a sentence level.

### Sentence-level tokenization

To divide texts into sentences, we'll use the sentencizer from spaCy

In [None]:
def sentence_tokenization(text):
    
    text = text.lower() # lowercase
    doc = nlp(text)     # loading text into model
    sentences = [sentence.text for sentence in doc.sents]
    
    return sentences

We'll now try this function with one abstract:

In [None]:
text = df['abstract'][100]

In [None]:
text

In [None]:
sentences = sentence_tokenization(text)

In [None]:
len(sentences)

In [None]:
print(sentences)

The result has text that could be cleaned a bit more (for example: removing line breaks and replacing multiple contiguous spaces with only one space), but we're going to omit that for now. This sentence separation is good enough for sentiment analysis.

Now we'll apply the separation to all abstracts. We use the abstracts for this example because executing the function on all paper texts would take several minutes to run.

In [None]:
df['abstract_sentencized'] = df['abstract'].apply(sentence_tokenization)

In [None]:
df.head()

We said before that we wanted to work with a dataframe at the sentence level. We use Pandas' method `.explode()` to obtain this result easily.

In [None]:
df_sentence = df[['id', 'abstract_sentencized']]  # leaving only the paper ID and the setencized abstract
df_sentence = df_sentence.explode('abstract_sentencized').reset_index(drop=True) # converting DF to sentence-level
df_sentence = df_sentence.rename({'abstract_sentencized': 'sentence'}, axis='columns') # renaming column

In [None]:
df_sentence

## Obtaining sentiments

There are several libraries with pre-loaded models that analyze the sentiment of a text in English or other languages. We'll use `nltk` because it's one of the simplest. For texts in other languages, you can check [spaCy's language models](https://spacy.io/usage/models). Most of them have a model method for sentiment analysis.

In [None]:
# Activate and run this line to install nltk
#!pip install nltk

In [None]:
import nltk

In [None]:
# This downloads nltk's model for sentiment analysis. Activate this line and run it only once
#nltk.download('vader_lexicon')

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
analyzer = SentimentIntensityAnalyzer()

In [None]:
text = "I'm having a terrible day"
analyzer.polarity_scores(text)

In [None]:
text = 'This is excellent!'
analyzer.polarity_scores(text)

Most sentiment analysis models don't give a final result on whether a text is positive, negative, or neutral; they give scores for each sentiment. We'll apply a simple rule for these scores to determine the tone of a text: the highest score is the tone. 

In [None]:
def get_sentiment(text):
    
    result = analyzer.polarity_scores(text)
    
    if result['neg'] > result['pos'] and result['neg'] > result['neu']:
        return 'negative'
    elif result['pos'] > result['neg'] and result['pos'] > result['neu']:
        return 'positive'
    else:
        return 'neutral'

In [None]:
df_sentence['sentiment'] = df_sentence['sentence'].apply(get_sentiment)

In [None]:
df_sentence.head()

Visualizing the results:

In [None]:
sns.barplot(data=df_sentence['sentiment'].value_counts());

The overwhelming majority of sentences in the abstracts have a neutral tone. This is exactly what we would expect of a corpus like this. We'll tabulate the results to see if there are any positive- or negative-sentiment sentences.

In [None]:
df_sentence['sentiment'].value_counts()

## Text classification

For the last part of the session, we'll do a couple of simple text classification examples. We're calling them "simple" because there are now very fancy and state-of-the-art text classification techniques for text, but that are not suitable for a 2-hour session. You can check the link listed below about LLMs if you want to explore more about these.

Simply put, text classification consists of assigning a text to a group. If you're familiar with machine learning, this is exactly a machine learning classification task. For our exercises, we'll show two ways of classifying text:
- **Unsupervised classification:** we'll group texts into groups of similarity, without pre-defining the groups
- **Supervised classification:** we'll group texts into pre-defined groups. The pre-defined groups will be the first topic of the column `topics`.

In [None]:
df['first_topic'] = df['topics'].apply(lambda x: x.split(',')[0].lower())

Now we'll tabulate the first topic:

In [None]:
df['first_topic'].value_counts()

In [None]:
len(df['first_topic'].unique())

In [None]:
len(df)

There are 198 topics for a total of 399 papers (!), which means that a lot of topics have only one or two papers. We'll keep only topics that have at least eight papers so that there is at least some observations in each topic to build a classifier. This will reduce the size of our dataframe.

In [None]:
topics_to_keep = df['first_topic'].value_counts()[df['first_topic'].value_counts() >= 8]

In [None]:
topics_to_keep.sum()

In [None]:
len(topics_to_keep)

Our resulting dataframe will have only have 95 observations and 8 topics. This numbers of obs is not enough to generate a good classifier but we'll still go ahead and use it for the exercise as an example of the application of the text classification method.

In [None]:
df2 = df[df['first_topic'].isin(topics_to_keep.index)].reset_index(drop=True)

In [None]:
len(df2)

In [None]:
df2.head()

### Text encoding

Our classifier will be built (trained) using the tokenized and normalized abstracts. However, we need first to convert them into numbers so a classifier con work with them. This operation is called **encoding**.

There are several ways of encoding texts. We'll use term-frequency inverse-document frequency (TF-IDF). TF-IDF transforms a text of words into a numeric vector where each word has a score. It gives a high score to words that show up a lot in a given document, but rarely across documents in the corpus (so they are more distinctive for the document only).

We'll start by loading the library we'll use for the encoding and text classification: scikit-learn.

In [None]:
# Uncomment the line below for the installation:
#!pip install scikit-learn

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Generating the encoder
corpus = list(df2['abstract_tokenized'].apply(lambda x: ' '.join(x)))
encoder = TfidfVectorizer(stop_words = ['paper'], max_features=1000) # initializing the encoder
vectors = encoder.fit_transform(corpus)                              # encoding

In [None]:
vectors.shape

The resulting object `vectors` holds the encodings of the 95 abstracts. Each of them is a vector with the TF-IDF encoding of the 1000 more used words across the corpus. Choosing 1000 is arbitrary, remember we initially had a total of 4,500+ words --this was printed when we counted words and generated a dictionary with all the word counts. From now on, we'll refer to these 1000 words as our **dictionary**.

For an easier understanding of the text encoding, we'll transform this back into a dataframe:

In [None]:
# Column names for the df
words_encoded = encoder.get_feature_names_out()

In [None]:
# Actual content of the df
vectors_data = vectors.todense()

In [None]:
df_tfidf = pd.DataFrame(data=vectors_data, columns=words_encoded)
df_tfidf.insert(0, 'id', df2['id']) # inserting the paper ID

In [None]:
df_tfidf.tail()

Two important points on this result:
- The matrix `tf_idf` contains the same information as `vectors`, except that it's transformed into a Pandas dataframe with column names and the paper IDs. This step was not really necessary. We added it in order to better understand the encoding result. Most NLP programmers will omit this step and will work with `vectors` directly.
- The resul in t`df_tfidf` and `vectors` is a matrix with **a lot** of zeroes. This happens because the encoding assigns a score of zero to the words that are part of the dictionary but not present in that paper. This is a usual result in TF-IDF encoding.

The actual information in this data is sparse and spreads across the many dimensions (colums) of the data. We'll reduce the data dimensionality with principal component analysis (PCA) to only two dimensions. This will also allow us to visualize the proximity of each document.

### Principal component analysis

In [None]:
import sklearn.decomposition
PCA = sklearn.decomposition.PCA

In [None]:
data = df_tfidf.drop(['id'], axis=1)  # dropping the column id
pca = PCA(n_components = 2).fit(data) # initializing and fitting the PCA transformer
reduced_data = pca.transform(data)    # transforming the data

In [None]:
reduced_data.shape

Note that the result of reducing the data dimensions is a NumPy array:

In [None]:
type(reduced_data)

We can visualize the two resulting dimensions. The specific values of dimensions 1 and 2 are actually meaningless, but the **proximity of the observations** means that those abstracts had a close TF-IDF encoding, meaning that they're close in the words they contain.

Also, remember that we obtained the first topic of each paper? we're going to use them as an approximation the true classes for our classification exercises. This will be noted in the color we use for the plot below.

In [None]:
# Producing figure
fig = plt.figure(figsize = (10,6))
plot = sns.scatterplot(x = reduced_data[:, 0], y = reduced_data[:, 1], hue = df2['first_topic'])

# Aesthetics
plt.legend(title='Topic')
sns.move_legend(plot, "upper left", bbox_to_anchor=(1, 1))
plt.xticks(())
plt.yticks(())
plt.title('True Classes')
plt.axis('off')
plt.show()

### Training a classifier for unsupervised classification

Let's pause for a minute to go over the steps we've followed for text classification until this point:

1. We started with the raw abstracts and did text data preparation:
    - converted texts to lowercase
    - removed stop words and numbers
    - lemmatized words
2. Then we encoded the prepared data using TF-IDF
3. Next, we reduced the dimensions from 1000 to 2 and visualized the result

We're going to continue using the results from (3) to train our unsupervised and supervised classifiers. We'll start with the unsupervised approach, building clusters of the abstracts that are close to each other in the PCA results. We're going to use the method `Kmeans()` from the module `cluster` of the library `sklearn` for this.

In [None]:
import sklearn.cluster

In [None]:
n = len(df2['first_topic'].unique())                        # same n of clusters as topics: 8
km = sklearn.cluster.KMeans(n_clusters=n, init='k-means++') # initializing the classifier

In [None]:
km.fit(reduced_data)

### Unsupervised classification

Once the classifier is fitted with new data, the cluster classification will be stored in the atrribute `labels_` of the classifier.

In [None]:
km.labels_

We can use this to plot the predicted classes:

In [None]:
# Producing figure
fig = plt.figure(figsize = (10,6))
plot = sns.scatterplot(x = reduced_data[:, 0], y = reduced_data[:, 1], hue = km.labels_, palette="deep")

# Aesthetics
plt.legend(title='Class')
sns.move_legend(plot, "upper left", bbox_to_anchor=(1, 1))
plt.xticks(())
plt.yticks(())
plt.title('Predicted Classes - Clustering')
plt.axis('off')
plt.show()

Note that the predicted classes don't have a label that clearly corresponds to the actual classes --the topics. This is because clustering is an unsupervised method of classification: it only builds groups based on proximity but doesn't label what each group is.

### Training a classifier for supervised classification

To assign observations to labeled groups, we need to do supervised classification. We'll use a random forest classifier in this example, but other types of classifiers are available in the library we're using (scikit learn).

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
classifier = GaussianNB()

In [None]:
x = reduced_data          # the PCA result
y = df2['first_topic']    # the predicted labels
classifier.fit(x, y)

After this, `classifier` has been trained with the data in `x` to know which patterns in it produce the results in `y`.

### Supervised classification

Now we'll classify our texts with the classifier we trained. Given that it was trained with encoded normalized words, the input for any classification should also be encoded normalized words. We'll use our same data of `tf_idf` to produce a classification and will compare it the actual true values to have a sense of how well this classifier performs.

In [None]:
predictions = classifier.predict(x)

In [None]:
df_predictions = df2[['title', 'first_topic']]
df_predictions['predictions'] = predictions

In [None]:
df_predictions.head()

In [None]:
df_predictions['correct'] = False
df_predictions.loc[df_predictions['first_topic'] == df_predictions['predictions'], 'correct'] = True

In [None]:
df_predictions['correct'].value_counts()

Some notes on this result:

- Our classifier is only 43% accurate. This is not a good performance but we had to work with very small data that we can manage in a short training session. In a real setting, you should have ideally with 1,000+ observations and different types of classifiers.
- We are using our classifier on the same data we used for training it. In a real setting, this is a very bad practice as it will likely lead to overfitting: producing a classifier that works perfectly well for the data it was trained on but can't generalize for out-of-sample cases. The way you avoid this is by separating your data in a training dataset and a test dataset. Then you use the training set for training and the test set for evaluating its performance.
- You can add to the the PCA vectors or TF-IDF matrix other data that will probably have predicting power for the variable we classify. Remember we extracted the JEL topics before? those are probably good predictors in this case.

Visualizing the result:

In [None]:
# Figure
fig = plt.figure(figsize = (10,6))
plot = sns.scatterplot(x = reduced_data[:, 0], y = reduced_data[:, 1], hue = df_predictions['predictions'], palette="deep")

# Aesthetics
plt.legend(title='Class')
sns.move_legend(plot, "upper left", bbox_to_anchor=(1, 1))
plt.xticks(())
plt.yticks(())
plt.title('Predicted Classes - Naive Bayes')
plt.axis('off')
plt.show()

## Final notes

### Other languages

These exercises used a corpus in English. However, the principles for working with other languages are just the same for all of these text classification tasks. SpaCy has NLP models in other languages available, you can check them [here](https://spacy.io/usage/models).

### Other text analysis tasks

This was an overview of possibly the simplest text analysis tasks. Other tasks are:

- Named entity recognition: detecting mentions of a meaningful entity (places, names of people, dates, etc) in texts
- Vector spaces and word embeddings: transforming texts or words into vectors of "meanings". You can then work with them for other tasks, such as compare the proximity of texts based on meanings
- Generative AI with texts: generating texts based on prompts or previous text.

### Large Language Models (LLMs)

We didn't cover LLMs because they're not part of an introductory session. If you're more interested in learning about them, we recommend these readings:

- BERT was the first (or at least one of the first?) LLM publicly released. This article explains well how it works: [BERT Explained: State of the art language model for NLP](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)
- This is a tutorial of how to work with BERT to fine-tune it for specific NLP/text analysis tasks: [BERT Fine-Tuning Tutorial with PyTorch](https://mccormickml.com/2019/07/22/BERT-fine-tuning/)