# Exercise Sheet 3 - Text Classification

In [None]:
import nltk
nltk.download(['brown', 'stopwords'])

In [None]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import brown
from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
import string
import pandas as pd
from nltk.tokenize.treebank import TreebankWordDetokenizer, TreebankWordTokenizer



# 1. Preprocessing 

Framework for Machine learning And Feature Extraction: **sklearn**.

Classes from sklearn used:
1. [sklearn.feature_extraction.text.CountVectorize](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn-feature-extraction-text-countvectorizer)
  It converts a collection of text documents to a matrix of token counts.

2. [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
  Convert a collection of raw documents to a matrix of TF-IDF features.

3. [class sklearn.feature_extraction.text.TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) 
  Transform a count matrix to a normalized tf or tf-idf representation

4. [sklearn.metrics.classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html). 

  Build a text report showing the main classification metrics. It shows macro-average, weighted-average and per class scores for `precision`, `recall` and `f1`.
It also displays support, which is the actual occurance of the class/label in the dataset.


In order to feed the text to `*CountVectorizer`, it needs to exist as sentences. As shown in the following example:

| text | label |
| ---- | ----- |
|The capital expansion programs business firms involve multi-year budgeting true country development programs|government|
|Now Dogtown one places creeps marrow worms get old wood veneer|mystery|
|This claim submitted District Court dismissed 126 F.Supp.235 alleged violation 7 Clayton Act also 1 2 Sherman Act|government|
|Mrs. Meeker struck ready seek anyone's advice least Garth's|	mystery|
|Richmond Va.	|government|

Essentially what we need:

X: Array of sentences

y: Array of corresponding labels

The corpus which we are using is already tokenized. It could be used as it is.
But in real life the corpus would rarely be tokenized, so we prepare the data as sentences and labels before proceeding with the exercise.


<br>
<br>

## Tokenization And Detokenisation(instead of `.join()`)

The default tokenization method in NLTK involves tokenization using regular expressions as defined in the Penn Treebank (based on English text). It assumes that the text is already split into sentences.

This is a very useful form of tokenization since it incorporates several rules of linguistics to split the sentence into the most optimal tokens.

Detokenizer is required to put the sentence back together from a list of words, with proper punctuation form.

In [None]:
detokenizer = TreebankWordDetokenizer()
tokenizer = TreebankWordTokenizer()

#2. Dataset And Problem Statement

## [Brown Corpus](https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html)
The corpus consists of one million words of American English texts printed in 1961. The texts for the corpus were sampled from 15 different text categories to make the corpus a good standard reference.

From this dataset we select two categories:
1. government: Text from government documents
2. mystery: Text from mystery and detective fiction

And we create our own dataset by detokenizing and shuffling the above.

In [None]:
for category in brown.categories():
    corpus_length = len(brown.sents(categories=[category]))
    print(f'Category: {category:<16}, Dataset Size:{corpus_length}')

english_stopwords = stopwords.words('english')
punctuations = list(string.punctuation)

print('\n\nSelecting `government` and `mystery` categories from brown corpus')

def filter_and_join(sent_arr, lab):
    filtered_tokens = [token for token in sent_arr if (token not in english_stopwords and token not in punctuations)]
    return [detokenizer.detokenize(filtered_tokens), lab]

## Using the filter_and_join function on all the text inputs of government categories
government_text = list(map(lambda x: filter_and_join(x, 'government'), brown.sents(categories=['government'])))

## Using the filter_and_join function on all the text inputs of government categories
mystery_text = list(map(lambda x: filter_and_join(x, 'mystery'), brown.sents(categories=['mystery'])))

dataset = pd.DataFrame(government_text + mystery_text, columns=['text', 'label'])
dataset = dataset.sample(frac=1)
dataset.head()


## PROBLEM STATEMENT

Use the given corpus to perform the following tasks:

1. Setting Test/Train dataset: Split the dataset in the train and test dataset. (10% test, 90% training)

2. Feature Extraction: Use the text to extract the features i.e. Count Vectors and TFIDF.

3. Train ML model: Use the extracted Features to train `Naive Bias` models (1 with each extracted feature)

4. Evaluation: calculate the precision, recall and f1 score.
  Hint: Use classification report

5. Inference: Use the given strings and the trained models to predict the class/label of the text.

OPTIONAL:
Train Any other model of your choice which could do better than the naive bias model.

# 3. Split Data into training and testing sets



## EXERCISE 1
Split the dataset in the train and test dataset. The test set should be 10% of the overall dataset size.

In [None]:
from sklearn.model_selection import train_test_split
train_data, test_data =  ## YOUR CODE GOES HERE

<Details>
<summary>HINT</summary>
Use the function

```python
train_test_split(dataset, test_size=???)
```

</Details>

# 4. Feature Engineering using raw counts and TF-IDF



## Example
The vector representation of the text using counts


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]


vectorizer1 = CountVectorizer(analyzer='word', ngram_range=(1, 1))
X2 = vectorizer1.fit_transform(corpus)
print(X2.toarray())
vectorizer1.get_feature_names_out().tolist()

- In the example above, method get_feature_names() returns vocabulary of the corpus i.e. number of unique words. 
- Each document in the corpus is represented with the reference to the vocabulary
- Example: In the document 1 i.e. **"This is the first document."** can be rearranged to **[0, "document", "first", "is", 0, 0, "the", 0, "this"]** which in the end transformed into count vector based on the number of times the given word occurs in the document i.e. **[0 1 1 1 0 0 1 0 1]**



Example below shows the vector representation of the text using tf-idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()

vectorizer2 = TfidfVectorizer(analyzer='word', ngram_range=(1, 1))
X2 = vectorizer2.fit_transform(corpus)
print(X2.toarray())
vectorizer2.get_feature_names_out().tolist()

- Similar to count vector, each index in tf-idf vector represents word in the vocabulary.
- Each value represents the L2 normalized tf-idf of the word in the document.

## FEATURE EXTRACTION FOR THE DATASET

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

X_train, y_train = train_data["text"], train_data["label"]
X_test, y_test = test_data["text"], test_data["label"]

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train) 


tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


## EXERCISE 2
The features for the training set have already been generated. Now, generate the features for the test set.

## WARNING: 

Make sure that you do not change the features based on the test dataset.

In [None]:
X_test_counts = ### YOUR CODE GOES HERE
X_test_tfidf = ### YOUR CODE GOES HERE

<Details>
<summary>HINT</summary>
Use the function

Do NOT use the `fit_transform` function as shown above, it is likely to add more features from the testing data, which is not the purpose of the test dataset.

</Details>

# 5. Naive Bias Classifier

Naive Bayes is a generative classification model.

A generative model learns parameters by maximizing the joint probability  𝑃(𝑋,𝑌)  through Bayes' rule by learning  𝑃(𝑌)  and  𝑃(𝑋|𝑌)  (where  𝑋  are features and  𝑌  are labels).

Prediction with Naive Bias

$$P\bigg(\frac{\text{label}}{\text{features}}\bigg) = \frac{P(\text{label}) \times P(\frac{\text{features}}{\text{label}})}{P(\text{features})}$$

Assumption that all features are independant modifies the formula to:

$$P\bigg(\frac{\text{label}}{\text{features}}\bigg)= \frac{P(\text{label}) * P\big(\frac{f_1}{\text{label}}\big)*...  * P\big(\frac{f_n}{\text{label}}\big)}{P(\text{features})}$$


In [None]:
from sklearn.naive_bayes import MultinomialNB
from tqdm import tqdm
from sklearn.metrics import classification_report

# 5 Training And Evaluation


## 5.1. Navie Bias 

#### Training the Gaussian Naive Bayes with word counts feature vectors (CountVectorizer)


In [None]:
# Lets train a Gaussian Naive Bayes clasifier using counts 
NB_classifier_counts = MultinomialNB()
NB_classifier_counts.fit(X_train_counts.toarray(), y_train)
# evaluation
preds = NB_classifier_counts.predict(X_test_counts.toarray())
print(classification_report(y_test, preds))

## EXERCISE 3
Train Gaussian Naive Bayes using TF-IDF vectors 

In [None]:
NB_classifier_tfidf = ## Your CODE GOES HERE
## CODE FOR TRAINING GOES HERE

## EXERCISE 4
Evaluate the results on the test set.

In [None]:
## CODE FOR EVALUATION GOES HERE

#6. Random Examples (Tv Reviews from internet)

In [None]:
citizen_info_ireland = '. The Government is chosen by and is collectively responsible to the Dáil. \
                        There must be a minimum of 7 and a maximum of 15 Ministers. \
                        The Taoiseach, the Tanaiste and the Minister for Finance must be members of the Dáil.\
                        It is possible to have 2 Ministers who are members of the Senate but this rarely happens.'
gone_girl_review = 'Audience Reviews for Gone Girl ... \
                          Mesmerizing performances, tense atmosphere, unexpected plot twists and turns \
                          of events, this movie is a real crime thriller!'

sherlock_bbc_review = 'Dr Watson, a former army doctor, finds himself sharing a flat with Sherlock Holmes, \
                        an eccentric individual with a knack for solving crimes. Together, they take on the most unusual cases.'




## EXERCISE 5.1
Predict the labels for the above text, using either of the model trained in exercise 4.


In [None]:
# YOUR CODE GOES HERE

<Details>
<summary>HINT</summary>
This step is somewhat like the step that is performed on the test set.
But keep in mind that the feature extraction step is still to be performed before the prediction step.

</Details>

# [OPTIONAL] Exercise
## 5.2 Train a Classifier of your choice which performs better than the previous one

In [None]:
## YOUR CODE GOES HERE