# Table of contents
- The 20newsgroup dataset 
- Text vectorization: numeric representation of text
- Classification model
- Pipelines for text classification

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

## The 20newsgroup dataset 

Load the 20newsgroups dataset by exploiting the dedicated function in `sklearn.datasets` module. (what is a [newsgroup](https://en.wikipedia.org/wiki/Usenet_newsgroup))

In [None]:
from sklearn.datasets import fetch_20newsgroups
ng20_train = fetch_20newsgroups(subset='train')


In [None]:
type(ng20_train)

As for other "native" sklearn datasets, the 20newsgroups is provided as a `Bunch` data structure and has the following attributes.

In [None]:
dir(ng20_train)

In [None]:
print(ng20_train.DESCR)

In [None]:
ng20_train.target_names

Explore the dataset.

In [None]:
len(ng20_train.data),len(ng20_train.target)

In [None]:
ng20_train.target_names

In [None]:
target_n = pd.Series(ng20_train.target).apply(lambda x: ng20_train.target_names[x])
target_vc = pd.Series(target_n).value_counts()
target_vc.plot(kind='bar',title = '20newsgroups train: classes distribution')

plt.show()

In [None]:
ng20_train.target[:10]

Most classes have around 600 samples. Few classes are slightly less represented but in any case with more than 300 samples.

Let's have a look at some samples

In [None]:
for idx in [0, 123,2000]:
    print(f'\n\n\t\t RECORD {idx}')
    print(f'category index: {ng20_train.target[idx]} - name: {ng20_train.target_names[ng20_train.target[idx]]}')
    print()
    print(ng20_train.data[idx])

We can plot the histogram of the length of the pieces of text (in terms of number of characters).

In [None]:
plt.hist([len(x) for x in ng20_train.data],bins = 20)
plt.show()

Note that it is possible to load only a sub-selection of the categories by passing the list of the categories to load to the `fetch_20newsgroups` function


In [None]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']

In [None]:
ng4_train = fetch_20newsgroups(subset='train',categories = categories)
print(len(ng4_train.data))
print(ng4_train.target[:10])

In [None]:
for idx in [0, 10, 20]:
    print(f'\n\n\t\t RECORD {idx}')
    print(f'category index: {ng4_train.target[idx]} - name: {ng4_train.target_names[ng4_train.target[idx]]}')
    print()
    print(ng4_train.data[idx])

In [None]:
target_n = pd.Series(ng4_train.target).apply(lambda x: ng4_train.target_names[x])
target_vc = pd.Series(target_n).value_counts()
target_vc.plot(kind='bar',title = '4newsgroups train: classes distribution')

plt.show()

## Text vectorization: numeric representation of text

The machine learning algorithms we will use require us to give numerical data to them. Raw text data as an input will not work! This means that we have to transfer our texts to some kind of numerical representation without loosing too much information. Transferring a text from a sequence of characters to a vector of numbers is called **text vectorization**.

![text_vectorization.png](images/text_vectorization.png)

There are many different ways to vectorize texts, from fancy techniques like [word embeddings](https://en.wikipedia.org/wiki/Word_embedding) and topic models like [latent dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA) to simple [bag-of-words models](https://en.wikipedia.org/wiki/Bag-of-words_model).

The most intuitive way to turn the text content into numerical feature vectors is the **bag of words representation**:

* assign a **fixed integer id** to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).

* for each document $d_i$, count the **number of occurrences** of each word $w$ and store it in $X[i, j]$ as the value of feature $w_j$ where $j$ is the index of word $w$ in the dictionary.

The bag of words representation implies that `n_features` is the number of distinct words in the corpus: this number is typically very large ($10^5, 10^6$).

Let the number of features be $100,000$. If `n_samples` is $10,000$, storing `X` as a NumPy array of type `float32` would require $10,000 \times 100,000 \times 4$ bytes = 4 GB in RAM which may still be manageable on today’s computers, But undoubtedly burdensome.

Fortunately, most values in `X` will be zeros since for a given document less than a couple thousands of distinct words will be used. For this reason we say that bag of words are typically *high-dimensional sparse datasets*. We can save a lot of memory by *only storing the non-zero parts of the feature vectors in memory*.
The `scipy.sparse` matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

`scikit-learn` offers a provides basic tools to process text using the bag of words representation. To build such a representation we will proceed as follows:

* *tokenize* strings and *give an integer id* for each possible token, for instance by using whitespaces and punctuation as token separators.
* *count* the occurrences of tokens in each document.
* *normalize* and weighting with diminishing importance tokens that occur in the majority of samples (i.e. documents).

![tfidf_vectorization.png](images/tfidf_vectorization.png)

This approach is called [**TFIDF**](http://en.wikipedia.org/wiki/Tf–idf):

* **term frequency** (TF): counts the number of times a term $t$ (word) appears in a document $d$ adjusted by the length of the document (number of all words $t'$ in document $d$).
* **inverse document frequency** (IDF): counts the number of documents $n_t$ an individual term $t$ appears over all documents $N$.
* **term frequency-inverse document frequency** (TFIDF): weights down common words like "the" and gives more weight to rare words like "software".

To perform this vectorization, scikit-learn provides the `TfidfVectorizer` class.

Usage of `TfidfVectorizer` is equivalent to usage of `CountVectorizer` (*tokenize* and *count*) followed by `TfidfTransformer` (*normalize*).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

These objects enable many operations. Let's have a look at the documentation.

In [None]:
CountVectorizer?

```python
class sklearn.feature_extraction.text.CountVectorizer(*,
  input='content', 
  encoding='utf-8', 
  decode_error='strict', 
  strip_accents=None,   # Remove accents and perform character normalization, (by default does nothing)
  lowercase=True, 
  preprocessor=None,  # Override strip_accents and lowercase while preserving tokenizing and n-grams
  tokenizer=None, # Override the string tokenization step while preserving the preprocessing and n-grams generation steps. 
  stop_words=None, # If None, no stop words will be used. 
  token_pattern='(?u)\b\w\w+\b', # Regular expression denoting what constitutes a "token"
  ngram_range=(1, 1),  # (lower, upper) boundary of the values of n in n-grams. (1,2) -> unigrams and bigrams
  analyzer='word', # Whether the feature should be made of word n-gram or character n-grams.  If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
  max_df=1.0,  # Build vocabulary ignoring terms that have a document frequency higher than the threshold (corpus-specific stop words)
  min_df=1,  # Build vocabulary ignoring terms that have a document frequency lower than the threshold (cut-off)
  max_features=None,# If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
  vocabulary=None, # Custom vocabulary, otherwise determined from input documents
  binary=False, # If True, all non zero counts are set to 1. 
  dtype=<class 'numpy.int64'>)
```

In [None]:
#pip install nltk
import nltk 
nltk.download('stopwords')
nltk.download('words')

stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords

In [None]:
count_vect = CountVectorizer(stop_words = stopwords)

In [None]:
X_train_vector = count_vect.fit_transform(ng4_train.data)
X_train_vector

In [None]:
X_train_vector.shape

The sparse matrix has 2257 rows (number of documents) and 35644 columns (number of tokens in vocabulary)

**Word occurrences of a document**


In [None]:
print(count_vect.transform(['the car is red']))

In [None]:
print(count_vect.transform(['the car is red. The cat is red as well']))

In [None]:
print(count_vect.transform(['la macchina è rossa']))

Let's have a look at the vocabulary the vectorizer learned from our data. We call the `vocabulary_` property on the trained vectorizer to retrieve the full vocabulary and then use a little loop to print the first items in the vocabulary:

In [None]:
vocabulary = count_vect.vocabulary_ # it is a mapping of terms to feature indices.



In [None]:
for x in ['the', 'car', 'is', 'red', 'the', 'cat', 'is', 'red', 'as', 'well']:
    print(f'{x}: {vocabulary.get(x,"token not in voc")}')

In [None]:
for x in ['la', 'macchina', 'è', 'rossa']:
    print(f'{x}: {vocabulary.get(x,"token not in voc")}')

In [None]:
vocabulary = count_vect.vocabulary_

#little loop to print the first items in the vocabulary
for count, item in enumerate(iter(vocabulary.items())):
    print(item)
    if count >= 10:
        break

In [None]:
print(f'The vocabulary contains {len(vocabulary)} terms in total')

In [None]:
tokens = count_vect.get_feature_names_out()
len(tokens)

In [None]:
tokens[:100]

In [None]:
tokens[10000:10100]

In [None]:
tokens[-100:]

A glimpse on the features that are extracted (i.e. tokens) suggests that there are a lot of meaningless  words.
We can perform a more aggressive preprocessing (e.g., removing numbers and other special symbols) to avoid this.


Furthermore, by setting the parameters `max_df` and/or `min_df` and/or `max_features`, we can further reduce the number of words (i.e. number of attributes) in order to:
- apply a sort of feature selection (remove noisy features)
- reduce memory consumption 
- reduce required computational power


In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_vector)

X_train_tfidf

In [None]:
X_train_vector

In [None]:
print(tfidf_transformer.transform(count_vect.transform(['the car is red'])))

Combine `CountVectorizer` and `TfIdfTransformer` in a single object:

In [None]:
vectorizer = TfidfVectorizer(stop_words = stopwords)
X_train_tfidf = vectorizer.fit_transform(ng4_train.data)

In [None]:
vectorizer.transform(['the car is red']).data

In [None]:
X_train_tfidf.nnz / float(X_train_tfidf.shape[0])

In [None]:
123/35644

On average, there are 124 non-zero components by sample in a 35644-dimensional space (0.035%)

### Import the test samples

Import the files pertaining to the same 5 categories

In [None]:
ng4_test = fetch_20newsgroups(subset='test',categories = categories)


In [None]:
target_n = pd.Series(ng4_test.target).apply(lambda x: ng4_test.target_names[x])
target_vc = pd.Series(target_n).value_counts()
target_vc.plot(kind='bar',title = '4newsgroups test: classes distribution')

plt.show()

In [None]:
len(ng4_test.data), len(ng4_train.data)

## Classification model

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, ng4_train.target)


**Novel, custom, samples (you can simulate *production* behaviour)**

In [None]:
docs_new = ['is there any afterlife', 'OpenGL on the GPU is fast']


In [None]:
docs_new_count = vectorizer.transform(docs_new)
docs_new_count

In [None]:
predicted_class = clf.predict(docs_new_count)
[ng4_train.target_names[x] for x in predicted_class]

In [None]:
predicted_class_prob = clf.predict_proba(docs_new_count)
predicted_class_prob

**The actual test-set**

In [None]:
X_test_tfidf = vectorizer.transform(ng4_test.data)
y_pred = clf.predict(X_test_tfidf)


# Evaluation metrics

In [None]:
from sklearn.metrics import classification_report,ConfusionMatrixDisplay
print(classification_report(ng4_test.target,y_pred,target_names = ng4_test.target_names))

In [None]:
ConfusionMatrixDisplay.from_predictions(ng4_test.target,y_pred,display_labels = ng4_test.target_names)
plt.show()

The overall accuracy is quite high (around 90%), and so are precision and recall per class.

The only exceptions arise for the **precision of class "soc.religion.chrisitan"** (0.75) and **recall of class "alt.atheism"**. Specifically 80 examples from class alt.atheism are improperly classified as soc.religion.chrisitan.
Evidently, the two classes are likely to pertain to similar semantic areas and this introduces some confusion in discriminating among them.

Let's have a look at some misclassified samples.

In [None]:
misclassified_atheism = np.asarray(ng4_test.data)[(ng4_test.target!=y_pred) & (ng4_test.target == 0)]
for m_atheism in misclassified_atheism[:3]:
    print(m_atheism)


### Model inspection

Can we get some information about the most informative features?

The MultinomialNB object exposes the attribute `feature_log_prob_` which represents the empirical log probability of features given a class, $P(x_i|y)$.
 

In [None]:
# https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset
import numpy as np
def show_top10(clf, vectorizer, categories):
    feature_names = vectorizer.get_feature_names_out()
    for i, category in enumerate(categories):
        top10 = np.argsort(clf.feature_log_prob_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))

show_top10(clf, vectorizer, ng4_train.target_names)


We can notice, for example, that among the most informative features there are some apparently unrelated to the topic (e.g. *edu*, *com*) 

## Pipelines for text classification

Text processing and classification is a typical scenario where pipelines can be helpful.

In the following, the whole exercise is repeated for the whole 20newsgroup dataset.


In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline

ng20_train = fetch_20newsgroups(subset='train')
ng20_test = fetch_20newsgroups(subset='test')

text_clf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words = stopwords)),
    ('clf', MultinomialNB()),
])

text_clf.fit(ng20_train.data,ng20_train.target)
y_pred = text_clf.predict(ng20_test.data)
print(classification_report(ng20_test.target,y_pred,target_names = ng20_test.target_names))
fig,ax = plt.subplots(figsize = (15,15))
ConfusionMatrixDisplay.from_predictions(ng20_test.target,y_pred,display_labels = ng20_test.target_names,ax=ax,xticks_rotation='vertical')
plt.show()
len(text_clf[0].vocabulary_)

Let's try to add the stemmer to the text processing pipeline.

In [None]:
analyzer = TfidfVectorizer(stop_words = stopwords).build_analyzer()
print(analyzer(ng20_train.data[0])[:10])
from nltk.stem.snowball import SnowballStemmer
snow_stemmer = SnowballStemmer('english')
def snowball_analyzer(doc):
    return [snow_stemmer.stem(w) for w in analyzer(doc)]
print(snowball_analyzer(ng20_train.data[0])[:10])
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer = snowball_analyzer)),
    ('clf', MultinomialNB()),
])

In [None]:
text_clf.fit(ng20_train.data,ng20_train.target)
y_pred = text_clf.predict(ng20_test.data)
print(classification_report(ng20_test.target,y_pred,target_names = ng20_test.target_names))
fig,ax = plt.subplots(figsize = (15,15))
ConfusionMatrixDisplay.from_predictions(ng20_test.target,y_pred,display_labels = ng20_test.target_names,ax=ax,xticks_rotation='vertical')
plt.show()
len(text_clf[0].vocabulary_)

How can we comment this result?

# <font color='blue'><ins>TASK</ins></font>
- 13_TASK-1: 
    - Consider the 20newsgroup dataset, limited to the 4 categories we have analyzed in this notebook.
    - Can you tune the processing/classification pipeline and improve the performance on the test set?
        - For example (not exhaustive, be creative!)
            - act on the text processing stage to reduce the feature space and remove "noisy" tokens;
            - act on the text processing stage by apply stemming;
            - remove the header (`fetch_20newsgroup` has an attribute `remove` that you can tune);
            - extract only the "subject" of each piece of news and try to infer the topic based only on that; 
            - use other classification models;
        - Note: select your best model (text processing / vectorization / classification) on the training set by using cross-validation, and then evaluate its final performance on the test set

- 13_TASK-2:
    - Consider the TripAdvisor dataset
        - Carry out an exploratory data analysis.        
        - Carry out a classification analysis in the following setting:
            - consider three models, namely SVM - MultinomialNB - DecisionTree.
            - Evaluate the classifiers by applying a 5 fold cross-validation
            - Compare the results achieved by the three models on the test set, in terms of accuracy
            - Comment the results achieved in terms of precision and recall per class    