

# Artificial Intelligence project


### Raffaele Mura 70/90/00312
A.Y 2022/2023

Automated text categorization, or text classification, has gained significant attention in the last decade due to the growing volume of digital text, advancements in machine learning algorithms, and its wide-ranging applications in information retrieval, sentiment analysis, customer support, healthcare, and marketing, among others. It has become a crucial tool for organizing and making sense of the vast amount of unstructured textual data available online and has enabled the development of more accurate and efficient solutions in various industries.

This project aims to explore one of the main approaches to text categorization thorugh machine learning tools. The specific objective is to categorize textual news data from the Reuters news agency into predefined categories that describe their content.

The problem is a multi-label classification task, where each news document can belong to one or more categories from a predefined set of approximately 90 categories. The goal is to train a multi-label classifier using the "one-vs-all" approach, where *M* binary classifiers are independently trained for each category, and then the categories assigned by these classifiers are combined to form the final label set for each document.

In this Python code, we will employ various techniques to perform data preprocessing and evaluation to develop an efficient multi-label text categorization model for the Reuters news dataset. The code will encompass data loading, preprocessing, feature selection, classifier training, and model evaluation.

To achieve this, the initial steps of data preprocessing involve selecting informative features associated with terms present in the documents. The standard approach is to evaluate the discriminatory power of each term present in the training set and select the *N* most discriminative terms. The documents are then represented as feature vectors containing a relevant representation of the selected *N* terms.

Before feature selection, it is essential to eliminate "stop words," which are non-discriminatory terms like articles and prepositions, and perform lemmatization (stemming) to reduce the number of features to be evaluated. Lemmatization reduces each word to its base form, e.g., transforming various forms of a single verb (e.g., make, made, making) to a common root like "mak."

Finally, in the evaluation phase, the performance of the multi-label classifier will be assessed using the \\(F_{β}\\) measure, a combination of precision and recall.



### Downloading the Dataset

In this first section there is the code related to the download of the dataset.

In [38]:
import tarfile
import os.path
import tarfile
import pickle
import pandas as pd
from IPython.display import clear_output
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
from collections import Counter




auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [39]:
#code for progress bar
from IPython.display import HTML, display
import time

def progress(value, max=100):
    return HTML("""<progress value='{value}' max='{max}', style='width: 100%'>
            {value}
        </progress>""".format(value=value, max=max))


In [40]:
import requests
from PIL import Image

url = 'http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz'
# This requests the resource at the given link, extracts its contents
# and saves it in a variable
data = requests.get(url).content
f = open('reuters21578.tar.gz','wb')
# Storing the dataset inside the data variable to the file
f.write(data)
f.close()
#extracting the files inside the archive
file = tarfile.open('/content/reuters21578.tar.gz')
file.extractall('./dataset')
file.close()



### Extracting data from the dataset

The functions below are used for a first handling of the data in the Reuters dataset.
The dataset is distributed in 22 files, where the first 21 files contain 1000 documents, while the last contains 578 documents.
All files are in SGML format, and all of them contain some tags used to describe the documents in each file.

Some examples of tags are 'TOPICS', which refer to the categories of the document, or the tag 'BODY' which contain the main text of each document.

the **minor_preprocess** function takes a file path as input, reads the file, and performs minor preprocessing steps to clean the data. It first opens the file in binary mode, reads its lines, and converts them to utf-8 safe lines, ignoring any characters that cannot be decoded. Then, it removes problematic strings containing numeric character references (e.g., &#123;). Finally, it replaces newline characters with spaces. The processed lines are joined to form a single string, which is returned.

In [41]:
import os, re
from bs4 import BeautifulSoup


def minor_preprocess(file):
    with open(file, 'rb') as f:
        lines = f.readlines()
        utf8_safe_lines = [line.decode('utf-8', 'ignore') for line in lines]
        xml_safe_lines = [re.sub(r'&#\d*;', '', line) for line in utf8_safe_lines]  # Get rid of problematic strings
        no_newlines = [line.replace('\n', ' ') for line in xml_safe_lines]
    f.close()

    return ''.join(no_newlines)

The **compile_data** function compiles data from multiple files within a specified datapath directory. It iterates through all files in the directory and selects those with the '.sgm' extension. For each selected file, it calls the minor_preprocess function to preprocess its content. The preprocessed data is then split into individual 'REUTERS' records, and these records are concatenated into a single list called dataset. The function returns this list containing all the preprocessed records.

In [42]:
def compile_data(datapath='./dataset'):
    dataset = []
    for file in os.listdir(datapath):
        if file.endswith('.sgm'):
            preprocessed_data = minor_preprocess(datapath + '/' + file)
            records = [record + '</REUTERS>' for record in preprocessed_data.split('</REUTERS>') if
                       record]  # Retain all original formatting
            dataset.extend(records)
    return dataset


The **compile_dictionary** function takes a single 'REUTERS' record (as a string) as input. It initializes a dictionary called data_dict with keys corresponding to various fields (the tags of the document) of the record, such as 'REUTERS TOPICS', 'LEWISSPLIT', 'TOPICS', 'TITLE', and 'BODY'. It then extracts information from the 'data' string using string manipulations and the BeautifulSoup library (used for parsing XML data).

*   The 'REUTERS TOPICS' tag could contain *YES* or *NO* if the document has an
assigned topic or not.
*   The 'LEWISSPLIT' tag contains *TRAIN* or *TEST* to indicate how the documents wshould be splitted in the training or testing dataset
*   Finally the tags 'TOPICS', 'TITLE' and 'BODY' contains respectively the category, the title and the main text of the document.  

After retriving this data the function returns data_dict, a dictionary containing the extracted information from the 'REUTERS' record.



In [43]:
def compile_dictionary(data):
    data_dict = {
        'REUTERS TOPICS': '',
        'LEWISSPLIT': '',
        'TOPICS': 'none',
        'TITLE': '',
        'BODY': '',
    }

    # Grab the Reuters Topics between the following tags
    start = data.find('<REUTERS TOPICS="') + len('<REUTERS TOPICS="')
    end = data.find('" LEWISSPLIT=')
    data_dict['REUTERS TOPICS'] = data[start:end]
    start = data.find('LEWISSPLIT="') + len('LEWISSPLIT="')
    end = data.find('" CGISPLIT=')
    data_dict['LEWISSPLIT'] = data[start:end]

    soup = BeautifulSoup(data, 'xml')

    # Use a try/except block to grab Topics, Title, and Body in case they are empty
    # If empty, the default value remains unchanged
    try:
        if soup.TOPICS.contents:
            data_dict['TOPICS'] = soup.TOPICS.contents[:]
            data_dict['TOPICS'] = [str(data).replace('<D>', '') for data in data_dict['TOPICS']]
            data_dict['TOPICS'] = [str(data).replace('</D>', '') for data in data_dict['TOPICS']]

        if soup.TITLE.contents:
            data_dict['TITLE'] = soup.TITLE.contents[0]

        if soup.BODY.contents:
            body = soup.BODY.contents[0]
            data_dict['BODY'] = soup.BODY.contents[0]
    except AttributeError:
        pass

    return data_dict


Due to the large imbalance of samples in the various categories, the code below aims to retrieve from the dataset the labels of the categories that are prevalent in the dataset.
The definition of the variable below number_of_classes determines the number of categories to consider.
The higher this variable is, the more the problem at hand becomes a multi-label classification problem.
This follows from the fact that with the definition of this variable we will consider a maximum 'number_of_classes'.  And these are equivalent, as described later, to the classes most occurring over the entire dataset.
However, this has a drawback: if a sample belongs to multiple classes, they must all be within the most common on the dataset, otherwise it will be categorized only according to the most common one.


In [51]:
number_of_classes = 20

The code segment below performs several actions:
The *compile_data* function is called to create the dataset list containing preprocessed 'REUTERS' records.

Then we iterate through each element in the dataset list and applies the compile_dictionary function to convert each 'REUTERS' record into a dictionary format. The resulting dictionaries are stored in the dataset_dicts list.
Then we filter the dataset_dicts list, removing any entries that have the value 'none' for the key 'TOPICS'. This step removes any records without topics, as indicated by the 'none' value.

To understand which are the most common labels we extract all the values of the 'TOPICS' key from each dictionary in the dataset_dicts list and stores them in the labels list.
Then we will retrieve the *number_of_classes* most common labels and stores them in the most_common_labels list. The *most_common* method of the Counter object returns a list of tuples containing the label and its corresponding count, sorted in descending order of counts.

Finally we filter the dataset_dicts list, keeping only the records whose 'TOPICS' value is present in the most_common_labels list. This step further reduces the dataset to only include the most common classes.

In [52]:
dataset = compile_data()
dataset_dicts = [compile_dictionary(data) for data in dataset]
dataset_dicts = [data for data in dataset_dicts if data['TOPICS'] != 'none']

# maintaining the rows of the most common label
labels = [data["TOPICS"] for data in dataset_dicts]
flat_labels = [item for sublist in labels for item in sublist]
common_labels = Counter(flat_labels)
most_common_labels = common_labels.most_common(number_of_classes)

most_common_labels = [most_common_labels[i][0] for i in range(number_of_classes)]

print('the most common labels are:', most_common_labels)

with open('labels.pkl', 'wb') as f:
  pickle.dump(most_common_labels, f)
#here we are maintaning only the samples which have as first label a label inside the most common list
dataset_dicts = [data for data in dataset_dicts if data['TOPICS'][0] in most_common_labels]

#this cycle is used to store in the body the title of each sample which has an empty body
for data in dataset_dicts:
  if data['BODY'] == '':
    data['BODY'] = data['TITLE']

the most common labels are: ['earn', 'acq', 'money-fx', 'crude', 'grain', 'trade', 'interest', 'wheat', 'ship', 'corn', 'dlr', 'oilseed', 'money-supply', 'sugar', 'gnp', 'coffee', 'veg-oil', 'gold', 'nat-gas', 'soybean']


### Pre-Processing of data
The next code segment uses the Natural Language Toolkit (NLTK) library to perform text preprocessing on the previously retrieved text data.
The main tasks carried out by the functions are:
*   the **removing_stop_words** function takes a pandas DataFrame as input and removes stop words from the text data in the "TITLE" and "BODY" columns of the DataFrame. The function iterates through the rows of the DataFrame using the DataFrame index. For each row, it splits the text in "TITLE" and "BODY" columns into individual words using whitespace as the separator. Then, it checks each word against the set of English stop words obtained from NLTK. If the word is not a stop word, it appends it to the new version of the title and body texts, effectively removing stop words from the texts. The updated title and body texts are then stored back into the DataFrame.

*   the **stemming** function takes the DataFrame as input and applies stemming to the text data in the "TITLE" and "BODY" columns. Stemming is a process of reducing words to their base or root form (e.g., "running" to "run" or "jumps" to "jump"). The function initializes a PorterStemmer object from NLTK. It then iterates through the rows of the DataFrame using the DataFrame index. For each row, it tokenizes the text in the "TITLE" and "BODY" columns into individual words using NLTK's word_tokenize method. It applies stemming to each word using the PorterStemmer, and then it reconstructs the new version of the title and body texts with the stemmed words.

In [53]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer


def removing_stop_words(dataframe):
    stop_words = set(stopwords.words('english'))
    for i in dataframe.index:

        new_title = ''
        new_body = ''

        title_text = dataframe["TITLE"][i]
        title_words = title_text.split()
        body_text = dataframe["BODY"][i]
        body_words = body_text.split()

        for t in title_words:
            if not t in stop_words:
                new_title = new_title + ' ' + t

        for b in body_words:
            if not b in stop_words:
                new_body = new_body + ' ' + b

        dataframe["TITLE"][i] = new_title
        dataframe["BODY"][i] = new_body

    return dataframe


def stemming(dataframe):
    ps = PorterStemmer()
    for i in dataframe.index:

        new_title = ''
        new_body = ''

        title_text = dataframe["TITLE"][i]
        title_words = word_tokenize(title_text)

        body_text = dataframe["BODY"][i]
        body_words = word_tokenize(body_text)
        for t in title_words:
            new_title = new_title + ' ' + ps.stem(t)
        for b in body_words:
            new_body = new_body + ' ' + ps.stem(b)

        dataframe["TITLE"][i] = new_title
        dataframe["BODY"][i] = new_body

    return dataframe


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The code below creates a pandas dataframe from the dictionary implmented before and the calls the two function previously described on it.
Finally it stores the dataframe as a pickle file.


In [54]:
df = pd.DataFrame(dataset_dicts)
print(df.head())

print("\n after the cleaning of the data \n")
df = removing_stop_words(df)
df = stemming(df)
print(df.head())
df.to_pickle('dataframe.pkl')

  REUTERS TOPICS LEWISSPLIT                TOPICS  \
0            YES      TRAIN                [earn]   
1            YES      TRAIN                [earn]   
2            YES      TRAIN  [money-fx, interest]   
3            YES      TRAIN                [earn]   
4            YES      TRAIN                [earn]   

                                              TITLE  \
0              NATIONAL FSI INC <NFSI> 4TH QTR LOSS   
1      <PRECAMBRIAN SHIELD RESOURCES LTD> YEAR LOSS   
2  U.K. MONEY MARKET GIVEN FURTHER 437 MLN STG HELP   
3     GREASE MONKEY HOLDING CORP <GMHC> YEAR NOV 30   
4     ACCEPTANCE INSURANCE HOLDINGS INC <ACPT> YEAR   

                                                BODY  
0  Shr loss six cts vs profit 19 cts     Net loss...  
1  Shr loss 1.93 dlrs vs profit 16 cts     Net lo...  
2  The Bank of England said it had provided the m...  
3  Shr nil vs nil     Net 130,998 vs 30,732     R...  
4  Oper shr profit 1.80 dlrs vs loss 2.28 dlrs   ...  

 after the cleaning

Right above we can see the print of the head of the dataset before and after the removing of stop words and stemming operations.
The results of these operations are visible in the BODY section of each record, where there is the absence of stop words and and there are only base words.


### Train and Test Split

The function below is used to split the DataFrame into training and testing sets. The data is read from the pickle file then performs the following steps:



*   It initializes empty lists: _train, y_train, X_test, and y_test, which will be used to store the training and testing data.
*   It iterates through the rows of the DataFrame using its index, and for each row if the value of 'LEWISSPLIT' in that row is 'TRAIN', it appends the 'BODY' value to X_train and the 'TOPICS' value to y_train. If the value of 'LEWISSPLIT' in that row is 'TEST', it appends the 'BODY' value to X_test and the 'TOPICS' value to y_test.

Before saving the arrays related to the targets, we use the *MultiLabelBinarizer* to convert the multi-label targets into a binary format, suitable for multi-label classification tasks. It also encodes the labels using one-hot encoding.
We are passing to the MultiLabelBinarizer the parameter 'labels', which corresponds to the set of mosto common label that we are considering.
This results in an encoding only considering the labels included in the most common list.

In [55]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import MultiLabelBinarizer


def train_test_split(labels):
    X_train = []
    y_train = []
    X_test = []
    y_test = []
    df = pd.read_pickle('dataframe.pkl')
    for i in df.index:

        if df['LEWISSPLIT'][i] == 'TRAIN':
            X_train.append(df['BODY'][i])
            y_train.append(df['TOPICS'][i])
        elif df['LEWISSPLIT'][i] == 'TEST':
            X_test.append(df['BODY'][i])
            y_test.append(df['TOPICS'][i])

    print('len x train {}, len y train {}, len x test {}, len y test {}'.format(len(X_train),
                                                                                len(y_train),
                                                                                len(X_test),
                                                                                len(y_test)))

    X_train = np.array(X_train)
    X_test = np.array(X_test)
    y_train = np.array(y_train, dtype='object')
    y_test = np.array(y_test, dtype='object')

    mlb = MultiLabelBinarizer(classes=labels)
    y_train = mlb.fit_transform(y_train)
    y_test = mlb.fit_transform(y_test)

    with open('X_train.pkl', 'wb') as f:
        pickle.dump(X_train, f)
    with open('X_test.pkl', 'wb') as f:
        pickle.dump(X_test, f)
    with open('y_train.pkl', 'wb') as f:
        pickle.dump(y_train, f)
    with open('y_test.pkl', 'wb') as f:
        pickle.dump(y_test, f)

    return X_train, X_test, y_train, y_test



In [56]:
import warnings
warnings.filterwarnings('ignore')
with open('labels.pkl', 'rb') as f:
    labels = pickle.load(f)
X_train, X_test, y_train, y_test = train_test_split(labels)

print(y_train[0:10])

len x train 7046, len y train 7046, len x test 2713, len y test 2713
[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


As we can see above, the encoding of the targets result in a matrix where for each row we have all the all the labels which pertains to a sample. Each entry equal to one in the matrix means the appartenence of the i-th sample to those labels.
Now, considering the 20 most common classes, only few sample pertain to more than one classes which are included in the most commons.

### Feature Selection
The code segment below performs feature selection and computes the Term Frequency-Inverse Document Frequency (TF-IDF) representation of the most discriminant words in the input text data.

First we set the *number_of_features*  to be considered as discriminant. This value has been decided after a trial and error approach. By changing it and executing the sections below we can see how performances change.

The first function **retrieving_most_discriminant_words** performs a feature selection using mutual information between the text features (X_train) and the corresponding target labels (y_train).

Mutual information is a concept from information theory that measures the amount of information shared between two random variables. In the context of feature selection for machine learning, mutual information is often used to quantify the amount of information one feature provides about the target variable.Moreover mutual information (MI) measures the dependency or association between two variables. For discrete random variables X and Y, the mutual information MI(X, Y) is defined as the reduction in uncertainty about X given the knowledge of Y, or vice versa. In other words, it tells us how much knowing the value of one variable would help us predict the value of the other.
Higher mutual information indicates a stronger dependency between the feature and the target, suggesting that the feature is more informative for predicting the target variable.

In order to compute MI, in the related function we will first use CountVectorizer to convert text data in a document-term matrix which represents the word occurrence counts for each document.
Then we calculate the mutual information between each word's count in the matrix and the corresponding target labels. The mutual_info_classif function returns a dictionary with word-to-mutual-information mapping.
Finally we sort the mutual information dictionary in descending order based on mutual information scores and keep the top *number_of_features* most discriminant words.

The function **computing_tfidf** computes the TF-IDF representation of the input text data using the most discriminant words obtained from the previous step.
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical representation of text data. It measures the importance of words in a document with respect to a collection of documents. TF computes the word frequency in a document, while IDF assesses the rarity of words across the collection. The final TF-IDF score for a word is the product of its TF and IDF. High TF-IDF scores indicate words that are frequent in a document but rare in the collection, making them more informative.
The steps in this function are:
- Creating a TfidfVectorizer object with the vocabulary set to the most discriminant words and the maximum number of features set to number_of_features.- Converting the input text data into a TF-IDF weighted document-term matrix using the TfidfVectorizer.
- Applying the TF-IDF transformer to normalize the TF-IDF vectors in the matrix.
- Converting the sparse TF-IDF matrix into a dense numpy array X.


In [57]:
number_of_features = 1200

In [59]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import pandas as pd
import pickle
import numpy as np
from operator import itemgetter

def retrieving_most_discriminant_words(X_train, y_train):
  #progress bar related code
  out = display(progress(0, y_train.shape[1]), display_id=True)
  kk = 0
  ##########
  cv = CountVectorizer()
  X_vec = cv.fit_transform(X_train)

  # Compute mutual information for each label individually
  mi_per_label = []
  for label_idx in range(y_train.shape[1]):
      mi = mutual_info_classif(X_vec, y_train[:, label_idx], discrete_features=True, n_neighbors=3)
      mi_per_label.append(mi)

      #################
      kk=kk+1
      out.update(progress(kk, y_train.shape[1]))

  # Calculate the average mutual information across all labels
  average_mi = np.mean(mi_per_label, axis=0)

  IG = dict(zip(cv.get_feature_names_out(), average_mi))
  most_discriminant = dict(sorted(IG.items(), key=itemgetter(1), reverse=True)[:number_of_features])
  print('Number of feature to consider:', len(most_discriminant))
  most_discriminant_features = list(most_discriminant.keys())
  print('the 20 most discriminant features:', most_discriminant_features[0:20])
  with open('most_discriminant_features.pkl', 'wb') as f:
      pickle.dump(most_discriminant_features, f)
  return most_discriminant_features

def computing_tfidf(most_discriminant_features, X, print=False):

    cv = TfidfVectorizer(max_features=number_of_features, vocabulary=most_discriminant_features)
    X_vec = cv.fit_transform(X)

    tfidf_transformer = TfidfTransformer()
    X_tfidf = tfidf_transformer.fit_transform(X_vec)
    idf = dict({'feature_name': cv.get_feature_names_out(), 'idf_weights': tfidf_transformer.idf_})
    tf_idf = pd.DataFrame(X_tfidf.toarray(), columns=cv.get_feature_names_out())
    if print:
      print(tf_idf.head())
    X = X_tfidf.toarray()

    return X


In [60]:
most_discriminant_words = retrieving_most_discriminant_words(X_train, y_train)
X_train = computing_tfidf(most_discriminant_words, X_train, print=True)
X_test = computing_tfidf(most_discriminant_words, X_test)

Number of feature to consider: 2000
the 20 most discriminant features: ['vs', 'ct', 'shr', 'net', 'said', 'wheat', 'rev', 'tonn', 'bank', 'oil', 'export', 'trade', 'the', 'agricultur', 'loss', '000', 'inc', 'rate', 'corn', 'mln', 'dollar', 'grain', 'pct', 'soybean', 'profit', 'market', 'share', 'acquir', 'compani', 'corp', 'note', 'div', 'money', 'import', 'barrel', 'currenc', 'japan', 'countri', 'price', 'dlr', 'product', 'crop', 'offici', 'record', 'would', 'week', 'usda', 'avg', 'sugar', 'yen', 'prior', 'coffe', 'central', 'qtli', 'crude', 'billion', 'last', 'dealer', 'offer', 'govern', 'ship', 'dividend', 'qtr', 'produc', 'exchang', 'acquisit', 'econom', 'depart', 'year', 'minist', 'state', 'today', 'agreement', 'foreign', 'farmer', 'deficit', 'gold', 'stake', 'nation', 'ga', 'treasuri', 'mth', 'reserv', 'england', 'fed', 'merger', 'sharehold', 'cut', 'rise', 'day', 'bill', 'ec', 'polici', 'soviet', 'around', 'maiz', 'suppli', 'sourc', 'program', 'told', 'co', 'buy', 'growth', 'com

Right above we can see a dataframe consisting of all the training sample with the tfidf representation. As we can see for each row we have a set of values in function of the most discriminant features.

### Classification and Results
Below are reported the functions related to the training and testing of some classifiers and the evalution of their performances.

The performances function takes the true labels and the predicted ones, and compute some metrics to evalutate the performances of the classifier.
It computes:
- the accuracy as the number of correctly classified samples divided by the total number of samples
- the precision, which is a metric that focuses on the positive predictions made by the classifier. It is the proportion of true positive predictions (correctly predicted positive samples) out of all the positive predictions (both true positives and false positives).
- the recall, also known as sensitivity or true positive rate, measures the ability of the classifier to correctly identify positive samples (true positives) out of all the samples that are actually positive (true positives + false negatives).
- The F_beta score is a balanced metric that combines precision and recall using a parameter beta. It allows us to control the emphasis on either precision or recall based on the value of beta. The F1 score is a special case of the F_beta score when beta is set to 1, giving equal importance to precision and recall. The F_beta score is calculated as follows:
$$F_β = (1+\beta^2) ⋅ Precision \cdot Recall \over β^2 \cdot Precision + Recall $$

Hence, setting β = 1 gives the F1 score, which is commonly used when we want a balance between precision and recall.

In the code, average='macro' is used for precision, recall, and F_beta score calculations. This means that the metrics are calculated for each class, and the macro-average is taken to get an overall score across all classes, giving equal weight to each class regardless of its size.

In [62]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import fbeta_score, precision_score, recall_score

def performances(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro', zero_division=0.0)
    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Recall", recall)
    print("F_beta score:", fbeta_score(y_test, y_pred, average='macro', beta=1))



The classify function takes the training and testing data X_train, y_train, X_test, and y_test as input. Then the code creates an instance of the SVC and MLP classifier and then wraps it using OneVsRestClassifier, which allows multi-label classification. The model is trained on the training data (X_train, y_train) and tested on the testing data (X_test). The performance metrics are then computed and printed using the performances function.

The OneVsRestClassifier implements the One-vs-the-rest (OvR) multiclass strategy. Also known as one-vs-all strategy, this method involves training a separate classifier for each class. Each classifier is designed to distinguish its corresponding class from all the other classes. Therefore, OneVsRestClassifier can be used for multilabel classification tasks. To utilize this feature, the target labels should be presented as a 2D binary matrix, where [i, j] == 1 indicates the presence of label j in sample i (as in our case). When used for multilabel classification, the estimator employs the binary relevance method, training one binary classifier independently for each label.

In [63]:
from sklearn.neural_network import MLPClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC


def classify(X_train, y_train, X_test, y_test):

    ####### SVC #########
    print("Training and testing with a SVC classifier")
    clf1 = SVC()
    clfs1 = OneVsRestClassifier(clf1, verbose=51).fit(X_train, y_train)
    y_pred = clfs1.predict(X_test)
    print("SVC performances")
    performances(y_test, y_pred)
    ####### MLP #########
    print("Training and testing with a MLP classifier")
    clf2 = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
    clfs2 = OneVsRestClassifier(clf2, verbose=51).fit(X_train, y_train)
    y_pred = clfs2.predict(X_test)
    print("MLP performances")
    performances(y_test, y_pred)

    print("y_test:\n", y_test[5:15])
    print("y_pred:\n", y_pred[5:15])


In [64]:
classify(X_train, y_train, X_test, y_test)

Training and testing with a SVC classifier
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:   27.2s
[Parallel(n_jobs=1)]: Done   2 tasks      | elapsed:   56.0s
[Parallel(n_jobs=1)]: Done   3 tasks      | elapsed:  1.2min
[Parallel(n_jobs=1)]: Done   4 tasks      | elapsed:  1.5min
[Parallel(n_jobs=1)]: Done   5 tasks      | elapsed:  1.8min
[Parallel(n_jobs=1)]: Done   6 tasks      | elapsed:  2.0min
[Parallel(n_jobs=1)]: Done   7 tasks      | elapsed:  2.3min
[Parallel(n_jobs=1)]: Done   8 tasks      | elapsed:  2.5min
[Parallel(n_jobs=1)]: Done   9 tasks      | elapsed:  2.7min
[Parallel(n_jobs=1)]: Done  10 tasks      | elapsed:  2.9min
[Parallel(n_jobs=1)]: Done  11 tasks      | elapsed:  3.0min
[Parallel(n_jobs=1)]: Done  12 tasks      | elapsed:  3.2min
[Parallel(n_jobs=1)]: Done  13 tasks      | elapsed:  3.3min
[Parallel(n_jobs=1)]: Done  14 tasks      | elapsed:  3.5min
[Parallel(n_jobs=1)]: Done  15 tasks      | elapsed:  3.6min
[Parallel(n_jobs=1)]: Done  16 tasks      

As can be seen from the results above, the classifiers behave in the same way by presenting comparable presentations. Finally, the predictions of 10 samples from the last classifier compared to the true labels of the same samples are printed, highlighting how the classifier is performing well.

As a final consideration, we encourage changing the number of classes to be considered, as well as the number of discriminating features, to see how the overall performance of the classifiers varies.

By reducing the number of classes to 5 for instance, performance rises above 95 % due to the strong presence of samples with those labels, but with a strongly low number of multi labelled samples.  This is due to the non-presence of samples whose labels are among the 5 most common.
Moreover, it can be shown that with an increase (above approximately 1500) in the number of features to represent the samples with, performance goes down due to the inclusion of features that are too poorly discriminating.