<a href="https://colab.research.google.com/github/saiteja-ms/DAL-Project/blob/main/ME21B171_Assignment_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task 1: Construct a Dataset with Roman-Script Based Languages

### Problem Overview:
Amazon has released a multilingual dataset called **MASSIVE** with sentences from 51 languages. The dataset is structured in JSON format and includes various features such as sentence utterances, language codes, partitions, and tokens. Our objective is to build a **language classifier** for a subset of languages that use the **Roman alphabet**. The dataset is available via [Hugging Face](https://huggingface.co/datasets/qanastek/MASSIVE), and we will focus on extracting the sentences for the following **27 locales** (language-country pairs) using **Roman script**:

`af-ZA`, `da-DK`, `de-DE`, `en-US`, `es-ES`, `fr-FR`, `fi-FI`, `hu-HU`, `is-IS`, `it-IT`, `jv-ID`, `lv-LV`, `ms-MY`, `nb-NO`, `nl-NL`, `pl-PL`, `pt-PT`, `ro-RO`, `ru-RU`, `sl-SL`, `sv-SE`, `sq-AL`, `sw-KE`, `tl-PH`, `tr-TR`, `vi-VN`, `cy-GB`

### Task:
1. **Extract Utterances**: Extract the sentence utterances (`utt`) from the **MASSIVE** dataset for each of the 27 Roman-script locales.
2. **Save to Files**: Store the extracted utterances in separate text files, one for each locale. Each file will have one sentence per line.
3. **Maintain Parallel Corpus**: Ensure all files have the same number of lines as the MASSIVE dataset is a **parallel-corpus**.

### Steps to Follow:

1. **Load the Dataset**: Use the **Hugging Face `datasets`** library to load the MASSIVE dataset.
   
2. **Locale Selection**: Programmatically filter the dataset by locale to extract only the Roman-script locales mentioned.

3. **Extract Utterances**: Extract the `utt` field (sentence utterances) from the dataset for each locale.

4. **Save Utterances to Text Files**: For each locale, save the extracted utterances into a corresponding text file, ensuring that each file contains one sentence per line.

In [None]:
# Install the datasets library from Huggingface
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K  

In [None]:
# Required imports
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

In [None]:
# Load the MASSIVE dataset using Huggingface's datasets library
dataset = load_dataset("qanastek/MASSIVE")

# List of selected locales (languages with Roman script)
selected_locales = [
    "af-ZA", "da-DK", "de-DE", "en-US", "es-ES", "fr-FR", "fi-FI", "hu-HU",
    "is-IS", "it-IT", "jv-ID", "lv-LV", "ms-MY", "nb-NO", "nl-NL", "pl-PL",
    "pt-PT", "ro-RO", "ru-RU", "sl-SL", "sv-SE", "sq-AL", "sw-KE", "tl-PH",
    "tr-TR", "vi-VN", "cy-GB"
]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


MASSIVE.py:   0%|          | 0.00/32.3k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/34.1k [00:00<?, ?B/s]

The repository for qanastek/MASSIVE contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/qanastek/MASSIVE.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/39.5M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
# Directory to save the extracted dataset
import os
output_dir = "datasets"
os.makedirs(output_dir, exist_ok=True) # Create the directory if it doesn't exist

In [None]:
# Function to save utterances (sentences) from the dataset to a file
def save_utterances(file_path, data):
    with open(file_path, "w", encoding="utf-8") as file:
        for utterance in data:
            utt = utterance['utt'] # Extract the 'utt' (sentence)
            file.write(utt + '\n') # Write each sentence on a new line

In [None]:
# Function to extract utterances from the dataset based on selected locales
def extract_and_save_data():
    for locale in selected_locales:
        # Filter the dataset for each language (locale) and partition (train/validation/test)
        train_data = dataset['train'].filter(lambda x: x['locale'] == locale)
        validation_data = dataset['validation'].filter(lambda x: x['locale'] == locale)
        test_data = dataset['test'].filter(lambda x: x['locale'] == locale)

        # Save the filtered data to text files (one file per language and partition)
        train_file_path = os.path.join(output_dir, f"train_{locale}.txt")
        validation_file_path = os.path.join(output_dir, f"validation_{locale}.txt")
        test_file_path = os.path.join(output_dir, f"test_{locale}.txt")

        save_utterances(train_file_path, train_data)
        save_utterances(validation_file_path, validation_data)
        save_utterances(test_file_path, test_data)

        print(f"Data for {locale} saved successfully.")

In [None]:
# Run the data extraction and saving process
extract_and_save_data()

Data for af-ZA saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for da-DK saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for de-DE saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for en-US saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for es-ES saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for fr-FR saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for fi-FI saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for hu-HU saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for is-IS saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for it-IT saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for jv-ID saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for lv-LV saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for ms-MY saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for nb-NO saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for nl-NL saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for pl-PL saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for pt-PT saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for ro-RO saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for ru-RU saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for sl-SL saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for sv-SE saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for sq-AL saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for sw-KE saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for tl-PH saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for tr-TR saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for vi-VN saved successfully.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

Data for cy-GB saved successfully.


# Task - 2
To build a Multionomial Naives Bayes Dataset

In [None]:
#Import additional libraries for model training and evaluation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

data_dir = "datasets"
# List of locales used in the dataset (same as before)
locales = ['af-ZA', 'da-DK', 'de-DE', 'en-US', 'es-ES', 'fr-FR', 'fiFI', 'hu-HU', 'is-IS', 'it-IT', 'jv-ID', 'lv-LV',
 'ms-MY', 'nb-NO', 'nl-NL', 'pl-PL', 'pt-PT', 'ro-RO', 'ruRU', 'sl-SL', 'sv-SE', 'sq-AL', 'sw-KE', 'tl-PH',
 'tr-TR', 'vi-VN', 'cy-GB']

# Function to load data from text files for each locale (for a specific partition)
def load_data(partition):
    data = []
    labels = []
    for idx, locale in enumerate(locales):
        file_path = os.path.join(data_dir, f"{partition}_{locale}.txt")
        if os.path.exists(file_path):
            with open(file_path, "r", encoding="utf-8") as file:
                utterances = file.readlines()
                data.extend(utterances) # Append all sentences from the file
                labels.extend([idx] * len(utterances)) # Label the data with corresponding locale index
    return data, labels


In [None]:
# Load training, validation, and test data
train_text, train_labels = load_data("train")
validation_text, validation_labels = load_data("validation")
test_text, test_labels = load_data("test")

In [None]:
# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(train_text)
validation_vectors = vectorizer.transform(validation_text)
test_vectors = vectorizer.transform(test_text)

In [None]:
# Initialize the Multinomial Naive Bayes classifier
NB = MultinomialNB()

In [None]:
# Hyperparameter tuning using GridSearchCV (searching for the best 'alpha' parameter)
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 0.5, 1.0, 10.0]
}
# Fine-tuning with Grid Search on Validation set for alpha parameter
grid_search = GridSearchCV(NB, param_grid, cv=5, scoring='accuracy')
grid_search.fit(validation_vectors, validation_labels)

# Get the best 'alpha' parameter
best_alpha = grid_search.best_params_['alpha']
print(f"Best alpha: {best_alpha}")


Best alpha: 0.1


In [None]:
#Re-train the model with the best 'alpha' hyper-parameter
NB_best = MultinomialNB(alpha=best_alpha)
NB_best.fit(train_vectors, train_labels)

In [None]:
# Evaluate on train, validation and test sets
train_predictions = NB_best.predict(train_vectors)
validation_predictions = NB_best.predict(validation_vectors)
test_predictions = NB_best.predict(test_vectors)


In [None]:
# Print Performance metrics for all three partitions
print("Performance Metrics on Train Set:")
print(classification_report(train_labels, train_predictions))
print(f"Accuracy:{accuracy_score(train_labels, train_predictions)}")

print("Performance Metrics on Validation Set:")
print(classification_report(validation_labels, validation_predictions))
print(f"Accuracy:{accuracy_score(validation_labels, validation_predictions)}")

print("Performance Metrics on Test Set:")
print(classification_report(test_labels, test_predictions))
print(f"Accuracy:{accuracy_score(test_labels, test_predictions)}")



Performance Metrics on Train Set:
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     11514
           1       0.97      0.97      0.97     11514
           2       1.00      0.99      1.00     11514
           3       0.97      1.00      0.98     11514
           4       0.99      0.99      0.99     11514
           5       1.00      1.00      1.00     11514
           7       1.00      1.00      1.00     11514
           8       1.00      1.00      1.00     11514
           9       0.99      1.00      0.99     11514
          10       1.00      0.99      0.99     11514
          11       1.00      1.00      1.00     11514
          12       0.99      1.00      0.99     11514
          13       0.98      0.97      0.97     11514
          14       0.99      0.98      0.99     11514
          15       0.99      1.00      1.00     11514
          16       0.99      0.99      0.99     11514
          17       1.00      1.00      1.00    

# Task - 3
To build a Regularized Discriminant Model

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, ClassifierMixin

In [None]:
# Define the mapping of languages to continents
continent_map = {
 'af-ZA': 'Africa', 'da-DK': 'Europe', 'de-DE': 'Europe', 'en-US':
'North America',
 'es-ES': 'Europe', 'fr-FR': 'Europe', 'fi-FI': 'Europe', 'hu-HU':
'Europe',
 'is-IS': 'Europe', 'it-IT': 'Europe', 'jv-ID': 'Asia', 'lv-LV':
'Europe',
 'ms-MY': 'Asia', 'nb-NO': 'Europe', 'nl-NL': 'Europe', 'pl-PL':
'Europe',
 'pt-PT': 'Europe', 'ro-RO': 'Europe', 'ru-RU': 'Europe', 'sl-SL':
'Europe',
 'sv-SE': 'Europe', 'sq-AL': 'Europe', 'sw-KE': 'Africa', 'tl-PH':
'Asia',
 'tr-TR': 'Asia', 'vi-VN': 'Asia', 'cy-GB': 'Europe'
}

In [None]:
# Create a directory for continent data
continent_data_dir = "continent_data"
os.makedirs(continent_data_dir, exist_ok=True)

In [None]:
# Helper function to combine language files into continent files
def combine_languages_into_continent():
    continent_files = {continent:[] for continent in set(continent_map.values())}

    for locale, continent in continent_map.items():
        train_file = os.path.join(data_dir, f"train_{locale}.txt")
        validation_file = os.path.join(data_dir, f"validation_{locale}.txt")
        test_file = os.path.join(data_dir, f"test_{locale}.txt")

        if os.path.exists(train_file):
            with open(train_file, "r", encoding="utf-8") as file:
                continent_files[continent].extend(file.readlines())

        if os.path.exists(validation_file):
            with open(validation_file, "r", encoding="utf-8") as file:
                continent_files[continent].extend(file.readlines())

        if os.path.exists(test_file):
            with open(test_file, "r", encoding="utf-8") as file:
                continent_files[continent].extend(file.readlines())

    for continent, lines in continent_files.items():
        continent_file_path = os.path.join(continent_data_dir, f"{continent}.txt")
        with open(continent_file_path, "w", encoding="utf-8") as file:
            file.writelines(lines)

    print("Language files combined into continent files successfully.")

In [None]:
combine_languages_into_continent()

Language files combined into continent files successfully.


In [None]:
def load_continent_info():
    data = []
    labels = []
    for continent in os.listdir(continent_data_dir):
        if continent.endswith(".txt"):
            continent_name = continent.replace("_combined.txt", "")
            file_path = os.path.join(continent_data_dir, continent)
            with open(file_path, "r", encoding="utf-8") as file:
                utterances = file.readlines()
                data.extend(utterances)
                labels.extend([continent_name] * len(utterances))

    return data, labels

In [None]:
Text, Label = load_continent_info()

In [None]:
Vectorizer_ = TfidfVectorizer(max_features = 5000, ngram_range = (1,2))
Text_Vector = Vectorizer_.fit_transform(Text)
Label_Vector = np.array(Label)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(Text_Vector, Label_Vector, test_size=0.2, random_state=42)


In [None]:
class RegularizedDiscriminantAnalysis(BaseEstimator, ClassifierMixin):
    def __init__(self, lambda_param=1.0):
        self.lambda_param = lambda_param
        self.LDA = LogisticRegression(penalty='l2', C = 1/self.lambda_param, solver='liblinear')

    def fit(self, X, y):
        self.LDA.fit(X, y)
        return self

    def predict(self, X):
        return self.LDA.predict(X)

    def get_params(self, deep=True):
        return {'lambda_param': self.lambda_param}

    def set_params(self, **params):
        for param, value in params.items():
            setattr(self, param, value)
        return self


In [None]:
param_grid = {
    'lambda_param': [0.001, 0.01, 0.1, 0.5, 1.0, 10.0]
}

rda = RegularizedDiscriminantAnalysis()
grid_search = GridSearchCV(rda, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 971, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py", line 279, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py", line 371, in _score
    y_pred = method_caller(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py", line 89, in _cached_call
    result, _ = _get_response_values(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_response.py", line 199, in _get_response_values
    classes = estimator.classes_
AttributeError: 'RegularizedDiscriminantAnalysis' object has no attribute 'classes_'

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_valida

In [None]:
best_lambda = grid_search.best_params_['lambda_param']
print(f"Best lambda: {best_lambda}")

Best lambda: 0.001


In [None]:
rda_best = RegularizedDiscriminantAnalysis(lambda_param=best_lambda)
rda_best.fit(X_train, y_train)

In [None]:
y_pred = rda_best.predict(X_test)

In [None]:
print('\n Test Set Metrics: \n')
print(f"Accuracy: {classification_report(y_test, y_pred)}")
print(classification_report(y_test, y_pred))


 Test Set Metrics: 

Accuracy:                    precision    recall  f1-score   support

       Africa.txt       0.98      0.94      0.96      6447
         Asia.txt       1.00      0.95      0.97     16513
       Europe.txt       0.98      1.00      0.99     62902
North America.txt       0.98      0.95      0.96      3352

         accuracy                           0.98     89214
        macro avg       0.98      0.96      0.97     89214
     weighted avg       0.98      0.98      0.98     89214

                   precision    recall  f1-score   support

       Africa.txt       0.98      0.94      0.96      6447
         Asia.txt       1.00      0.95      0.97     16513
       Europe.txt       0.98      1.00      0.99     62902
North America.txt       0.98      0.95      0.96      3352

         accuracy                           0.98     89214
        macro avg       0.98      0.96      0.97     89214
     weighted avg       0.98      0.98      0.98     89214

