Add multilingual spam classifier #33

yanisdb · 2022-12-14T09:38:31Z

This PR adds the production code for the multilingual spam classifier. The new README explains the scripts and should probably be reviewed first to understand the rest:

adds the sources in ./src,
change the requirements.txt to reflect the changes,
change the README.md to reflect the change and explain how to train and use,
change the Makefile to reflect the changes,
adds a few necessary lines to .gitignore.

This PR is based on my previous one (#32) to facilitate the merging. If this previous PR isn't going to merge, let me know, and I will remove it from this one.

slint

Many thanks for all the work @yanisdb, @lukasec, and @tecabert!

Don't be spooked, the review comments are not for you to address, but for us before merging or to investigate in the future.

slint · 2023-01-17T16:37:55Z

src/features/process_dataset.py

+        chunk = chunk[KEPT_FIELDS].dropna()
+
+        chunk_spams = chunk[chunk["spam"] == True]
+        chunk_spams["description"] = chunk_spams["description"].map(parse_description)


Using .map() here could be parallelized to make use of mult{threading,processing}.

slint · 2023-01-17T16:38:38Z

README.md


-1. Go to Zenodo Open Metadata record at <https://doi.org/10.5281/zenodo.787062> to acces all dataset versions.


We can keep the reference to the dataset DOI, to make it easier to get hold of the (test) data for training.

slint · 2023-01-17T16:39:53Z

src/utils/datasets_utils.py

+    Returns:
+        pd.DataFrame, pd.DataFrame: Train and test sets.
+    """
+    dataset = dataset.sample(frac=1, random_state=SEED).reset_index(drop=True)


This needs a closer look to keep the ham/spam distribution in the splitter sets.

Oh yes! This is indeed not correct. It's not keeping the exact same distribution. We noticed it when pushing the code to our Github classroom repository and forgot to push the change to this one.

Here is the correct version of the function:

def split_train_test( dataset: pd.DataFrame, test_size=0.2 ) -> Tuple[pd.DataFrame, pd.DataFrame]: """Split the dataset into train and test. It will keep exactly the same distribution of classes in both train and test. Args: dataset (pd.DataFrame): Dataset to split. test_size (float, optional): Percentage of the dataset to use for the test set. Returns: pd.DataFrame, pd.DataFrame: Train and test sets. """ spams = dataset[dataset["label"] == 1] hams = dataset[dataset["label"] == 0] spams = spams.sample(frac=1, random_state=SEED) hams = hams.sample(frac=1, random_state=SEED) spams_train = spams[: int(len(spams) * (1 - test_size))] spams_test = spams[int(len(spams) * (1 - test_size)) :] hams_train = hams[: int(len(hams) * (1 - test_size))] hams_test = hams[int(len(hams) * (1 - test_size)) :] train = pd.concat([spams_train, hams_train]) test = pd.concat([spams_test, hams_test]) train = train.sample(frac=1, random_state=SEED) test = test.sample(frac=1, random_state=SEED) return train, test

yanisdb · 2023-01-17T17:51:22Z

While pushing to our GitHub classroom repository, we noticed some minors issues that we fixed. We forgot to push them to your repository. I just pushed a new commit with those fixes.

This commit includes 4 smalls changes:

fixes the split_train_test method to ensure that the distribution remains the same in each part,
changes torch version to 1.13.1 (because we actually tested with that one),
adds details to the README,
changes from saving to CSV to pickles (we believe saving to CSV was actually causing some issues when reading and writing because of comas in the strings and encoding).

We've tested this last commits and it doesn't change the results (accuracy, F1-score, ...)

yanisdb and others added 5 commits December 5, 2022 09:10

Add BERT experiments

0e9e267

Add BERT multilingual classification

6287a91

Fix train and restart from checkpoints

250bcf8

Add visualization

6130d89

Add missing scripts for experiments

536f3df

slint reviewed Jan 17, 2023

View reviewed changes

Few fixes that we forgot to push

4bf2a8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multilingual spam classifier #33

Add multilingual spam classifier #33

yanisdb commented Dec 14, 2022

slint left a comment

slint Jan 17, 2023

slint Jan 17, 2023

slint Jan 17, 2023

yanisdb Jan 17, 2023

yanisdb commented Jan 17, 2023 •

edited

Loading


		1. Go to Zenodo Open Metadata record at <https://doi.org/10.5281/zenodo.787062> to acces all dataset versions.

Add multilingual spam classifier #33

Are you sure you want to change the base?

Add multilingual spam classifier #33

Conversation

yanisdb commented Dec 14, 2022

slint left a comment

Choose a reason for hiding this comment

slint Jan 17, 2023

Choose a reason for hiding this comment

slint Jan 17, 2023

Choose a reason for hiding this comment

slint Jan 17, 2023

Choose a reason for hiding this comment

yanisdb Jan 17, 2023

Choose a reason for hiding this comment

yanisdb commented Jan 17, 2023 • edited Loading

yanisdb commented Jan 17, 2023 •

edited

Loading