Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multilingual spam classifier #33

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

yanisdb
Copy link

@yanisdb yanisdb commented Dec 14, 2022

This PR adds the production code for the multilingual spam classifier. The new README explains the scripts and should probably be reviewed first to understand the rest:

  • adds the sources in ./src,
  • change the requirements.txt to reflect the changes,
  • change the README.md to reflect the change and explain how to train and use,
  • change the Makefile to reflect the changes,
  • adds a few necessary lines to .gitignore.

This PR is based on my previous one (#32) to facilitate the merging. If this previous PR isn't going to merge, let me know, and I will remove it from this one.

Copy link
Member

@slint slint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for all the work @yanisdb, @lukasec, and @tecabert!

Don't be spooked, the review comments are not for you to address, but for us before merging or to investigate in the future.

chunk = chunk[KEPT_FIELDS].dropna()

chunk_spams = chunk[chunk["spam"] == True]
chunk_spams["description"] = chunk_spams["description"].map(parse_description)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using .map() here could be parallelized to make use of mult{threading,processing}.


1. Go to Zenodo Open Metadata record at <https://doi.org/10.5281/zenodo.787062> to acces all dataset versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the reference to the dataset DOI, to make it easier to get hold of the (test) data for training.

Returns:
pd.DataFrame, pd.DataFrame: Train and test sets.
"""
dataset = dataset.sample(frac=1, random_state=SEED).reset_index(drop=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a closer look to keep the ham/spam distribution in the splitter sets.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes! This is indeed not correct. It's not keeping the exact same distribution. We noticed it when pushing the code to our Github classroom repository and forgot to push the change to this one.

Here is the correct version of the function:

def split_train_test(
    dataset: pd.DataFrame, test_size=0.2
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Split the dataset into train and test. It will keep exactly the same
    distribution of classes in both train and test.

    Args:
        dataset (pd.DataFrame): Dataset to split.
        test_size (float, optional): Percentage of the dataset to use for the
        test set.

    Returns:
        pd.DataFrame, pd.DataFrame: Train and test sets.
    """
    spams = dataset[dataset["label"] == 1]
    hams = dataset[dataset["label"] == 0]

    spams = spams.sample(frac=1, random_state=SEED)
    hams = hams.sample(frac=1, random_state=SEED)

    spams_train = spams[: int(len(spams) * (1 - test_size))]
    spams_test = spams[int(len(spams) * (1 - test_size)) :]
    hams_train = hams[: int(len(hams) * (1 - test_size))]
    hams_test = hams[int(len(hams) * (1 - test_size)) :]

    train = pd.concat([spams_train, hams_train])
    test = pd.concat([spams_test, hams_test])

    train = train.sample(frac=1, random_state=SEED)
    test = test.sample(frac=1, random_state=SEED)

    return train, test

@yanisdb
Copy link
Author

yanisdb commented Jan 17, 2023

While pushing to our GitHub classroom repository, we noticed some minors issues that we fixed. We forgot to push them to your repository. I just pushed a new commit with those fixes.

This commit includes 4 smalls changes:

  • fixes the split_train_test method to ensure that the distribution remains the same in each part,
  • changes torch version to 1.13.1 (because we actually tested with that one),
  • adds details to the README,
  • changes from saving to CSV to pickles (we believe saving to CSV was actually causing some issues when reading and writing because of comas in the strings and encoding).

We've tested this last commits and it doesn't change the results (accuracy, F1-score, ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants