Skip to content

wangcongcong123/MLP-TC

Repository files navigation

Machine Learning Package for Text Classification

This repository contains codes of machine learning algorithms for text classification, abbreviated to MLP-TC (Machine Learning Package for Text Classification). Due to the poor reproducibility of classification in Jupyter modelling on different datasets with different algorithms, this package is borned. This package is designed in a way especially suitable for researchers conducting comparison experiments and benchmarking analysis for text classification. This package empowers you to explore the performance difference that different ML techniques have on your specific datasets. Updated: 2019/12/09.

Highlights

  • Well logged for the whole process of training a model for text classification.
  • Feed different datasets into models quickly as long as they are formatted as required.
  • Support single or multi label (only binary relevance at this stage) classification.
  • Support model save, load, train, predict, eval, etc.

Dependencies

In order to use the package, clone the repository first and then install the following dependencies if you have not got it ready.

  • scikit-learn
  • seaborn
  • pandas
  • numpy
  • matplotlib

Steps of Usage

  1. Data preparation: format your classification datasets into the following format. For label attribute, labels are separated by "," if multi labels are available for a sample. Have a look at the dataset/tweet_sentiment_3 dataset provided in this package to know the format required.

     {"content":"this is the content of a sample in dataset","label":"label1,label2,..."}
  2. Configuration for model training: Below is an example of configuration model training (the script is in main.py). Important stuff are commented below.

    import pprint
    from sklearn.svm import LinearSVC
    # from sklearn.linear_model import LogisticRegression
    # from sklearn.svm import SVC
    # from sklearn.ensemble import RandomForestClassifier
    from sklearn.naive_bayes import GaussianNB
    
    print("=========configuration for model training======")
    configs = {}
    configs["relative_path"] = "./"  # the path relative to dataset
    configs["data"] = "tweet_sentiment_3/json"  # specify the path of your data that is under the dataset dir
    configs["data_mapping"] = {"content": "content",
                               "label": "label"}  # this is the mapping from the package required attribute names to your json dataset attributes
    
    configs["stemming"] = "true"  # specify whether you want to stem or not in preprocessing
    configs["tokenizer"] = "tweet"  # if it is a tweet-related dataset, it is suggested to use tweet tokenizer, or "string"
    configs["vectorizer"] = "count"  # options: count, tf-idf, embeddings/glove.twitter.27B.100d.txt.gz
    
    configs["type"] = "single"  # single or multi label classification?
    configs[
        "model"] =  LinearSVC(C=0.1)  # Options: LinearSVC(C=0.1),SVC, LogisticRegression(solver='ibfgs'),GaussianNB(),RandomForest, etc.
  3. Train and save: Below is an example of model training and save (the script is in main.py). Important stuff are commented below.

    import ml_package.model_handler as mh
    print("=========model training and save======")
    model = mh.get_model(
        configs)  # get the specified LinearSVC model from the model handler with configs passed as the parameter
    model.train()  # train a model
    mh.save_model(model, configs)  # you can save a model after it is trained
  4. Eval and predict: Below is an example of evaluating train, dev, and test, and predict without ground truth (the script is in main.py). Important stuff are commented below.

    print("=========evaluate on train, dev and test set======")
    model.eval("train")  # classification report for train set
    model.eval("dev")  # classification report for dev set
    model.eval("test", confusion_matrix=True)  # we can let confusion_matrix=True so as to report confusion matrix as well
    
    print("=========predict a corpus without ground truth======")
    corpus2predict = ["i love you", "i hate you"]  # get ready for two documents
    data_processor = model.get_data_processor()
    to_predict = data_processor.raw2predictable(["i love you", "i hate you"])
    predicted = model.predict_in_batch(
        to_predict["features"].toarray() if hasattr(to_predict["features"], "toarray") else to_predict["features"])
    print("Make predictions for:\n ", to_predict["content"])
    print("The predicted results are: ", predicted)

    The output:

     Make predictions for:
      ['i love you', 'i hate you']
      The predicted results are:  ['positive' 'negative']
    

    Predict with ground truth, e.g. the first three examples from test set.

     print("=========predict a corpus with ground truth======")
     train_data, _, test_data = model.get_fit_dataset()
     data = test_data
     to_predict_first = 0
     to_predict_last = 3
     
     if configs["type"] == "multi":
         mlb = model.get_multi_label_binarizer()
     
     predicted = model.predict_in_batch(data["features"][to_predict_first:to_predict_last].toarray() if hasattr(
         data["features"][to_predict_first:to_predict_last], "toarray") else data["features"][
                                                                             to_predict_first:to_predict_last])
     
     print("Make predictions for:\n ", "\n".join(data["content"][to_predict_first:to_predict_last]))
     print("Ground truth are:\n ")
     pprint.pprint(data["labels"][to_predict_first:to_predict_last])
     print("The predicted results are: ")
     pprint.pprint(mlb.inverse_transform(predicted) if configs["type"] == "multi" else predicted)

    The output:

     =========predict a corpus with ground truth======
     Make predictions for:
         Clearwire Is Sought by Sprint for Spectrum: Sprint disclosed on Thursday that it had offered to buy a stake in C... http://t.co/5Ais2S9j
         5 time @usopen champion Federer defeats 29th seed Kohlschreiber and he's through to the last 16 to play 13th seed Isner or Vesely! #USOpen
         Old radio commercials for Grateful Dead albums may just be the best thing I've discovered
         
     Ground truth are:
     ['neutral', 'neutral', 'positive']
     The predicted results are: 
     array(['neutral', 'negative', 'positive'], dtype='<U8')
    
  • After running main.py, you will find output.log and LinearSVCdc26f10760747d1c6d94b3a9679d28cf.pkl under the root of the repository. When you rerun the experiment, the model will be loaded locally instead of re-training from scratch as long as your configurations are not changed.

Others

  • More extensions of this package go to this tutorial (in plan). Feedback is welcome or any errors/bugs reporting is well appreciated.

Releases

No releases published

Packages

No packages published

Languages