This repository contains codes of machine learning algorithms for text classification, abbreviated to MLP-TC (Machine Learning Package for Text Classification). Due to the poor reproducibility of classification in Jupyter modelling on different datasets with different algorithms, this package is borned. This package is designed in a way especially suitable for researchers conducting comparison experiments and benchmarking analysis for text classification. This package empowers you to explore the performance difference that different ML techniques have on your specific datasets. Updated: 2019/12/09.
- Well logged for the whole process of training a model for text classification.
- Feed different datasets into models quickly as long as they are formatted as required.
- Support single or multi label (only binary relevance at this stage) classification.
- Support model save, load, train, predict, eval, etc.
In order to use the package, clone the repository first and then install the following dependencies if you have not got it ready.
- scikit-learn
- seaborn
- pandas
- numpy
- matplotlib
-
Data preparation: format your classification datasets into the following format. For label attribute, labels are separated by "," if multi labels are available for a sample. Have a look at the dataset/tweet_sentiment_3 dataset provided in this package to know the format required.
{"content":"this is the content of a sample in dataset","label":"label1,label2,..."}
-
Configuration for model training: Below is an example of configuration model training (the script is in main.py). Important stuff are commented below.
import pprint from sklearn.svm import LinearSVC # from sklearn.linear_model import LogisticRegression # from sklearn.svm import SVC # from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB print("=========configuration for model training======") configs = {} configs["relative_path"] = "./" # the path relative to dataset configs["data"] = "tweet_sentiment_3/json" # specify the path of your data that is under the dataset dir configs["data_mapping"] = {"content": "content", "label": "label"} # this is the mapping from the package required attribute names to your json dataset attributes configs["stemming"] = "true" # specify whether you want to stem or not in preprocessing configs["tokenizer"] = "tweet" # if it is a tweet-related dataset, it is suggested to use tweet tokenizer, or "string" configs["vectorizer"] = "count" # options: count, tf-idf, embeddings/glove.twitter.27B.100d.txt.gz configs["type"] = "single" # single or multi label classification? configs[ "model"] = LinearSVC(C=0.1) # Options: LinearSVC(C=0.1),SVC, LogisticRegression(solver='ibfgs'),GaussianNB(),RandomForest, etc.
-
Train and save: Below is an example of model training and save (the script is in main.py). Important stuff are commented below.
import ml_package.model_handler as mh print("=========model training and save======") model = mh.get_model( configs) # get the specified LinearSVC model from the model handler with configs passed as the parameter model.train() # train a model mh.save_model(model, configs) # you can save a model after it is trained
-
Eval and predict: Below is an example of evaluating train, dev, and test, and predict without ground truth (the script is in main.py). Important stuff are commented below.
print("=========evaluate on train, dev and test set======") model.eval("train") # classification report for train set model.eval("dev") # classification report for dev set model.eval("test", confusion_matrix=True) # we can let confusion_matrix=True so as to report confusion matrix as well print("=========predict a corpus without ground truth======") corpus2predict = ["i love you", "i hate you"] # get ready for two documents data_processor = model.get_data_processor() to_predict = data_processor.raw2predictable(["i love you", "i hate you"]) predicted = model.predict_in_batch( to_predict["features"].toarray() if hasattr(to_predict["features"], "toarray") else to_predict["features"]) print("Make predictions for:\n ", to_predict["content"]) print("The predicted results are: ", predicted)
The output:
Make predictions for: ['i love you', 'i hate you'] The predicted results are: ['positive' 'negative']
Predict with ground truth, e.g. the first three examples from test set.
print("=========predict a corpus with ground truth======") train_data, _, test_data = model.get_fit_dataset() data = test_data to_predict_first = 0 to_predict_last = 3 if configs["type"] == "multi": mlb = model.get_multi_label_binarizer() predicted = model.predict_in_batch(data["features"][to_predict_first:to_predict_last].toarray() if hasattr( data["features"][to_predict_first:to_predict_last], "toarray") else data["features"][ to_predict_first:to_predict_last]) print("Make predictions for:\n ", "\n".join(data["content"][to_predict_first:to_predict_last])) print("Ground truth are:\n ") pprint.pprint(data["labels"][to_predict_first:to_predict_last]) print("The predicted results are: ") pprint.pprint(mlb.inverse_transform(predicted) if configs["type"] == "multi" else predicted)
The output:
=========predict a corpus with ground truth====== Make predictions for: Clearwire Is Sought by Sprint for Spectrum: Sprint disclosed on Thursday that it had offered to buy a stake in C... http://t.co/5Ais2S9j 5 time @usopen champion Federer defeats 29th seed Kohlschreiber and he's through to the last 16 to play 13th seed Isner or Vesely! #USOpen Old radio commercials for Grateful Dead albums may just be the best thing I've discovered Ground truth are: ['neutral', 'neutral', 'positive'] The predicted results are: array(['neutral', 'negative', 'positive'], dtype='<U8')
- After running main.py, you will find
output.log
andLinearSVCdc26f10760747d1c6d94b3a9679d28cf.pkl
under the root of the repository. When you rerun the experiment, the model will be loaded locally instead of re-training from scratch as long as your configurations are not changed.
- More extensions of this package go to this tutorial (in plan). Feedback is welcome or any errors/bugs reporting is well appreciated.