Machine Learning Package for Text Classification

This repository contains codes of machine learning algorithms for text classification, abbreviated to MLP-TC (Machine Learning Package for Text Classification). Due to the poor reproducibility of classification in Jupyter modelling on different datasets with different algorithms, this package is borned. This package is designed in a way especially suitable for researchers conducting comparison experiments and benchmarking analysis for text classification. This package empowers you to explore the performance difference that different ML techniques have on your specific datasets. Updated: 2019/12/09.

Highlights

Well logged for the whole process of training a model for text classification.
Feed different datasets into models quickly as long as they are formatted as required.
Support single or multi label (only binary relevance at this stage) classification.
Support model save, load, train, predict, eval, etc.

Dependencies

In order to use the package, clone the repository first and then install the following dependencies if you have not got it ready.

scikit-learn
seaborn
pandas
numpy
matplotlib

Steps of Usage

Data preparation: format your classification datasets into the following format. For label attribute, labels are separated by "," if multi labels are available for a sample. Have a look at the dataset/tweet_sentiment_3 dataset provided in this package to know the format required.
```
 {"content":"this is the content of a sample in dataset","label":"label1,label2,..."}
```

Configuration for model training: Below is an example of configuration model training (the script is in main.py). Important stuff are commented below.

import pprint
from sklearn.svm import LinearSVC
# from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

print("=========configuration for model training======")
configs = {}
configs["relative_path"] = "./"  # the path relative to dataset
configs["data"] = "tweet_sentiment_3/json"  # specify the path of your data that is under the dataset dir
configs["data_mapping"] = {"content": "content",
                           "label": "label"}  # this is the mapping from the package required attribute names to your json dataset attributes

configs["stemming"] = "true"  # specify whether you want to stem or not in preprocessing
configs["tokenizer"] = "tweet"  # if it is a tweet-related dataset, it is suggested to use tweet tokenizer, or "string"
configs["vectorizer"] = "count"  # options: count, tf-idf, embeddings/glove.twitter.27B.100d.txt.gz

configs["type"] = "single"  # single or multi label classification?
configs[
    "model"] =  LinearSVC(C=0.1)  # Options: LinearSVC(C=0.1),SVC, LogisticRegression(solver='ibfgs'),GaussianNB(),RandomForest, etc.

Train and save: Below is an example of model training and save (the script is in main.py). Important stuff are commented below.

import ml_package.model_handler as mh
print("=========model training and save======")
model = mh.get_model(
    configs)  # get the specified LinearSVC model from the model handler with configs passed as the parameter
model.train()  # train a model
mh.save_model(model, configs)  # you can save a model after it is trained

Eval and predict: Below is an example of evaluating train, dev, and test, and predict without ground truth (the script is in main.py). Important stuff are commented below.

print("=========evaluate on train, dev and test set======")
model.eval("train")  # classification report for train set
model.eval("dev")  # classification report for dev set
model.eval("test", confusion_matrix=True)  # we can let confusion_matrix=True so as to report confusion matrix as well

print("=========predict a corpus without ground truth======")
corpus2predict = ["i love you", "i hate you"]  # get ready for two documents
data_processor = model.get_data_processor()
to_predict = data_processor.raw2predictable(["i love you", "i hate you"])
predicted = model.predict_in_batch(
    to_predict["features"].toarray() if hasattr(to_predict["features"], "toarray") else to_predict["features"])
print("Make predictions for:\n ", to_predict["content"])
print("The predicted results are: ", predicted)

The output:

 Make predictions for:
  ['i love you', 'i hate you']
  The predicted results are:  ['positive' 'negative']

Predict with ground truth, e.g. the first three examples from test set.

 print("=========predict a corpus with ground truth======")
 train_data, _, test_data = model.get_fit_dataset()
 data = test_data
 to_predict_first = 0
 to_predict_last = 3
 
 if configs["type"] == "multi":
     mlb = model.get_multi_label_binarizer()
 
 predicted = model.predict_in_batch(data["features"][to_predict_first:to_predict_last].toarray() if hasattr(
     data["features"][to_predict_first:to_predict_last], "toarray") else data["features"][
                                                                         to_predict_first:to_predict_last])
 
 print("Make predictions for:\n ", "\n".join(data["content"][to_predict_first:to_predict_last]))
 print("Ground truth are:\n ")
 pprint.pprint(data["labels"][to_predict_first:to_predict_last])
 print("The predicted results are: ")
 pprint.pprint(mlb.inverse_transform(predicted) if configs["type"] == "multi" else predicted)

The output:

 =========predict a corpus with ground truth======
 Make predictions for:
     Clearwire Is Sought by Sprint for Spectrum: Sprint disclosed on Thursday that it had offered to buy a stake in C... http://t.co/5Ais2S9j
     5 time @usopen champion Federer defeats 29th seed Kohlschreiber and he's through to the last 16 to play 13th seed Isner or Vesely! #USOpen
     Old radio commercials for Grateful Dead albums may just be the best thing I've discovered
     
 Ground truth are:
 ['neutral', 'neutral', 'positive']
 The predicted results are: 
 array(['neutral', 'negative', 'positive'], dtype='<U8')

After running main.py, you will find output.log and LinearSVCdc26f10760747d1c6d94b3a9679d28cf.pkl under the root of the repository. When you rerun the experiment, the model will be loaded locally instead of re-training from scratch as long as your configurations are not changed.

Others

More extensions of this package go to this tutorial (in plan). Feedback is welcome or any errors/bugs reporting is well appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
dataset/tweet_sentiment_3/json		dataset/tweet_sentiment_3/json
ml_package		ml_package
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

dataset/tweet_sentiment_3/json

dataset/tweet_sentiment_3/json

ml_package

ml_package

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

Repository files navigation

Machine Learning Package for Text Classification

Highlights

Dependencies

Steps of Usage

Others

About

Releases

Packages

Languages

License

wangcongcong123/MLP-TC

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Package for Text Classification

Highlights

Dependencies

Steps of Usage

Others

About

Topics

Resources

License

Stars

Watchers

Forks

Languages