Skip to content

tyiannak/pyTextClassification

Repository files navigation

pyTextClassification

Training and using classifiers for textual documents

General

pyTextClassification is a simple python library that can be used to train and use text classifiers. It can be trained using a corpus of text documents organized in folders, each folder corresponding to a different content class.

Installation and dependencies

  • pip dependencies:
pip install numpy matplotlib scipy sklearn nltk

Train a classifier

In order to train a classifier based on a dataset, the following command must be used:

python textClassification.py trainFromDirs -i <datasetPath> --method <svm or knn or randomforest or gradientboosting or extratrees> --methodname <modelFileName>

<datasetPath> is the path of the training corpus. This path must contain a list of folders, each one corresponding to a different content class. Each folder contains a list of filenames (no extension assumed) which correspond to documents belonging to this class

<modelFileName> is the path where the extracted model is stored

Feature extraction is done using a set of predefined (static) dictionaries, stored in the myDicts/ folder. For each dictionary, a separate feature value is extracted.

Example:

python textClassification.py trainFromDirs -i moviePlotsSmall/ --method svm --methodname svmMoviesPlot7Classes

Apply a classifier

Given a trained model, and an unknown document, the following command syntax is used to classify the document:

python textClassification.py classifyFile -i <pathToUnknownDocument> --methodname <modelFileName>

This repository already contains a trained SVM model (svmMoviesPlot7Classes) that discriminates between 7 classes of movie plots. The files samples/sample_pulpFiction, samples/sample_forestgump and samples/sample_lordoftherings contain three plot examples that can be used as unknown documents for testing.

In order to classify these three files using svmMoviesPlot7Classes, the following command must be executed:

python textClassification.py classifyFile -i samples/sample_pulpFiction --methodname svmMoviesPlot7Classes

python textClassification.py classifyFile -i samples/sample_forestgump --methodname svmMoviesPlot7Classes

python textClassification.py classifyFile -i samples/sample_lordoftherings --methodname svmMoviesPlot7Classes

The above examples return the most dominant content classes along with the respective normalized probabilities (sorted from highest to lowest).

About

Training and using classifiers for textual documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages