pyTextClassification

Training and using classifiers for textual documents

General

pyTextClassification is a simple python library that can be used to train and use text classifiers. It can be trained using a corpus of text documents organized in folders, each folder corresponding to a different content class.

Installation and dependencies

pip dependencies:

pip install numpy matplotlib scipy sklearn nltk

[pyAudioAnalysis] (https://github.com/tyiannak/pyAudioAnalysis) used for training and evaluating classifiers

Train a classifier

In order to train a classifier based on a dataset, the following command must be used:

python textClassification.py trainFromDirs -i <datasetPath> --method <svm or knn or randomforest or gradientboosting or extratrees> --methodname <modelFileName>

<datasetPath> is the path of the training corpus. This path must contain a list of folders, each one corresponding to a different content class. Each folder contains a list of filenames (no extension assumed) which correspond to documents belonging to this class

<modelFileName> is the path where the extracted model is stored

Feature extraction is done using a set of predefined (static) dictionaries, stored in the myDicts/ folder. For each dictionary, a separate feature value is extracted.

Example:

python textClassification.py trainFromDirs -i moviePlotsSmall/ --method svm --methodname svmMoviesPlot7Classes

Apply a classifier

Given a trained model, and an unknown document, the following command syntax is used to classify the document:

python textClassification.py classifyFile -i <pathToUnknownDocument> --methodname <modelFileName>

This repository already contains a trained SVM model (svmMoviesPlot7Classes) that discriminates between 7 classes of movie plots. The files samples/sample_pulpFiction, samples/sample_forestgump and samples/sample_lordoftherings contain three plot examples that can be used as unknown documents for testing.

In order to classify these three files using svmMoviesPlot7Classes, the following command must be executed:

python textClassification.py classifyFile -i samples/sample_pulpFiction --methodname svmMoviesPlot7Classes

python textClassification.py classifyFile -i samples/sample_forestgump --methodname svmMoviesPlot7Classes

python textClassification.py classifyFile -i samples/sample_lordoftherings --methodname svmMoviesPlot7Classes

The above examples return the most dominant content classes along with the respective normalized probabilities (sorted from highest to lowest).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
myDicts		myDicts
samples		samples
LICENSE		LICENSE
README.md		README.md
svmMoviesPlot7Classes		svmMoviesPlot7Classes
svmMoviesPlot7ClassesMEANS		svmMoviesPlot7ClassesMEANS
textClassification.py		textClassification.py
utility_getFreqWords.py		utility_getFreqWords.py
utility_parseCMUMovie.py		utility_parseCMUMovie.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pyTextClassification

General

Installation and dependencies

Train a classifier

Apply a classifier

About

Uh oh!

Releases

Packages

Languages

License

tyiannak/pyTextClassification

Folders and files

Latest commit

History

Repository files navigation

pyTextClassification

General

Installation and dependencies

Train a classifier

Apply a classifier

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages