This repository contains files for building classifiers for file type detection. Initially the project was built for developing Deep Neural Networks in order to spit out neural network parameters for Tika to learn.
However, I have built over other classifiers for performing the same functionality. This repository now supports almost 6 classifiers and preprocessors for creating your on dataset.
I highly encourage contributors to develop their own dataset and try out the different classifiers to update the result.
We are working on scaffolding in the form of a Flask App that employs a decision tree to classify files and that uses the built model files from the library.
- Pandas
- Numpy
- Theano (for leveraging GPU and building deeper netwoks)
- Sklearn
- Decision Tree
- Neural Network
- Gaussian NB
- SVM
- Random Forest Classifier
- K-Nearest Neighbor Classifier
- Gradient Boosting Classifier
Mime type | Test Accuracy | Number of Hidden Layers |
---|---|---|
application/x-grib | 92.34% | 2 |
application/x-grib | 94.33% | 4 |
application/xhtml | 99.5% | 2 |
Mime type | Test Accuracy |
---|---|
application/x-grib | 99.76% |
Mime type | Test Accuracy |
---|---|
application/x-grib | 90.85% |
Mime type | Test Accuracy |
---|---|
application/x-grib | 90.30% |
Mime type | Test Accuracy |
---|---|
application/x-grib | 99.94% |
Mime type | Test Accuracy |
---|---|
application/x-grib | 99.54% |
Mime type | Test Accuracy |
---|---|
application/x-grib | 98.99 |
Mime type | Test Accuracy |
---|---|
application/x-grib | 99.91 |
Each classifier in the classifier package can be used to train your model. Classifiers follow a simple structure which involves three steps:
- build the model
- train the classifier
- test the classifier
The neural network is special and generates a nnmodel file that can be used with Apache Tika in order to train the NN to work on content based detection and not using Magic Numbers.
It is assumed that the input training files have the following format:
- First 256 columns correspond to the byte frequency companded using any function you like
- The last column is the output column.
Wonderng how to go ahead and create dataset like the one used? The preprocessor contains important constructs that will help generate the dataset needed.