Skip to content

This repository contains files of generating tika neural network model using Theano. Ir provides a way for you to build Deep Neural Network and increase Tika's detection capability.

License

rahulravindran0108/fileTypeDetection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

File Type Detection

This repository contains files for building classifiers for file type detection. Initially the project was built for developing Deep Neural Networks in order to spit out neural network parameters for Tika to learn.

However, I have built over other classifiers for performing the same functionality. This repository now supports almost 6 classifiers and preprocessors for creating your on dataset.

I highly encourage contributors to develop their own dataset and try out the different classifiers to update the result.

Current Status

We are working on scaffolding in the form of a Flask App that employs a decision tree to classify files and that uses the built model files from the library.

Dependencies

  • Pandas
  • Numpy
  • Theano (for leveraging GPU and building deeper netwoks)
  • Sklearn

Classifiers Supported

  • Decision Tree
  • Neural Network
  • Gaussian NB
  • SVM
  • Random Forest Classifier
  • K-Nearest Neighbor Classifier
  • Gradient Boosting Classifier

Neural Network Results

Mime type Test Accuracy Number of Hidden Layers
application/x-grib 92.34% 2
application/x-grib 94.33% 4
application/xhtml 99.5% 2

Decision Tree Results

Mime type Test Accuracy
application/x-grib 99.76%

Support Vector Machine

Mime type Test Accuracy
application/x-grib 90.85%

Gaussian Naive Bayes

Mime type Test Accuracy
application/x-grib 90.30%

Random Forest Classifier

Mime type Test Accuracy
application/x-grib 99.94%

KNN Classifier

Mime type Test Accuracy
application/x-grib 99.54%

Stochastic Gradient Descent

Mime type Test Accuracy
application/x-grib 98.99

Gradient Boosting Classifier

Mime type Test Accuracy
application/x-grib 99.91

Running the project

Each classifier in the classifier package can be used to train your model. Classifiers follow a simple structure which involves three steps:

  • build the model
  • train the classifier
  • test the classifier

The neural network is special and generates a nnmodel file that can be used with Apache Tika in order to train the NN to work on content based detection and not using Magic Numbers.

Understanding the input file

It is assumed that the input training files have the following format:

  • First 256 columns correspond to the byte frequency companded using any function you like
  • The last column is the output column.

Wonderng how to go ahead and create dataset like the one used? The preprocessor contains important constructs that will help generate the dataset needed.

About

This repository contains files of generating tika neural network model using Theano. Ir provides a way for you to build Deep Neural Network and increase Tika's detection capability.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages