Skip to content
yooper edited this page Oct 6, 2017 · 14 revisions

PHP Text Analysis

Want to process text using PHP? Well, you picked the right library for the task.

PHP Text Analysis provides a variety of tools for :

  • Analysis
  • Collections - data structures for managing documents during analysis
  • Collocation - helps you find terms that co-occur more often than would be expected by chance.
  • Comparisons - Algorithms for comparing text and text documents
  • Console - a command line interface for performing base indexing and text mining analysis with PHP
  • Entity Extraction - helps you find entities such as people, places and dates
  • Downloaders - Downloads 3rd party data files from the web
  • Filters - A set of tools for normalizing the terms and tokens before data analysis begins
  • Phonetics - Phonetic algorithms for fixing data. Helpful when you need to perform record linkage tasks with PHP
  • Ngrams - PHP code for generating NGrams from a given set of tokens or terms
  • Stemmers - Several stemmers are available for normalizing the data sets prior to further analysis
  • Tokenizers - A common set of tokenizers is availble for breaking up the corpus into tokens or sentences
  • Utilities - helper utilities for manipulating text data

Beyond Analysis

PHP Text Analysis is a light weight Information Retrieval and NLP library built using PHP. In addition, to analysis tools, PHP Text Analysis can be used to create a search engine that supports simple and advanced query types. This is especially useful when your data models have raw text that must be searchable.

  • Adapters
  • Engines
  • Indexes
  • Query

Suggestions on Performance

Performance is always very challenging. Here are a couple suggestions on how to improve the speed of your code.

  • Use the whitespace tokenizer, it works better than the general tokenizer
  • Use the filter classes on the whole text/corpus, avoid the applyTranformation method calls within the TokenDoc class. They are useful when each token must be validated or transformed. A lot of the filter classes have been re-written to better support the above approach

Running the unit tests

In order to run all the unit tests successfully you must have JAVA installed. Here is the command used to run all the tests.

JAVA_HOME=/opt/jdk1.8.0_111/bin/java ./vendor/bin/phpunit