Natural Language Pre-processing Architecture
NLPA is a plugin designed to operate in conjuction with BDP4J (https://github.com/sing-group/bdp4j) and able to extract texts from Twitter, Youtube Comments, text files, raw email files (.eml) or WARC (Web Archive) files. The extracted text can be preprocessed into a Dataset using task (org.bdp4j.pipe.Pipe) definitions. This framework incorporates more than 30 preprocessing tasks to transform the text.
You need this user accounts:
- Babelnet / Babelfy
To properly opearte with NLPA the following configuration files (and options) should be filled:
babelnet.dir- Babelnet path. Directory where are babelnet files .
babelnet.key- Babelnet key. If you log in babelnet website, this key is in RESTful information target.
babelfy.key- The same key as Babelnet
[twitter] ConsumerKey=<YourConsumerKey> ConsumerSecret=<YourConsumerSecret> AccessToken=<YourAccessToken> AccessTokenSecret=<YourAccessTokenSecret> [youtube] APIKey=<YourYoutubeAPIKey>
We are aware that with youtube we are using unsecure API keys. Next versions will include more secure forms to access YouTube services (i.e. using OAuth schemes).
Please see files config/configurations.ini.example, config/babelnet.var.properties.example and config/babelfy.var.properties.example for further configuration details.
As stated before, to execute Text Analytics, big data should be transformed into full-featured datasets ready to be processed using AI techniques. Customizing and executing this transformation is the main functionality provided by NLPA. For this purpose, NLPA contains a list of pre-processing tasks implementations written in Java that can be used as a plugin for BDP4J framework. These task definitions implement efficient processing of the information exchanged through the different protocols and Internet services, using synsets and/or tokens. NLPA provides more than 30 pre-processing tasks applicable over corpora containing e-mails, websites, tweets, YouTube comments, SMS or plain text. The identification of these filetypes is made by using file extensions (.eml, .warc,.twtid, .ytbid, .tsms and .txt, respectively). Given the architecture of BDP4J framework, these pre-processing tasks are applied to
org.bdp4j.types.Instance objects (see Figure 1).
Figure 1. UML representation of BDP4J Instance class
As shown in Figure 1, Instance class brings together for attributes: (i) source, (ii) name, (iii) data (iv) target and (v) props. In detail, source attribute stores the information required to access the source of the information compiled (usually a
java.io.File is enough). Moreover, name stands for any form of unique identification of the instance. The processing of the instance implies a sequence of modifications of data attribute (originally with the same value as source attribute) while pre-processing tasks are applied. Logically, the modifications of data attribute could imply the loss of information from original data and the impossibility of returning to the original state of the data (that could be achieved from source attribute). Besides, target attribute is useful to address classification and prediction problems to include into the instance information about the real prediction/classification. Finally, props contains several properties that are being computed through pre-processing tasks (such as language, text-length, etc).
NLPA pre-processing tasks comprises five input types for data attribute:
org.nlpa.types.FeatureVector. Table 1 brings together the input and output data types of data attribute of all tasks included in our framework.
Table 1. Input and output data types for all tasks
As shown in Table 1, most tasks are designed to use
java.lang.StringBuffer (a mutable representation of Strings) as input data. Some interesting operations such as the processing of abbreviations, slang, interjections, stop-words, URLs, references to users, etc. are made from this representation. Next subsections describe the tasks implemented by our framework classified by the data type required for data attribute of instances being processed.
java.io.File data type
Although streaming sources could be easily used with NLPA, currently it uses locally stored files (i.e. in a local or network file system) represented through
java.io.File objects as the main of reading data instances. Descriptions of each available task for processing instances containing
java.io.File objects as data, are included below.
File2StringBufferPipeis able to transform the data attribute of an instance from
java.lang.StringBuffertype. To this end, the textual content of the target file (only for supported formats) is stored in the data attribute.
GuessDateFromFilePipeis able to find in input file (interpreting their formats) the date of the contents (when available). As result of this process, date is stored as a property of each instance (using “date” as default name).
StoreFileExtensionPipecreates a property to insert the type of content into the instance properties (using “extension” as default name). The value of the property is computed as the extension of the file referenced by the data attribute of the instance.
TargetAssigningFromPathPipecan be applied only for classification purposes. This task finds in the path of the File referenced by the data attribute of the instance if a folder matching the class is found. This task uses a transformation map to connect system folder names with instance categories.
java.io.File is processed into a
java.lang.StringBuffer (i.e. using
File2StringBufferPipe task), a wide variety of operations can be used to pre-process text. Next subsection presents the tasks that can be used for processing
StringBuffer object included as instance data.
java.io.StringBuffer data type
StringBuffer class is used to represent and modify (issued by mutable property provided by
StringBuffer Java class) textual contents. Once the text is extracted, a wide variety of tasks are provided to transform input contents dropping parts (interjections, stopwords, emoticons, etc.) and/or replacing text (slang forms, abbreviations, etc.). The descriptions of available tasks for processing this type of data (alphabetically ordered) are included below.
ComputePolarityFromStringBufferPipeadds the polarity of the text as an instance property. Possible results are in the form of 5-level Likert scale (i.e. 0 to indicate "Very Negative", 1 to "Negative", 2 to "Neutral", 3 to "Positive" and 4 to "Very Positive"). The polarity is computed by taking advantage of Stanford NLP framework.
ComputePolarityTBWSFromStringBufferPipeadds the polarity of the text computed by querying the TextBlob 1 Python library. The polarity score computed by using this library is a float within the range [-1.0, 1.0]. In order to query Python TextBlob library from Java, a REST (Representational State Transfer) web service (TBWS, TextBlob Web Service) has been developed and can be easily launched as a docker container using the scripts provided with the tool.
ContractionsFromStringBufferPipereplaces contractions in the original text using dictionaries (JSON files). To made the replacements, some language-specific dictionary files are included in NLPA. In order to properly select the right contractions dictionary, a property storing the language of text should be previously computed (see
FindEmojiInStringBufferPipefinds and removes (if desired) emojis from text and adds them as a property of the instance. By default, the property name is “emoji”. The process is made by taking advantage of emoji-java library 2.
FindEmoticonInStringBufferPipefinds and removes (if needed) emoticons from text and creating a new property (named “emoticon” by default) for the instance. Emoticons are found through using a complex regular expression over the whole text. As the main limitations,
FindHashtagInStringBufferPipe(see next paragraph) cannot be executed after this task.
FindHashtagInStringBufferPipesearches for hashtags in text and adds them as a property (“hashtag” by default) of the instance. Additionally, the task can be configured to remove the identified hashtags from the original text. Internally, this task uses a regular expression to find hashtags in text.
FindUrlInStringBufferPipefinds URLs from text adding them as a property (“URLs” by default) of the instance. Additionally, removing URLs from the original text is also possible. This task is made by using regular expressions.
FindUserNameInStringBufferPipe(see next paragraph) task cannot be executed after
FindUserNameInStringBufferPipetakes advantage of regular expressions to search and optionally remove tokens in the form “@” from text. Also, it adds the identified user references as a property of the instance (“@userName” by default).
GuessLanguageFromStringBufferPipefinds out the language of the text included in the instance. It adds “language” and “language-reliability” properties (by default) to the instance to store the result of the process. The data of the instance should contain text without HTML tags. To detect both the language and the probability of a successful identification, we take advantage of language-detector library for Java 3 , able to distinguish up to 71 languages.
InterjectionFromStringBufferPipeallows to identify and optionally drop interjections from text using dictionaries (JSON files). It adds them into “interjection” property of the instance. As long as interjections are language-dependant, the language of the instance should be computed before executing it.
MeasureLengthFromStringBufferPipeadds the “length” property (by default) computed by measuring the length of the text included in the data of the instance.
NERFromStringBufferPipeimplements NER adding all identified entities into instance properties and optionally deletes them from the input text. By default, date (property “NERDATE”), money (“NERMONEY”), number (“NERNUMBER”), address (“NERADDRESS”) and location (“NERLOCATION”) are the entities that can be recognized. NER is implemented through Stanford NLP framework.
SlangFromStringBufferPipedetects slang terms in the input text and replaces by its formal term using dictionaries (JSON files). In order to select the appropriate dictionary, the language of the text should be previously computed.
StopWordFromStringBufferPipedrops stopwords from text included in the data attribute of an instance. The language of text should be previously detected to select the appropriate stopword dictionary. Besides,
AbbreviationFromStringBufferPipetask could not be executed after this one.
StringBufferToLowerCasePipetransforms the textual content, included in the data attribute of an instance, to lowercase.
Additionally, an instance containing a
StringBuffercan be transformed into a
TokenSequence. These functionalities are implemented by
StringBuffer2TokenSequencePiperespectively. The former takes advantage of Babelfy API 4 to recognize synsets of each word included in text. The second one implements a tokenizing process using a set of characters as word delimiters.
TeeCSVFromStringBufferPipestores instances in a Comma Separated value(s) (CSV) file containing all computed properties together with the text of each them.
SynsetSequence object brings together a sequence of synsets that are found in the text of an instance. To handle instances with this data type as input, NLPA includes the task
SynsetSequence2FeatureVectorPipe. It is able to transform a
SynsetSequence into a
FeatureVector which compiles together duplicated features and assigns a score for each feature according to a grouping scheme. The grouping scheme can be one of the following ones:
SequenceGroupingStrategy.COUNTthat indicates the number of times that a synset is observed in the sequence
SequenceGroupingStrategy.BOOLEANthat assigns 1 when the synset is included in the content or 0 otherwise
SequenceGroupingStrategy.FREQUENCYthat indicates the frequency of the synset in the text (number of times that the synset is observed divided by the whole amount of synsets)
TokenSequence contains the sequence of tokens that are found in the text of an instance. To handle instances with this data type as input, NLPA includes the tasks described below.
TokenSequence2FeatureVectorPipe, similarly to
SynsetSequence2FeatureVectorPipe, transforms the list of tokens included in the text of the data instance into a
FeatureVectoraccording to a selected grouping scheme (
TokenSequencePorterStemmerPipeapplies the Porter stemmer algorithm to the
TokenSequenceincluded in an instance. This scheme allows to reduce inflected (or sometimes derived) words to their stem form, using a set of language-dependant rules. As long as the rules are defined by language, it should be previously computed (see
GuessLanguageFromStringBufferPipein subsection 3.2).
TokenSequenceStemIrregularPipeapplies irregular stemming (through language- dependant dictionaries) to tokens with the same purpose of the previous one. The irregular stemming task, if applied, should be executed before
TokenSequencePorterStemmerPipeand the language of the text should be computed before its usage.
FeatureVector compiles together a set of features of text properties (synset-based or token-based) identified in the text of an instance and their values. To handle input instances with this data type, NLPA includes
TeeDatasetFromFeatureVectorPipe tasks. They allow to generate a dataset into disk (CSV format) or in memory (
org.bdp4j.types.Dataset), respectively, for their further usage. These datasets can be easily used to execute experiments in Weka Machine Learning Software 5 through the functionalities provided by BDP4J framework.
Next section shows the usage of NLPA through a case study to show the creation and exploitation of a pipeline containing some of the previous described tasks.
NLPA is a plugin for BDP4J that implements a set of natural language processing task definitions. BDP4J is a pipelining Java framework able to combine and orchestrate the execution of pre-processing tasks in sequence or in parallel. The orchestration can be defined in Java source or using XML files. Additionally, BDP4J adds a wide variety of constraint checks that prevents development errors (dependencies and input/output types for tasks). BDP4J also supports the instance invalidation which is required when a NLPA task fails. Moreover, resuming the execution of pipelining processing after a hardware software failure. Finally, BDP4J supports the debugging mode that allows avoid the execution of some tasks by restoring the results achieved by them in a previous execution. This functionality was particularly useful during the development of NLPA plugin.
Figure 2 includes a class diagram and a snippet of source code showing the interaction of BDP4J and NLPA projects. Particularly, Figure 2a specifies some architecture details to facilitate the comprehension of the inner operation of both projects. As we can see from it, NLPA tasks (three of them have been included as example) are created as a subclass of
Figure 2. Interaction of BDP4J and NLPA projects
As we can see in Figure 2b, the pipeline orchestration comprises some tasks executed in sequence. In order to check whether the pipeline has been correctly defined, a dependency check is executed. Additionally, NLPA incorporates mechanisms to automatically load instances from files. And finally, instances are processed.
This project has been conceived and developed by SING research group. The development team is composed by:
- José R. Méndez: Subject Matter Expert
- María Novo: Software Architecture Design and Team Leader
- Rosalía Laza: Developer
- Reyes Pavón: Developer
Copyright (C) 2018 Sing Group (University of Vigo)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.