# Information Extraction from Text

Information extraction is an important step in Data Science and is often overlooked with preference to customizing Machine Learning algorithms. <br/>
However, it is still an essential part of the data analytics pipeline that should not be forgotten.<br/> 
In this tutorial, we introduce the basic methods for information extraction from text data. <br/>

Information can be extracted from text in multiple ways. <br/>
One of the easiest and most common ways is to use a dictionary of concepts to see if it is mentioned in the text. <br/>
In this walkthrough, we will guide you through 3 common variants of the dictionary-based information extraction and what we have learnt from it. <br/>
A) dictionary search <br/>
B) regular expressions <br/>
C) flashtext <br/>
<br/>


## Dictionary Loading
In order to conduct a dictionary search, we first need a dictionary of term. </br>
We will be using a small dictionary of package names from https://pythonwheels.com/ and </br>
a larger dictionary of package names from https://pypi.org/simple/. <br/>


In [1]:
# lets load up some dictionary of terms
# bs4 is BeautifulSoup, a package for parsing html text
import bs4

popularPythonPackages = []
htmlSnippet = ""
# html snippet taken from https://pythonwheels.com/
with open("popularPythonWheels.txt", "r") as fh:
    htmlSnippet = "".join(fh.readlines())
# process and extract the package names from the html text using BeautifulSoup
soup = bs4.BeautifulSoup(htmlSnippet, "lxml")
spanElems = soup.find_all("span", attrs={"ng-bind":"package.name"})

popularPythonPackages += [spanElem.text for spanElem in spanElems]

In [2]:
popularPythonPackages

['simplejson',
 'setuptools',
 'six',
 'requests',
 'pip',
 'python-dateutil',
 'virtualenv',
 'boto',
 'pyasn1',
 'pbr',
 'docutils',
 'pytz',
 'certifi',
 'botocore',
 'rsa',
 'PyYAML',
 'jmespath',
 'awscli',
 'colorama',
 'Jinja2',
 'wincertstore',
 'nose',
 'MarkupSafe',
 'lxml',
 'cffi',
 'selenium',
 'paramiko',
 'pycrypto',
 'argparse',
 'pycparser',
 'coverage',
 'Django',
 'ecdsa',
 'mock',
 'psycopg2',
 'pika',
 'wheel',
 'httplib2',
 'pep8',
 'Pygments',
 'enum34',
 'redis',
 'SQLAlchemy',
 'futures',
 'Werkzeug',
 'psutil',
 'pymongo',
 'cryptography',
 'Pillow',
 'Flask',
 'supervisor',
 'greenlet',
 'pyOpenSSL',
 'Babel',
 'bcdoc',
 'numpy',
 'py',
 'meld3',
 'MySQL-python',
 'ipaddress',
 'kombu',
 'docopt',
 'zc.buildout',
 'urllib3',
 'Paste',
 'pyparsing',
 'pyflakes',
 'Sphinx',
 'tornado',
 'carbon',
 'jsonschema',
 'zope.interface',
 'anyjson',
 'itsdangerous',
 'decorator',
 'beautifulsoup4',
 'idna',
 'PasteDeploy',
 'Mako',
 'ssl',
 'flake8',
 'mccabe',
 'amqp'

In [3]:
# load up a much larger dictionary of terms
allPipPackages = []
pipPackagesSimpleText = ""
# list taken from https://pypi.org/simple/
with open("pypi_simple.txt", "r") as fh:
    pipPackagesSimpleText = fh.read()
# to control the number of packages, we decided to filter out those package names with less than 3 characters
allPipPackages = [packageName for packageName in pipPackagesSimpleText.split(sep=" ") if len(packageName) >3 ]



In [4]:
allPipPackages

['0-._.-._.-._.-._.-._.-._.-0',
 '0.0.1',
 '00SMALINUX',
 '01changer',
 '02exercicio',
 '0805nexter',
 '0-core-client',
 '0FELA',
 '0-orchestrator',
 '0wdg9nbmpm',
 '0x10c-asm',
 '100bot',
 '1020-nester',
 '10daysweb',
 '115wangpan',
 '12factor-vault',
 '131228_pytest_1',
 '1332132132132132132132132132131',
 '1337',
 '153957-theme',
 '15five-django-ajax-selects',
 '17MonIP',
 '18-e',
 '199Fix',
 '1and1',
 '1c-utilites',
 '1dyfolabs-test-script',
 '1nester',
 '1pass',
 '1to001',
 '2013007_pyh',
 '2048',
 '2112',
 '2311321-das1d31-2131313213213',
 '23andme-to-vcf',
 '24to25',
 '2969nester',
 '2C.py',
 '2factorcli',
 '2gis',
 '2jjtt6cwa6',
 '2lazy2rest',
 '2mp3',
 '2mp4',
 '2or3',
 '3049bab9',
 '311devs_peewee',
 '32ghghfghfghfhghghghgfh',
 '36ban_commons',
 '36-chambers',
 '3color-Press',
 '3debt',
 '3Dfunctiongrapher',
 '3d-wallet-generator',
 '3lwg',
 '3to2',
 '3to2_py3k',
 '3xsd',
 '40wt-common-tasks',
 '42cc-pystyle',
 '42qucc',
 '440-create-user',
 '4cdl',
 '4chan',
 '4chandownloade

In [5]:
len(popularPythonPackages), len(allPipPackages)

(360, 136993)

## Load a sample JD

In [6]:
sampleJD = ""
with open("sampleJD.txt", "r") as fh:
    sampleJD = "".join(fh.readlines())

In [7]:
print(sampleJD)

 What you will do: Interpret data, analyze results using statistical techniques and provide ongoing reports Develop and maintain data warehouse from multiple data sources Managing and designing the reporting environment, including data sources, security, and metadata Implement solutions and processes for management and governance across data quality metrics, metadata, lineage, data access rights and business definitions Assessing tests and implementing new or upgraded software and assisting with strategic decisions on process improvements Establish effective and adaptable stakeholder working group Work with management to prioritize business and information needs Work with development team to enhance capabilities of internal analytics Use your creativity and intuition to help solve challenging problems faced by the Business users Provide support to users and assist business unit controllers in translating data requirements into deliverables Present results of analysis to team and other 

## A) Dictionary Search 
The main advantage is this is that it is the easiest and fastest code to implement. <br/>


In [8]:
for aPythonMod in popularPythonPackages:
    if aPythonMod in sampleJD:
        print(aPythonMod)

pandas
scikit-learn
sh


However, this will not always work. As the search is case sensitive.

In [9]:
"python" in sampleJD

False

In [10]:
"Python" in sampleJD

True

In [11]:
# A quick hack:
"python" in sampleJD.lower()

True

Additionally, it does not recognize if the word is part of another word. <br/>
This causes lots of false hits, especially if the word is shorter.</br>

In [12]:
"formation" in sampleJD
# False hit due to the word "Information"

True

This does not scale well for larger dictionaries <br/>
Note: the <i>%%timeit</i> command is an iPython built-in command that allows the timing of the execution of a cell.

In [13]:
%%timeit -n10

# we get the average time it takes to run this using the smaller dictionary
spottedMods = []
for aPythonMod in popularPythonPackages:
    if aPythonMod in sampleJD:
        spottedMods += [aPythonMod]

637 µs ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [14]:
%%timeit -n10

# we try this using the larger dictionary
spottedMods = []
for aPythonMod in allPipPackages:
    if aPythonMod in sampleJD:
        spottedMods += [aPythonMod]


166 ms ± 5.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


This may not seem alot, until you consider how many documents you will have to run this on..

In [15]:
# To run this on 9,000,000 documents will require.. 
0.002 * 9000000 / 3600 

5.0

### Dictionary Search Summary: 
Main Advantage: 
* Fast and easy <b>to code</b>.

Main Disadvantages:
* Case sensitive (can be mitigated)
* No word boundary
* Does not scale well


## Regular Expressions
Regular expressions are a powerful and flexible way to extract information when there are limited variations of text (and you know the variations!).<br/>
However, the syntax requires significant study https://docs.python.org/3/library/re.html 


In [16]:
# import the regular expressions package re
import re
termRE = re.compile("(scikits.learn|(scikit|sk)+[\.\- ]*learn)", re.IGNORECASE)

In [17]:
termRE.findall("Knowledge of Python, pandas and scikit learn necessary")

[('scikit learn', 'scikit')]

In [18]:
termRE.findall("Knowledge of Python, pandas and scikit-learn necessary")

[('scikit-learn', 'scikit')]

In [19]:
termRE.findall("Knowledge of Python, pandas and sklearn necessary")

[('sklearn', 'sk')]

Although it is a powerful tool. There are several drawbacks. <br/>
One of the main drawbacks is also due to its pattern matching behaviour.. <br/>
The more complex your pattern is, the higher chance it will take longer to patten match. <br/>
We mention "higher chance" here because the pattern match complexity much depends on the text we are matching with. <br/>

In [20]:
idealText = "I have learnt about scikit learn in python"
sleepyText = "I have learn about scikit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                learn in python"


In [21]:
len(idealText), len(sleepyText)

(42, 5224)

In [22]:
%%timeit -n100
termRE.findall(idealText)

3.27 µs ± 58.8 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [23]:
%%timeit -n100
termRE.findall(sleepyText)

98.5 µs ± 8.03 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Time taken for Regex

In [24]:
kwREDict = {}
for aPythonMod in popularPythonPackages:
    kwREDict[aPythonMod] =re.compile("\\b{}\\b".format(aPythonMod))

In [25]:
%%timeit -n100
spottedMods = []
for aPythonMod in popularPythonPackages:
    if len(kwREDict[aPythonMod].findall(sampleJD)) != 0 :
        spottedMods += [aPythonMod]

12.6 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


['pandas', 'scikit-learn']

Some regular expressions are also very hard to maintain due to its syntax...<br/>
Imagine debugging the following regular expression..<br/>
Not to mention crafting each regular expression for each concept we want to extract<br/>

In [28]:
emailExtractionRE = re.compile("([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")

In [29]:
emailExtractionRE.findall("You can find me at weixuan@jobtech.sg")

['weixuan@jobtech.sg']

### Regular Expression Summary:

Main Advantage:
* Powerful. Pattern matching gives alot of flexibility

Main Disadvantages:
* Non-scalable
 * Code maintainability
 * Pattern match complexity
 
 

## Flashtext
A fairly new package has been introduced into Python which leverages on a Trie https://en.wikipedia.org/wiki/Trie
in order to improve search timings. <br/>
<i>Computer Science people may recall learning about Trie-s from <b>Data Structures and Algorithms</b></i>

This strategy scales very well for working with larger dictionaries and extraction across larger datasets. <br/>
A very good explanation and analysis can be found in the author's article:
https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f

In [30]:
import flashtext

In [31]:
# prepare the flashtext Keyword processer by loading in the dictionary of terms in.
kwProcessor = flashtext.KeywordProcessor()
for aPythonMod in popularPythonPackages:
    kwProcessor.add_keyword(aPythonMod)

# we do the same for the full listing of python modules too. We will need it later
pypiKWProcessor = flashtext.KeywordProcessor()
for aPythonMod in allPipPackages:
    pypiKWProcessor.add_keyword(aPythonMod)

Let's do a quick comparison on how well this scales on larger term dictionaries

In [32]:
len(kwProcessor), len(pypiKWProcessor)

(360, 136993)

In [33]:
%%timeit -n100
spottedMods = kwProcessor.extract_keywords(sampleJD)

475 µs ± 19.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [34]:
%%timeit -n100
spottedMods = pypiKWProcessor.extract_keywords(sampleJD)

653 µs ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Look like it scales very well! There must be some drawback right?

In [35]:
# before we do that, let's just add in a variant into the keyword processor.
kwProcessor.add_keyword("scikit learn")

True

In [36]:
kwProcessor.extract_keywords("I have learnt about scikit learn from data science 101")

['scikit learn']

In [37]:
kwProcessor.extract_keywords("I have learnt about scikit  learn from data science 101")

[]

Luckily, we have a quick hack solution for this

In [38]:
# we shall use regular expressions to preprocess the data.
whitespaceEliminationRe = re.compile("\s+")


In [39]:
rawText = "I have learnt about scikit             learn from data science 101"

# All multiple whitespace will be removed and replaced by just 1 space
cleanedText = whitespaceEliminationRe.sub(" ", rawText)

In [40]:
cleanedText

'I have learnt about scikit learn from data science 101'

In [41]:
kwProcessor.extract_keywords(cleanedText)

['scikit learn']

### Flashtext Summary

Main Advantage:
* Scales well

Main Disadvantage:
* Not as flexible as regular expressions

## FlashText vs Regex Performance

Source: Vikash Singh - https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f


In [42]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://cdn-images-1.medium.com/max/800/1*_wjTfRdsnLKGnbr4VJ4Xqw.png")

# Other Notes and Links

Information Extraction can come in other forms such as Named Entity Recognition. https://nlpforhackers.io/named-entity-extraction/<br/>
However, Named Entity Chunkers requires a model to be trained to recognize the concepts from text. <br/>
The process of annotation for the body of documents is a very time consuming and tedious process. <br/>

With the dictionary based matching introduced; you will notice that there are many falsely extracted terms. <br/>
Some strategies involve checking the context of the term extracted and work pretty well. Do approach us after to discuss if you face similar challenges! <br/>



