# Extracting Data

In this part we are going to cover various sources of text data and ways to extract it, serving as information or insights for businesses

The follpwing are covered in this chapter:
- 1. Collecting Data from API
- 2. Collecting Data from PDFs
- 3. Collecting Data from Word files
- 4. Collecting Data from JSON
- 5. Collecting Data from HTML
- 6. Parsing text using Regular expressions

# 1. Collecting Data from API

There are a lot of free APIs through which we can collect data and use it to solve problems. Here, we will mainly refer to Twitter API in particular since it contains a huge amount of data with a lot of value in it.

When all of this data is collected and analyzed, it gives a tremendous amount of insights to a business about the company, product, service etc.

The following steps are needed for Twitter data analysis:
- consumer key: Key associated with the application 
- consumer secret: Password used to authenticate with the authentication server
- access token: Key given to the client after successful authentication of above keys
- access token secret: Password for the access key

In [2]:
# install tweepy
!pip3 install tweepy



In [3]:
# import required libraries
import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler

In [14]:
# credentials
consumer_key = "wTkUgbPZnCQBKXeZA4jv5KUoX"
consumer_secret = "Z5hfLPvkyR49q7uFjUf4dDO9Hp65j3YqZ2WY1fHVYvo1GG9FG6"
access_token = "706794838696574976-s8H9dItuWc1Z8xXakByhXgoSYC8qHha"
access_token_secret = "RYUI34vZrdArHKigshrQlNQ4wq9inVvmpnDFaGPoRu1RV"

In [15]:
# calling API
auth = OAuthHandler(consumer_key, consumer_secret)

In [16]:
auth.set_access_token(access_token, access_token_secret)

In [17]:
api = tweepy.API(auth)

In [18]:
# provide the query you want to pull the data (e.g pulling data for the mobile phone ABC)
query = "ABC"

In [19]:
# fetching tweets
tweets = api.search(query, count=10, lang='en', exclude='retweets', tweet_mode='extended')

The query above will pull the top 10 tweets when the product ABC is searched. The API will pull English tweets since the language given is 'en' and it will exclude retweets.

# 2. Collecting Data from PDFs

Most of the time data will be stored as PDF files. We need to extract text from these files and store it for further analysis. In this part we will make use of PyPDF2 library and see how we can extract data from it.

In [71]:
# install required library
!pip3 install PyPDF2 --user



In [24]:
# import required libraries
import PyPDF2
from PyPDF2 import PdfFileReader

For the purpose of this exercise, I will import a personal pdf file from where I will extract text data.

In [32]:
# create a pdf file object
pdf = open('./data/pdf/Text Classification on Social Media Bachelor Thesis.pdf', 'rb')

In [33]:
# create a pdf reader object
pdf_reader = PdfFileReader(pdf)

In [34]:
# check number of pages in the pdf file
pdf_reader.numPages

54

In [35]:
# create a page object
page = pdf_reader.getPage(0)

In [36]:
# finally extract text from the specified page
page.extractText()

'U\nNIVERSITY\nC\nOLLEGEOF\nN\nORTHERN\nB\nACHELOR\nT\nHESIS\nTextonSocialMedia\nAuthor:\nDacianTamasan\nSupervisor:\nHenrikKristianUlrikllgaard\nAthesissubmittedinoftherequirements\nforthedegreeofWebDevelopment\nin\nJanuary20,2019\n'

In [37]:
# close the pdf file
pdf.close()

# 3. Collecting Data from Word files

In this part we will take a look into how to extract data from Word files in Python. For this we will make use of docx library in Python.

We will use as well a personal doc file to extract data from it.

In [70]:
# install required library
!pip3 install python-docx --user



In [43]:
# import required library
from docx import Document

In [48]:
# create a word file object
file = open('./data/docx/TrustCo System - Analyzing Reviews Report.docx', 'rb')

In [49]:
# create a word reader object
document = Document(file)

In [50]:
# create an empty string and call the document.
# document variable stores each paragraph in the Word document. We create a for loop that goes through 
# each paragraph in the Word document and appends the paragraph.
doc = ""
for p in document.paragraphs:
    doc += p.text

In [51]:
doc

"University College ofNorthern DenmarkTrustCo System Analyzing Reviews from Amazon\t\t\t\t\t\t\t\t                                                        Dacian Tamasan, Curcuta Lucian \t\tdmai0914, Computer Science\tTable of contentsIntroduction.............................................................................................................................................4Problem statement ..............................................................................................................................4Project Learning Goals................................................................................................................................5Crawling Data from Amazon................................................................................................................................5Domain Model.............................................................................................................................................6Relation

# 4. Collecting Data from JSON

The simplest solution for reading data from JSON files in Python is by using requests and JSON library provided by Python.

In [52]:
# import required libraries
import requests
import json

In [56]:
# json from 'https://quotes.rest/qod.json'
r = requests.get('https://quotes.rest/qod.json')

In [59]:
# get the result into dictionary
res = r.json()

In [64]:
# check json structure
print(json.dumps(res, indent=4))

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "When you win, say nothing. When you lose, say less.",
                "length": "51",
                "author": "Paul Brown",
                "tags": [
                    "inspire",
                    "losing",
                    "running",
                    "tod",
                    "winning"
                ],
                "category": "inspire",
                "date": "2019-09-19",
                "permalink": "https://theysaidso.com/quote/paul-brown-when-you-win-say-nothing-when-you-lose-say-less",
                "title": "Inspiring Quote of the day",
                "background": "https://theysaidso.com/img/bgs/man_on_the_mountain.jpg",
                "id": "3dlKxoNAOZsB__Nb61H95weF"
            }
        ],
        "copyright": "2017-19 theysaidso.com"
    }
}


In [65]:
# extract contents
q = res['contents']['quotes'][0]

In [67]:
# print output
q

{'quote': 'When you win, say nothing. When you lose, say less.',
 'length': '51',
 'author': 'Paul Brown',
 'tags': ['inspire', 'losing', 'running', 'tod', 'winning'],
 'category': 'inspire',
 'date': '2019-09-19',
 'permalink': 'https://theysaidso.com/quote/paul-brown-when-you-win-say-nothing-when-you-lose-say-less',
 'title': 'Inspiring Quote of the day',
 'background': 'https://theysaidso.com/img/bgs/man_on_the_mountain.jpg',
 'id': '3dlKxoNAOZsB__Nb61H95weF'}

In [69]:
# extract quote and author
print(q['quote'], '\n--', q['author'])

When you win, say nothing. When you lose, say less. 
-- Paul Brown


# 5. Collecting Data from HTML

In this part we will take a look into how to collect data from HTML pages. As a solution we will make use of bs4 library also known as BeautifulSoup in Python.

In [72]:
# install required library
!pip3 install bs4 --user



In [74]:
# import required libraries
import urllib.request as urllib
from bs4 import BeautifulSoup

In [75]:
# fetch html file (e.g Wikipedia)
response = urllib.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')

In [84]:
# get data into html_doc (binary)
html_doc = response.read()

In [79]:
# parse html file
soup = BeautifulSoup(html_doc, 'html.parser')

In [85]:
# format the parsed html file
strhtml = soup.prettify()

In [93]:
# print few lines
print(strhtml[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Natural language processing - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Natural_language_processing","wgTitle":"Natural language processing","wgCurRevisionId":915734112,"wgRevisionId":915734112,"wgArticleId":21652,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","All accuracy disputes","Articles with disputed statements from June 2018","Wikipedia articles with LCCN identifiers","Wikipedia articles with NDL identifiers","Natural language processing","Computational linguistics","Speech recognition","Computational fields of study","Artificial intelligence"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparator

In [88]:
# extract tag values
# 1. extract title
print(soup.title)

<title>Natural language processing - Wikipedia</title>


In [89]:
# 2. extract content between the following tags: <title> </title>
print(soup.title.string)

Natural language processing - Wikipedia


In [90]:
# 3. extract content between the following tags: <a> </a>
print(soup.a.string)

None


In [91]:
# 4. extract content between the following tags: <b> </b>
print(soup.b.string)

Natural language processing


In [99]:
# extract all instances of a particular tag (e.g 'a')
content_tag_a = []
for x in soup.find_all('a'):
    content_tag_a.append(x.string)

In [100]:
# print the first 5 occurencies 
content_tag_a[:5]

[None,
 'Jump to navigation',
 'Jump to search',
 'Neuro-linguistic programming',
 'Language processing in the brain']

In [103]:
# extract all text of a particular tag
content_tag_p = []
for x in soup.find_all('p'):
    content_tag_p.append(x.text)

In [106]:
# print the first 5 occurencies
content_tag_p[:5]

['Not to be confused with Non-linear programming (also NLP)',
 'Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.\n',
 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.\n',
 'The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.\nIn 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.\n',
 'The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within

# 6. Parsing text using Regular Expressions

In this part we will take a look into how regular expressions are helpful when dealing with text data. This is very much required when dealing with raw data from the web, which would contain HTML tags, long text, repeated text.

For this we will make use of the "re" library written in Python.

The basic flags in "re" library are:
- re.I : used for ignoring casing
- re.L : used for finding a local dependent
- re.M : used for finding patterns throughout multiple lines
- re.S : used to find dot matches
- re.U : used to work for unicode data
- re.X : used for writing regex in a more readable format

Regular expression's functionality:
- Find the single occurence of character a and b: 
Regex: [ab]
- Find characters except for a and b:
Regex: [^ab]
- Find the character range of a to z:
Regex: [a-z]
- Find a range except to z:
Regex: [^a-z]
- Find all the characters a to z as well A to Z:
Regex: [a-zA-Z]
- Any single character:
Regex:
- Any whitespace character:
Regex: \s
- Any non-whitespace character:
Regex: \S
- Any digit:
Regex: \d
- Any non-digit:
Regex: \D
- Any words:
Regex: \w
- Any non-words:
Regex: \W
- Either match a or b:
Regex: (a|b)
- The occurence of a is either zero or one:
    - Matches zero or one occurence but not more than one occurence
    Regex: a? ; ?
    - The occurence of a is zero times or more than that:
    Regex: a* ; * matches zero or more than that
    - The occurence of a is one time or more than that:
    Regex: a+ ; + matches occurences one or more that one time

Exactly match three occurences of a:
Regex: a{3}

Match simultaneous occurences of a with 3 or more than 3:
Regex: a{3,}

Match simultaneous occurences of a between 3 to 6:
Regex: a{3,6}

Starting of the string:
Regex: ^

Ending of the string:
Regex: $

Match word boundary:
Regex: \b

Non-word boundary:
Regex: \B

The most used functions are as follows: re.match() and re.search() and they are used to find patterns, and they can be processed according to the requirements of the application

- re.match() function checks for a match of the string only at the beginning of the string
                      if it finds the pattern at the beginning of the input string then it
                      return matched pattern; else it returns a noun
- re.search() function checks for a match of the string anywhere in the string. It finds all
                      the occurences of the pattern in the given input string or data.

# Tokenizing

Is the process of splitting the sentence into chunk of words. One way to do this is by using re.split function from Python

In [107]:
# import required libraries
import re

In [109]:
# run the split query
re.split('\s+', 'I like this book.')

['I', 'like', 'this', 'book.']

# Extracting email IDs

The simplest way to do this is by using re.findall function from Python "re" library.

In [110]:
# 1. read / create the document or sentences
doc = "For more details please mail us at: xyz@abc.com, pqr@mno.com"

In [111]:
# 2 execute the re.findall function
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', doc)

In [112]:
type(addresses)

list

In [113]:
for address in addresses:
    print(address)

xyz@abc.com
pqr@mno.com


# Replacing email IDs

Here we replace  email ids from sentences or documents with another email id. The simplest way to do this is by using re.sub

In [118]:
# 1. read / create the document or sentences
doc = "For more details please mail us at xyz@abc.com"

In [119]:
# 2. execute re.sub function
new_doc = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'pqr@mno.com', doc)

In [120]:
new_doc

'For more details please mail us at pqr@mno.com'

# Extract data from ebook and perform regex

In [126]:
# import required libraries
import re
import requests
import json

In [127]:
# url that we want to extract information
url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

In [147]:
# define function to extract
def get_book(url):
    # send a http request to get the text from project Gutenberg
    raw_text = requests.get(url).text
    
    # skip metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*", raw_text).end()
    
    # skip metadata from the end of the book
    end = re.search(r"II", raw_text).start()
    
    # keep relevant text
    text = raw_text[start:end]
    return text

In [149]:
# preprocessing
def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+', ' ', sentence).lower()

In [150]:
book = get_book(url)

In [151]:
# apply preprocessing
processed_book = preprocess(book)

In [154]:
processed_book[:2000]

' produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part i i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages fo

In [155]:
# perform exploratory data analysis on data using regex
# Count number of times "the" is appeared in the book
len(re.findall(r'the', processed_book))

302

In [156]:
# Replace "i" with "I"
processed_book = re.sub(r'\si\s', " I ", processed_book)

In [158]:
processed_book[:2000]

' produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part I i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages fo

In [160]:
# find all occurances of text in the format "abc--xyz"
re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)

['ironical--it',
 'malicious--smile',
 'fur--or',
 'astrachan--overcoat',
 'it--the',
 'Italy--was',
 'malady--a',
 'money--and',
 'little--to',
 'No--Mr',
 'is--where',
 'I--I',
 'I--',
 '--though',
 'crime--we',
 'or--judge',
 'gaiters--still',
 '--if',
 'through--well',
 'say--through',
 'however--and',
 'Epanchin--oh',
 'too--at',
 'was--and',
 'Andreevitch--that',
 'everyone--that',
 'reduce--or',
 'raise--to',
 'listen--and',
 'history--but',
 'individual--one',
 'yes--I',
 'but--',
 't--not',
 'me--then',
 'perhaps--',
 'Yes--those',
 'me--is',
 'servility--if',
 'Rogojin--hereditary',
 'citizen--who',
 'least--goodness',
 'memory--but',
 'latter--since',
 'Rogojin--hung',
 'him--I',
 'anything--she',
 'old--and',
 'you--scarecrow',
 'certainly--certainly',
 'father--I',
 'Barashkoff--I',
 'see--and',
 'everything--Lebedeff',
 'about--he',
 'now--I',
 'Lihachof--',
 'Zaleshoff--looking',
 'old--fifty',
 'so--and',
 'this--do',
 'day--not',
 'that--',
 'do--by',
 'know--my',
 'il