# Data Scraping

## Outline
* Access Structured Data
    * accessing ***Rest*** APIs
    * Comon Bus-Systems
        
* Access Unstructured Data
    * web scraping
    * PDF scraping
    

## Data Science Processing Pipeline
<img src="IMG/workflow.png" width=1200>

## Recall: JSON
### *Data with less structure than tables: sparse entries or flexible schema*
<img SRC="IMG/json-logo.png" width=400>

***JavaScript Object Notation (JSON)*** is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types. 
JSON is a language-independent data format. It was derived from JavaScript, but as of 2017 many programming languages include code to generate and parse JSON-format data. 

### JSON Document Tree
<img SRC="IMG/json.png">

### JSON in Pythhon


In [1]:
import pandas as pd

Data = {'Product': ['Desktop Computer','Tablet','iPhone','Laptop'],
        'Price': [700,250,800,1200]
        }

df = pd.DataFrame(Data, columns= ['Product', 'Price'])
 
print (df)


            Product  Price
0  Desktop Computer    700
1            Tablet    250
2            iPhone    800
3            Laptop   1200


In [2]:
#native JSON support in pandas
Export = df.to_json ('Export_DataFrame.json')

#### Use ***Colab*** to browse the JSON file.

## REST-APIs: Getting Data from Sensors and Services
### *IoT usecases and mesh-ups*

* **REST** **RE**presentational **S**tate **T**ransfer - de facto standard for network (HTTP) communication 
* performance, scalability, simplicity, and reliability for **client-server** data exchange



Also see: [https://en.wikipedia.org/wiki/Representational_state_transfer](https://en.wikipedia.org/wiki/Representational_state_transfer)

### REST
* **Stateless:** The server won’t maintain any state between requests from the client.
* **Client-server:** The client and server must be decoupled from each other, allowing each to develop independently.
* **Cacheable:** The data retrieved from the server should be cacheable either by the client or by the server.
* **Uniform interface:** The server will provide a uniform interface for accessing resources without defining their representation.
* **Layered system:** The client may access the resources on the server indirectly through other layers such as a proxy or load balancer.

### REST Schema
<img src='IMG/REST.png'>
[image from Wikipedia]

### Example

* ***GitHub*** REST API -> User information

[https://api.github.com/users/keuperj](https://api.github.com/users/keuperj)

### REST communication via HTTP(s) requests

* GET	Retrieve an existing resource.
* POST	Create a new resource.
* PUT	Update an existing resource.
* PATCH	Partially update an existing resource.
* DELETE	Delete a resource.

#### Data payload -> JSON !

### REST interactions in Python
#### with the REQUESTS lib

<center>
<img src="IMG/requests.png" width=300>
</center>

* [https://docs.python-requests.org/en/master/](https://docs.python-requests.org/en/master/)

In [3]:
# Example: read data from service

import requests
api_url = "https://jsonplaceholder.typicode.com/todos/1" # open REST service for tests
response = requests.get(api_url)
response.json()

{'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}

We use the open REST test server at: https://jsonplaceholder.typicode.com

In [4]:
# write data to service

api_url = "https://jsonplaceholder.typicode.com/todos"
todo = {"userId": 99, "title": "Buy milk", "completed": False}
response = requests.post(api_url, json=todo)
response.json()
{'userId': 1, 'title': 'Buy milk', 'completed': False, 'id': 201}



{'userId': 1, 'title': 'Buy milk', 'completed': False, 'id': 201}

In [5]:
#check transaction
response.status_code

201

Error Codes:

* 2xx - SUCCESS
* 4xx - Client Error
* 5xx - Server Error

#### Large scale JSON handling and Queries -> MogoDB !

#### Other comunication libs i.e. for CAN-Bus available: [https://python-can.readthedocs.io/en/master/](https://python-can.readthedocs.io/en/master/)

## Getting Data from Web-Ressources
#### *Data Scraping*

* In some cases data is not *provided* via a defined API, but needs to be collected
   * i.e. from unstructured web-data 

In [6]:
# example using requests
import requests

r = requests.get('https://www.google.com')
print(r.text)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="42ofyfj8a0OV_frCMpY-yA">(function(){window.google={kEI:'9pmQYtyOB6mHxc8P-_y88Ak',kEXPI:'0,1302536,56873,6058,207,4804,926,1390,383,246,5,1354,4013,1123753,1197771,630,380090,16114,17444,11240,17572,4858,1362,12313,17594,4012,978,13228,3847,10622,22741,5968,706,1279,2742,149,1103,840,1983,4314,3514,606,2023,1777,521,14669,3227,2845,7,17450,16320,17607,1,2,346,230,6182,277,149,13975,4,1528,2304,7039,25073,2658,7355,13660,4437,16786,5824,2533,4092,2,4052,3,3541,1,42154,2,14022,6248,1398,6470,11623,5679,3398,28745,4567,6259,23418,1252,5835,14968,4332,6089,1395,11239,15406,1,436,8155,6395,187,98,701,11772,2908,7341,3217,8190,3057,1920,5596,110,917,3156,23,10158,2233,3099,210,6056,5931,5119,781,2544,41,1486,2443,

### How to get structured information from Websites?
#### BeautifulSoup
* Docu: https://beautiful-soup-4.readthedocs.io/en/latest/


* Example for news website: https://news.ycombinator.com

In [7]:
import requests
from bs4 import BeautifulSoup

r = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.findAll('tr', class_='athing')

formatted_links = []

for link in links:
    data = {
        'id': link['id'],
        'title': link.find_all('td')[2].a.text,
        "url": link.find_all('td')[2].a['href'],
        "rank": int(links[0].td.span.text.replace('.', ''))
    }
    formatted_links.append(data)



In [8]:
for i in  range(10):
    print(formatted_links[i])

{'id': '31526370', 'title': 'Today’s JavaScript, from an outsider’s perspective (2020)', 'url': 'https://lea.verou.me/2020/05/todays-javascript-from-an-outsiders-perspective/', 'rank': 1}
{'id': '31525505', 'title': 'Ultra compact GAN ATX power supply delivers up to 250 Watts', 'url': 'https://www.cnx-software.com/2022/05/26/ultra-compact-gan-atx-power-supply-delivers-up-to-250-watts/', 'rank': 1}
{'id': '31525792', 'title': 'Internet drama in Canada', 'url': 'https://www.nytimes.com/2022/05/26/technology/canada-internet-service.html', 'rank': 1}
{'id': '31526231', 'title': 'Organice: An implementation of Org mode without the dependency of Emacs', 'url': 'https://github.com/200ok-ch/organice', 'rank': 1}
{'id': '31526044', 'title': 'NPM security update: Attack campaign using stolen OAuth tokens', 'url': 'https://github.blog/2022-05-26-npm-security-update-oauth-tokens/', 'rank': 1}
{'id': '31524669', 'title': 'What are the odds that some idiot will name his mutex ether-rot-mutex (2017)'

In [9]:
#another example
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [10]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

In [11]:
#get the clean title
soup.title.string

"The Dormouse's story"

In [12]:
#iterate over all links
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [13]:
#get all text
print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



#### get table and convert to pandas
Examle: https://www.w3schools.com/html/html_tables.asp

In [14]:
import bs4 as bs
import urllib.request
import pandas as pd

source = urllib.request.urlopen('https://www.w3schools.com/html/html_tables.asp').read()
soup = bs.BeautifulSoup(source,'lxml')

table = soup.find_all('table')
df = pd.read_html(str(table))[0]

In [15]:
df.head()

Unnamed: 0,Company,Contact,Country
0,Alfreds Futterkiste,Maria Anders,Germany
1,Centro comercial Moctezuma,Francisco Chang,Mexico
2,Ernst Handel,Roland Mendel,Austria
3,Island Trading,Helen Bennett,UK
4,Laughing Bacchus Winecellars,Yoshi Tannamuri,Canada


## Scaling Web-Scaraping wiht Scrapy
#### Crowling the web

<img SRC="IMG/scrapy.jpg">

#### [https://scrapy.org/](https://scrapy.org/)

#### Example

Scraping [http://quotes.toscrape.com/page/1/](http://quotes.toscrape.com/page/1/)

In [16]:
%%writefile myCrwler.py 

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class TheFriendlyNeighbourhoodSpider(CrawlSpider):
    name = 'TheFriendlyNeighbourhoodSpider'

    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/page/1/']

    custom_settings = {
    'LOG_LEVEL': 'INFO'
    }

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )


    def parse_item(self, response):
        print('Downloaded... ', response.url)
        filename = "download/"+str(response.url.split("/")[-2]) +  '.html'
        print('Saving as :', filename)
        with open(filename, 'wb') as f:
            f.write(response.body)

Overwriting myCrwler.py


### Run in Shell:

In [17]:
#This would produce all the linked HTML file we can use for analysis 
!mkdir download
!scrapy runspider myCrwler.py

mkdir: cannot create directory ‘download’: File exists
2022-05-27 11:29:28 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-05-27 11:29:28 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.6 (default, Jan  8 2020, 19:59:22) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 2.8, Platform Linux-4.15.0-177-generic-x86_64-with-debian-buster-sid
2022-05-27 11:29:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-05-27 11:29:28 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 'INFO', 'SPIDER_LOADER_WARN_ONLY': True}
2022-05-27 11:29:28 [scrapy.extensions.telnet] INFO: Telnet Password: 79e269613073f4a3
2022-05-27 11:29:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.lo

Downloaded...  http://quotes.toscrape.com/tag/open-mind/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/food/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/chocolate/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/author/J-M-Barrie/
Saving as : download/J-M-Barrie.html
Downloaded...  http://quotes.toscrape.com/tag/understanding/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/wisdom/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/dumbledore/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/knowledge/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/page/5/
Saving as : download/5.html
Downloaded...  http://quotes.toscrape.com/author/Suzanne-Collins/
Saving as : download/Suzanne-Collins.html
Downloaded...  http://quotes.toscrape.com/author/Terry-Pratchett/
Saving as

Downloaded...  http://quotes.toscrape.com/tag/beatles/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/author/George-R-R-Martin/
Saving as : download/George-R-R-Martin.html
Downloaded...  http://quotes.toscrape.com/author/William-Nicholson/
Saving as : download/William-Nicholson.html
Downloaded...  http://quotes.toscrape.com/author/Dr-Seuss/
Saving as : download/Dr-Seuss.html
Downloaded...  http://quotes.toscrape.com/tag/self-indulgence/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/author/Ayn-Rand/
Saving as : download/Ayn-Rand.html
Downloaded...  http://quotes.toscrape.com/tag/lies/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/lying/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/truth/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/insanity/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/

In [18]:
!ls download

10.html			   Eleanor-Roosevelt.html    Jorge-Luis-Borges.html
1.html			   Elie-Wiesel.html	     J-R-R-Tolkien.html
2.html			   Ernest-Hemingway.html     Khaled-Hosseini.html
3.html			   Friedrich-Nietzsche.html  life.html
4.html			   friendship.html	     love.html
5.html			   friends.html		     Madeleine-LEngle.html
6.html			   Garrison-Keillor.html     Marilyn-Monroe.html
7.html			   George-Bernard-Shaw.html  Mark-Twain.html
8.html			   George-Carlin.html	     Martin-Luther-King-Jr.html
9.html			   George-Eliot.html	     Mother-Teresa.html
Albert-Einstein.html	   George-R-R-Martin.html    Pablo-Neruda.html
Alexandre-Dumas-fils.html  Harper-Lee.html	     quotes.toscrape.com.html
Alfred-Tennyson.html	   Haruki-Murakami.html      Ralph-Waldo-Emerson.html
Allen-Saunders.html	   Helen-Keller.html	     reading.html
Andre-Gide.html		   humor.html		     simile.html
Ayn-Rand.html		   inspirational.html	     Stephenie-Meyer.html
Bob-Marley.html		   James-Baldwin.html	     Steve-

## Scraping PDFs
* using ``tabula-py``
* Docu: https://tabula-py.readthedocs.io/en/latest/

In [19]:
pip install tabula-py

Note: you may need to restart the kernel to use updated packages.


#### Example: 
* a UN report as [PDF](https://www.un.org/en/development/desa/population/publications/pdf/urbanization/the_worlds_cities_in_2016_data_booklet.pdf)

In [20]:
import tabula
url= "https://www.un.org/en/development/desa/population/publications/pdf/urbanization/the_worlds_cities_in_2016_data_booklet.pdf"

#get table on page 4
df = tabula.read_pdf(url,  pages='6')


In [21]:
#returns list of all tables - get the first one
df[0].head()

Unnamed: 0.1,"Of the world’s 31 megacities (that is,",Unnamed: 0,Population,Unnamed: 1,Population.1
0,,,in 2016,,in 2030
1,cities with 10 million inhabitants or Rank,"City, Country",(thousands),"City, Country",(thousands)
2,"more) in 2016, 24 are located in the 1","Tokyo, Japan",38 140,"Tokyo, Japan",37 190
3,less developed regions or the “global 2 3,"Delhi, India Shanghai, China",26 454 24 484,"Delhi, India Shanghai, China",36 06030 751
4,South”. China alone was home to six 4,"Mumbai (Bombay), India",21 357,"Mumbai (Bombay), India",27 797


# Discussion