# DATA ENGINEERING PLATFORMS (MSCA 31012)

## Webscraping using Python ( Example 1 )



References: 

https://first-web-scraper.readthedocs.io/en/latest/
http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html
http://altitudelabs.com/blog/web-scraping-with-python-and-beautiful-soup/

Installation:
`pip install BeautifulSoup4`  | 
`pip install Requests`

In [9]:
import csv
import requests
from bs4 import BeautifulSoup
from IPython.display import HTML

import pandas as pd
import numpy as np

Scraping Rules
--------------
- You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
- Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
- The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

In [2]:
# scrape the current Detainees of Boone County Jail from webpage into CSV
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

In [3]:
HTML('<iframe src=http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp width=900 height=350></iframe>')

In [4]:
soup = BeautifulSoup(html, "lxml")
table = soup.find('tbody', attrs={'class': 'stripe'})

In [5]:
tmpRow = (table.findAll('tr')[1:])
print (tmpRow)


[<tr class="even">
<td class="two td_left" data-th="Last Name">AKERS</td>
<td class="two td_left" data-th="First Name">SYDNEY</td>
<td class="two td_left" data-th="Middle Name">RAE</td>
<td class="two td_left" data-th="Sex">F</td>
<td class="two td_left" data-th="Race">W</td>
<td class="two td_right" data-th="Age">21</td>
<td class="two td_left" data-th="City">COLUMBIA</td>
<td class="two td_left" data-th="State">MO</td>
<td class="two td_left" data-th="">
<a class="_lookup btn btn-primary" height="600" href="SH01_MP.I00500s?PERKEP=76028&amp;hover_redir=&amp;height=600&amp;width=950" linkedtype="I" mrc="returndata" target="_lookup" width="860"><i class="fa fa-large fa-fw fa-list-alt"> </i>Details</a>
</td>
</tr>, <tr class="odd">
<td class="one td_left" data-th="Last Name">AL-AQUIL</td>
<td class="one td_left" data-th="First Name">MOHAMMED</td>
<td class="one td_left" data-th="Middle Name">NASSER</td>
<td class="one td_left" data-th="Sex">M</td>
<td class="one td_left" data-th="Race">U

In [6]:
list_of_rows = []
try:
    outfile = open("./inmates.csv", "w")
    writer = csv.writer(outfile)
    writer.writerow(["Last", "First", "Middle", "Gender", "Race", "Age", "City", "State"])
    for row in table.findAll('tr')[1:]:
        list_of_cells = []
        for cell in row.findAll('td')[2:]:
            text = cell.text.replace('&nbsp;', '')
            list_of_cells.append(text)
        arrLength = len(list_of_cells)
        list_of_cells.append(arrLength)
        writer.writerow(list_of_cells)
finally:
    outfile.close()    

In [7]:
f = open("./inmates.csv", 'rt')
try:
    reader = csv.reader(f)
    for row in reader:
        if(len(row)>0):
            print (row)
finally:
    f.close()

['Last', 'First', 'Middle', 'Gender', 'Race', 'Age', 'City', 'State']
['RAE', 'F', 'W', '21', 'COLUMBIA', 'MO', '\n\xa0Details\n', '7']
['NASSER', 'M', 'U', '27', 'ST. LOUIS', 'MO', '\n\xa0Details\n', '7']
['ADRIAN', 'M', 'W', '40', 'COLUMBIA', 'MO', '\n\xa0Details\n', '7']
['ADOLF', 'M', 'B', '30', 'JEFFERSON', 'MO', '\n\xa0Details\n', '7']
['DIANA', 'F', 'W', '41', 'COLUMBIA', 'MO', '\n\xa0Details\n', '7']
['MORGAN', 'M', 'W', '37', 'COLUMBIA', 'MO', '\n\xa0Details\n', '7']
['EDWARD', 'M', 'W', '30', 'THOMPSON', 'MO', '\n\xa0Details\n', '7']
['ANN', 'F', 'W', '31', 'PRAIRIE HOME', 'MO', '\n\xa0Details\n', '7']
['COREY', 'M', 'B', '27', 'COLUMBIA', 'MO', '\n\xa0Details\n', '7']
['EUGENE', 'M', 'B', '39', 'COLUMBIA', 'MO', '\n\xa0Details\n', '7']
['SUE', 'F', 'W', '40', 'ST. LOUIS', 'MO', '\n\xa0Details\n', '7']
['DEVONE', 'M', 'B', '41', 'SPRINGFIELD', 'MO', '\n\xa0Details\n', '7']
[' ', 'M', 'B', '34', 'COLUMBIA', 'MO', '\n\xa0Details\n', '7']
['MARIE', 'F', 'W', '33', 'COLUMBIA', 'M

In [10]:
pd.read_csv('inmates.csv')

Unnamed: 0,Last,First,Middle,Gender,Race,Age,City,State
0,RAE,F,W,21.0,COLUMBIA,MO,\n Details\n,7.0
1,NASSER,M,U,27.0,ST. LOUIS,MO,\n Details\n,7.0
2,ADRIAN,M,W,40.0,COLUMBIA,MO,\n Details\n,7.0
3,ADOLF,M,B,30.0,JEFFERSON,MO,\n Details\n,7.0
4,DIANA,F,W,41.0,COLUMBIA,MO,\n Details\n,7.0
5,MORGAN,M,W,37.0,COLUMBIA,MO,\n Details\n,7.0
6,EDWARD,M,W,30.0,THOMPSON,MO,\n Details\n,7.0
7,ANN,F,W,31.0,PRAIRIE HOME,MO,\n Details\n,7.0
8,COREY,M,B,27.0,COLUMBIA,MO,\n Details\n,7.0
9,EUGENE,M,B,39.0,COLUMBIA,MO,\n Details\n,7.0
