# Data Engineering - Web Scraping

## Exercise 1: To Scrape dot Com

For this exercise, we will use a site that was actually _made for scraping_: [Web Scraping Sandbox](https://toscrape.com/) 

In [1]:
# 1.1 imports (regex, beautifulsoup, requests, and pandas)
import regex 
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import numpy as np

In [3]:
# 1.2 scrape all urls from https://toscrape.com/
r = requests.get(' https://toscrape.com/')
r

<Response [200]>

In [4]:
# 1.3 scrape all text ('p') from https://toscrape.com/
scrp = requests.get('https://toscrape.com/')
scrp = bs(scrp.content)
#print(scrp.prettify())

In [60]:
body = scrp.find('body')
div = scrp.find('div')
header1 = scrp.find('h1').get_text()
header2 = scrp.find('h2').get_text()

print('header1:', header1,'\n', 'header:2', header2)
#body
#div

header1: Web Scraping Sandbox 
 header:2 Books


In [61]:
title =  scrp.find('title')#title of webpage
print('title:', title.string)
para = scrp.find_all('p')#
# print(para.prettify())
# print(para.get_text())
para = [x.get_text() for x in para] #get the text parts of para
para


title: Scraping Sandbox


["A fictional bookstore that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: books.toscrape.com",
 'A website that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.']

## Exercise 2: The Office (wikipedia)

For this exercise, scrape the side bar data (text box only), as a dictionary from [The Office Wikipedia Page](https://en.wikipedia.org/wiki/The_Office_(American_TV_series)).

Convert your dictionary into a dataframe and print it as shown: 

![](../assets/the_office_DF.png)

In [26]:
# exercise 2
wikipedia = requests.get('https://en.wikipedia.org/wiki/The_Office_(American_TV_series)')
wikipedia = bs(wikipedia.content, 'html.parser')
wikipedia.find_all('p')
wikipedia.find('title')


<title>The Office (American TV series) - Wikipedia</title>

In [27]:
page = requests.get('https://en.wikipedia.org/wiki/The_Office_(American_TV_series)')
str(page.content)[:200]#make connection for the page

'b\'<!DOCTYPE html>\\n<html class="client-nojs" lang="en" dir="ltr">\\n<head>\\n<meta charset="UTF-8"/>\\n<title>The Office (American TV series) - Wikipedia</title>\\n<script>document.documentElement.classNa'

In [28]:
soup = bs(page.content, 'html.parser')#bs converts from html to object 
#soup.prettify()

In [58]:
table = soup.find(class_='infobox vevent')#extract table from webpage
t_heads = table.find_all('th')#extract the heads of table
rows = [th.text for th in t_heads]#convert table heads to string as rows' name
titles = ['The Office', 'Production', 'Release', 'Chronology']#separated the header as th were in rows

for item in titles: #to removed some title from rows
    rows.remove(item)
    
rows.insert(0, 'title') #insert the 'title at the first cell'

t_data = table.find_all('tr') #finding table's body

#for lop to extract data inside the body
values = []
for i, tr in enumerate(t_data):
    
    for td in tr.find_all("td"):
        data=td.text.strip().split("\n")
        values.append(data)

values = values[1:] #the first cell is empty
values.insert(0, 'The Office') # inserting at the first cell

df = pd.concat([pd.Series(rows), pd.Series(values, name='values')], axis=1, ignore_index=True)#making table by concatinating two series
df.columns = [' ', 'values']#assign the name to column as given sample
df
#I could not removethe index from table


Unnamed: 0,Unnamed: 1,values
0,title,The Office
1,Genre,"[Mockumentary, Workplace comedy, Cringe comedy..."
2,Based on,[The Officeby Ricky GervaisStephen Merchant]
3,Developed by,[Greg Daniels]
4,Starring,"[Steve Carell, Rainn Wilson, John Krasinski, J..."
5,Theme music composer,[Jay Ferguson]
6,Country of origin,[United States]
7,Original language,[English]
8,No. of seasons,[9]
9,No. of episodes,[201 (list of episodes)]
