# Intro to Web Scraping

## This project will cover: 
1. Introduction to Web Scraping
2. Creating Requests using requests package
3. Accessing elements via BeautifulSoup

#### Web-scraping can be achieved using 2 packages. 
1. requests - To create the api call in the first place
2. BeautifulSoup - To distill the resultant XML/return object down into a bs object for subsequent processing.

#### Commonly Used Scraping Options
1. soup.select('div') - Access elements with div tag
2. soup.select('#some_id') - Access elements with id tag
3. soup.select('.some_class')  - Access elements with some class
4. soup.select('div span') - Access elements named span within a div
5. soup.select('div > span') - Access elements named span between div tags

## Imports

In [1]:
import requests
import bs4

### 1. Create Request

In [2]:
import requests
result = requests.get("http://www.example.com")
result.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

## 2. Parsing using BeautifulSoup

### 2.1 Creating a Soup

In [3]:
# BS uses xml to parse its associated queries
import bs4
soup = bs4.BeautifulSoup(result.text, "lxml") # Splits XML into HTML/CSS
soup

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

#### 2.2 Selecting from the Soup

In [4]:
# Returns a bs object
h1 = soup.select('h1') # Pass in the relevant HTML tag
type(h1)

bs4.element.ResultSet

In [5]:
# Returns a string
site_title = soup.select('title')[0].getText()
site_title

'Example Domain'

In [6]:
site_paragraphs = soup.select('p')[0].getText() # Add getText to cut 
site_paragraphs

'This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.'

## 3. Common Search Options

### 3.1 Inspecting Elements

In [7]:
## Setting up a generalised method 

res = requests.get("https://en.wikipedia.org/wiki/Jonas_Salk")
res.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Jonas Salk - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"d3b886fe-d39b-4d31-8273-c7e6515a7f3f","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Jonas_Salk","wgTitle":"Jonas Salk","wgCurRevisionId":1019782015,"wgRevisionId":1019782015,"wgArticleId":25709692,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: archived copy as title","Webarchive template wayback links","CS1 Italian-language sources (it)","Articles with short description","Short description is differ

In [8]:
soup = bs4.BeautifulSoup(res.text, "lxml") # Splits XML into HTML/CSS
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Jonas Salk - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"d3b886fe-d39b-4d31-8273-c7e6515a7f3f","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Jonas_Salk","wgTitle":"Jonas Salk","wgCurRevisionId":1019782015,"wgRevisionId":1019782015,"wgArticleId":25709692,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: archived copy as title","Webarchive template wayback links","CS1 Italian-language sources (it)","Articles with short description","Short description is different fr

In [9]:
contents = soup.select('.toctext') # Using Class Call
contents

[<span class="toctext">Early life and education</span>,
 <span class="toctext">Education</span>,
 <span class="toctext">Medical school</span>,
 <span class="toctext">Postgraduate research and early laboratory work</span>,
 <span class="toctext">Polio research</span>,
 <span class="toctext">Becoming a public figure</span>,
 <span class="toctext">Celebrity versus privacy</span>,
 <span class="toctext">Maintaining his individuality</span>,
 <span class="toctext">Establishing the Salk Institute</span>,
 <span class="toctext">AIDS vaccine work</span>,
 <span class="toctext">Salk's "biophilosophy"</span>,
 <span class="toctext">Personal life</span>,
 <span class="toctext">Honors and recognition</span>,
 <span class="toctext">Documentary films</span>,
 <span class="toctext">Salk's book publications</span>,
 <span class="toctext">See also</span>,
 <span class="toctext">References</span>,
 <span class="toctext">Further reading</span>,
 <span class="toctext">External links</span>]

In [10]:
for item in soup.select('.toctext'):
    print(item.text)

Early life and education
Education
Medical school
Postgraduate research and early laboratory work
Polio research
Becoming a public figure
Celebrity versus privacy
Maintaining his individuality
Establishing the Salk Institute
AIDS vaccine work
Salk's "biophilosophy"
Personal life
Honors and recognition
Documentary films
Salk's book publications
See also
References
Further reading
External links


In [25]:
# Fly by wire - Example 1: Id

# 1. Create request (.text to display results)
# 2. Convert result via BeautifulSoup 
# 3. Select item/items of interest
# 4. Output the results. 

# 1. Create Request
fbw_request = requests.get("https://en.wikipedia.org/wiki/Alan_Partridge")
fbw_request.text

soup = bs4.BeautifulSoup(fbw_request.text, "lxml")
soup  # type(soup) = bs4.BeautifulSoup object

contents = soup.select('#firstHeading') # Selected by id
contents # type(contents) = bs4.element.ResultSet
contents[0].getText()

'Alan Partridge'

In [94]:
# Fly by wire - Example 2: Images

fbw_request = requests.get("https://en.wikipedia.org/wiki/Alan_Partridge")
fbw_request.text

soup = bs4.BeautifulSoup(fbw_request.text, "lxml")
soup  # type(soup) = bs4.BeautifulSoup object

contents = soup.select('.image')[0]
type(contents)
# contents['src']

bs4.element.Tag

In [95]:
html = '''
<img src="smiley.gif" alt="Smiley face" height="42" width="42">'''

soup = bs4.BeautifulSoup(html)
images = soup.find('img')
print(images['src']) #smiley.gif

smiley.gif
