### Intro to Web Scraping


Scraping is not always legal!  

Some rules to consider: 
* Be respectful and do not bombard a website with scraping request or else you can get your IP address blocked
* Check the website permission before you begin! If there is an API available, use it. Most websites won't let you use their data commercially.
* Each website is unique and may update, so you may need to update your code and/or customize your scraping code for each website


When is it a good idea to scrape a website:
* API is not available, or information you want is not in the API
* You want to anonoymously scrape a website (use a VPN) 

Here is a Web Scraping Sandbox where you can practice scraping: 
http://toscrape.com/

Today, we're going to start with scraping www.wikipedia.com because it is *legal* to scrape

Make sure you download requests and bs4 via terminal 

* pip install requests
* pip install bs4

or if you're using Anaconda 

* conda install requests
* conda install bs4

or install it via notebook 

* !pip install requests
* !pip install bs4 

In [1]:
# The request library will grab the page
import requests

response = requests.get("https://en.wikipedia.org/wiki/Ada_Lovelace")

In [2]:
response.text

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Ada Lovelace - Wikipedia</title>\n<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(){var cookie=document.cookie.ma

In [3]:
# The beautifulsoup library makes your code legible and helps you analyze the extracted page

import bs4 
soup = bs4.BeautifulSoup(response.text, 'html.parser')

In [4]:
soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Ada Lovelace - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(){var cookie=document.cookie.match(/

In [5]:
# next inspect the elements on the wiki page, I want to grab the headlines 
# the headlines are in the class="mw-headline" in a <span> 

soup.select(".mw-headline")

[<span class="mw-headline" id="Biography">Biography</span>,
 <span class="mw-headline" id="Childhood">Childhood</span>,
 <span class="mw-headline" id="Adult_years">Adult years</span>,
 <span class="mw-headline" id="Education">Education</span>,
 <span class="mw-headline" id="Death">Death</span>,
 <span class="mw-headline" id="Work">Work</span>,
 <span class="mw-headline" id="First_computer_program">First computer program</span>,
 <span class="mw-headline" id="Insight_into_potential_of_computing_devices">Insight into potential of computing devices</span>,
 <span class="mw-headline" id="Distinction_between_mechanism_and_logical_structure">Distinction between mechanism and logical structure</span>,
 <span class="mw-headline" id="Controversy_over_contribution">Controversy over contribution</span>,
 <span class="mw-headline" id="In_popular_culture">In popular culture</span>,
 <span class="mw-headline" id="1810s">1810s</span>,
 <span class="mw-headline" id="1970s">1970s</span>,
 <span class="

In [6]:
# Create a list for the scrapped headlines 

headlines = []
for item in soup.select(".mw-headline"):
    headlines.append(item.text)

In [7]:
headlines

['Biography',
 'Childhood',
 'Adult years',
 'Education',
 'Death',
 'Work',
 'First computer program',
 'Insight into potential of computing devices',
 'Distinction between mechanism and logical structure',
 'Controversy over contribution',
 'In popular culture',
 '1810s',
 '1970s',
 '1990s',
 '2000s',
 '2010s',
 '2020s',
 'Commemoration',
 'Bicentenary',
 'Publications',
 'Publication history',
 'See also',
 'Explanatory notes',
 'References',
 'General and cited sources',
 'Further reading',
 'External links']

In [8]:
# To save to a CSV, we first want to create a dataframe for the data

import pandas as pd

headline_df = pd.DataFrame()
headline_df['headlines'] = pd.Series(headlines).values

This problem is likely to be solved by installing an updated version of `importlib-metadata`.


In [9]:
headline_df

Unnamed: 0,headlines
0,Biography
1,Childhood
2,Adult years
3,Education
4,Death
5,Work
6,First computer program
7,Insight into potential of computing devices
8,Distinction between mechanism and logical stru...
9,Controversy over contribution


In [10]:
# Save to csv 
headline_df.to_csv('headline.csv')

In [11]:
# We're starting by going through the HTML and looking for something all the images have in common 
# the class 'thumbimage' applies to all the images 

image_info = soup.select('.thumbimage')

# image_info = soup.select('.image')

In [12]:
image_info

[<img alt="Ada Byron, portrait at age four" class="thumbimage" data-file-height="1376" data-file-width="1352" decoding="async" height="224" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Miniature_of_Ada_Byron.jpg/220px-Miniature_of_Ada_Byron.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Miniature_of_Ada_Byron.jpg/330px-Miniature_of_Ada_Byron.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Miniature_of_Ada_Byron.jpg/440px-Miniature_of_Ada_Byron.jpg 2x" width="220"/>,
 <img alt="Ada Byron, portrait at age 7" class="thumbimage" data-file-height="856" data-file-width="974" decoding="async" height="193" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/de/Ada_Lovelace_child_portrait_Somerville_College.jpg/220px-Ada_Lovelace_child_portrait_Somerville_College.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/de/Ada_Lovelace_child_portrait_Somerville_College.jpg/330px-Ada_Lovelace_child_portrait_Somerville_College.jpg 1.5x, //upload.w

In [13]:
len(image_info)

10

In [14]:
# We're creating a list of the links for the thumbnails 

links = []
for link in image_info:
    #the links is in the 'src' attribute 
    item =link.get('src')
    #we're adding https: to format it properly 
    print(item)
    #this if statement is bc some <div> tags also have the class "thumbnail"
    if type(item) is str: 
        # print(type(item))
        links.append("https:"+item)

//upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Miniature_of_Ada_Byron.jpg/220px-Miniature_of_Ada_Byron.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/d/de/Ada_Lovelace_child_portrait_Somerville_College.jpg/220px-Ada_Lovelace_child_portrait_Somerville_College.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/b/b1/Ada_Byron_aged_seventeen_%281832%29.jpg/200px-Ada_Byron_aged_seventeen_%281832%29.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Ada_Lovelace_portrait.jpg/220px-Ada_Lovelace_portrait.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/1/16/Ada_Lovelace_sonnet_The_Rainbow_Somerville_College.JPG/220px-Ada_Lovelace_sonnet_The_Rainbow_Somerville_College.JPG
//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/Ada_Lovelace_in_1852.jpg/220px-Ada_Lovelace_in_1852.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Ada_Lovelace.jpg/220px-Ada_Lovelace.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Diagram_for_the_computation_of_Bernoulli_numbers.jpg/220px-D

In [15]:
links

['https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Miniature_of_Ada_Byron.jpg/220px-Miniature_of_Ada_Byron.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Ada_Lovelace_child_portrait_Somerville_College.jpg/220px-Ada_Lovelace_child_portrait_Somerville_College.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b1/Ada_Byron_aged_seventeen_%281832%29.jpg/200px-Ada_Byron_aged_seventeen_%281832%29.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Ada_Lovelace_portrait.jpg/220px-Ada_Lovelace_portrait.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Ada_Lovelace_sonnet_The_Rainbow_Somerville_College.JPG/220px-Ada_Lovelace_sonnet_The_Rainbow_Somerville_College.JPG',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/Ada_Lovelace_in_1852.jpg/220px-Ada_Lovelace_in_1852.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Ada_Lovelace.jpg/220px-Ada_Lovelace.jpg',
 'https://upload.wikimedia.org/wikipedia/c

In [16]:
import urllib.request #to download 
import urllib.parse # to download 
from urllib.error import HTTPError # to see error 
import time # to keep track of time 
import random as r # to donwload at random time 

In [17]:
def download(url, full_path):
    try:
        urllib.request.urlretrieve(url, full_path)
        time.sleep(r.randint(1, 5))
    except urllib.error.HTTPError as err:
        print(err.code)
        pass

In [18]:
#enumerate() take 2 parameters, iterable, start (optional)
#iterable iterates through the object (in our case, a list of links)

#first, create folder called "images"

for index, url in enumerate(links):
    print(url)
    file_name = 'img_' + str(index) + '.jpg'
    file_path = 'images/'    
    full_path = '{}{}'.format(file_path, file_name)
    print(file_name)
    download(url, full_path)

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Miniature_of_Ada_Byron.jpg/220px-Miniature_of_Ada_Byron.jpg
img_0.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Ada_Lovelace_child_portrait_Somerville_College.jpg/220px-Ada_Lovelace_child_portrait_Somerville_College.jpg
img_1.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/b/b1/Ada_Byron_aged_seventeen_%281832%29.jpg/200px-Ada_Byron_aged_seventeen_%281832%29.jpg
img_2.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Ada_Lovelace_portrait.jpg/220px-Ada_Lovelace_portrait.jpg
img_3.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Ada_Lovelace_sonnet_The_Rainbow_Somerville_College.JPG/220px-Ada_Lovelace_sonnet_The_Rainbow_Somerville_College.JPG
img_4.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/Ada_Lovelace_in_1852.jpg/220px-Ada_Lovelace_in_1852.jpg
img_5.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Ada_Lovelace.jpg/220px-Ada_Lovelace.jpg
img_6.jpg
