### Intro to Web Scraping


Scraping is not always legal!  

Some rules to consider: 
* Be respectful and do not bombard a website with scraping request or else you can get your IP address blocked
* Check the website permission before you begin! If there is an API available, use it. Most websites won't let you use their data commercially.
* Each website is unique and may update, so you may need to update your code and/or customize your scraping code for each website


When is it a good idea to scrape a website:
* API is not available, or information you want is not in the API
* You want to anonoymously scrape a website (use a VPN) 

Here is a Web Scraping Sandbox where you can practice scraping: 
http://toscrape.com/

If you want to go deeper into scraping using additional libraries, visit this tutorial by Sam Lavigne: 
https://scrapism.lav.io/


Today, we're going to start with scraping www.wikipedia.com because it is *legal* to scrape

Make sure you download requests and bs4 via terminal 

* pip install requests
* pip install bs4

or if you're using Anaconda 

* conda install requests
* conda install bs4

or install it via notebook 

* !pip install requests
* !pip install bs4 

In [1]:
# The request library will grab the page
import requests

response = requests.get("https://en.wikipedia.org/wiki/Art")

In [2]:
response.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Art - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"35d6d314-8bd5-4a63-87ed-b62177dfb62c","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Art","wgTitle":"Art","wgCurRevisionId":1040765740,"wgRevisionId":1040765740,"wgArticleId":752,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: long volume value","CS1 maint: archived copy as title","All articles with dead external links","Articles with dead external links from May 2016","Webarchive template wayback links","CS1 Germ

In [3]:
# The beautifulsoup library makes your code legible and helps you analyze the extracted page

import bs4 
soup = bs4.BeautifulSoup(response.text, 'html.parser')

In [4]:
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Art - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"35d6d314-8bd5-4a63-87ed-b62177dfb62c","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Art","wgTitle":"Art","wgCurRevisionId":1040765740,"wgRevisionId":1040765740,"wgArticleId":752,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: long volume value","CS1 maint: archived copy as title","All articles with dead external links","Articles with dead external links from May 2016","Webarchive template wayback links","CS1 German-la

In [5]:
# next inspect the elements on the wiki page, I want to grab the headlines 
# the headlines are in the class="mw-headline" in a <span> 

soup.select(".mw-headline")

[<span class="mw-headline" id="Overview">Overview</span>,
 <span class="mw-headline" id="History">History</span>,
 <span class="mw-headline" id="Forms,_genres,_media,_and_styles">Forms, genres, media, and styles</span>,
 <span class="mw-headline" id="Skill_and_craft">Skill and craft</span>,
 <span class="mw-headline" id="Purpose">Purpose</span>,
 <span class="mw-headline" id="Non-motivated_functions">Non-motivated functions</span>,
 <span class="mw-headline" id="Motivated_functions">Motivated functions</span>,
 <span class="mw-headline" id="Public_access">Public access</span>,
 <span class="mw-headline" id="Controversies">Controversies</span>,
 <span class="mw-headline" id="Theory">Theory</span>,
 <span class="mw-headline" id="Arrival_of_Modernism">Arrival of Modernism</span>,
 <span class="mw-headline" id='New_Criticism_and_the_"intentional_fallacy"'>New Criticism and the "intentional fallacy"</span>,
 <span class="mw-headline" id='"Linguistic_turn"_and_its_debate'>"Linguistic turn" a

In [6]:
# Create a list for the scrapped headlines 

headlines = []
for item in soup.select(".mw-headline"):
    headlines.append(item.text)

In [7]:
headlines

['Overview',
 'History',
 'Forms, genres, media, and styles',
 'Skill and craft',
 'Purpose',
 'Non-motivated functions',
 'Motivated functions',
 'Public access',
 'Controversies',
 'Theory',
 'Arrival of Modernism',
 'New Criticism and the "intentional fallacy"',
 '"Linguistic turn" and its debate',
 'Classification disputes',
 'Value judgment',
 'Art and law',
 'See also',
 'Notes',
 'Bibliography',
 'Further reading',
 'External links']

In [8]:
# To save to a CSV, we first want to create a dataframe for the data

import pandas as pd

headline_df = pd.DataFrame()
headline_df['headlines'] = pd.Series(headlines).values



In [9]:
headline_df

Unnamed: 0,headlines
0,Overview
1,History
2,"Forms, genres, media, and styles"
3,Skill and craft
4,Purpose
5,Non-motivated functions
6,Motivated functions
7,Public access
8,Controversies
9,Theory


In [10]:
# Save to csv 
headline_df.to_csv('headline.csv')

In [11]:
# We're starting by going through the HTML and looking for something all the images have in common 
# the class 'thumbimage' applies to all the images 

image_info = soup.select('.thumbimage')

In [12]:
image_info

[<img alt="" class="thumbimage" data-file-height="800" data-file-width="800" decoding="async" height="330" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/34/Art-portrait-collage_2.jpg/330px-Art-portrait-collage_2.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/34/Art-portrait-collage_2.jpg/495px-Art-portrait-collage_2.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/34/Art-portrait-collage_2.jpg/660px-Art-portrait-collage_2.jpg 2x" width="330"/>,
 <img alt="" class="thumbimage" data-file-height="1280" data-file-width="853" decoding="async" height="255" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Teke_bottle.JPG/170px-Teke_bottle.JPG" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Teke_bottle.JPG/255px-Teke_bottle.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Teke_bottle.JPG/340px-Teke_bottle.JPG 2x" width="170"/>,
 <img alt="" class="thumbimage" data-file-height="2220" data-file-width="1350" decoding="async" heigh

In [13]:
len(image_info)

22

In [14]:
# We're creating a list of the links for the thumbnails 

links = []
for link in image_info:
    #the links is in the 'src' attribute 
    item =link.get('src')
    #we're adding https: to format it properly 
    print(item)
    #this if statement is bc some <div> tags also have the class "thumbnail"
    if type(item) is str: 
        # print(type(item))
        links.append("https:"+item)

//upload.wikimedia.org/wikipedia/commons/thumb/3/34/Art-portrait-collage_2.jpg/330px-Art-portrait-collage_2.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Teke_bottle.JPG/170px-Teke_bottle.JPG
//upload.wikimedia.org/wikipedia/commons/thumb/1/19/Venus_of_Willendorf_frontview_retouched_2.jpg/170px-Venus_of_Willendorf_frontview_retouched_2.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/0/07/Oval_basin_or_dish_with_subject_from_Amadis_of_Gaul_MET_DP320592.jpg/220px-Oval_basin_or_dish_with_subject_from_Amadis_of_Gaul_MET_DP320592.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/0/07/Lascaux2.jpg/220px-Lascaux2.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/9/96/Tugra_Mahmuds_II.gif/220px-Tugra_Mahmuds_II.gif
//upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Great_Mosque_of_Kairouan_Panorama_-_Grande_Mosqu%C3%A9e_de_Kairouan_Panorama.jpg/220px-Great_Mosque_of_Kairouan_Panorama_-_Grande_Mosqu%C3%A9e_de_Kairouan_Panorama.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/e

In [15]:
links

['https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Art-portrait-collage_2.jpg/330px-Art-portrait-collage_2.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Teke_bottle.JPG/170px-Teke_bottle.JPG',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Venus_of_Willendorf_frontview_retouched_2.jpg/170px-Venus_of_Willendorf_frontview_retouched_2.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/07/Oval_basin_or_dish_with_subject_from_Amadis_of_Gaul_MET_DP320592.jpg/220px-Oval_basin_or_dish_with_subject_from_Amadis_of_Gaul_MET_DP320592.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/07/Lascaux2.jpg/220px-Lascaux2.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/9/96/Tugra_Mahmuds_II.gif/220px-Tugra_Mahmuds_II.gif',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Great_Mosque_of_Kairouan_Panorama_-_Grande_Mosqu%C3%A9e_de_Kairouan_Panorama.jpg/220px-Great_Mosque_of_Kairouan_Panorama_-_Grande_Mosqu%C3%A9e_de_Ka

In [16]:
import urllib.request #to download 
import urllib.parse # to download 
from urllib.error import HTTPError # to see error 
import time # to keep track of time 
import random as r # to donwload at random time 

In [17]:
def download(url, full_path):
    try:
        urllib.request.urlretrieve(url, full_path)
        time.sleep(r.randint(1, 5))
    except urllib.error.HTTPError as err:
        print(err.code)
        pass

In [18]:
#enumerate() take 2 parameters, iterable, start (optional)
#iterable iterates through the object (in our case, a list of links)

#first, create folder called "images"

for index, url in enumerate(links):
    print(url)
    file_name = 'img_' + str(index) + '.jpg'
    file_path = 'images/'    
    full_path = '{}{}'.format(file_path, file_name)
    print(file_name)
    download(url, full_path)

https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Art-portrait-collage_2.jpg/330px-Art-portrait-collage_2.jpg
img_0.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Teke_bottle.JPG/170px-Teke_bottle.JPG
img_1.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Venus_of_Willendorf_frontview_retouched_2.jpg/170px-Venus_of_Willendorf_frontview_retouched_2.jpg
img_2.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/0/07/Oval_basin_or_dish_with_subject_from_Amadis_of_Gaul_MET_DP320592.jpg/220px-Oval_basin_or_dish_with_subject_from_Amadis_of_Gaul_MET_DP320592.jpg
img_3.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/0/07/Lascaux2.jpg/220px-Lascaux2.jpg
img_4.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/9/96/Tugra_Mahmuds_II.gif/220px-Tugra_Mahmuds_II.gif
img_5.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Great_Mosque_of_Kairouan_Panorama_-_Grande_Mosqu%C3%A9e_de_Kairouan_Panorama.jpg/220px-Great_Mosque_of_Kairouan_Pa