### Intro to Web Scraping


Scraping is not always legal!  

Some rules to consider: 
* Be respectful and do not bombard a website with scraping request or else you can get your IP address blocked
* Check the website permission before you begin! If there is an API available, use it. Most websites won't let you use their data commercially.
* Each website is unique and may update, so you may need to update your code and/or customize your scraping code for each website


When is it a good idea to scrape a website:
* API is not available, or information you want is not in the API
* You want to anonoymously scrape a website (use a VPN) 

Here is a Web Scraping Sandbox where you can practice scraping: 
http://toscrape.com/

Today, we're going to start with scraping www.wikipedia.com because it is *legal* to scrape

This lesson was adapted from: https://github.com/Pierian-Data/Complete-Python-3-Bootcamp/blob/master/13-Web-Scraping/00-Guide-to-Web-Scraping.ipynb

Make sure you download requests and bs4 via terminal 

* pip install requests
* pip install bs4

or if you're using Anaconda 

* conda install requests
* conda install bs4

or install it via notebook 

* !pip install requests
* !pip install bs4 

In [2]:
# The request library will grab the page
import requests

response = requests.get("https://en.wikipedia.org/wiki/John_Lennon")

In [3]:
response.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>John Lennon - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"X8@9BgpAICgAAA7h8JsAAACX","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"John_Lennon","wgTitle":"John Lennon","wgCurRevisionId":993067073,"wgRevisionId":993067073,"wgArticleId":15852,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","Articles with short description","Short description is different from Wikidata","Articles to be merged from November 2020","All articles to be merg

In [4]:
# The beautifulsoup library makes your code legible and helps you analyze the extracted page

import bs4 
soup = bs4.BeautifulSoup(response.text, 'html.parser')

In [5]:
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>John Lennon - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"X8@9BgpAICgAAA7h8JsAAACX","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"John_Lennon","wgTitle":"John Lennon","wgCurRevisionId":993067073,"wgRevisionId":993067073,"wgArticleId":15852,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","Articles with short description","Short description is different from Wikidata","Articles to be merged from November 2020","All articles to be merged","

In [6]:
# next inspect the elements on the wiki page, I want to grab the headlines 
# the headlines are in the class="mw-headline" in a <span> 

soup.select(".mw-headline")

[<span class="mw-headline" id="Biography">Biography</span>,
 <span class="mw-headline" id="1940–1957:_Early_years">1940–1957: Early years</span>,
 <span class="mw-headline" id="1956–1970:_The_Quarrymen_to_the_Beatles">1956–1970: The Quarrymen to the Beatles</span>,
 <span class="mw-headline" id="1956–1966:_Formation,_fame_and_touring">1956–1966: Formation, fame and touring</span>,
 <span class="mw-headline" id="1966–1970:_Studio_years,_break-up_and_solo_work">1966–1970: Studio years, break-up and solo work</span>,
 <span class="mw-headline" id="1970–1980:_Solo_career">1970–1980: Solo career</span>,
 <span class="mw-headline" id="1970–1972:_Initial_solo_success_and_activism">1970–1972: Initial solo success and activism</span>,
 <span class="mw-headline" id='1973–1975:_"Lost_weekend"'>1973–1975: "Lost weekend"</span>,
 <span class="mw-headline" id="1975–1980:_Hiatus_and_return">1975–1980: Hiatus and return</span>,
 <span class="mw-headline" id="8_December_1980:_Murder">8 December 1980: M

In [7]:
# Create a list for the scrapped headlines 

headlines = []
for item in soup.select(".mw-headline"):
    headlines.append(item.text)

In [8]:
headlines

['Biography',
 '1940–1957: Early years',
 '1956–1970: The Quarrymen to the Beatles',
 '1956–1966: Formation, fame and touring',
 '1966–1970: Studio years, break-up and solo work',
 '1970–1980: Solo career',
 '1970–1972: Initial solo success and activism',
 '1973–1975: "Lost weekend"',
 '1975–1980: Hiatus and return',
 '8 December 1980: Murder',
 'Personal relationships',
 'Cynthia Lennon',
 'Brian Epstein',
 'Julian Lennon',
 'Yoko Ono',
 'May Pang',
 'Sean Lennon',
 'Former Beatles',
 'Political activism',
 'Deportation attempt',
 'FBI surveillance and declassified documents',
 'Writing and art',
 'Musicianship',
 'Instruments played',
 'Vocal style',
 'Legacy',
 'Accolades',
 'Discography',
 'Solo',
 'With Yoko Ono',
 'Filmography',
 'Film',
 'Television',
 'Bibliography',
 'See also',
 'Notes',
 'References',
 'Citations',
 'Sources',
 'Further reading',
 'External links']

In [9]:
# To save to a CSV, we first want to create a dataframe for the data

import pandas as pd

headline_df = pd.DataFrame()
headline_df['headlines'] = pd.Series(headlines).values

In [10]:
headline_df

Unnamed: 0,headlines
0,Biography
1,1940–1957: Early years
2,1956–1970: The Quarrymen to the Beatles
3,"1956–1966: Formation, fame and touring"
4,"1966–1970: Studio years, break-up and solo work"
5,1970–1980: Solo career
6,1970–1972: Initial solo success and activism
7,"1973–1975: ""Lost weekend"""
8,1975–1980: Hiatus and return
9,8 December 1980: Murder


In [11]:
# Save to csv 
headline_df.to_csv('headline.csv')

In [12]:
# We're starting by going through the HTML and looking for something all the images have in common 
# the class 'thumbimage' applies to all the images 

image_info = soup.select('.thumbimage')

In [13]:
image_info

[<img alt="A grey two-storey building, with numerous windows visible on both levels" class="thumbimage" data-file-height="1690" data-file-width="2040" decoding="async" height="182" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Mendipsnationaltrust.JPG/220px-Mendipsnationaltrust.JPG" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Mendipsnationaltrust.JPG/330px-Mendipsnationaltrust.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Mendipsnationaltrust.JPG/440px-Mendipsnationaltrust.JPG 2x" width="220"/>,
 <img alt="" class="thumbimage" data-file-height="329" data-file-width="500" decoding="async" height="171" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/04/Paul%2C_George_%26_John.png/260px-Paul%2C_George_%26_John.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/04/Paul%2C_George_%26_John.png/390px-Paul%2C_George_%26_John.png 1.5x, //upload.wikimedia.org/wikipedia/commons/0/04/Paul%2C_George_%26_John.png 2x" width="260"/>,
 <img al

In [14]:
len(image_info)

26

In [15]:
# We're creating a list of the links for the thumbnails 

links = []
for link in image_info:
    #the links is in the 'src' attribute 
    item =link.get('src')
    #we're adding https: to format it properly 
    print(item)
    #this if statement is bc some <div> tags also have the class "thumbnail"
    if type(item) is str: 
        # print(type(item))
        links.append("https:"+item)

//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Mendipsnationaltrust.JPG/220px-Mendipsnationaltrust.JPG
//upload.wikimedia.org/wikipedia/commons/thumb/0/04/Paul%2C_George_%26_John.png/260px-Paul%2C_George_%26_John.png
//upload.wikimedia.org/wikipedia/commons/thumb/9/97/John_Lennon_%28cropped%29.jpg/130px-John_Lennon_%28cropped%29.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/1/1d/The_Beatles_i_H%C3%B6torgscity_1963.jpg/220px-The_Beatles_i_H%C3%B6torgscity_1963.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/a/a0/John_Lennon_passport_photo_%28cropped%29.jpg/144px-John_Lennon_passport_photo_%28cropped%29.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/8/87/The_Beatles_magical_mystery_tour_%28cropped%29.jpg/220px-The_Beatles_magical_mystery_tour_%28cropped%29.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/8/81/John_Lennon_en_echtgenote_Yoko_Ono_vertrekken_van_Schiphol_naar_Wenen_in_de_vert%2C_Bestanddeelnr_922-2496_%28cropped%29.jpg/190px-John_Lennon_en_echtgenote_Yoko_

In [16]:
links

['https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Mendipsnationaltrust.JPG/220px-Mendipsnationaltrust.JPG',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Paul%2C_George_%26_John.png/260px-Paul%2C_George_%26_John.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/John_Lennon_%28cropped%29.jpg/130px-John_Lennon_%28cropped%29.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/The_Beatles_i_H%C3%B6torgscity_1963.jpg/220px-The_Beatles_i_H%C3%B6torgscity_1963.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a0/John_Lennon_passport_photo_%28cropped%29.jpg/144px-John_Lennon_passport_photo_%28cropped%29.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/The_Beatles_magical_mystery_tour_%28cropped%29.jpg/220px-The_Beatles_magical_mystery_tour_%28cropped%29.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/John_Lennon_en_echtgenote_Yoko_Ono_vertrekken_van_Schiphol_naar_Wenen_in_de_vert%2C_Bestanddee

In [17]:
import urllib.request
import urllib.parse
from urllib.error import HTTPError
import time
import random as r

In [18]:
def download(url, full_path):
    try:
        urllib.request.urlretrieve(url, full_path)
        time.sleep(r.randint(1, 5))
    except urllib.error.HTTPError as err:
        print(err.code)
        pass

In [21]:
for index, url in enumerate(links):
    print(url)
    file_name = 'img_' + str(index) + '.jpg'
    file_path = 'images/'    
    full_path = '{}{}'.format(file_path, file_name)
    print(file_name)
    download(url, full_path)

https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Mendipsnationaltrust.JPG/220px-Mendipsnationaltrust.JPG
img_0.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Paul%2C_George_%26_John.png/260px-Paul%2C_George_%26_John.png
img_1.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/John_Lennon_%28cropped%29.jpg/130px-John_Lennon_%28cropped%29.jpg
img_2.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/The_Beatles_i_H%C3%B6torgscity_1963.jpg/220px-The_Beatles_i_H%C3%B6torgscity_1963.jpg
img_3.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/a/a0/John_Lennon_passport_photo_%28cropped%29.jpg/144px-John_Lennon_passport_photo_%28cropped%29.jpg
img_4.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/The_Beatles_magical_mystery_tour_%28cropped%29.jpg/220px-The_Beatles_magical_mystery_tour_%28cropped%29.jpg
img_5.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/John_Lennon_en_echtgenote_Yoko_Ono_vertrekken_van_Schiphol_n