### Intro to Web Scraping


Scraping is not always legal!  

Some rules to consider: 
* Be respectful and do not bombard a website with scraping request or else you can get your IP address blocked
* Check the website permission before you begin! If there is an API available, use it. Most websites won't let you use their data commercially.
* Each website is unique and may update, so you may need to update your code and/or customize your scraping code for each website


When is it a good idea to scrape a website:
* API is not available, or information you want is not in the API
* You want to anonoymously scrape a website (use a VPN) 

Here is a Web Scraping Sandbox where you can practice scraping: 
http://toscrape.com/

Today, we're going to start with scraping www.wikipedia.com because it is *legal* to scrape

Make sure you download requests and bs4 via terminal 

* pip install requests
* pip install bs4

or if you're using Anaconda 

* conda install requests
* conda install bs4

or install it via notebook 

* !pip install requests
* !pip install bs4 

In [1]:
# The request library will grab the page
import requests

response = requests.get("https://en.wikipedia.org/wiki/Loki")

In [2]:
response.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Loki - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"7a897562-1e3d-4b19-8dcd-d09427ae4656","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Loki","wgTitle":"Loki","wgCurRevisionId":1034351440,"wgRevisionId":1034351440,"wgArticleId":18013,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: dates","CS1 Japanese-language sources (ja)","Wikipedia indefinitely semi-protected pages","Articles with short description","Short description matches Wikidata","Articles with tr

In [3]:
# The beautifulsoup library makes your code legible and helps you analyze the extracted page

import bs4 
soup = bs4.BeautifulSoup(response.text, 'html.parser')

In [4]:
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Loki - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"7a897562-1e3d-4b19-8dcd-d09427ae4656","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Loki","wgTitle":"Loki","wgCurRevisionId":1034351440,"wgRevisionId":1034351440,"wgArticleId":18013,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: dates","CS1 Japanese-language sources (ja)","Wikipedia indefinitely semi-protected pages","Articles with short description","Short description matches Wikidata","Articles with trivia 

In [5]:
# next inspect the elements on the wiki page, I want to grab the headlines 
# the headlines are in the class="mw-headline" in a <span> 

soup.select(".mw-headline")

[<span class="mw-headline" id="Etymology,_and_alternative_names">Etymology, and alternative names</span>,
 <span class="mw-headline" id="Attestations">Attestations</span>,
 <span class="mw-headline" id="Poetic_Edda"><i>Poetic Edda</i></span>,
 <span class="mw-headline" id="Völuspá"><i>Völuspá</i></span>,
 <span class="mw-headline" id="Lokasenna"><i>Lokasenna</i></span>,
 <span class="mw-headline" id="Entrance_and_rejection">Entrance and rejection</span>,
 <span class="mw-headline" id="Re-entrance_and_insults">Re-entrance and insults</span>,
 <span class="mw-headline" id="The_arrival_of_Thor_and_the_bondage_of_Loki">The arrival of Thor and the bondage of Loki</span>,
 <span class="mw-headline" id="Þrymskviða"><i>Þrymskviða</i></span>,
 <span class="mw-headline" id="Reginsmál"><i>Reginsmál</i></span>,
 <span class="mw-headline" id="Baldrs_draumar"><i>Baldrs draumar</i></span>,
 <span class="mw-headline" id="Hyndluljóð"><i>Hyndluljóð</i></span>,
 <span class="mw-headline" id="Fjölsvinnsmá

In [6]:
# Create a list for the scrapped headlines 

headlines = []
for item in soup.select(".mw-headline"):
    headlines.append(item.text)

In [7]:
headlines

['Etymology, and alternative names',
 'Attestations',
 'Poetic Edda',
 'Völuspá',
 'Lokasenna',
 'Entrance and rejection',
 'Re-entrance and insults',
 'The arrival of Thor and the bondage of Loki',
 'Þrymskviða',
 'Reginsmál',
 'Baldrs draumar',
 'Hyndluljóð',
 'Fjölsvinnsmál',
 'Prose Edda',
 'Gylfaginning',
 "High's introduction",
 'Loki, Svaðilfari, and Sleipnir',
 'Loki, Útgarða-Loki, and Logi',
 'Norwegian rune poem',
 'Archaeological record',
 'Snaptun Stone',
 'Kirkby Stephen Stone and Gosforth Cross',
 'Scandinavian folklore',
 'Origin and identification with other figures',
 'Origin',
 'Identification with Lóðurr',
 'Binding',
 'Modern interpretations and legacy',
 'Modern popular culture',
 'Science',
 'See also',
 'References',
 'Cited sources',
 'External links']

In [8]:
# To save to a CSV, we first want to create a dataframe for the data

import pandas as pd

headline_df = pd.DataFrame()
headline_df['headlines'] = pd.Series(headlines).values

In [9]:
headline_df

Unnamed: 0,headlines
0,"Etymology, and alternative names"
1,Attestations
2,Poetic Edda
3,Völuspá
4,Lokasenna
5,Entrance and rejection
6,Re-entrance and insults
7,The arrival of Thor and the bondage of Loki
8,Þrymskviða
9,Reginsmál


In [10]:
# Save to csv 
headline_df.to_csv('headline.csv')

In [11]:
# We're starting by going through the HTML and looking for something all the images have in common 
# the class 'thumbimage' applies to all the images 

image_info = soup.select('.thumbimage')

In [12]:
image_info

[<img alt="" class="thumbimage" data-file-height="1125" data-file-width="903" decoding="async" height="224" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/40/Processed_SAM_loki.jpg/180px-Processed_SAM_loki.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/40/Processed_SAM_loki.jpg/270px-Processed_SAM_loki.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/40/Processed_SAM_loki.jpg/360px-Processed_SAM_loki.jpg 2x" width="180"/>,
 <img alt="" class="thumbimage" data-file-height="600" data-file-width="472" decoding="async" height="280" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg/220px-Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg/330px-Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg/440px-Loki%2C_by_M%C3%A5rt

In [13]:
len(image_info)

14

In [14]:
# We're creating a list of the links for the thumbnails 

links = []
for link in image_info:
    #the links is in the 'src' attribute 
    item =link.get('src')
    #we're adding https: to format it properly 
    print(item)
    #this if statement is bc some <div> tags also have the class "thumbnail"
    if type(item) is str: 
        # print(type(item))
        links.append("https:"+item)

//upload.wikimedia.org/wikipedia/commons/thumb/4/40/Processed_SAM_loki.jpg/180px-Processed_SAM_loki.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg/220px-Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Loki_taunts_Bragi.jpg/220px-Loki_taunts_Bragi.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Lokasenna_by_Lorenz_Fr%C3%B8lich.jpg/220px-Lokasenna_by_Lorenz_Fr%C3%B8lich.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/0/02/Loki_leaves_the_hall_and_threatens_the_%C3%86sir_with_fire_by_Fr%C3%B8lich.jpg/220px-Loki_leaves_the_hall_and_threatens_the_%C3%86sir_with_fire_by_Fr%C3%B8lich.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Louis_Huard_-_The_Punishment_of_Loki.jpg/220px-Louis_Huard_-_The_Punishment_of_Loki.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/a/a0/Loki%27s_flight_to_J%C3%B6tunheim.jpg/220px-Loki%27s_flight_to_J%C3%B6tunheim.jpg
//upload.wikimedia.org/wik

In [15]:
links

['https://upload.wikimedia.org/wikipedia/commons/thumb/4/40/Processed_SAM_loki.jpg/180px-Processed_SAM_loki.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg/220px-Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Loki_taunts_Bragi.jpg/220px-Loki_taunts_Bragi.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Lokasenna_by_Lorenz_Fr%C3%B8lich.jpg/220px-Lokasenna_by_Lorenz_Fr%C3%B8lich.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/02/Loki_leaves_the_hall_and_threatens_the_%C3%86sir_with_fire_by_Fr%C3%B8lich.jpg/220px-Loki_leaves_the_hall_and_threatens_the_%C3%86sir_with_fire_by_Fr%C3%B8lich.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Louis_Huard_-_The_Punishment_of_Loki.jpg/220px-Louis_Huard_-_The_Punishment_of_Loki.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a0/Loki%27s_flight_to_J%C3%B6tunheim.jpg/22

In [16]:
import urllib.request #to download 
import urllib.parse # to download 
from urllib.error import HTTPError # to see error 
import time # to keep track of time 
import random as r # to donwload at random time 

In [17]:
def download(url, full_path):
    try:
        urllib.request.urlretrieve(url, full_path)
        time.sleep(r.randint(1, 5))
    except urllib.error.HTTPError as err:
        print(err.code)
        pass

In [18]:
#enumerate() take 2 parameters, iterable, start (optional)
#iterable iterates through the object (in our case, a list of links)

#first, create folder called "images"

for index, url in enumerate(links):
    print(url)
    file_name = 'img_' + str(index) + '.jpg'
    file_path = 'images/'    
    full_path = '{}{}'.format(file_path, file_name)
    print(file_name)
    download(url, full_path)

https://upload.wikimedia.org/wikipedia/commons/thumb/4/40/Processed_SAM_loki.jpg/180px-Processed_SAM_loki.jpg
img_0.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg/220px-Loki%2C_by_M%C3%A5rten_Eskil_Winge_1890.jpg
img_1.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Loki_taunts_Bragi.jpg/220px-Loki_taunts_Bragi.jpg
img_2.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Lokasenna_by_Lorenz_Fr%C3%B8lich.jpg/220px-Lokasenna_by_Lorenz_Fr%C3%B8lich.jpg
img_3.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/0/02/Loki_leaves_the_hall_and_threatens_the_%C3%86sir_with_fire_by_Fr%C3%B8lich.jpg/220px-Loki_leaves_the_hall_and_threatens_the_%C3%86sir_with_fire_by_Fr%C3%B8lich.jpg
img_4.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Louis_Huard_-_The_Punishment_of_Loki.jpg/220px-Louis_Huard_-_The_Punishment_of_Loki.jpg
img_5.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/a/a0/Loki%2