# Course: Intro to Python & R for Data Analysis
## Lecture: Webscraping with Anthony

contact: avspinelli@optonline.net



# Goal: Build a dataset of skate decks from: https://laborskateshop.com/

## General structure for webscraping with BS:

### 1. Scrapability
The first thing we want to know is if BS can scrape what we want it to. BS is good for working with html, but cant deal with java
### 2. Building variables using a point example
If the website is scrapable, its best to start small and build up
### 3. Building loops
We then typically build a loop aroud this concept to flip through items and pull what we want from each
### 4. Pagination (If needed)
And lastly we often want to flip through pages of a website to collect more than kust the first page
#### 

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import requests
import re
import time

# 1. Scrapability

Lets start by seeing if the website will allow us to scrape, by making a request to the page with the skate decks:

In [2]:
# Make Request
url = 'https://laborskateshop.com/collections/decks?filter.v.availability=1&page=1'
response = requests.get(url)
response

<Response [200]>

In [3]:
# Get html from the link

soup = bs(response.text, 'html.parser')

#
# 2. Building variables using a point example
Most websites use containers which we can flip through and grab data from. They typically all have the same tag which makes them, easy to loop over. In our case, each containor has the tag:

li class="grid__item"

So lets start by getting all the tags that have that structure:

In [4]:
# .find_all()
decks = soup.find_all('li',{'class':'grid__item'})

# After we have all of them, its best to just use one container to create our variables from, since they all have the same structure
deck = decks[0]

In [5]:
deck # Looking inside each container

<li class="grid__item">
<link href="//laborskateshop.com/cdn/shop/t/25/assets/component-rating.css?v=24573085263941240431663960143" media="all" rel="stylesheet" type="text/css"/>
<div class="card-wrapper underline-links-hover">
<div class="card card--standard card--media" style="--ratio-percent: 100%;">
<div class="card__inner color-background-2 ratio" style="--ratio-percent: 100%;"><div class="card__media">
<div class="media media--transparent media--hover-effect">
<img alt="Hardbody Hjalte Halberg Pro Skateboard Deck - Black" class="motion-reduce" height="4284" sizes="(min-width: 1400px) 317px, (min-width: 990px) calc((100vw - 130px) / 4), (min-width: 750px) calc((100vw - 120px) / 3), calc((100vw - 35px) / 2)" src="//laborskateshop.com/cdn/shop/files/IMG_3875_eea1f56e-738b-4ae0-9c66-72265aee3825.jpg?v=1709231435&amp;width=533" srcset="//laborskateshop.com/cdn/shop/files/IMG_3875_eea1f56e-738b-4ae0-9c66-72265aee3825.jpg?v=1709231435&amp;width=165 165w,//laborskateshop.com/cdn/shop/fil

# Variables to collect

From each container, we want two variables:

### 1. Price
### 2. Link

We want to get, clean, and store all of our variables by identifying the tags associated and extracting the good stuff

## Price:

In [6]:
# First we see that price is stored here:
# <span class="price-item price-item--regular" data-uw-rm-sr="">
#         $60.00 USD
#       </span>


# Taking that we find the variable:
# deck.find('span',{'class':'price-item price-item--regular'})

# Then use the .get_text() method to get the text of the element: 
# deck.find('span',{'class':'price-item price-item--regular'}).get_text()

# Then we use a bit of regex: re.findall(r'\d+', STRING_HERE) to find all the numeric charecters
# re.findall(r'\d+',deck.find('span',{'class':'price-item price-item--regular'}).get_text())

# Final variable
price = re.findall(r'\d+',deck.find('span',{'class':'price-item price-item--regular'}).get_text())[0]
price

'70'

## Link:

In [7]:
# The link is stored here
# <a href="/products/krooked-tom-knox-debut-secret-pro-skateboard-deck?_pos=1&amp;_fid=c59c1ef1d&amp;_ss=c" class="full-unstyled-link" data-uw-rm-brl="false">
#              Krooked Tom Knox Debut Secret Pro Skateboard Deck
#           </a>

# First get it with:
# deck.find('h3',{'class':'card__heading h5'})

# We then put it in a string and split it by href to get the link
# str(deck.find('h3',{'class':'card__heading h5'})).split('href="')[1]

# Clean the gardbage
# str(deck.find('h3',{'class':'card__heading h5'})).split('href="')[1].split('">\n')[0]

# and adding the begining of the https so it is searchable:
#'https://laborskateshop.com/'+str(deck.find('h3',{'class':'card__heading h5'})).split('href="')[1].split('">\n')[0]

# Final variable
link = 'https://laborskateshop.com/'+str(deck.find('h3',{'class':'card__heading h5'})).split('href="')[1].split('">\n')[0]
link

'https://laborskateshop.com//products/hardbody-hjalte-halberg-pro-skateboard-deck-black?_pos=1&amp;_fid=36778eda2&amp;_ss=c&amp;variant=40804088316006'

#
# 3. Building loops
Next thing we want to do is loop through our containors while grabbing our three variables form each:

In [8]:
# Lets again make the request here to make sure any manipulation we may have donr above isnt happening
url = 'https://laborskateshop.com/collections/decks?filter.v.availability=1&page=1'
response = requests.get(url)
soup = bs(response.text, 'html.parser')
decks = soup.find_all('li',{'class':'grid__item'})


# first loop
out = []    # create a empty list that we are going to append our data into
for deck in decks:     # "for each container in the list of containers"
    
    # Getting all of our variables from "item"
    price = re.findall(r'\d+',deck.find('span',{'class':'price-item price-item--regular'}).get_text())[0]
    link = 'https://laborskateshop.com/'+str(deck.find('h3',{'class':'card__heading h5'})).split('href="')[1].split('">\n')[0]
    
    data = {'price' : price,   # here we build the variables into a dictionary
            'link' : link
    }
    
    out.append(data) # and then append it to the empty dict
pd.DataFrame(out)

Unnamed: 0,price,link
0,70,https://laborskateshop.com//products/hardbody-...
1,70,https://laborskateshop.com//products/hardbody-...
2,70,https://laborskateshop.com//products/hardbody-...
3,70,https://laborskateshop.com//products/hardbody-...
4,78,https://laborskateshop.com//products/baker-tys...
5,78,https://laborskateshop.com//products/baker-tri...
6,78,https://laborskateshop.com//products/baker-row...
7,78,https://laborskateshop.com//products/deathish-...
8,78,https://laborskateshop.com//products/deathwish...
9,78,https://laborskateshop.com//products/deathwish...


# 4. Pagination (If needed)

Okay so great, we got one page of skate decks with their price and link. But thats kind of boring as its only 20 or so observations. To get mroe, we need to go to the next page of the website and keep collecting. most url's have a pagination feature we can manipulate to get the next page. In the case of labor.com, its the standard "page=":
https://laborskateshop.com/collections/decks?filter.v.availability=1&page=1

If we manipulate that, we can pass the loop diffrent pages and it will collect more

In [9]:
start_time = time.time()

# outer loop
page = 1   # create our page variable which we will increase at the end of the loop
url = 'https://laborskateshop.com/collections/decks?filter.v.availability=1&page='  # We replaced the 'page=1' with just 'page=' because we are goign to cycle through pages
final_data = [] # Outer dictionary which will hold the inners


while page < 5:     # "While our page variable is less than 5, continue to run". If you dont change the page number at the bottom of the loop it will run forever. if that happnes just restart the kernal
    url_get = url+str(page)    # URL Creation

    response = requests.get(url_get)         # Make the request to get the html from the page
    soup = bs(response.text, 'html.parser')            
    decks = soup.find_all('li',{'class':'grid__item'})

    
    
    # Inner Loop
    out_inner = []
    for deck in decks:
        price = re.findall(r'\d+',deck.find('span',{'class':'price-item price-item--regular'}).get_text())[0]
        link = 'https://laborskateshop.com/'+str(deck.find('h3',{'class':'card__heading h5'})).split('href="')[1].split('">\n')[0]
        data = {'price' : price,
                'link' : link
        }
        out_inner.append(data)

    
    
    # Bottom half of outer loop
    inner_data = pd.DataFrame(out_inner).assign(page = page)   # Create a DataFrame of out_inner and assign a page column which = page
    final_data.append(inner_data)   # Append the inner data to the outer dictionary to save it when we create another inner_data with the next page
    print('Page '+str(page)+' Complete')    #print statment telling us where the scraper is and on which page
    page = page + 1    # Increases the page by one to get the next page at the top of the loop
    time.sleep(2)   # Very important to not bombard the website, always put sleep times inbetweem requests if you dont want to get caught...
    # End of outer loop
    
print('Total Time to Run: ' + str(round(time.time() - start_time,2))+' Seconds') # We use the time library to see how long the scraper takes
df = pd.concat(final_data).reset_index(drop=True)
df

Page 1 Complete
Page 2 Complete
Page 3 Complete
Page 4 Complete
Total Time to Run: 10.29 Seconds


Unnamed: 0,price,link,page
0,70,https://laborskateshop.com//products/hardbody-...,1
1,70,https://laborskateshop.com//products/hardbody-...,1
2,70,https://laborskateshop.com//products/hardbody-...,1
3,70,https://laborskateshop.com//products/hardbody-...,1
4,78,https://laborskateshop.com//products/baker-tys...,1
...,...,...,...
91,72,https://laborskateshop.com//products/pass-port...,4
92,72,https://laborskateshop.com//products/pass-port...,4
93,75,https://laborskateshop.com//products/baker-tys...,4
94,75,https://laborskateshop.com//products/baker-bra...,4


# You Try!

## Alter the scraper below to collect the first 2 pages of Footware from:
https://laborskateshop.com/collections/footwear?filter.v.availability=1&page=1

### Varibales to Collect:
1. Price of each shoe
2. Link to each shoe
3. Name of each shoe


In [10]:
start_time = time.time()

# outer loop
page = 1   # create our page variable which we will increase at the end of the loop
url = 'https://laborskateshop.com/collections/decks?filter.v.availability=1&page='  # We replaced the 'page=1' with just 'page=' because we are going to cycle through pages
final_data = [] # Outer dictionary which will hold the inners


while page < 5:     # "While our page variable is less than 5, continue to run". If you dont change the page number at the bottom of the loop it will run forever. if that happnes just restart the kernal
    url_get = url+str(page)    # URL Creation

    response = requests.get(url_get)         # Make the request to get the html from the page
    soup = bs(response.text, 'html.parser')            
    decks = soup.find_all('li',{'class':'grid__item'})

    
    
    # Inner Loop
    out_inner = []
    for deck in decks:
        price = re.findall(r'\d+',deck.find('span',{'class':'price-item price-item--regular'}).get_text())[0]
        link = 'https://laborskateshop.com/'+str(deck.find('h3',{'class':'card__heading h5'})).split('href="')[1].split('">\n')[0]
        data = {'price' : price,
                'link' : link
        }
        out_inner.append(data)

    
    
    # Bottom half of outer loop
    inner_data = pd.DataFrame(out_inner).assign(page = page)   # Create a DataFrame of out_inner and assign a page column which = page
    final_data.append(inner_data)   # Append the inner data to the outer dictionary to save it when we create another inner_data with the next page
    print('Page '+str(page)+' Complete')    #print statment telling us where the scraper is and on which page
    page = page + 1    # Increases the page by one to get the next page at the top of the loop
    time.sleep(2)   # Very important to not bombard the website, always put sleep times inbetweem requests if you dont want to get caught...
    # End of outer loop
    
print('Total Time to Run: ' + str(round(time.time() - start_time,2))+' Seconds') # We use the time library to see how long the scraper takes
df = pd.concat(final_data).reset_index(drop=True)
df

Page 1 Complete
Page 2 Complete
Page 3 Complete
Page 4 Complete
Total Time to Run: 10.8 Seconds


Unnamed: 0,price,link,page
0,70,https://laborskateshop.com//products/hardbody-...,1
1,70,https://laborskateshop.com//products/hardbody-...,1
2,70,https://laborskateshop.com//products/hardbody-...,1
3,70,https://laborskateshop.com//products/hardbody-...,1
4,78,https://laborskateshop.com//products/baker-tys...,1
...,...,...,...
91,72,https://laborskateshop.com//products/pass-port...,4
92,72,https://laborskateshop.com//products/pass-port...,4
93,75,https://laborskateshop.com//products/baker-tys...,4
94,75,https://laborskateshop.com//products/baker-bra...,4
