In [1]:
import pandas as pd
import numpy as np
import requests
import bs4 as bs

### Web Scraping #1: Getting a list of best selling GPUs from the newegg.ca

There are no tables provided, so I need to make one myself by grabbing information from the url.

There are over a hundred pages of GPUs, but I'm going to use only 3 pages of the most **popular** GPUs.

I only want to see the top 15 popular GPUs, which means it would be sufficient to web scrape just the first page, but this is to practice web scraping from multiple pages.

Each page holds 36 items so I should be collecting 108 in total.

In [2]:
# this is getting the source code
page_list = [i for i in range(1,4)]
brand = []
name = []
price_list = []
shipping_list = []

# looping through 3 pages
for page in page_list:
    newegg_url = requests.get('https://www.newegg.ca/Desktop-Graphics-Cards/SubCategory/ID-48/Page-{}?Tid=7709&Order=3'.format(page))
    # turn it into a beatifulsoup object - lxml is the parser
    soup = bs.BeautifulSoup(newegg_url.text, 'lxml')
    # grab each product 
    containers = soup.findAll('div', {'class':'item-container'})
    price = soup.findAll('li', {'class':'price-current'})
    shipping = soup.findAll('li', {'class':'price-ship'})

    # getting names
    for container in containers:
        if container.div.div.a.img != None:
            brand.append(container.div.div.a.img['title'])
            name.append(container.a.img['title'])

    # getting prices
    for item in price:
        price_list.append(item.strong.text + item.sup.text)

    # getting prices of shipping
    for item in shipping:
        shipping_list.append(item.text.split()[0])

print(len(shipping_list))
print(len(price_list))
print(len(brand))
print(len(name))

108
108
108
108


In [3]:
price_list = [float(x.replace(',','')) for x in price_list]

In [4]:
frame = {'Name':name, 'Brand':brand, 'Price':price_list, 'Shipping':shipping_list}
gpu_df = pd.DataFrame(frame)
print(gpu_df.head(15))

                                                 Name          Brand   Price  \
0   MSI Radeon RX 5700 DirectX 12 RX 5700 MECH GP ...            MSI  499.99   
1   MSI GeForce RTX 2060 DirectX 12 RTX 2060 VENTU...            MSI  419.99   
2   MSI GeForce RTX 2060 SUPER DirectX 12 RTX 2060...            MSI  559.99   
3   GIGABYTE Radeon RX 580 GAMING 8G (rev. 2.0) Gr...       GIGABYTE  249.99   
4   GIGABYTE Radeon RX 5600 XT DirectX 12 GV-R56XT...       GIGABYTE  409.99   
5   EVGA GeForce RTX 2070 SUPER XC HYBRID GAMING, ...           EVGA  819.99   
6   GIGABYTE Radeon RX 5600 XT WINDFORCE OC 6G (Re...       GIGABYTE  399.99   
7   SAPPHIRE NITRO+ Radeon RX 5700 XT DirectX 12 1...  Sapphire Tech  641.99   
8   ZOTAC GAMING GeForce RTX 2060 SUPER MINI 8GB G...          ZOTAC  549.99   
9   ZOTAC GAMING GeForce RTX 2060 6GB GDDR6 192-bi...          ZOTAC  444.99   
10  MSI Radeon RX 5600 XT DirectX 12 RX 5600 XT GA...            MSI  429.99   
11  ASRock Phantom Gaming D Radeon RX 57

These are the 15 most sold GPUs on Newegg.ca

### Web Scraping #2: Steam Sale

In [5]:
# getting the source code
steam_url = requests.get('https://store.steampowered.com/search/?specials={}')
soup = bs.BeautifulSoup(steam_url.text, 'lxml')
all_container = soup.findAll('div', {'class':'responsive_search_name_combined'})

game_list = []
discount_list = []
original_price = []
dis_price_list = []

for container in all_container:
    game = container.findChildren()[1].text
    game_list.append(game)
    
    # percentage of discount 
    discount = container.findChildren('div')[3]
    if discount.div.span == None:
        discount_list.append('No Discount')
    else:
        discount_list.append(discount.div.span.text.replace('-',''))
    
    # original price
    o_price = container.findChildren('div')[3].findChildren('div')[1]
    if o_price.strike == None:
        original_price.append('0')
    else:
        original_price.append(o_price.strike.text.split()[1].replace(',',''))
    
    # discounted price
    dis_price_line = container.findChildren('div')[3].findChildren('div')[1]
    line_split = dis_price_line.text.strip().split()
    if not line_split:
        dis_price_list.append('0')
    else:
        dis_price_list.append(line_split[-1])

# checking if the number of items match
print(len(dis_price_list))
print(len(original_price))
print(len(discount_list))
print(len(game_list))

50
50
50
50


In [6]:
# to calculate the discounted amount, convert original price list and discounted price list to numpy arrays and do subtraction
original_price = np.array([float(x) for x in original_price])
dis_price_list = np.array([float(x) for x in dis_price_list])

frame = {'Game':game_list, 'Original Price':original_price, 'Discounted Amount':original_price-dis_price_list,
         'Discounted Price':dis_price_list, 'Discount %':discount_list}
steam_df = pd.DataFrame(frame)
steam_df.sort_values(by='Discounted Amount', inplace=True, ascending=False)
steam_df.reset_index(drop=True, inplace=True)
steam_df.head(20)

Unnamed: 0,Game,Original Price,Discounted Amount,Discounted Price,Discount %
0,EA Racing Pack,250.94,176.6,74.34,70%
1,The Deus Ex Collection,98.54,86.9,11.64,88%
2,NBA 2K20,79.99,73.59,6.4,92%
3,RPG Maker MV,88.69,70.96,17.73,80%
4,IL-2 Sturmovik: Battle of Stalingrad,65.99,56.1,9.89,85%
5,Injustice™ 2,69.99,56.0,13.99,80%
6,Need for Speed™ Heat,89.99,54.0,35.99,60%
7,Plants vs. Zombies: Battle for Neighborville™,64.99,45.5,19.49,70%
8,STAR WARS Jedi: Fallen Order Deluxe Edition,89.99,45.0,44.99,50%
9,Dying Light Enhanced Edition,59.99,42.0,17.99,70%


These are the 20 games that give the most value from the sale

### Web Scraping #3: EPL Table

This time, there is a table already created on the website, so I can conveniently grab it and apply some alterations

In [7]:
epl_url = 'https://www.skysports.com/premier-league-table/2019'
epl_list = pd.read_html(epl_url) # this contains all the tables from the url in a list
print(len(epl_list)) # get the quantity of tables
epl_df = pd.DataFrame(epl_list[0]) # turn the first table to a datframe
print(epl_df)

1
     #                      Team  Pl   W   D   L    F   A  GD  Pts  Last 6
0    1                 Liverpool  38  32   3   3   85  33  52   99     NaN
1    2           Manchester City  38  26   3   9  102  35  67   81     NaN
2    3         Manchester United  38  18  12   8   66  36  30   66     NaN
3    4                   Chelsea  38  20   6  12   69  54  15   66     NaN
4    5            Leicester City  38  18   8  12   67  41  26   62     NaN
5    6         Tottenham Hotspur  38  16  11  11   61  47  14   59     NaN
6    7   Wolverhampton Wanderers  38  15  14   9   51  40  11   59     NaN
7    8                   Arsenal  38  14  14  10   56  48   8   56     NaN
8    9          Sheffield United  38  14  12  12   39  39   0   54     NaN
9   10                   Burnley  38  15   9  14   43  50  -7   54     NaN
10  11               Southampton  38  15   7  16   51  60  -9   52     NaN
11  12                   Everton  38  13  10  15   44  56 -12   49     NaN
12  13          Newcast

The table looks great; however, it needs some polishing.

Let's
* Drop `Last 6` since it's a column of NaN values
* Change names of the columns for clarity
* Set rank as index to avoid redundancy

In [8]:
epl_df.drop(columns='Last 6', inplace=True)
epl_df.rename(columns={'#':'Rank'}, inplace=True)
epl_df.set_index(keys='Rank', drop=True, inplace=True)
epl_df.rename(columns={'Pl':'Games Played'}, inplace=True)
epl_df.rename(columns={'W':'Win'}, inplace=True)
epl_df.rename(columns={'D':'Draw'}, inplace=True)
epl_df.rename(columns={'L':'Loss'}, inplace=True)
epl_df.rename(columns={'F':'Goals For'}, inplace=True)
epl_df.rename(columns={'A':'Goals Against'}, inplace=True)
epl_df.rename(columns={'GD':'Goal Difference'}, inplace=True)

epl_df

Unnamed: 0_level_0,Team,Games Played,Win,Draw,Loss,Goals For,Goals Against,Goal Difference,Pts
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Liverpool,38,32,3,3,85,33,52,99
2,Manchester City,38,26,3,9,102,35,67,81
3,Manchester United,38,18,12,8,66,36,30,66
4,Chelsea,38,20,6,12,69,54,15,66
5,Leicester City,38,18,8,12,67,41,26,62
6,Tottenham Hotspur,38,16,11,11,61,47,14,59
7,Wolverhampton Wanderers,38,15,14,9,51,40,11,59
8,Arsenal,38,14,14,10,56,48,8,56
9,Sheffield United,38,14,12,12,39,39,0,54
10,Burnley,38,15,9,14,43,50,-7,54


Everything looks clear and polished. Job done.

### Web Scraping #4: Joining Province and Abbreviation Tables

In [9]:
# getting population of provinces/territories in Canada
canada_2016 = pd.read_html('https://en.wikipedia.org/wiki/Population_of_Canada_by_province_and_territory')
population_df = pd.DataFrame(canada_2016[1])
population_df.set_index('Rank', inplace=True, drop=True)
population_df.drop(index='Total',inplace=True)

# getting abbreviation of provinces/territories in Canada
abbrev = pd.read_html('https://en.wikipedia.org/wiki/Canadian_postal_abbreviations_for_provinces_and_territories')
abbrev_df = pd.DataFrame(abbrev[0])
abbrev_df = abbrev_df[['Province or Territory', 'Postal and ISO 3166‑2:CA abbreviation']]
abbrev_df.rename(columns={'Province or Territory':'Province/Territory', 'Postal and ISO 3166‑2:CA abbreviation':'Abbrev'},
                inplace=True)

# merge/join the dataframes on provinces/territories
population_df = population_df.merge(abbrev_df, on='Province/Territory')
population_df = population_df[['Province/Territory', 'Abbrev', '2016 Census', '2011 Census', 'Change']]
population_df.sort_values(by='Province/Territory', inplace=True)
population_df.reset_index(drop=True, inplace=True)
population_df

Unnamed: 0,Province/Territory,Abbrev,2016 Census,2011 Census,Change
0,Alberta,AB,4067175,3645257,+11.57%
1,British Columbia,BC,4648055,4400057,+5.64%
2,Manitoba,MB,1278365,1208268,+5.80%
3,New Brunswick,NB,747101,751171,−0.54%
4,Newfoundland and Labrador,NL,519716,514536,+1.01%
5,Northwest Territories,NT,41786,41462,+0.78%
6,Nova Scotia,NS,923598,921727,+0.20%
7,Nunavut,NU,35944,31906,+12.66%
8,Ontario,ON,13448494,12851821,+4.64%
9,Prince Edward Island,PE,142907,140204,+1.93%


### Web Scraping #5: Books

In [10]:
import requests
import bs4 as bs
books_url = requests.get('http://books.toscrape.com/')
soup = bs.BeautifulSoup(books_url.text, 'lxml')

category_container = soup.findAll('ul',{'class':'nav nav-list'})

category_list = []
category_url_list = []

# Instead of pulling info directly from the homepage, I want to grab books from genre pages
# Adding genres & urls
for category in category_container:
    children = category.li.ul.findChildren('li')
    for child in children:
        # grabbing genres
        category_list.append(child.text.strip())
        # grabbing url of each genre
        category_url_list.append(child.a['href'])

books= []
price_list = []
availability = []
genres = []
books_url = []

for i, url in enumerate(category_url_list):
    books_genre_url = requests.get('http://books.toscrape.com/' + url)
    temp_soup = bs.BeautifulSoup(books_genre_url.text, 'lxml')
    all_container = temp_soup.findAll('ol',{'class':'row'})
    
    # container is a list that contains all the top nodes of trees that can branch out to grab any information I want about books
    for container in all_container:
        # children = contains prices, availability and price of all books within a page
        children = container.findChildren('li')
        # child = contains prices, availability and price of a book
        for child in children:
            # book titles
            books.append(child.article.h3.a['title'])
            # genres
            genres.append(category_list[i])
            # price - there's a weird alphabet in front of pound, so slicing it out
            price_list.append(child.findChildren('p')[1].text[1:])
            # availability
            availability.append(child.findChildren('p')[2].text.strip())
            # url of each book
            # sling out .../.../.../
            books_url.append('http://books.toscrape.com/catalogue/' + child.article.h3.a['href'][9:])

In [11]:
frame = {'Name':books, 'Price':price_list, 'Genre':genres, 'Availability':availability, 'URL':books_url}
book_df = pd.DataFrame(frame)
book_df

Unnamed: 0,Name,Price,Genre,Availability,URL
0,It's Only the Himalayas,£45.17,Travel,In stock,http://books.toscrape.com/catalogue/its-only-t...
1,Full Moon over Noahâs Ark: An Odyssey to Mou...,£49.43,Travel,In stock,http://books.toscrape.com/catalogue/full-moon-...
2,See America: A Celebration of Our National Par...,£48.87,Travel,In stock,http://books.toscrape.com/catalogue/see-americ...
3,Vagabonding: An Uncommon Guide to the Art of L...,£36.94,Travel,In stock,http://books.toscrape.com/catalogue/vagabondin...
4,Under the Tuscan Sun,£37.33,Travel,In stock,http://books.toscrape.com/catalogue/under-the-...
...,...,...,...,...,...
512,Why the Right Went Wrong: Conservatism--From G...,£52.65,Politics,In stock,http://books.toscrape.com/catalogue/why-the-ri...
513,Equal Is Unfair: America's Misguided Fight Aga...,£56.86,Politics,In stock,http://books.toscrape.com/catalogue/equal-is-u...
514,Amid the Chaos,£36.58,Cultural,In stock,http://books.toscrape.com/catalogue/amid-the-c...
515,Dark Notes,£19.19,Erotica,In stock,http://books.toscrape.com/catalogue/dark-notes...


In [12]:
print(book_df.isna().sum())
print(book_df.isnull().sum())

Name            0
Price           0
Genre           0
Availability    0
URL             0
dtype: int64
Name            0
Price           0
Genre           0
Availability    0
URL             0
dtype: int64


Everything looks perfect!

### Web Scraping 6: IMDb Reviews - "Load More" Button

In [18]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import time

In [14]:
path = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(path)
driver.get('https://www.imdb.com/title/tt8850222/reviews')

# on IMDb website, reviews aren't divided into pages, instead there is a load more button
# To grab the source code of all reviews, need to use selenium to click the load more button until it's fully expanded
while True:
    try:
        # expanding load more until there is not more
        load_more = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, 'load-more-trigger'))
        )
        time.sleep(2)
        # clicking load more button
        ActionChains(driver).move_to_element(load_more).click(load_more).perform()
    except:
        # if there is no more to expand, store the source code of all reviews and quit the driver
        html = driver.page_source
        driver.quit()
        break

titles = []
ratings = []
reviews = []
users = []
dates = []
reviews = []

# web scraping using BeautifulSoup
soup = bs.BeautifulSoup(html, 'lxml')
all_containers = soup.findAll('div', {'class':'lister-item-content'})
for container in all_containers:
    # title cannot be empty
    titles.append(container.a.text.strip())
    if container.div.span.span is None:
        ratings.append(0)
    else:
        ratings.append(container.div.span.span.text)
    # user name and post date are in the same string
    users.append(container.findChildren('div')[1].text.split()[0])
    # concatenate month and year
    dates.append(container.findChildren('div')[1].text.split()[1]
        + ' '
        + container.findChildren('div')[1].text.split()[2])
    if container.findChildren('div')[2].div is None:
        reviews.append('None')
    else:
        reviews.append(container.findChildren('div')[2].div.text)

In [17]:
frame = {'Title':titles, 'User':users, 'Date':dates, 'Rating':ratings, 'Review':reviews}
review_df = pd.DataFrame(frame)
review_df.head(10)

Unnamed: 0,Title,User,Date,Rating,Review
0,Everything at once,TheFinalGirl1324,July 2020,6,Peninsula is an extremely fun movie to watch. ...
1,More to humans vs humans than zombies vs humans,forthatusage15,July 2020,3,
2,A lot of valid criticism in the reviews I have...,simonize8508,August 2020,6,I was one of three patrons of a local theatre ...
3,Peninsula: An entertaining yet Disappointing o...,acinemalens15,July 2020,5,"When Train to Busan hit theaters, it became a ..."
4,trash,mingmiinteoh16,July 2020,2,
5,More action than the first one!,fluffset26,July 2020,7,Everyone expect the same amount of emotion and...
6,Absolute Garbage *AVOID THIS MOVIE*,Orgasmo-Erectus17,July 2020,1,"Allowing this movie to air on the big screen, ..."
7,The Worst Movie that Came Out of South Korea,osman_teket8,August 2020,1,I have no idea how this movie takes itself ser...
8,Fast and Furious,alfredloo17,July 2020,6,
9,"Disappointed, feels like its a graphic action ...",shibal-0090216,July 2020,5,Not thrilled at all n not scary at all.\nNot t...
