In [1]:
import pandas as pd
import numpy as np
import requests
import bs4 as bs

### Web Scraping #1: Getting a list of best selling GPUs from the newegg.ca

There are no tables provided, so I need to make one myself by grabbing information from the url.

There are over a hundred pages of GPUs, but I'm going to use only 3 pages of the most **popular** GPUs.

I only want to see the top 15 popular GPUs, which means it would be sufficient to web scrape just the first page, but this is to practice web scraping from multiple pages.

Each page holds 36 items thus I should be collecting 108 in total.

In [2]:
# this is getting the source code
page_list = [i for i in range(1,4)]
brand = []
name = []
price_list = []
shipping_list = []

# looping through 3 pages
for page in page_list:
    newegg_url = requests.get('https://www.newegg.ca/Desktop-Graphics-Cards/SubCategory/ID-48/Page-{}?Tid=7709&Order=3'.format(page))
    # turn it into a beatifulsoup object - lxml is the parser
    soup = bs.BeautifulSoup(newegg_url.text, 'lxml')
    # grab each product 
    containers = soup.findAll('div', {'class':'item-container'})
    price = soup.findAll('li', {'class':'price-current'})
    shipping = soup.findAll('li', {'class':'price-ship'})

    # getting names
    for container in containers:
        if container.div.div.a.img != None:
            brand.append(container.div.div.a.img['title'])
            name.append(container.a.img['title'])

    # getting prices
    for item in price:
        price_list.append(item.strong.text + item.sup.text)

    # getting prices of shipping
    for item in shipping:
        shipping_list.append(item.text.split()[0])

print(len(shipping_list))
print(len(price_list))
print(len(brand))
print(len(name))

108
108
108
108


In [3]:
price_list = [float(x.replace(',','')) for x in price_list]

In [4]:
frame = {'Name':name, 'Brand':brand, 'Price':price_list, 'Shipping':shipping_list}
gpu_df = pd.DataFrame(frame)
print(gpu_df.head(15))

                                                 Name          Brand   Price  \
0   MSI GeForce GTX 1660 SUPER DirectX 12 GTX 1660...            MSI  338.99   
1   MSI GeForce GTX 1660 DirectX 12 GTX 1660 VENTU...            MSI  289.99   
2   MSI GeForce RTX 2060 DirectX 12 RTX 2060 GAMIN...            MSI  519.99   
3   MSI GeForce RTX 2070 SUPER DirectX 12 RTX 2070...            MSI  789.99   
4   MSI GeForce RTX 2060 DirectX 12 RTX 2060 VENTU...            MSI  459.99   
5   GIGABYTE GeForce GTX 1660 SUPER DirectX 12 GV-...       GIGABYTE  349.99   
6   MSI GeForce RTX 2070 DirectX 12 RTX 2070 VENTU...            MSI  569.99   
7   MSI GeForce RTX 2070 SUPER DirectX 12 RTX 2070...            MSI  729.99   
8   ASUS TUF Gaming GeForce GTX 1660 SUPER Overclo...           ASUS  299.99   
9   EVGA GeForce RTX 2060 SUPER SC ULTRA GAMING, 0...           EVGA  569.99   
10  SAPPHIRE PULSE Radeon RX 5700 XT 100416P8GL 8G...  Sapphire Tech  549.99   
11  SAPPHIRE NITRO+ Radeon RX 5700 XT 10

These are the 15 most sold GPUs on Newegg.ca

### Web Scraping #2: Steam Sale

In [5]:
# getting the source code
steam_url = requests.get('https://store.steampowered.com/search/?specials={}')
soup = bs.BeautifulSoup(steam_url.text, 'lxml')
all_container = soup.findAll('div', {'class':'responsive_search_name_combined'})

game_list = []
discount_list = []
original_price = []
dis_price_list = []

for container in all_container:
    game = container.findChildren()[1].text
    game_list.append(game)
    
    # percentage of discount 
    discount = container.findChildren('div')[3]
    if discount.div.span == None:
        discount_list.append('No Discount')
    else:
        discount_list.append(discount.div.span.text.replace('-',''))
    
    # original price
    o_price = container.findChildren('div')[3].findChildren('div')[1]
    if o_price.strike == None:
        original_price.append('0')
    else:
        original_price.append(o_price.strike.text.split()[1])
    
    # discounted price
    dis_price_line = container.findChildren('div')[3].findChildren('div')[1]
    line_split = dis_price_line.text.strip().split()
    if not line_split:
        dis_price_list.append('0')
    else:
        dis_price_list.append(line_split[-1])

# checking if the number of items match
print(len(dis_price_list))
print(len(original_price))
print(len(discount_list))
print(len(game_list))

50
50
50
50


In [6]:
# to calculate the discounted amount, convert original price list and discounted price list to numpy arrays and do subtraction
original_price = np.array([float(x) for x in original_price])
dis_price_list = np.array([float(x) for x in dis_price_list])

frame = {'Game':game_list, 'Original Price':original_price, 'Discounted Amount':original_price-dis_price_list,
         'Discounted Price':dis_price_list, 'Discount %':discount_list}
steam_df = pd.DataFrame(frame)
steam_df.sort_values(by='Discounted Amount', inplace=True, ascending=False)
steam_df.reset_index(drop=True, inplace=True)
steam_df.head(20)

Unnamed: 0,Game,Original Price,Discounted Amount,Discounted Price,Discount %
0,Darksiders Blades & Whip Franchise Pack,166.46,131.64,34.82,79%
1,Wolfenstein Alt History Collection,144.96,103.6,41.36,71%
2,Dishonored: Complete Collection,109.99,77.0,32.99,70%
3,Elder Scrolls Summer Bundle,94.97,68.0,26.97,72%
4,RAGE 2,79.99,64.0,15.99,80%
5,The Elder Scrolls V: Skyrim VR,79.99,56.0,23.99,70%
6,Fallout 4: Game of the Year Edition,79.99,56.0,23.99,70%
7,Darksiders III,79.99,53.6,26.39,67%
8,Borderlands 3,79.99,40.0,39.99,50%
9,DOOM Eternal,79.99,40.0,39.99,50%


These are the 20 games that give the most value from the sale

### Web Scraping #3: EPL Table

This time, there is a table already created on the website, so I can conveniently grab it and apply some alterations

In [7]:
epl_url = 'https://www.skysports.com/premier-league-table'
epl_list = pd.read_html(epl_url) # this contains all the tables from the url in a list
print(len(epl_list)) # get the quantity of tables
epl_df = pd.DataFrame(epl_list[0]) # turn the first table to a datframe
print(epl_df)

1
     #                      Team  Pl   W   D   L    F   A  GD  Pts  Last 6
0    1                 Liverpool  38  32   3   3   85  33  52   99     NaN
1    2           Manchester City  38  26   3   9  102  35  67   81     NaN
2    3         Manchester United  38  18  12   8   66  36  30   66     NaN
3    4                   Chelsea  38  20   6  12   69  54  15   66     NaN
4    5            Leicester City  38  18   8  12   67  41  26   62     NaN
5    6         Tottenham Hotspur  38  16  11  11   61  47  14   59     NaN
6    7   Wolverhampton Wanderers  38  15  14   9   51  40  11   59     NaN
7    8                   Arsenal  38  14  14  10   56  48   8   56     NaN
8    9          Sheffield United  38  14  12  12   39  39   0   54     NaN
9   10                   Burnley  38  15   9  14   43  50  -7   54     NaN
10  11               Southampton  38  15   7  16   51  60  -9   52     NaN
11  12                   Everton  38  13  10  15   44  56 -12   49     NaN
12  13          Newcast

The table looks great; however, it needs some polishing.

Let's
* Drop `Last 6` since it's a column of NaN values
* Change names of the columns for clarity
* Set rank as index to avoid redundancy

In [8]:
# dropping & renaming columns
epl_df.drop(columns='Last 6', inplace=True)
epl_df.rename(columns={'#':'Rank'}, inplace=True)
# use rank as the index to avoid redundancy
epl_df.set_index(keys='Rank', drop=True, inplace=True)
epl_df.rename(columns={'Pl':'Games Played'}, inplace=True)
epl_df.rename(columns={'W':'Win'}, inplace=True)
epl_df.rename(columns={'D':'Draw'}, inplace=True)
epl_df.rename(columns={'L':'Loss'}, inplace=True)
epl_df.rename(columns={'F':'Goals For'}, inplace=True)
epl_df.rename(columns={'A':'Goals Against'}, inplace=True)
epl_df.rename(columns={'GD':'Goal Difference'}, inplace=True)

epl_df

Unnamed: 0_level_0,Team,Games Played,Win,Draw,Loss,Goals For,Goals Against,Goal Difference,Pts
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Liverpool,38,32,3,3,85,33,52,99
2,Manchester City,38,26,3,9,102,35,67,81
3,Manchester United,38,18,12,8,66,36,30,66
4,Chelsea,38,20,6,12,69,54,15,66
5,Leicester City,38,18,8,12,67,41,26,62
6,Tottenham Hotspur,38,16,11,11,61,47,14,59
7,Wolverhampton Wanderers,38,15,14,9,51,40,11,59
8,Arsenal,38,14,14,10,56,48,8,56
9,Sheffield United,38,14,12,12,39,39,0,54
10,Burnley,38,15,9,14,43,50,-7,54


Everything looks clear and polished. Job done.