# Web Scraping SteamCharts

<a href="https://steamcharts.com/">SteamCharts</a> is a website containing ongoing analysis of Steam's concurrent players. 

## What We'll Accomplish in this Notebook

In this notebook we'll do the following:
- Use `BeautifulSoup` to scrape data from SteamCharts
- Scrape the TopGames pages, and save them in TopGames.csv
- Scrape each game's data page, and save them all in GamesData.csv

In [1]:
## Import base packages we'll use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from seaborn import set_style
set_style("whitegrid")

In [2]:
##Import the packages for scraping
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests

### First we focus on the <a href="https://steamcharts.com/top">Top Games</a> pages.

We begin with some testing on the front page.

In [3]:
## First let's make a soup object
url = "https://steamcharts.com/top"
html = requests.get(url)
soup = BeautifulSoup(html.text,'html')

We create a dataframe to store all the infos for the last 30 days.

In [4]:
## We scrape the entire table and put it into a dataframe
topgames = soup.find('table',{'class':"common-table"} )
df_0 = pd.read_html(str(topgames))[0]

## Drop Unnamed:0 and Last 30 Days, then rename the index to ranking
df_1 = df_0.drop(columns=['Unnamed: 0','Last 30 Days'])
df_1.index.name = 'Ranking'

## For every entry we find the app_id and store it
app_ids = []
for game in soup.find_all('td',{'class':"game-name"}):
    app_id = game.find('a')['href'].replace('/app/','')
    app_ids.append(app_id)

## Add the app_ids to the dataframe
df_1.insert(1,'App_id',app_ids,True)

##
df_1.head()

Unnamed: 0_level_0,Name,App_id,Current Players,Peak Players,Hours Played
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Counter-Strike: Global Offensive,730,724037,1119102,509747009
1,Dota 2,570,449321,672307,302803456
2,PLAYERUNKNOWN'S BATTLEGROUNDS,578080,293717,419509,133453145
3,Apex Legends,1172470,184462,330879,112585996
4,Team Fortress 2,440,93341,109565,62237042


#### We make the above test into a function, then scrape every page into a dataframe.

In [5]:
## This function scrapes the table from the given url and appends it to df,
## after cleaning the columns a bit and fixing the index

## Originially had a flag in case the table is empty, but now it's commented out
def scrape_table_0(url,df):
    html = requests.get(url)
    soup = BeautifulSoup(html.text,'html')
    topgames = soup.find('table',{'class':"common-table"} )
    
    ## If there is not table, return 0
    ##if topgames == 'None':
    ##    return(df,0)
    
    df_tmp_0 = pd.read_html(str(topgames))[0]
    df_tmp_1 = df_tmp_0.drop(columns=['Unnamed: 0','Last 30 Days'])
    
    app_ids = []
    for game in soup.find_all('td',{'class':"game-name"}):
        app_id = game.find('a')['href'].replace('/app/','')
        app_ids.append(app_id)
    df_tmp_1.insert(1,'App_id',app_ids,True)
    
    df = df.append(df_tmp_1,ignore_index=True)
    df.index.name = 'Ranking'
    ##return(df,1)
    return(df)

In [6]:
## Test
page = 1
url = "https://steamcharts.com/top/p." + str(page)
##test,flag = scrape_table(url,df_1)
test = scrape_table_0(url,df_1)
##print(flag)
##test

## This is the main loop (watch out it takes time!)

In [7]:
## The max page gets constantly updated, so we create a function to find it
def find_max_page():
    flag = 1
    ## It's generally above 500, so this saves some time
    page = 500
    while(flag):
        url = "https://steamcharts.com/top/p." + str(page)
        html = requests.get(url)
        soup = BeautifulSoup(html.text,'html')
        header = soup.find('h1').text
        if header == 'Page Not Found':
            flag = 0
            page = page-1
        else:
            page = page+1
    ## We take 10 away as the last few pages are always changing
    return(page-10)

In [598]:
## Loop through all pages
## The first page is already in df_1
page = 2   
##max_page = find_max_page()

## Instead of struggling with finding the max page, we know it's always bigger than 500.
## The last pages are not informative anyway.
max_page = 500

## flag = 1
## We use df_1 to start the loop
df_old = df_1.copy()

for i in range (0,max_page):
    url = "https://steamcharts.com/top/p." + str(page)
    ##df_new,flag = scrape_table_0(url,df_old)
    df_new = scrape_table_0(url,df_old)
    ##if flag == 0:
    ##    print('Page p.',page,'is empty.')
    ##    df_final = df_new.copy()
    ##else:
    ##   df_old = df_new.copy()
    df_old = df_new.copy()
    page = page + 1
    if page%50==0:
        print(page)

50
100
150
200
250
300
350
400
450
500


In [8]:
df_final_0 = df_new.copy()

NameError: name 'df_new' is not defined

In [11]:
## Store the final datafram to a .csv file
df_final_0.to_csv("TopGames.csv")

### Now we look at the relvant infos for each game, for example <a href="https://steamcharts.com/app/730">Counter-Strike: Global Offensive</a>.

Let's begin with a test.

In [8]:
## Read the TopGames.csv dataframe for the rest of the code
df_final_0 = pd.read_csv("TopGames.csv")

In [19]:
## Function to create an empty dataframe in case table is empty
def dummy_frame():
    ## To get all columns we load the first game
    num = 0
    app_id = df_final_0.App_id[num]
    game = df_final_0.loc[df_final_0.App_id == app_id,'Name'].values[0]
    url = "https://steamcharts.com/app/" + str(app_id)
    html = requests.get(url)
    soup = BeautifulSoup(html.text,'html')

    ## Put the data in a dataframe and clean it
    data_tb = soup.find('table',{'class':"common-table"} )
    df_tmp_0 = pd.read_html(str(data_tb))[0]
    df_tmp_1 = df_tmp_0.rename(columns={'Avg. Players':'Avg_Players','% Gain':'Perc_Gain','Peak Players':'Peak_Players'})
    ##df_tmp_1 = df_tmp_0.drop(columns=['Unnamed: 0','Last 30 Days'])

    ## I need to change the columns to make the dataframe better
    cols_0 = ['Name','App_id']
    cols_1 = np.array(df_tmp_1.Month[0:])
    for i in range(0,len(cols_1)):
        cols_1[i] = cols_1[i].replace(' ','_')
    cols = np.concatenate([cols_0,np.array(df_tmp_1.columns[1:]),cols_1])

    ## I want to create a df where the columns are the months, and the other columns of df_tmp_0
    df_tmp_2 = pd.DataFrame(columns = cols)

    ## I fill in the rows with the info, depending on which of feature of cols[2:6] I am considering
    if len(df_tmp_2.index) == 0:
        new_ind = 0
    else:
        new_ind = df_tmp_2.index[-1] + 1

    for i in range(0,4):
        row = []
        row.append(game)
        row.append(str(app_id))
        vect = np.zeros(4)
        vect[i] = 1
        row = row + list(vect)
        row = row + list(np.zeros(len(cols)-6))
        df_tmp_2.loc[new_ind] = row
        new_ind = new_ind+1
    return(df_tmp_2)

In [20]:
## First I scrape data from one game
num = round(0)
app_id = df_final_0.App_id[num]
game = df_final_0.loc[df_final_0.App_id == app_id,'Name'].values[0]
url = "https://steamcharts.com/app/" + str(app_id)
html = requests.get(url)
soup = BeautifulSoup(html.text,'html')

## Put the data in a dataframe and clean it
data_tb = soup.find('table',{'class':"common-table"} )

## If data is empty, then fill with empty rows
if data_tb == None:
    df_tmp_2 = dummy_frame(app_id)
## Otherwise we use the data in the table
else:
    df_tmp_0 = pd.read_html(str(data_tb))[0]
    df_tmp_1 = df_tmp_0.rename(columns={'Avg. Players':'Avg_Players','% Gain':'Perc_Gain','Peak Players':'Peak_Players'})
    ##df_tmp_1 = df_tmp_0.drop(columns=['Unnamed: 0','Last 30 Days'])

    ## I need to change the columns to make the dataframe better
    cols_0 = ['Name','App_id']
    cols_1 = np.array(df_tmp_1.Month[0:])
    for i in range(0,len(cols_1)):
        cols_1[i] = cols_1[i].replace(' ','_')
    cols = np.concatenate([cols_0,np.array(df_tmp_1.columns[1:]),cols_1])

    ## I want to create a df where the columns are the months, and the other columns of df_tmp_0
    df_tmp_2 = pd.DataFrame(columns = cols)

    ## I fill in the rows with the info, depending on which of feature of cols[2:6] I am considering
    if len(df_tmp_2.index) == 0:
        new_ind = 0
    else:
        new_ind = df_tmp_2.index[-1] + 1

    for i in range(0,4):
        row = []
        row.append(game)
        row.append(app_id)
        vect = np.zeros(4)
        vect[i] = 1
        row = row + list(vect)
        row = row + list(df_tmp_1.loc[:,cols[i+2]])
        df_tmp_2.loc[new_ind] = row
        new_ind = new_ind+1

Now we transform the above code into a function that I can run for every game (app_id), and spits out the dataframe to append.

In [21]:
dummy_df = dummy_frame()

def scrape_table_1(app_id):
    game = df_final_0.loc[df_final_0.App_id == app_id,'Name'].values[0]
    url = "https://steamcharts.com/app/" + str(app_id)
    html = requests.get(url)
    soup = BeautifulSoup(html.text,'html')

    ## Put the data in a dataframe and clean it
    data_tb = soup.find('table',{'class':"common-table"} )
    
    ## If data is empty, then fill with empty rows
    if data_tb == None:
        df_tmp_2 = dummy_df.copy()
        df_tmp_2.Name = str(game)
        df_tmp_2.App_id = str(app_id)
    else:
        df_tmp_0 = pd.read_html(str(data_tb))[0]
        df_tmp_1 = df_tmp_0.rename(columns={'Avg. Players':'Avg_Players','% Gain':'Perc_Gain','Peak Players':'Peak_Players'})
        ##df_tmp_1 = df_tmp_0.drop(columns=['Unnamed: 0','Last 30 Days'])

        ## I need to change the columns to make the dataframe better
        cols_0 = ['Name','App_id']
        cols_1 = np.array(df_tmp_1.Month[0:])
        for i in range(0,len(cols_1)):
            cols_1[i] = cols_1[i].replace(' ','_')
        cols = np.concatenate([cols_0,np.array(df_tmp_1.columns[1:]),cols_1])

        ## I want to create a df where the columns are the months, and the other columns of df_tmp_0
        df_tmp_2 = pd.DataFrame(columns = cols)


        ## I fill in the rows with the info, depending on which of feature of cols[2:6] I am considering
        if len(df_tmp_2.index) == 0:
            new_ind = 0
        else:
            new_ind = df_tmp_2.index[-1] + 1

        for i in range(0,4):
            row = []
            row.append(game)
            row.append(str(app_id))
            vect = np.zeros(4)
            vect[i] = 1
            row = row + list(vect)
            row = row + list(df_tmp_1.loc[:,cols[i+2]])
            df_tmp_2.loc[new_ind] = row
            new_ind = new_ind+1
    return(df_tmp_2)

In [32]:
df_final_0.App_id[1000:1100].value_counts()

448510     1
22370      1
528200     1
1546570    1
39500      1
          ..
761830     1
384180     1
468920     1
377530     1
1101190    1
Name: App_id, Length: 100, dtype: int64

In [29]:
scrape_table_1(740250)

Unnamed: 0,Name,App_id,Avg_Players,Gain,Perc_Gain,Peak_Players,Last_30_Days,April_2021,March_2021,February_2021,...,February_2019,January_2019,December_2018,November_2018,October_2018,September_2018,August_2018,July_2018,June_2018,May_2018
0,Neos VR,740250,1.0,0.0,0.0,0.0,165.48,153.29,123,118.11,...,5.12,3.94,3.06,2.25,2.58,1.72,2.68,2.16,1.24,2.73
1,Neos VR,740250,0.0,1.0,0.0,0.0,+12.2,30.29,4.89,8.43,...,1.18,0.89,0.81,-0.34,0.86,-0.96,0.52,0.92,-1.50,-
2,Neos VR,740250,0.0,0.0,1.0,0.0,+7.95%,+24.62%,+4.14%,+7.69%,...,+29.99%,+29.05%,+35.87%,-12.98%,+50.29%,-35.74%,+23.86%,+74.56%,-54.72%,-
3,Neos VR,740250,0.0,0.0,0.0,1.0,317,317,210,218,...,21,22,11,11,15,10,12,8,6,15


## This is the main loop (watch out it takes time!)
One should notice that most of the later games are not very popular, so it might be better to just cut them off, as they probably don't hold relevant information.

In [27]:
df_final_0.loc[df_final_0.App_id == app_id,'Name'].values[0]

'Counter-Strike: Global Offensive'

In [65]:
## Loop through all app_id
df_tmp_3 = df_tmp_2.copy()
print('Total =', len(df_final_0.App_id))
count = 0
## for app_id in df_final_0.App_id[1:10]:
for app_id in df_final_0.App_id:
    game = df_final_0.loc[df_final_0.App_id == app_id,'Name'].values[0]
    test = scrape_table_1(app_id)
    df_tmp_3 = df_tmp_3.append(test,ignore_index=True)
    if count%50==0:
        print('Count =',count)
    count = count+1

Total = 12525
Count = 0
Count = 50
Count = 100
Count = 150
Count = 200
Count = 250
Count = 300
Count = 350
Count = 400
Count = 450
Count = 500
Count = 550
Count = 600
Count = 650
Count = 700
Count = 750
Count = 800
Count = 850
Count = 900
Count = 950
Count = 1000
Count = 1050
Count = 1100
Count = 1150
Count = 1200
Count = 1250
Count = 1300
Count = 1350
Count = 1400
Count = 1450
Count = 1500
Count = 1550
Count = 1600
Count = 1650
Count = 1700
Count = 1750
Count = 1800
Count = 1850
Count = 1900
Count = 1950
Count = 2000
Count = 2050
Count = 2100
Count = 2150
Count = 2200
Count = 2250
Count = 2300
Count = 2350
Count = 2400
Count = 2450
Count = 2500
Count = 2550
Count = 2600
Count = 2650
Count = 2700
Count = 2750
Count = 2800
Count = 2850
Count = 2900
Count = 2950
Count = 3000
Count = 3050
Count = 3100
Count = 3150
Count = 3200
Count = 3250
Count = 3300
Count = 3350
Count = 3400
Count = 3450
Count = 3500
Count = 3550
Count = 3600
Count = 3650
Count = 3700
Count = 3750
Count = 3800
Count = 

In [66]:
df_final_1 = df_tmp_3.copy()

In [67]:
## Store the final datafram to a .csv file
df_final_1.to_csv("GamesData.csv", index=True)

### Finally let's try to scrape the store of steam: store.steampowered.com.
For example <a href="https://store.steampowered.com/app/730/CounterStrike_Global_Offensive/">Counter-Strike: Global Offensive</a>.

Let's begin with a test.

In [613]:
## First I scrape data from one game
app_id = df_final_0.App_id[0]
game = df_final_0.loc[df_final_0.App_id == app_id,'Name'].values[0]
url = "https://store.steampowered.com/app/" + app_id
html = requests.get(url)
soup = BeautifulSoup(html.text,'html')

## Put the data in a dataframe and clean it
data_tb = soup.find('table',{'class':"common-table"} )
df_tmp_0 = pd.read_html(str(data_tb))[0]
df_tmp_1 = df_tmp_0.rename(columns={'Avg. Players':'Avg_Players','% Gain':'Perc_Gain','Peak Players':'Peak_Players'})
##df_tmp_1 = df_tmp_0.drop(columns=['Unnamed: 0','Last 30 Days'])


In [None]:
## First I scrape data from one game
app_id = df_final_0.App_id[0]
game = df_final_0.loc[df_final_0.App_id == app_id,'Name'].values[0]
url = "https://steamcharts.com/app/" + app_id
html = requests.get(url)
soup = BeautifulSoup(html.text,'html')

## Put the data in a dataframe and clean it
data_tb = soup.find('table',{'class':"common-table"} )
df_tmp_0 = pd.read_html(str(data_tb))[0]
df_tmp_1 = df_tmp_0.rename(columns={'Avg. Players':'Avg_Players','% Gain':'Perc_Gain','Peak Players':'Peak_Players'})
##df_tmp_1 = df_tmp_0.drop(columns=['Unnamed: 0','Last 30 Days'])

## I need to change the columns to make the dataframe better
cols_0 = ['Name','App_id']
cols_1 = np.array(df_tmp_1.Month[0:])
for i in range(0,len(cols_1)):
    cols_1[i] = cols_1[i].replace(' ','_')
cols = np.concatenate([cols_0,np.array(df_tmp_1.columns[1:]),cols_1])

## I want to create a df where the columns are the months, and the other columns of df_tmp_0
df_tmp_2 = pd.DataFrame(columns = cols)


## I fill in the rows with the info, depending on which of feature of cols[2:6] I am considering
if len(df_tmp_2.index) == 0:
    new_ind = 0
else:
    new_ind = df_tmp_2.index[-1] + 1

for i in range(0,4):
    row = []
    row.append(game)
    row.append(app_id)
    vect = np.zeros(4)
    vect[i] = 1
    row = row + list(vect)
    row = row + list(df_tmp_1.loc[:,cols[i+2]])
    df_tmp_2.loc[new_ind] = row
    new_ind = new_ind+1