Research Question: Can I predict a game's overall sales based on information such as Publishing Company, Developer Company, Genre, etc.

This data was generated by a scrape of vgchartz.com, a website that hosts a bunch of video game and console data such as their sales. The original script is available at https://github.com/GregorUT/vgchartzScrape. It had to be updated and altered to fit this project so it was not easy to collect. Due to rate limitation, it took almost 6 hours to run. The code is in the cells below. It does not reflect all code because unfortunately although Google Colab successfully saved some csv's to my drive, this notebook did not save (I think it had been running so long it didn't save the state of the notebook but I know it completed successfully). The code missing is just the code to save to Google Drive and to drop some columns.


This data represents data on games sold for the Xbox One. Vgchartz does not outline how they retrieve their data but it seems to be accurate and thus will provide suitable data for predicting a game's success. Some columns were dropped as I didn't need them (like specific country sales).

The fields include:
- Rank - Ranking of overall sales
- Name - The games name
- Year - Year of the game's release
- Genre - Genre of the game
- Publisher - Publisher of the game
- Developer
- Critic Score
- User Score
- Global_Sales - Total worldwide sales.

There are 1318 rows, each representing a game for the xbox one.
Of those, 539 have reached over 10,000 sales (the minimum amount to have sales recorded on this website). 

The Xbox One was released in 2013 and has since sold ~46.6m units.

I have hosted the data on github at https://raw.githubusercontent.com/thedanieljk/Data301FinalProject/master/df_vg_final.csv

In [0]:
import pandas as pd
from bs4 import BeautifulSoup, element
import urllib
import pandas as pd
import numpy as np
import time


# https://www.vgchartz.com/gamedb/?page=1&console=XOne&region=All&developer=&publisher=&genre=&boxart=Both&ownership=Both&results=1000&order=Sales&showtotalsales=0&showtotalsales=1&showpublisher=0&showpublisher=1&showvgchartzscore=0&shownasales=1&showdeveloper=1&showcriticscore=1&showpalsales=0&showpalsales=1&showreleasedate=1&showuserscore=1&showjapansales=1&showlastupdate=0&showothersales=1&showgenre=1&sort=GL

pages = 3
rec_count = 0
rank = []
gname = []
platform = []
year = []
genre = []
critic_score = []
user_score = []
publisher = []
developer = []
sales_na = []
sales_pal = []
sales_jp = []
sales_ot = []
sales_gl = []

urlhead = 'http://www.vgchartz.com/gamedb/?page='
urltail = '&console=XOne&region=All&developer=&publisher=&genre=&boxart=Both&ownership=Both'
urltail += '&results=1000&order=Sales&showtotalsales=0&showtotalsales=1&showpublisher=0'
urltail += '&showpublisher=1&showvgchartzscore=0&shownasales=1&showdeveloper=1&showcriticscore=1'
urltail += '&showpalsales=0&showpalsales=1&showreleasedate=1&showuserscore=1&showjapansales=1'
urltail += '&showlastupdate=0&showothersales=1&showgenre=1&sort=GL'

for page in range(1, pages):
    surl = urlhead + str(page) + urltail
    r = urllib.request.urlopen(surl).read()
    time.sleep(20)
    soup = BeautifulSoup(r)
    print(f"Page: {page}")
    # vgchartz website is really weird so we have to search for
    # <a> tags with game urls
    game_tags = list(filter(
        lambda x: x.attrs['href'].startswith('https://www.vgchartz.com/game/'),
        # discard the first 10 elements because those
        # links are in the navigation bar
        soup.find_all("a")
    ))[10:]

    for tag in game_tags:

        # add name to list
        gname.append(" ".join(tag.string.split()))
        print(f"{rec_count + 1} Fetch data for game {gname[-1]}")

        # get different attributes
        # traverse up the DOM tree
        data = tag.parent.parent.find_all("td")
        rank.append(np.int32(data[0].string))
        platform.append(data[3].find('img').attrs['alt'])
        publisher.append(data[4].string)
        developer.append(data[5].string)
        critic_score.append(
            float(data[6].string) if
            not data[6].string.startswith("N/A") else np.nan)
        user_score.append(
            float(data[7].string) if
            not data[7].string.startswith("N/A") else np.nan)
        sales_na.append(
            float(data[9].string[:-1]) if
            not data[9].string.startswith("N/A") else np.nan)
        sales_pal.append(
            float(data[10].string[:-1]) if
            not data[10].string.startswith("N/A") else np.nan)
        sales_jp.append(
            float(data[11].string[:-1]) if
            not data[11].string.startswith("N/A") else np.nan)
        sales_ot.append(
            float(data[12].string[:-1]) if
            not data[12].string.startswith("N/A") else np.nan)
        sales_gl.append(
            float(data[8].string[:-1]) if
            not data[8].string.startswith("N/A") else np.nan)
        release_year = data[13].string.split()[-1]
        # different format for year
        if release_year.startswith('N/A'):
            year.append('N/A')
        else:
            if int(release_year) >= 80:
                year_to_add = np.int32("19" + release_year)
            else:
                year_to_add = np.int32("20" + release_year)
            year.append(year_to_add)

        # go to every individual website to get genre info
        url_to_game = tag.attrs['href']
        site_raw = urllib.request.urlopen(url_to_game).read()
        time.sleep(13)
        sub_soup = BeautifulSoup(site_raw, "html.parser")
        # again, the info box is inconsistent among games so we
        # have to find all the h2 and traverse from that to the genre name
        h2s = sub_soup.find("div", {"id": "gameGenInfoBox"}).find_all('h2')
        # make a temporary tag here to search for the one that contains
        # the word "Genre"
        temp_tag = element.Tag
        for h2 in h2s:
            if h2.string == 'Genre':
                temp_tag = h2
        genre.append(temp_tag.next_sibling.string)

        rec_count += 1

columns = {
    'Rank': rank,
    'Name': gname,
    'Platform': platform,
    'Year': year,
    'Genre': genre,
    'Critic_Score': critic_score,
    'User_Score': user_score,
    'Publisher': publisher,
    'Developer': developer,
    'NA_Sales': sales_na,
    'PAL_Sales': sales_pal,
    'JP_Sales': sales_jp,
    'Other_Sales': sales_ot,
    'Global_Sales': sales_gl
}
print(rec_count)
df = pd.DataFrame(columns)
print(df.columns)
df = df[[
    'Rank', 'Name', 'Platform', 'Year', 'Genre',
    'Publisher', 'Developer', 'Critic_Score', 'User_Score',
    'NA_Sales', 'PAL_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']]

Page: 1
1 Fetch data for game Grand Theft Auto V
2 Fetch data for game Call of Duty: Black Ops 3


KeyboardInterrupt: ignored

In [0]:
df_vg = pd.read_csv("https://raw.githubusercontent.com/thedanieljk/Data301FinalProject/master/df_vg_final.csv")
df_vg.head()

Unnamed: 0,Rank,Name,Year,Genre,Publisher,Developer,Critic_Score,User_Score,Global_Sales
0,1,Grand Theft Auto V,2014.0,Action,Rockstar Games,Rockstar North,9.0,9.0,8.72
1,2,Call of Duty: Black Ops 3,2015.0,Shooter,Activision,Treyarch,,,7.37
2,3,Call of Duty: WWII,2017.0,Shooter,Activision,Sledgehammer Games,,,6.23
3,4,Red Dead Redemption 2,2018.0,Action-Adventure,Rockstar Games,Rockstar Games,,,5.77
4,5,Minecraft,2014.0,Misc,Microsoft Studios,Mojang,,,5.43
