<h1>Video Game Data EDA</h1>

In [170]:
import pandas as pd
import numpy as np

In [171]:
raw_games_data_df = pd.read_csv('../data/vgchartz_games_webscrape.csv', dtype=str)
expanded_games_data_df = pd.read_csv('../data/games_data_expanded.csv')

<h3>Prepping the data for merging</h3>

We are making necessary modifications in order to perform a merger between the two datasets.

In [172]:
# Reformat the release date column to be YYYY-MM-DD
raw_games_data_df['release_date'] = pd.to_datetime(raw_games_data_df['release_date'], errors='coerce').dt.strftime('%Y-%m-%d')
raw_games_data_df['last_update_date'] = pd.to_datetime(raw_games_data_df['last_update_date'], errors='coerce').dt.strftime('%Y-%m-%d')

# Rename the columns in the expanded dataset to match the raw dataset
expanded_games_data_df = expanded_games_data_df.rename(columns={
    'Name': 'game',
})

# Strip out any whitespace from the 'game' column
raw_games_data_df['game'] = raw_games_data_df['game'].str.strip()
expanded_games_data_df['game'] = expanded_games_data_df['game'].str.strip()

<h3>Merging the 2 datasets</h3>

We are merging the valuable datapoints from the expanded dataset (sourced from Kaggle) into the original dataset (webscraped from VGChartz).

In [173]:
raw_games_data_df['metacritic_count'] = raw_games_data_df.merge(expanded_games_data_df, on='game', how='left')['Critic_Count']
raw_games_data_df['metacritic_score'] = raw_games_data_df.merge(expanded_games_data_df, on='game', how='left')['Critic_Score']
raw_games_data_df['metacritic_user_count'] = raw_games_data_df.merge(expanded_games_data_df, on='game', how='left')['User_Count']
raw_games_data_df['metacritic_user_score'] = raw_games_data_df.merge(expanded_games_data_df, on='game', how='left')['User_Score']
raw_games_data_df['esrb_rating'] = raw_games_data_df.merge(expanded_games_data_df, on='game', how='left')['Rating']

In [174]:
raw_games_data_df[raw_games_data_df['platform'] != 'Series'].head(10)

Unnamed: 0,rank,game,platform,publisher,developer,vgchartz_score,critic_score,user_score,total_shipped,total_sales,...,pal_sales,japan_sales,other_sales,release_date,last_update_date,metacritic_count,metacritic_score,metacritic_user_count,metacritic_user_score,esrb_rating
7,8,Minecraft,All,Mojang,Mojang AB,,,,238.00m,,...,,,,2011-11-18,2020-10-08,,,,,
11,12,Grand Theft Auto V,All,Rockstar Games,Rockstar North,,,,180.00m,,...,,,,2013-09-17,2020-10-08,,,,,
26,27,Wii Sports,Wii,Nintendo,Nintendo EAD,,7.7,,82.90m,,...,,,,2006-11-19,,17.0,84.0,20.0,6.9,T
39,40,PlayerUnknown's Battlegrounds,All,PUBG Corporation,PUBG Corporation,,,,70.00m,,...,,,,2017-12-17,2020-10-24,,,,,
52,53,Mario Kart 8 Deluxe,NS,Nintendo,Nintendo EPD,,9.3,,53.79m,,...,,,,2017-04-28,2018-11-19,16.0,96.0,138.0,8.7,E
54,55,Red Dead Redemption 2,All,Rockstar Games,Rockstar Studios,,,,53.00m,,...,,,,2018-10-26,2020-10-08,51.0,76.0,322.0,8.0,E
60,61,The Witcher 3: Wild Hunt,All,Warner Bros. Interactive Entertainment,CD Projekt Red Studio,,,,50.00m,,...,,,,2015-05-18,2020-10-29,,,,,
65,66,Terraria,All,Re-Logic,Re-Logic,,,,44.50m,,...,,,,2011-05-16,2020-10-10,42.0,87.0,137.0,8.9,E
71,72,Animal Crossing: New Horizons,NS,Nintendo,Nintendo,8.0,,,42.21m,,...,,,,2020-03-20,2020-04-11,,,,,
75,76,Super Mario Bros.,NES,Nintendo,Nintendo EAD,,10.0,8.2,40.24m,,...,,,,1985-10-18,,,,,,


<h3>Exploratory data analysis findings</h3>

&#x2022; `critic_score` and `user_score` are wildly inconsistent in their appearances. Thus, we will ignore and drop these columns.

&#x2022; `total_shipped` represents volume of sales, not revenue or profit.

&#x2022; Unfortunately, the individual breakdown of the sales by geographical region is not available. Perhaps I can supplement this data with a different dataset as a stretch goal.

&#x2022; `last_update_date` is essentially useless for this project. We will drop this column.

&#x2022; With the remaining columns, I will be able to glean enough information to answer the initial question posed - at least on a surface level. More data will be needed to answer the question in more depth.