# Global Sales of Video Games
Author: Jiahang Liu

Course Project, UC Irvine, Math 10, Summer 2023

## Introduction

This project will mainly focus on anaylzing videos games sales from 1980 to 2020. I hope to find a way to predict the sales of different video games based on their scores received by critics and users. I will look at the following features in my analysis: “Platform”, ”Year_of_Release”, ”Genre”, "Global_Sales", "Critic_Score", "Critic_Count", "User_Score", "User_Count". 

There are two main problems I aim to solve for the project. 
1) Explore the relationship between "Global_Sales" and others.
2) Find out ways to predict "Global_Sales" using only "Critic_Score" and "User_Score".

## Exploring the Dataset



### Data Cleaning

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read in the dataset
df_vg=pd.read_csv("/work/Video_Games_Sales_as_at_22_Dec_2016.csv")
df_vg

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16714,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,Tecmo Koei,0.00,0.00,0.01,0.00,0.01,,,,,,
16715,LMA Manager 2007,X360,2006.0,Sports,Codemasters,0.00,0.01,0.00,0.00,0.01,,,,,,
16716,Haitaka no Psychedelica,PSV,2016.0,Adventure,Idea Factory,0.00,0.00,0.01,0.00,0.01,,,,,,
16717,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,0.00,0.00,0.00,0.01,,,,,,


In [3]:
# Check that "Global_Sales" (the response variable) has no missing values
df_vg["Global_Sales"].isna().any()

False

I am not interested in the variables "Developer" and "Rating". I also do not want to use "NA_Sales", "EU_Sales", "JP_Sales", and "Other_Sales" because they have a direct relationship with "Global_Sales". Thus, I decide to remove these columns.

In [4]:
# Remove unwanted columns
df_vg = df_vg.drop(columns=["NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales", "Developer", "Rating"])
df_vg

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count
0,Wii Sports,Wii,2006.0,Sports,Nintendo,82.53,76.0,51.0,8,322.0
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,40.24,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,35.52,82.0,73.0,8.3,709.0
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,32.77,80.0,73.0,8,192.0
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,31.37,,,,
...,...,...,...,...,...,...,...,...,...,...
16714,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,Tecmo Koei,0.01,,,,
16715,LMA Manager 2007,X360,2006.0,Sports,Codemasters,0.01,,,,
16716,Haitaka no Psychedelica,PSV,2016.0,Adventure,Idea Factory,0.01,,,,
16717,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,,,,


We can see from the output above that there is a very small amount of missing data from "Name", "Year_of_Release", "Genre", and "Publisher". I will remove these rows.

In [5]:
# Remove rows with missing values in these columns
df_vg = df_vg[df_vg["Name"].notna()]
df_vg = df_vg[df_vg["Year_of_Release"].notna()]
df_vg = df_vg[df_vg["Genre"].notna()]
df_vg = df_vg[df_vg["Publisher"].notna()]
df_vg

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count
0,Wii Sports,Wii,2006.0,Sports,Nintendo,82.53,76.0,51.0,8,322.0
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,40.24,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,35.52,82.0,73.0,8.3,709.0
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,32.77,80.0,73.0,8,192.0
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,31.37,,,,
...,...,...,...,...,...,...,...,...,...,...
16714,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,Tecmo Koei,0.01,,,,
16715,LMA Manager 2007,X360,2006.0,Sports,Codemasters,0.01,,,,
16716,Haitaka no Psychedelica,PSV,2016.0,Adventure,Idea Factory,0.01,,,,
16717,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,,,,


In [6]:
# Look for missing data in numerical variables
df_vg[["Critic_Score","Critic_Count", "User_Score", "User_Count"]].isna().sum()

Critic_Score    8434
Critic_Count    8434
User_Score      6579
User_Count      8955
dtype: int64

We can see there is quite a lot of missing data, therefore I will fill in the missing values with the median.

In [7]:
# Change "tbd" entries to NA in "User_Score"
df_vg.loc[df_vg["User_Score"] == "tbd", "User_Score"] = np.nan
# Convert "User_Score" to numeric column
df_vg["User_Score"] = pd.to_numeric(df_vg["User_Score"])
# Impute missing values with median
df_vg[["Critic_Score", "Critic_Count", "User_Score", "User_Count"]] = df_vg[["Critic_Score", "Critic_Count", "User_Score", "User_Count"]].fillna(df_vg[["Critic_Score", "Critic_Count", "User_Score", "User_Count"]].median())
df_vg

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count
0,Wii Sports,Wii,2006.0,Sports,Nintendo,82.53,76.0,51.0,8.0,322.0
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,40.24,71.0,22.0,7.5,24.0
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,35.52,82.0,73.0,8.3,709.0
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,32.77,80.0,73.0,8.0,192.0
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,31.37,71.0,22.0,7.5,24.0
...,...,...,...,...,...,...,...,...,...,...
16714,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,Tecmo Koei,0.01,71.0,22.0,7.5,24.0
16715,LMA Manager 2007,X360,2006.0,Sports,Codemasters,0.01,71.0,22.0,7.5,24.0
16716,Haitaka no Psychedelica,PSV,2016.0,Adventure,Idea Factory,0.01,71.0,22.0,7.5,24.0
16717,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,71.0,22.0,7.5,24.0


After removing and imputing missing data, we can now look at the relationship between "Global_Sales" and other variables. 

### Relationship Between Global Sales and Categorical Variables

#### Global Sales and Genre

In [8]:
df_vg["Genre"].unique()

array(['Sports', 'Platform', 'Racing', 'Role-Playing', 'Puzzle', 'Misc',
       'Shooter', 'Simulation', 'Action', 'Fighting', 'Adventure',
       'Strategy'], dtype=object)

In [9]:
df_vg["Genre"].unique().size

12

From the above code, there are 12 different genres of video games in this dataset.

In [10]:
# Look at average sales (in millions) per genre
df_salesbygenre = df_vg.groupby("Genre").mean()["Global_Sales"]
df_salesbygenre

Genre
Action          0.519389
Adventure       0.180674
Fighting        0.528829
Misc            0.461514
Platform        0.940615
Puzzle          0.422373
Racing          0.590767
Role-Playing    0.627714
Shooter         0.803881
Simulation      0.454058
Sports          0.568252
Strategy        0.256979
Name: Global_Sales, dtype: float64

In [11]:
# Genre with highest average sales
df_salesbygenre.idxmax()

'Platform'

In [12]:
# Genre with lowest average sales
df_salesbygenre.idxmin()

'Adventure'

From the results, we can see that Platform video games are most popular based on average global sales, and Adventure video games have the least global sales.

#### Global Sales and Year of Release

In [13]:
import altair as alt

In [14]:
# Total global sales by year of release
df_totalbyyear = df_vg.groupby("Year_of_Release").sum()["Global_Sales"].to_frame().reset_index()
df_totalbyyear = df_totalbyyear.rename(columns={"Year_of_Release":"Year", "Global_Sales":"Total Global Sales (Millions)"})

In [15]:
# Plot total global sales against year of release
chart_total = alt.Chart(df_totalbyyear).mark_bar().encode(x="Year", y = "Total Global Sales (Millions)", tooltip = ["Year","Total Global Sales (Millions)"])
chart_total

From the bar chart above, it seems like games released in 2008 have the most global sales. However, there may be a lot more video games published in 2008 compared to other years which would result in the most global sales. Therefore, to compare each year's global sales in a more objective persepective, we look at the average global sales instead.

In [16]:
# Average global sales by year of release
df_avgbyyear = df_vg.groupby("Year_of_Release").mean()["Global_Sales"].to_frame().reset_index()
df_avgbyyear = df_avgbyyear.rename(columns={"Year_of_Release":"Year", "Global_Sales":"Average Global Sales (Millions)"})

In [17]:
# Plot average global sales against year of release
chart_avg = alt.Chart(df_avgbyyear).mark_bar().encode(x="Year", y = "Average Global Sales (Millions)", tooltip = ["Year","Average Global Sales (Millions)"])
chart_avg

After changing the code from sum to mean, the bar chart looks very different. We can see average global sales is higher in the 1980s, with the highest being 1989. This could mean that those years was the peak of the video game market or that not many games were released but they all had great sales. However, the dataset did not clarify how global sales was calculated. Thus, I could not make any conclusion about whether the year of release has any effect on global sales.

#### Global Sales and Platform

In [18]:
# Average global sales by platform
df_vg["Platform"].unique()
df_pf = pd.DataFrame()
df_salesbypf = df_vg.groupby("Platform").mean()["Global_Sales"]
df_salesbypf

Platform
2600    0.746293
3DO     0.033333
3DS     0.503750
DC      0.307115
DS      0.378761
GB      2.622887
GBA     0.388830
GC      0.363727
GEN     1.050370
GG      0.040000
N64     0.690538
NES     2.561939
NG      0.120000
PC      0.269128
PCFX    0.030000
PS      0.611269
PS2     0.579906
PS3     0.712979
PS4     0.799567
PSP     0.242909
PSV     0.125455
SAT     0.194162
SCD     0.311667
SNES    0.837029
TG16    0.080000
WS      0.236667
Wii     0.693421
WiiU    0.558912
X360    0.780349
XB      0.313935
XOne    0.645506
Name: Global_Sales, dtype: float64

In [19]:
# Platform with highest average sales
df_salesbypf.idxmax()

'GB'

In [20]:
# Platform with lowest average sales
df_salesbypf.idxmin()

'PCFX'

From the results, GB has the highest average global sales and PCFX has the lowest average global sales.

#### Global Sales and Publisher

In [21]:
# Plot average global sales by publisher
df_avgbypl = df_vg.groupby("Publisher").mean()["Global_Sales"].to_frame().reset_index()
df_avgbypl = df_avgbypl.rename(columns={"Publisher":"Publisher", "Global_Sales":"Average Global Sales (Millions)"})
chart_avg2 = alt.Chart(df_avgbypl).mark_bar().encode(x="Publisher", y = "Average Global Sales (Millions)", tooltip = ["Publisher","Average Global Sales (Millions)"])
chart_avg2

A quick look at the bar chart tells us that Palcom has the highest average global sales, with a few other publishers like Arena Entertainment and Nintendo being popular too.

### Correlation between Global Sales and Numerical Variables

In [22]:
# Put numerical columns in a dataframe
df_correlation = df_vg[["Global_Sales","Critic_Score", "Critic_Count", "User_Score", "User_Count"]]
df_correlation

Unnamed: 0,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count
0,82.53,76.0,51.0,8.0,322.0
1,40.24,71.0,22.0,7.5,24.0
2,35.52,82.0,73.0,8.3,709.0
3,32.77,80.0,73.0,8.0,192.0
4,31.37,71.0,22.0,7.5,24.0
...,...,...,...,...,...
16714,0.01,71.0,22.0,7.5,24.0
16715,0.01,71.0,22.0,7.5,24.0
16716,0.01,71.0,22.0,7.5,24.0
16717,0.01,71.0,22.0,7.5,24.0


In [23]:
# Calculate correlation coefficient between each numerical variable, get correlation matrix
df_correlation.corr()

Unnamed: 0,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count
Global_Sales,1.0,0.18972,0.262305,0.048682,0.235408
Critic_Score,0.18972,1.0,0.40199,0.477299,0.243784
Critic_Count,0.262305,0.40199,1.0,0.124606,0.388712
User_Score,0.048682,0.477299,0.124606,1.0,-0.006295
User_Count,0.235408,0.243784,0.388712,-0.006295,1.0


In [24]:
df_correlation.corr()["Global_Sales"]

Global_Sales    1.000000
Critic_Score    0.189720
Critic_Count    0.262305
User_Score      0.048682
User_Count      0.235408
Name: Global_Sales, dtype: float64

We could see that none of the numerical variables have a strong correlation with global sales. In other words, "Critic_Score", "Critic_Count", "User_Score", and "User_Count" have no obvious linear relationship with global sales. This might be due to the fact that these numerical columns had a lot of missing data, which were filled in with median.

## Machine Learning


### Linear Regression

After exploring the data, I am wondering if we can use "Critic_Score" and "User_Score" only to predict "Global_Sales", even though they do not seem to have a direct relationship with "Global_Sales" from the correlation analysis above. Since global sales are continous response variable, this is a regression problem. So, the ML part below will focusing more on Regression instead of classification. I am also curious how accurate the predictions can be based on limited data. In the code below, I will try to use linear regression first.

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

In [26]:
# Split dataset into train and test sets
pred_col = ["Critic_Score", "User_Score"]
X_train, X_test, y_train, y_test = train_test_split(df_vg[pred_col], df_vg["Global_Sales"], test_size=0.5, random_state=42)

It is important to have a test set for validating the model and assessing the performance.

In [27]:
# Initialize linear regression model
reg = LinearRegression()

In [28]:
# Fit linear regression model using train set
reg.fit(X_train[pred_col],y_train)

LinearRegression()

In [29]:
# Look at coefficients of the fitted model
reg.coef_

array([ 0.0337092 , -0.07293657])

We interpret the coefficients:

The mean "Global_Sales" increases by 0.0337 million as "Critic_Score" increases by 1. This makes sense. However, contrary to what most people would think, the mean "Global_Sales" actually decreases by 0.0729 million as "User_Score" increases by 1.

In [30]:
# Make predictions on test set using fitted model and calculate mean squared error
from sklearn.metrics import mean_squared_error
y_pred = reg.predict(X_test[pred_col])
mean_squared_error(y_test, y_pred)

2.577231862240738

In [31]:
# Prepare sample for plotting
X_test["pred"] = reg.predict(X_test[pred_col])
df3 = pd.DataFrame(X_test)
df3["Global_Sales"] = y_test
df_sample = df3.sample(n=200, random_state=42)

Since Altair can only work with smaller datasets (maximum 5000 rows), we took a subset of 200 rows to plot below.

In [32]:
# Plot regression line against original points for "Critic_Score"
c1 = alt.Chart(df_sample).mark_circle().encode(
    x="Critic_Score",
    y="Global_Sales"
)
c2 = alt.Chart(df_sample).mark_line(color="red").encode(
    x= "Critic_Score",
    y="pred"
)
c1+c2

It is unclear whether there is overfitting. Although we cannot be certain how good the fit is, the relationship between "Critic_Score" and "Global_Sales" is generally positive. In other words, the regression line provided by the model shows an upward trend, which means higher critic scores tend to have higher global sales. In addition, there are some negative global sales predicted by the regression model when critic score is below roughly 50, which would not make sense in the real world.

In [33]:
# Plot regression line against original points for "User_Score"
c1 = alt.Chart(df_sample).mark_circle().encode(
    x="User_Score",
    y="Global_Sales"
)
c2 = alt.Chart(df_sample).mark_line(color="red").encode(
    x= "User_Score",
    y="pred"
)
c1+c2

The result tend to have the risk of overfitting and there are no obvious trends to be seen in the fitted line. Since the model is too sensitive to noise and random fluctuations in the data, it will perform poorly when given new unseen data. Besides, there are some negative global sales predicted by the regression model, which would not make sense in the real world.

### Lasso Regression (Extra Component)

After using linear regression, I wonder if there is a model with better performance that can avoid overfitting. Thus, I try Lasso regression below.

#### Introduction of Lasso

Lasso stands for Least Absolute Shrinkage Selector Operator. Lasso regression is often used in machine learning for variable selection and regularization. It could automatically perform feature selection. In addition, this model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean. This could result in a more accurate prediction. 

In [34]:
from sklearn.linear_model import Lasso

# Train a lasso regression model and make predictions, calculate mean squared error
reg_lasso = Lasso(alpha=0.1)
reg_lasso.fit(X_train[pred_col],y_train)
y_pred_lasso = reg_lasso.predict(X_test[pred_col])
mean_squared_error(y_test, y_pred_lasso)

2.5837924034855884

The mean squared error of the Lasso regression model turned out to be larger than that of the linear regression model. I still want to see if there are any differences between the two fits visually.

In [35]:
# Prepare sample for plotting
X_test["pred"] = reg_lasso.predict(X_test[pred_col])
df3 = pd.DataFrame(X_test)
df3["Global_Sales"] = y_test
df_sample = df3.sample(n=200, random_state=42)

In [36]:
# Plot regression line against original points for "Critic_Score"
c1 = alt.Chart(df_sample).mark_circle().encode(
    x="Critic_Score",
    y="Global_Sales"
)
c2 = alt.Chart(df_sample).mark_line(color="red").encode(
    x= "Critic_Score",
    y="pred"
)
c1+c2

As the chart shows, the fitted values form a straight line with positive slope. We can see that Lasso regression avoided overfitting due to the shrinkage parameter alpha. However, this can now pose risks of underfitting.

In [37]:
# Plot regression line against original points for "User_Score"
c1 = alt.Chart(df_sample).mark_circle().encode(
    x="User_Score",
    y="Global_Sales"
)
c2 = alt.Chart(df_sample).mark_line(color="red").encode(
    x= "User_Score",
    y="pred"
)
c1+c2

As the chart shows, there is no big difference between Lasso and Linear Regression in terms of the fit for "User_Score". Both look overfitted and have predictions of negative global sales for lower user scores.

In [38]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Tune hyperparameter alpha based on lowest mean squared error
reg_lasso = Lasso(alpha=0)
reg_lasso.fit(X_train[pred_col],y_train)
y_pred_lasso = reg_lasso.predict(X_test[pred_col])
mean_squared_error(y_test, y_pred_lasso)

2.577231862240738

After trying different alpha values, I found that the smallest mean squared error is achieved when alpha equals 0. Therefore, in this situation, linear regression might be the better model.

## Summary

In the first part of the project, I cleaned and explored the dataset. After removing some NAs and filling in missing values, I looked at the relationship between each categorical varible and "Global_Sales" using charts. For the numerical variables, I found that theey do not have a strong correlation with global sales. In the second part of the project, I try to adopt machine learning to predict "Global_Sales by using "Critic_Score" and "User_Score" only. After comparing two different methods, I found that linear regression performed the best in this case but tends to overfit. In the future, it would be beneficial to look at outliers and their impact on the accuracy of predictions models.

## References

* What is the source of your dataset(s)?
https://www.kaggle.com/datasets/rush4ratio/video-game-sales-with-ratings

* List any other references that you found helpful.
https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/

https://www.mygreatlearning.com/blog/understanding-of-lasso-regression/#:~:text=Lasso%20regression%20is%20a%20regularization,i.e.%20models%20with%20fewer%20parameters).

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=d06346f4-a9df-4940-bb44-891d2aa8793c' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>