# **ETL (Extract, Transform, Load)**

## Objectives
- Load the raw **VGChartz Video Game Sales** dataset and prepare it for analysis and dashboarding.  
- Perform basic data profiling to understand structure and quality.  
- Clean and transform the dataset (handle missing values, unify formats, engineer features such as multi-platform indicator, first-party flag, and release era).  
- Export a cleaned, analysis-ready dataset for use in visualizations and Tableau.

## Inputs
- **Raw data file:** `data/raw/Video_Games_Sales_as_at_22_Dec_2016.csv`  
- **Columns used:**  
  `Name`, `Platform`, `Year_of_Release`, `Genre`, `Publisher`,  
  `NA_Sales`, `EU_Sales`, `JP_Sales`, `Other_Sales`, `Global_Sales`,  
  `Critic_Score`, `Critic_Count`, `User_Score`, `User_Count`, `Developer`, `Rating`  
- **Python libraries:** `pandas`, `numpy`, `matplotlib`, `seaborn` (for quick profiling)

## Outputs
- **Processed dataset:** `data/processed/video_game_sales_clean.csv` — cleaned and feature-engineered for analysis.  
- Summary of data issues and cleaning actions in the ETL notebook (`notebooks/etl.ipynb`).  
- Basic exploratory statistics (row counts, missing values, data types) for reference.

## Additional Comments
- Major cleaning steps include:  
  - Removing rows with no game name or no sales data.  
  - Converting year to integer and handling missing or unrealistic years.  
  - Dropping or flagging games without review scores when needed for hypotheses.  
  - Creating new features:  
    - `Vendor` (Nintendo, Sony, Microsoft, Other)  
    - `is_multiplatform` (1 if game appears on ≥2 platforms)  
    - `is_first_party` (1 if publisher matches platform vendor)  
    - `Era` (pre-2010 vs post-2010 for trend analysis)  
- This notebook produces the single source of truth dataset used throughout the project (analysis, testing, and Tableau dashboard).


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/stephenbeese/GitHub/Video-Game-Sales-Analysis/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/stephenbeese/GitHub/Video-Game-Sales-Analysis'

Set up the data directories

In [4]:
# Set the file path for the raw data
raw_data_dir = os.path.join(current_dir, 'data/raw')

# Set the file path for the processed data
processed_data_dir = os.path.join(current_dir, 'data/processed')

In [5]:
print("Raw data directory:", raw_data_dir)
print("Processed data directory:", processed_data_dir)

Raw data directory: /Users/stephenbeese/GitHub/Video-Game-Sales-Analysis/data/raw
Processed data directory: /Users/stephenbeese/GitHub/Video-Game-Sales-Analysis/data/processed


# Imports

Import the necessary packages to perform the ETL process.

In [6]:
import numpy as np
import pandas as pd

# Load the data

In [7]:
df = pd.read_csv(os.path.join(raw_data_dir, 'video_game_sales.csv'))

df.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


# Data Profiling

Understanding the structure and basic info of the dataframe

In [8]:
df.shape

(16719, 16)

This dataset contains 16719 rows and 16 columns

## Check and convert datatypes

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       7590 non-null   float64
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(10), object(6)
memory usage: 2.0+ MB


### Data Type Adjustments

To prepare the dataset for analysis, several columns were converted to more suitable data types:

- **Year_of_Release** → changed from `float64` to `Int64` (nullable integer) to store whole years and handle missing values.
- **Critic_Score** → optionally converted to `Int64` since scores are whole numbers.
- **Platform, Genre, Publisher, Developer, Rating** → converted from `object` to `category` to reduce memory use and speed up grouping/filtering.

These changes make the dataset cleaner, improve performance, and prevent issues when running statistical tests or creating visualisations.


In [10]:
# Convert Year_of_Release to nullable int
df['Year_of_Release'] = df['Year_of_Release'].astype('Int64')

# Optionally convert Critic_Score to int if you want
df['Critic_Score'] = df['Critic_Score'].astype('Int64')

# Convert categorical columns
cat_cols = ['Platform', 'Genre', 'Publisher', 'Developer', 'Rating']
df[cat_cols] = df[cat_cols].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Name             16717 non-null  object  
 1   Platform         16719 non-null  category
 2   Year_of_Release  16450 non-null  Int64   
 3   Genre            16717 non-null  category
 4   Publisher        16665 non-null  category
 5   NA_Sales         16719 non-null  float64 
 6   EU_Sales         16719 non-null  float64 
 7   JP_Sales         16719 non-null  float64 
 8   Other_Sales      16719 non-null  float64 
 9   Global_Sales     16719 non-null  float64 
 10  Critic_Score     8137 non-null   Int64   
 11  Critic_Count     8137 non-null   float64 
 12  User_Score       7590 non-null   float64 
 13  User_Count       7590 non-null   float64 
 14  Developer        10096 non-null  category
 15  Rating           9950 non-null   category
dtypes: Int64(2), category(5), float64(8), ob

In [11]:
df.describe(include='all')

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
count,16717,16719,16450.0,16717,16665,16719.0,16719.0,16719.0,16719.0,16719.0,8137.0,8137.0,7590.0,7590.0,10096,9950
unique,11562,31,,12,582,,,,,,,,,,1696,8
top,Need for Speed: Most Wanted,PS2,,Action,Electronic Arts,,,,,,,,,,Ubisoft,E
freq,12,2161,,3370,1356,,,,,,,,,,204,3991
mean,,,2006.487356,,,0.26333,0.145025,0.077602,0.047332,0.533543,68.967679,26.360821,7.125046,162.229908,,
std,,,5.878995,,,0.813514,0.503283,0.308818,0.18671,1.547935,13.938165,18.980495,1.500006,561.282326,,
min,,,1980.0,,,0.0,0.0,0.0,0.0,0.01,13.0,3.0,0.0,4.0,,
25%,,,2003.0,,,0.0,0.0,0.0,0.0,0.06,60.0,12.0,6.4,10.0,,
50%,,,2007.0,,,0.08,0.02,0.0,0.01,0.17,71.0,21.0,7.5,24.0,,
75%,,,2010.0,,,0.24,0.11,0.04,0.03,0.47,79.0,36.0,8.2,81.0,,


# Check for missing values

In [12]:
df.isna().sum()

Name                  2
Platform              0
Year_of_Release     269
Genre                 2
Publisher            54
NA_Sales              0
EU_Sales              0
JP_Sales              0
Other_Sales           0
Global_Sales          0
Critic_Score       8582
Critic_Count       8582
User_Score         9129
User_Count         9129
Developer          6623
Rating             6769
dtype: int64

**Key observations:**

- **Sales data** (`NA_Sales`, `EU_Sales`, `JP_Sales`, `Other_Sales`, `Global_Sales`) is complete — no missing values.
- **Core identifiers** (`Name`, `Platform`, `Genre`, `Publisher`) are mostly complete, with only a few missing entries.
- **Year_of_Release** has 269 missing values — these may need to be dropped or imputed.
- **Review data** (`Critic_Score`, `Critic_Count`, `User_Score`, `User_Count`) is missing for about **50–55% of games**. This limits sample size for review-based hypotheses but is acceptable if we focus only on reviewed games for those analyses.
- **Developer and Rating** have ~40% missing — these are less critical but should be noted if we use them.
- `Name` and `Genre` each have only 2 missing entries — negligible and can be dropped.

**Implications for cleaning:**

- I will likely **drop rows with missing `Name` or `Global_Sales`** (key identifiers and target variable).
- For analyses involving reviews, we’ll use the subset with non-null `Critic_Score` or `User_Score`.
- I will consider dropping or flagging rows with missing `Year_of_Release` if time-based trends matter.
- Missing `Developer` and `Rating` can be ignored for now since they’re not central to chosen hypotheses.


# Check for duplicate values

In [16]:
df.duplicated().sum()

0

As we can see above there are no exact duplicate rows.

Next I will check if there are any game titles that are duplicated.

In [None]:
df[df.duplicated(subset=['Name'])].sort_values(by='Name')

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
3862,Frozen: Olaf's Quest,DS,2013,Platform,Disney Interactive Studios,0.21,0.26,0.00,0.04,0.52,,,,,,
14660,007: Quantum of Solace,PC,2008,Action,Activision,0.01,0.01,0.00,0.00,0.03,70,18.0,6.3,55.0,Treyarch,T
1785,007: Quantum of Solace,PS3,2008,Action,Activision,0.43,0.51,0.02,0.19,1.14,65,42.0,6.6,47.0,Treyarch,T
3120,007: Quantum of Solace,Wii,2008,Action,Activision,0.29,0.28,0.01,0.07,0.65,54,11.0,7.5,26.0,Treyarch,T
4475,007: Quantum of Solace,PS2,2008,Action,Activision,0.17,0.00,0.00,0.26,0.43,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4664,pro evolution soccer 2011,PS2,2010,Sports,Konami Digital Entertainment,0.04,0.21,0.05,0.11,0.41,,,6.7,7.0,Konami,E
2583,pro evolution soccer 2011,PSP,2010,Sports,Konami Digital Entertainment,0.05,0.30,0.29,0.16,0.79,74,10.0,5.8,5.0,Konami,E
7150,pro evolution soccer 2011,Wii,2010,Sports,Konami Digital Entertainment,0.07,0.10,0.03,0.02,0.22,78,9.0,5.4,7.0,Konami,E
15614,uDraw Studio: Instant Artist,X360,2011,Misc,THQ,0.01,0.01,0.00,0.00,0.02,54,5.0,5.7,6.0,THQ,E


As we can see here there are games that have duplicates.

This is because there are games with the same name released on different consoles.

These could be ports or remakes.

Later in this notebook I will convert these to a dataframe where they will be combined to give the total sales of a certain game across all platforms that they are released on.

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [13]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (553063055.py, line 5)