# **ETL (Extract, Transform, Load)**

## Objectives
- Load the raw **VGChartz Video Game Sales** dataset and prepare it for analysis and dashboarding.  
- Perform basic data profiling to understand structure and quality.  
- Clean and transform the dataset (handle missing values, unify formats, engineer features such as multi-platform indicator, first-party flag, and release era).  
- Export a cleaned, analysis-ready dataset for use in visualizations and Tableau.

## Inputs
- **Raw data file:** `data/raw/Video_Games_Sales_as_at_22_Dec_2016.csv`  
- **Columns used:**  
  `Name`, `Platform`, `Year_of_Release`, `Genre`, `Publisher`,  
  `NA_Sales`, `EU_Sales`, `JP_Sales`, `Other_Sales`, `Global_Sales`,  
  `Critic_Score`, `Critic_Count`, `User_Score`, `User_Count`, `Developer`, `Rating`  
- **Python libraries:** `pandas`, `numpy`, `matplotlib`, `seaborn` (for quick profiling)

## Outputs
- **Processed dataset:** `data/processed/video_game_sales_clean.csv` — cleaned and feature-engineered for analysis.  
- Summary of data issues and cleaning actions in the ETL notebook (`notebooks/etl.ipynb`).  
- Basic exploratory statistics (row counts, missing values, data types) for reference.

## Additional Comments
- Major cleaning steps include:  
  - Removing rows with no game name or no sales data.  
  - Converting year to integer and handling missing or unrealistic years.  
  - Dropping or flagging games without review scores when needed for hypotheses.  
  - Creating new features:  
    - `Vendor` (Nintendo, Sony, Microsoft, Other)  
    - `is_multiplatform` (1 if game appears on ≥2 platforms)  
    - `is_first_party` (1 if publisher matches platform vendor)  
    - `Era` (pre-2010 vs post-2010 for trend analysis)  
- This notebook produces the single source of truth dataset used throughout the project (analysis, testing, and Tableau dashboard).


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/stephenbeese/GitHub/Video-Game-Sales-Analysis/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/stephenbeese/GitHub/Video-Game-Sales-Analysis'

Set up the data directories

In [4]:
# Set the file path for the raw data
raw_data_dir = os.path.join(current_dir, 'data/raw')

# Set the file path for the processed data
processed_data_dir = os.path.join(current_dir, 'data/processed')

In [5]:
print("Raw data directory:", raw_data_dir)
print("Processed data directory:", processed_data_dir)

Raw data directory: /Users/stephenbeese/GitHub/Video-Game-Sales-Analysis/data/raw
Processed data directory: /Users/stephenbeese/GitHub/Video-Game-Sales-Analysis/data/processed


# Imports

Import the necessary packages to perform the ETL process.

In [6]:
import numpy as np
import pandas as pd

# Load the data

Section 1 content

In [7]:
df = pd.read_csv(os.path.join(raw_data_dir, 'video_game_sales.csv'))

df.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
