# ML Process
1. Problem Definition
2. Data Preparation
3. EDA
4. Preprocessing
5. Training Model
6. Evaluation
7. API Services
8. PyTest
9. Deployment

#### Problem Definition

- Prediksi sales berdasarkan features yang dimiliki
- Pada case ini kita akan coba melakukan prediksi terhadap `NA_Sales`

Goals dari project ini untuk lebih kenal atau membiasakan dengan ML Process


Source data: https://www.kaggle.com/datasets/gregorut/videogamesales

In [5]:
!pwd

/home/shandytp/Pacmann/ml_process/video_game_sales


In [15]:
import pandas as pd
import numpy as np
import joblib
from src.utils.helper import load_params, read_data

%load_ext autoreload
%autoreload 1

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
params_dir = "config/params.yaml"

In [3]:
params = load_params(params_dir)

In [4]:
params

{'dataset_dir': 'data/raw/'}

### Data Collection 

In [5]:
data = read_data(params["dataset_dir"], "vgsales.csv")

In [6]:
data

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


Drop columns `Rank`, `EU_Sales`, `JP_Sales`, `Other_Sales`, dan `Global_Sales`

In [9]:
cols_to_drop = ["Rank", "EU_Sales", "JP_Sales", "Other_Sales", "Global_Sales"]

In [12]:
data.drop(cols_to_drop, axis = 1, inplace = True)

In [14]:
data.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27


In [17]:
# save data into .pkl file
joblib.dump(data, "data/processed/data.pkl")

['data/processed/data.pkl']

### Data Definition [todo list]

Name: Nama game
    [object]
    
Platform: Platform untuk games dirilis (pc, ps4, etc)

Year: Tahun game dirilis

Genre: Genre dari game

Publisher: Publisher yang publish game

NA_Sales: Sales pada North America (in millions)

###  Data Validation

#### Check Data Types

In [18]:
# cek tipe data
data.dtypes

Name          object
Platform      object
Year         float64
Genre         object
Publisher     object
NA_Sales     float64
dtype: object

Agak aneh untuk column `Year` karena dalam bentuk float. Bisa kita ubah dalam bentuk int

#### Range 

In [20]:
data.describe()

Unnamed: 0,Year,NA_Sales
count,16327.0,16598.0
mean,2006.406443,0.264667
std,5.828981,0.816683
min,1980.0,0.0
25%,2003.0,0.0
50%,2007.0,0.08
75%,2010.0,0.24
max,2020.0,41.49


#### Check data shape

In [21]:
data.shape

(16598, 6)

#### Check missing values 

In [22]:
data.isna().sum()

Name           0
Platform       0
Year         271
Genre          0
Publisher     58
NA_Sales       0
dtype: int64