# ML Process
1. Problem Definition
2. Data Preparation
3. EDA
4. Preprocessing
5. Training Model
6. Evaluation
7. API Services
8. PyTest
9. Deployment

### 1. Problem Definition

- Melakukan prediksi Global Sales berdasarkan informasi yang dimiliki, seperti Sales di masing - masing region nya.
- Harapannya kita bisa melihat most potential game based on Sales
- Pada case ini kita akan coba melakukan prediksi terhadap `Global_Sales`

Goals dari project ini untuk lebih kenal atau membiasakan dengan ML Process

Source data: https://www.kaggle.com/datasets/gregorut/videogamesales

In [29]:
import pandas as pd
import joblib
from src.utils.helper import load_params, read_data, check_data

In [30]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [31]:
params_dir = "config/params.yaml"

In [32]:
params = load_params(params_dir)

In [33]:
params

{'dataset_dir': 'data/raw/',
 'int32_columns': ['Year'],
 'float32_columns': ['NA_Sales',
  'EU_Sales',
  'JP_Sales',
  'Other_Sales',
  'Global_Sales'],
 'object_columns': ['Name', 'Platform', 'Genre', 'Publisher'],
 'label': 'Global_Sales',
 'predictors': ['Name',
  'Year',
  'Platform',
  'Genre',
  'Publisher',
  'NA_Sales',
  'EU_Sales',
  'JP_Sales',
  'Other_Sales'],
 'range_Year': [-1, 2020]}

### 2. Data Collection 

In [34]:
data = read_data(params["dataset_dir"], "vgsales.csv")

In [35]:
data

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


Drop columns `Rank`, `EU_Sales`, `JP_Sales`, `Other_Sales`, dan `Global_Sales`

In [37]:
data.drop(["Rank"], axis = 1, inplace = True)

In [38]:
data.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [39]:
# save data into .pkl file
joblib.dump(data, "data/processed/data.pkl")

['data/processed/data.pkl']

### 3. Data Definition

Name: Nama game
    [object]
    
Platform: Platform untuk games dirilis
    [object]
    [PC, PS4, PS2, XOne, WiiU, dll]

Year: Tahun game dirilis
    [integer]
    [1980 - 2020]

Genre: Genre dari game
    [object]
    [Sports, Platform, Racing, Puzzle, dll]

Publisher: Publisher yang publish game
    [object]
    [Mojang, Konami, EA, dll]

NA_Sales: Sales pada North America (in millions)
    [float]
    
EU_Sales: Sales pada Europe (in millions)
    [float]
    
JP_Sales: Sales pada Jepang (in millions)
    [float]
    
Other_Sales: Sales pada region lain (in millions)
    [float]

###  4. Data Validation

#### Check Data Types

In [40]:
# cek tipe data
data.dtypes

Name             object
Platform         object
Year            float64
Genre            object
Publisher        object
NA_Sales        float64
EU_Sales        float64
JP_Sales        float64
Other_Sales     float64
Global_Sales    float64
dtype: object

Agak aneh untuk column `Year` karena dalam bentuk float. Bisa kita ubah dalam bentuk int

#### Range 

In [41]:
data.describe()

Unnamed: 0,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16327.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,2006.406443,0.264667,0.146652,0.077782,0.048063,0.537441
std,5.828981,0.816683,0.505351,0.309291,0.188588,1.555028
min,1980.0,0.0,0.0,0.0,0.0,0.01
25%,2003.0,0.0,0.0,0.0,0.0,0.06
50%,2007.0,0.08,0.02,0.0,0.01,0.17
75%,2010.0,0.24,0.11,0.04,0.04,0.47
max,2020.0,41.49,29.02,10.22,10.57,82.74


#### Check data shape

In [42]:
data.shape

(16598, 10)

#### Check missing values 

In [43]:
data.isna().sum()

Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

#### Handling Column "Year"  

In [44]:
# convert float into int

data["Year"] = data["Year"].astype("int").copy()

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

Ternyata ada missing values, sehingga kita tidak bisa melakukan casting

In [45]:
data["Year"].isnull().sum()

271

Kita bisa handle secara temporary dengan isi missing values tersebut dengan value `-1`

In [46]:
data["Year"].fillna(-1, inplace=True)

In [47]:
data["Year"].isnull().sum()

0

In [48]:
# Kita coba casting ke integer lagi
data["Year"] = data["Year"].astype("int").copy()

In [49]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          16598 non-null  object 
 1   Platform      16598 non-null  object 
 2   Year          16598 non-null  int64  
 3   Genre         16598 non-null  object 
 4   Publisher     16540 non-null  object 
 5   NA_Sales      16598 non-null  float64
 6   EU_Sales      16598 non-null  float64
 7   JP_Sales      16598 non-null  float64
 8   Other_Sales   16598 non-null  float64
 9   Global_Sales  16598 non-null  float64
dtypes: float64(5), int64(1), object(4)
memory usage: 1.3+ MB


#### Handling Column "Publisher"

In [50]:
data["Publisher"].isnull().sum()

58

Ada dua cara untuk treat data ini:
1. Kita drop aja missing values ini
2. Kita isi missing values ini dengan "UNKNOWN"

Tapi kita bisa treat di bagian preprocessing, karena disini kita cuma nge cek data aja

In [51]:
data[data["Publisher"].isnull() == True]

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
470,wwe Smackdown vs. Raw 2006,PS2,-1,Fighting,,1.57,1.02,0.0,0.41,3.0
1303,Triple Play 99,PS,-1,Sports,,0.81,0.55,0.0,0.1,1.46
1662,Shrek / Shrek 2 2-in-1 Gameboy Advance Video,GBA,2007,Misc,,0.87,0.32,0.0,0.02,1.21
2222,Bentley's Hackpack,GBA,2005,Misc,,0.67,0.25,0.0,0.02,0.93
3159,Nicktoons Collection: Game Boy Advance Video V...,GBA,2004,Misc,,0.46,0.17,0.0,0.01,0.64
3166,SpongeBob SquarePants: Game Boy Advance Video ...,GBA,2004,Misc,,0.46,0.17,0.0,0.01,0.64
3766,SpongeBob SquarePants: Game Boy Advance Video ...,GBA,2004,Misc,,0.38,0.14,0.0,0.01,0.53
4145,Sonic the Hedgehog,PS3,-1,Platform,,0.0,0.48,0.0,0.0,0.48
4526,The Fairly Odd Parents: Game Boy Advance Video...,GBA,2004,Misc,,0.31,0.11,0.0,0.01,0.43
4635,The Fairly Odd Parents: Game Boy Advance Video...,GBA,2004,Misc,,0.3,0.11,0.0,0.01,0.42


### 4. Data Defense 

In [52]:
check_data(data, params)

'Passed data defense'

### 5. Data Splitting 

In [53]:
# Split into X and y
X = data[params["predictors"]].copy()
y = data[params["label"]].copy()

In [54]:
X.head()

Unnamed: 0,Name,Year,Platform,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,Wii Sports,2006,Wii,Sports,Nintendo,41.49,29.02,3.77,8.46
1,Super Mario Bros.,1985,NES,Platform,Nintendo,29.08,3.58,6.81,0.77
2,Mario Kart Wii,2008,Wii,Racing,Nintendo,15.85,12.88,3.79,3.31
3,Wii Sports Resort,2009,Wii,Sports,Nintendo,15.75,11.01,3.28,2.96
4,Pokemon Red/Pokemon Blue,1996,GB,Role-Playing,Nintendo,11.27,8.89,10.22,1.0


In [55]:
y.head()

0    82.74
1    40.24
2    35.82
3    33.00
4    31.37
Name: Global_Sales, dtype: float64

In [56]:
from sklearn.model_selection import train_test_split

In [57]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 42)

In [58]:
X_valid, X_test, y_valid, y_test = train_test_split(X_test,
                                                    y_test,
                                                    test_size = 0.5,
                                                    random_state = 42)

In [59]:
joblib.dump(X_train, "data/processed/X_train.pkl")
joblib.dump(y_train, "data/processed/y_train.pkl")
joblib.dump(X_valid, "data/processed/X_valid.pkl")
joblib.dump(y_valid, "data/processed/y_valid.pkl")
joblib.dump(X_test, "data/processed/X_test.pkl")
joblib.dump(y_test, "data/processed/y_test.pkl")

['data/processed/y_test.pkl']