# Recommendation System
# Game Recommendation with *content-based filtering*

*by: [Rifqi Novandi](https://github.com/rifqinvnd)*

## Background
In this machine learning project, a recommendation system model will be created to predict preferred games based on other games that have similar similarities or by using *content-based filtering* techniques with several variables such as platform, year of release, genre, etc.

## 1. Install and import the required libraries

In [1]:
# Install the required library
!pip install -U scikit-learn
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# using os and zipfile library to prepare the dataset
import os
import zipfile
import json

# library for data processing
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.preprocessing import MinMaxScaler

# library to make the recommendation system model
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

# library for evaluate the machine learning model
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score

## 2. Prepares the Dataset

### 2.1 Prepare the username and kaggle key

In [3]:
# prepares the Kaggle credential environment
os.environ['KAGGLE_USERNAME'] = 'rifqinovandi'
os.environ['KAGGLE_KEY'] = '61655b112a6218032cc7743aab07e371'

### 2.2 Download and prepare the dataset

In [4]:
# Download the dataset with Kaggle CLI
!kaggle datasets download -d rush4ratio/video-game-sales-with-ratings

Downloading video-game-sales-with-ratings.zip to /content
  0% 0.00/476k [00:00<?, ?B/s]
100% 476k/476k [00:00<00:00, 127MB/s]


In [5]:
# Extract zip file to CWD
files = "/content/video-game-sales-with-ratings.zip"
zip = zipfile.ZipFile(files, 'r')
zip.extractall('/content')
zip.close()

## 3. Data Understanding

### 3.1 Read data with pandas DataFrame

In [6]:
df = pd.read_csv(files)
df.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


## 3.2 Memahami isi keseluruhan dataset

In [7]:
df.shape

(16719, 16)

In [8]:
# Check dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


In [9]:
# Check NaN value in columns
df.isna().sum()

Name                  2
Platform              0
Year_of_Release     269
Genre                 2
Publisher            54
NA_Sales              0
EU_Sales              0
JP_Sales              0
Other_Sales           0
Global_Sales          0
Critic_Score       8582
Critic_Count       8582
User_Score         6704
User_Count         9129
Developer          6623
Rating             6769
dtype: int64

In [10]:
# Describe dataset column
df.describe()

Unnamed: 0,Year_of_Release,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Count
count,16450.0,16719.0,16719.0,16719.0,16719.0,16719.0,8137.0,8137.0,7590.0
mean,2006.487356,0.26333,0.145025,0.077602,0.047332,0.533543,68.967679,26.360821,162.229908
std,5.878995,0.813514,0.503283,0.308818,0.18671,1.547935,13.938165,18.980495,561.282326
min,1980.0,0.0,0.0,0.0,0.0,0.01,13.0,3.0,4.0
25%,2003.0,0.0,0.0,0.0,0.0,0.06,60.0,12.0,10.0
50%,2007.0,0.08,0.02,0.0,0.01,0.17,71.0,21.0,24.0
75%,2010.0,0.24,0.11,0.04,0.03,0.47,79.0,36.0,81.0
max,2020.0,41.36,28.96,10.22,10.57,82.53,98.0,113.0,10665.0


## 4. Data Preparation

### 4.1 Drop column that have missing values and unused

In [11]:
df.drop(['Global_Sales', 'Critic_Score', 'Critic_Count', 'User_Count'], axis=1, inplace=True)

### 4.2 Clean every columns of the data

#### 4.2.1 Name column

In [12]:
# Check missing value
df[df['Name'].isna()]

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score,Developer,Rating
659,,GEN,1993.0,,Acclaim Entertainment,1.78,0.53,0.0,0.08,,,
14246,,GEN,1993.0,,Acclaim Entertainment,0.0,0.0,0.03,0.0,,,


In [13]:
# Drop missing-value
for index in df[df['Name'].isna()].index:
  df.drop(index, axis=0, inplace=True)

In [14]:
# make sure the missing value has been deleted
if(df['Name'].isna().sum() == 0):
  print("There is no empty data in the Name column")
else:
  print("Missing value detected")

There is no empty data in the Name column


#### 4.2.2 Platform Column

In [15]:
# use collections Counter to check the sum of each platform column element
platform_counter = Counter(df['Platform'])
platform_counter

Counter({'2600': 133,
         '3DO': 3,
         '3DS': 520,
         'DC': 52,
         'DS': 2152,
         'GB': 98,
         'GBA': 822,
         'GC': 556,
         'GEN': 27,
         'GG': 1,
         'N64': 319,
         'NES': 98,
         'NG': 12,
         'PC': 974,
         'PCFX': 1,
         'PS': 1197,
         'PS2': 2161,
         'PS3': 1331,
         'PS4': 393,
         'PSP': 1209,
         'PSV': 432,
         'SAT': 173,
         'SCD': 6,
         'SNES': 239,
         'TG16': 2,
         'WS': 6,
         'Wii': 1320,
         'WiiU': 147,
         'X360': 1262,
         'XB': 824,
         'XOne': 247})

In [16]:
# removes columns with counts less than 350
platform_less_than_350 = ['2600', '3DO', 'DC', 'GB', 'GEN', 'GG', 'N64','NES', 'NG',
                          'PCFX', 'SAT', 'SCD', 'SNES', 'TG16', 'WS', 'WiiU', 'XOne']

df = df[~df['Platform'].isin(platform_less_than_350)]

In [17]:
# Check unique element
df['Platform'].unique()

array(['Wii', 'DS', 'X360', 'PS3', 'PS2', 'GBA', 'PS4', '3DS', 'PS', 'XB',
       'PC', 'PSP', 'GC', 'PSV'], dtype=object)

#### 4.2.3 Genre Column

In [18]:
# Check missing value
df['Genre'].isna().sum()

0

In [19]:
# Check unique element
df['Genre'].unique()

array(['Sports', 'Racing', 'Platform', 'Misc', 'Simulation', 'Action',
       'Role-Playing', 'Puzzle', 'Shooter', 'Fighting', 'Adventure',
       'Strategy'], dtype=object)

In [20]:
# Check the sum of each unique element
genre_counter = Counter(df['Genre'])
genre_counter

Counter({'Action': 3082,
         'Adventure': 1229,
         'Fighting': 718,
         'Misc': 1641,
         'Platform': 738,
         'Puzzle': 505,
         'Racing': 1132,
         'Role-Playing': 1359,
         'Shooter': 1182,
         'Simulation': 835,
         'Sports': 2108,
         'Strategy': 624})

In [21]:
# discard row with misc genre that is too complex
df = df[df['Genre'] != 'Misc']

In [22]:
# check dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13512 entries, 0 to 16718
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             13512 non-null  object 
 1   Platform         13512 non-null  object 
 2   Year_of_Release  13293 non-null  float64
 3   Genre            13512 non-null  object 
 4   Publisher        13485 non-null  object 
 5   NA_Sales         13512 non-null  float64
 6   EU_Sales         13512 non-null  float64
 7   JP_Sales         13512 non-null  float64
 8   Other_Sales      13512 non-null  float64
 9   User_Score       8844 non-null   object 
 10  Developer        8925 non-null   object 
 11  Rating           8792 non-null   object 
dtypes: float64(5), object(7)
memory usage: 1.3+ MB


#### 4.2.4 Publisher Column

In [23]:
# check missing-value
df['Publisher'].isna().sum()

27

In [24]:
# remove every riw with missing value
for index in df[df['Publisher'].isna()].index:
  df.drop(index, axis=0, inplace=True)

In [25]:
# recheck every missing value in the publisher column has been removed
if(df['Publisher'].isna().sum() == 0):
  print("There is no empty data in the Publisher column")
else:
  print("Missing value detected")

There is no empty data in the Publisher column


In [26]:
# Check unique element
df['Publisher'].unique()

array(['Nintendo', 'Take-Two Interactive', 'Sony Computer Entertainment',
       'Activision', 'Microsoft Game Studios', 'Bethesda Softworks',
       'Electronic Arts', 'Sega', 'SquareSoft', '505 Games', 'Ubisoft',
       'GT Interactive', 'Konami Digital Entertainment', 'Square Enix',
       'Sony Computer Entertainment Europe', 'Virgin Interactive',
       'LucasArts', 'Capcom', 'Warner Bros. Interactive Entertainment',
       'Universal Interactive', 'Eidos Interactive', 'Atari',
       'Vivendi Games', 'Enix Corporation', 'Hasbro Interactive',
       'Namco Bandai Games', 'THQ', 'Fox Interactive',
       'Acclaim Entertainment', 'Disney Interactive Studios',
       'Codemasters', 'Majesco Entertainment', 'Red Orb', 'Level 5',
       'Midway Games', 'JVC', 'Deep Silver', 'NCSoft', '989 Studios',
       'UEP Systems', 'Maxis', 'Tecmo Koei', 'ASCII Entertainment',
       'Valve Software', 'Unknown', 'Valve', 'Hello Games', 'D3Publisher',
       'Activision Value', 'Infogrames', 'Red S

In [27]:
# Check the unknown element
df[df['Publisher'] == 'Unknown']

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score,Developer,Rating
944,Gran Turismo Concept 2001 Tokyo,PS2,2001.0,Racing,Unknown,0.00,1.10,0.42,0.33,,,
1650,NASCAR Thunder 2003,PS2,,Racing,Unknown,0.60,0.46,0.00,0.16,8.7,EA Sports,E
2108,Suikoden III,PS2,,Role-Playing,Unknown,0.29,0.23,0.38,0.08,7.7,KCET,T
2224,Teenage Mutant Ninja Turtles,GBA,2003.0,Action,Unknown,0.67,0.25,0.00,0.02,8.8,Konami,E
2321,Blitz: The League,PS2,2005.0,Sports,Unknown,0.74,0.03,0.00,0.12,8,Midway,M
...,...,...,...,...,...,...,...,...,...,...,...,...
16558,"Horse Life 4: My Horse, My Friend, My Champion",3DS,2015.0,Action,Unknown,0.00,0.01,0.00,0.00,,,
16638,The Treasures of Mystery Island 3 Pack - Save ...,PC,2011.0,Puzzle,Unknown,0.01,0.00,0.00,0.00,,,
16653,Real Crimes: The Unicorn Killer,DS,2011.0,Puzzle,Unknown,0.00,0.01,0.00,0.00,,,
16706,STORM: Frontline Nation,PC,2011.0,Strategy,Unknown,0.00,0.01,0.00,0.00,7.2,SimBin,E10+


In [28]:
# remove row with the unknown publisher
for index in df[df['Publisher'] == 'Unknown'].index:
  df.drop(index, axis=0, inplace=True)

#### 4.2.5 Year of Release Column

In [29]:
# check missing value
df['Year_of_Release'].isna().sum()

115

In [30]:
# remove missing value
for index in df[df['Year_of_Release'].isna()].index:
  df.drop(index, axis=0, inplace=True)

In [31]:
# ensure the missing value has been removed
if(df['Year_of_Release'].isna().sum() == 0):
  print("There is no empty data in Year_of_Release column")
else:
  print("Missing value detected")

There is no empty data in Year_of_Release column


In [32]:
# check unique element
df['Year_of_Release'].unique()

array([2006., 2008., 2009., 2005., 2007., 2013., 2004., 2002., 2010.,
       2001., 2011., 2015., 2012., 2014., 1997., 1999., 2016., 2003.,
       1998., 1996., 2000., 1995., 1994., 1992., 2020., 2017., 1985.,
       1988.])

In [33]:
# change column type to string because it is categorical
df['Year_of_Release'] = df['Year_of_Release'].astype('str')

In [34]:
# check missing value of the whole data
df.isna().sum()

Name                  0
Platform              0
Year_of_Release       0
Genre                 0
Publisher             0
NA_Sales              0
EU_Sales              0
JP_Sales              0
Other_Sales           0
User_Score         4559
Developer          4496
Rating             4615
dtype: int64

#### 4.2.6 User Score Column

In [35]:
# remove the missing value in the user score, developer, and rating columns
for index in df[df['User_Score'].isna()].index:
  df.drop(index, axis=0, inplace=True)

for index in df[df['Developer'].isna()].index:
  df.drop(index, axis=0, inplace=True)

for index in df[df['Rating'].isna()].index:
  df.drop(index, axis=0, inplace=True)

In [36]:
# check that all missing values in the dataset have been removed
if df.isna().sum().sum() == 0:
  print('Dataset cleaned')
else:
  print('Missing value detected')

Dataset cleaned


In [37]:
# check unique element
df['User_Score'].unique()

array(['8', '8.3', '8.5', '8.4', '8.6', '7.7', '7.4', '8.2', '9', '8.1',
       '8.7', '7.1', '3.4', '6.3', '5.3', '4.8', '3.2', '8.9', '6.4',
       '7.8', '7.9', '7.5', '2.6', '7.2', '9.2', '7', '4.3', '6.6', '7.6',
       '5.7', '5', '9.1', '6.5', 'tbd', '8.8', '6.9', '7.3', '9.4', '6.8',
       '6.1', '6.7', '4', '5.4', '4.9', '4.5', '9.3', '4.2', '3.7', '5.8',
       '5.6', '5.9', '3.9', '5.5', '6.2', '5.2', '6', '4.1', '4.7', '4.4',
       '5.1', '3.5', '2.5', '3', '3.1', '2.9', '2.7', '2.2', '2', '4.6',
       '9.5', '2.1', '3.6', '2.8', '3.3', '1.8', '3.8', '0', '1.6', '9.6',
       '2.4', '1.7', '1.1', '0.3', '1.5', '0.7', '1.2', '2.3', '1.3',
       '0.2', '0.5', '0.6', '1.4', '0.9', '1.9', '1', '9.7'], dtype=object)

In [38]:
# check tbd element in User_Score column
df[df['User_Score'] == 'tbd']

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score,Developer,Rating
119,Zumba Fitness,Wii,2010.0,Sports,505 Games,3.45,2.59,0.00,0.66,tbd,"Pipeworks Software, Inc.",E
520,Zumba Fitness 2,Wii,2011.0,Sports,Majesco Entertainment,1.51,1.03,0.00,0.27,tbd,"Majesco Games, Majesco",T
726,Dance Dance Revolution X2,PS2,2009.0,Simulation,Konami Digital Entertainment,1.09,0.85,0.00,0.28,tbd,Konami,E10+
821,The Incredibles,GBA,2004.0,Action,THQ,1.15,0.77,0.04,0.10,tbd,Helixe,E
1047,Tetris Worlds,GBA,2001.0,Puzzle,THQ,1.25,0.39,0.00,0.06,tbd,3d6 Games,E
...,...,...,...,...,...,...,...,...,...,...,...,...
16699,Planet Monsters,GBA,2001.0,Action,Titus,0.01,0.00,0.00,0.00,tbd,Planet Interactive,E
16701,Bust-A-Move 3000,GC,2003.0,Puzzle,Ubisoft,0.01,0.00,0.00,0.00,tbd,Taito Corporation,E
16702,Mega Brain Boost,DS,2008.0,Puzzle,Majesco Entertainment,0.01,0.00,0.00,0.00,tbd,Interchannel-Holon,E
16708,Plushees,DS,2008.0,Simulation,Destineer,0.01,0.00,0.00,0.00,tbd,Big John Games,E


In [39]:
# remove row with tbd user score
for index in df[df['User_Score'] == 'tbd'].index:
  df.drop(index, axis=0, inplace=True)

In [40]:
# change the user score data type to float as a numerical feature
df['User_Score'] = df['User_Score'].astype('float')

#### 4.2.7 Developer Column

In [41]:
# check the number of different elements in the developer column
df['Developer'].nunique()

1267

In [42]:
# because the number of different elements is too much and the column is categorical then the column is discarded
df.drop('Developer', axis=1, inplace=True)

#### 4.2.8 Rating Column

In [43]:
# check unique element
df['Rating'].unique()

array(['E', 'M', 'T', 'E10+', 'K-A', 'AO', 'EC', 'RP'], dtype=object)

### 4.3 Duplicate data cleaning

In [44]:
df.duplicated().sum()

0

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6662 entries, 0 to 16700
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             6662 non-null   object 
 1   Platform         6662 non-null   object 
 2   Year_of_Release  6662 non-null   object 
 3   Genre            6662 non-null   object 
 4   Publisher        6662 non-null   object 
 5   NA_Sales         6662 non-null   float64
 6   EU_Sales         6662 non-null   float64
 7   JP_Sales         6662 non-null   float64
 8   Other_Sales      6662 non-null   float64
 9   User_Score       6662 non-null   float64
 10  Rating           6662 non-null   object 
dtypes: float64(5), object(6)
memory usage: 624.6+ KB


In [46]:
# The results of the data after the cleaning process
df = df.reset_index(drop=True)
df

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,8.0,E
1,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,8.3,E
2,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,8.0,E
3,New Super Mario Bros.,DS,2006.0,Platform,Nintendo,11.28,9.14,6.50,2.88,8.5,E
4,New Super Mario Bros. Wii,Wii,2009.0,Platform,Nintendo,14.44,6.94,4.70,2.24,8.4,E
...,...,...,...,...,...,...,...,...,...,...,...
6657,E.T. The Extra-Terrestrial,GBA,2001.0,Action,NewKidCo,0.01,0.00,0.00,0.00,2.4,E
6658,Mortal Kombat: Deadly Alliance,GBA,2002.0,Fighting,Midway Games,0.01,0.00,0.00,0.00,8.8,M
6659,Worms 2,PC,1997.0,Strategy,Microprose,0.00,0.01,0.00,0.00,8.1,K-A
6660,Metal Gear Solid V: Ground Zeroes,PC,2014.0,Action,Konami Digital Entertainment,0.00,0.01,0.00,0.00,7.6,M


In [47]:
df.describe()

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score
count,6662.0,6662.0,6662.0,6662.0,6662.0
mean,0.370982,0.224319,0.059917,0.080369,7.165596
std,0.925752,0.666564,0.275964,0.267998,1.492732
min,0.0,0.0,0.0,0.0,0.0
25%,0.06,0.02,0.0,0.01,6.5
50%,0.14,0.05,0.0,0.02,7.5
75%,0.37,0.2,0.01,0.07,8.2
max,41.36,28.96,6.5,10.57,9.7


### 4.4 Restructure data

#### 4.4.1 Creating a dataframe containing the game name

In [48]:
# save game names on new dataframe
df_game_name = pd.DataFrame({'Game': df['Name']}).reset_index(drop=True)
df_game_name.head()

Unnamed: 0,Game
0,Wii Sports
1,Mario Kart Wii
2,Wii Sports Resort
3,New Super Mario Bros.
4,New Super Mario Bros. Wii


In [49]:
# use name column as index
df.set_index('Name', inplace=True)
df.head()

Unnamed: 0_level_0,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score,Rating
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,8.0,E
Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,8.3,E
Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,8.0,E
New Super Mario Bros.,DS,2006.0,Platform,Nintendo,11.28,9.14,6.5,2.88,8.5,E
New Super Mario Bros. Wii,Wii,2009.0,Platform,Nintendo,14.44,6.94,4.7,2.24,8.4,E


#### 4.4.2 Categorical label conversion with one-hot encoding

In [50]:
# select all columns with datatype object
column_object = df.dtypes[df.dtypes == 'object'].keys()
column_object

Index(['Platform', 'Year_of_Release', 'Genre', 'Publisher', 'Rating'], dtype='object')

In [51]:
# convert category data to one-hot encoding
one_hot_label = pd.get_dummies(df[column_object])
one_hot_label.head(3)

Unnamed: 0_level_0,Platform_3DS,Platform_DS,Platform_GBA,Platform_GC,Platform_PC,Platform_PS,Platform_PS2,Platform_PS3,Platform_PS4,Platform_PSP,...,Publisher_id Software,Publisher_inXile Entertainment,Rating_AO,Rating_E,Rating_E10+,Rating_EC,Rating_K-A,Rating_M,Rating_RP,Rating_T
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Wii Sports,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Mario Kart Wii,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Wii Sports Resort,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [52]:
# delete column with data type object
df.drop(column_object,axis=1,inplace=True)
df.head()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Wii Sports,41.36,28.96,3.77,8.45,8.0
Mario Kart Wii,15.68,12.76,3.79,3.29,8.3
Wii Sports Resort,15.61,10.93,3.28,2.95,8.0
New Super Mario Bros.,11.28,9.14,6.5,2.88,8.5
New Super Mario Bros. Wii,14.44,6.94,4.7,2.24,8.4


In [53]:
# unify one-hot encoding data with whole data
df = pd.concat([df,one_hot_label],axis=1)
df.head()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score,Platform_3DS,Platform_DS,Platform_GBA,Platform_GC,Platform_PC,...,Publisher_id Software,Publisher_inXile Entertainment,Rating_AO,Rating_E,Rating_E10+,Rating_EC,Rating_K-A,Rating_M,Rating_RP,Rating_T
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Wii Sports,41.36,28.96,3.77,8.45,8.0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Mario Kart Wii,15.68,12.76,3.79,3.29,8.3,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Wii Sports Resort,15.61,10.93,3.28,2.95,8.0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
New Super Mario Bros.,11.28,9.14,6.5,2.88,8.5,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
New Super Mario Bros. Wii,14.44,6.94,4.7,2.24,8.4,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


#### 4.4.3 Numerical column standardization

In [54]:
# select all numeric column
column_numeric = list(df.dtypes[df.dtypes == 'float64'].keys())
column_numeric

['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'User_Score']

In [55]:
# MinMaxScaler initiation
scaler = MinMaxScaler()

In [56]:
# numerical column data standardization
scaled = scaler.fit_transform(df[column_numeric])

In [57]:
# scaled the data
i=0
for column in column_numeric:
    df[column] = scaled[:,i]
    i += 1

In [58]:
# check the result of the normalized data
df.head()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score,Platform_3DS,Platform_DS,Platform_GBA,Platform_GC,Platform_PC,...,Publisher_id Software,Publisher_inXile Entertainment,Rating_AO,Rating_E,Rating_E10+,Rating_EC,Rating_K-A,Rating_M,Rating_RP,Rating_T
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Wii Sports,1.0,1.0,0.58,0.799432,0.824742,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Mario Kart Wii,0.37911,0.440608,0.583077,0.311258,0.85567,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Wii Sports Resort,0.377418,0.377417,0.504615,0.279092,0.824742,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
New Super Mario Bros.,0.272727,0.315608,1.0,0.272469,0.876289,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
New Super Mario Bros. Wii,0.34913,0.239641,0.723077,0.211921,0.865979,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [59]:
df.describe()

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,User_Score,Platform_3DS,Platform_DS,Platform_GBA,Platform_GC,Platform_PC,...,Publisher_id Software,Publisher_inXile Entertainment,Rating_AO,Rating_E,Rating_E10+,Rating_EC,Rating_K-A,Rating_M,Rating_RP,Rating_T
count,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0,...,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0,6662.0
mean,0.00897,0.007746,0.009218,0.007604,0.738721,0.023717,0.070249,0.035575,0.050886,0.104473,...,0.00015,0.00015,0.00015,0.311618,0.130742,0.00015,0.0003,0.209397,0.00015,0.347493
std,0.022383,0.023017,0.042456,0.025355,0.15389,0.152176,0.255586,0.185242,0.219781,0.305896,...,0.012252,0.012252,0.012252,0.463189,0.337143,0.012252,0.017325,0.406908,0.012252,0.47621
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.001451,0.000691,0.0,0.000946,0.670103,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.003385,0.001727,0.0,0.001892,0.773196,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.008946,0.006906,0.001538,0.006623,0.845361,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 5. Create *Content-based Filtering* Recommendation System Model

### 5.1 Using the K-NearestNeighbors

In [60]:
# Model initiation
model = NearestNeighbors(metric='euclidean')

# Fit model to the data
model.fit(df)

NearestNeighbors(metric='euclidean')

In [61]:
# Create function to get the game recommendation
def GameRecommended(gamename:str, recommended_games:int=6):
  print(f'If user like playing Game: \n{gamename[0]}\n5 Game that the user might like to play:')
  # Looking for the game with the highest similarity to the game that user play
  distances, neighbors = model.kneighbors(df.loc[gamename],n_neighbors=recommended_games)
  # Input the recommended game into the list
  similar_game = []
  for gamename in df_game_name.loc[neighbors[0][:]].values:
    similar_game.append(gamename[0])
  # Input the distance score into the list
  similar_distance = []
  for distance in distances[0]:
    similar_distance.append(f"{round(100-distance, 2)}%")
  # Return a dataframe with the most recommended game
  return pd.DataFrame(data = {"Game" : similar_game[1:], "Similarity" : similar_distance[1:]})

In [62]:
# Give the recommendation to the selected game
GameRecommended(df_game_name.loc[111])

If user like playing Game: 
Final Fantasy IX
5 Game that the user might like to play:


Unnamed: 0,Game,Similarity
0,Final Fantasy VIII,98.58%
1,Final Fantasy Tactics,98.57%
2,Xenogears,98.55%
3,Tales of Destiny II,98.55%
4,Chrono Cross,98.55%


### 5.2 Using Cosine Similarity

In [63]:
# Calculate the cosine similarity of the dataframe
cosine_sim = cosine_similarity(df)

# Keep the result of the calculation dataframe
cosine_sim_df = pd.DataFrame(cosine_sim, index=df_game_name['Game'], columns=df_game_name['Game'])
cosine_sim_df.head(3)

Game,Wii Sports,Mario Kart Wii,Wii Sports Resort,New Super Mario Bros.,New Super Mario Bros. Wii,Mario Kart DS,Wii Fit,Wii Fit Plus,Grand Theft Auto V,Grand Theft Auto: San Andreas,...,Trine,Karnaaj Rally,Hospital Tycoon,Ben 10 Omniverse 2,Bookworm Deluxe,E.T. The Extra-Terrestrial,Mortal Kombat: Deadly Alliance,Worms 2,Metal Gear Solid V: Ground Zeroes,Breach
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Wii Sports,1.0,0.681225,0.806006,0.65558,0.652319,0.504198,0.774904,0.769941,0.217456,0.238192,...,0.099181,0.251738,0.048428,0.064216,0.382273,0.181952,0.105425,0.098121,0.092749,0.072453
Mario Kart Wii,0.681225,1.0,0.686753,0.541707,0.686951,0.682935,0.677577,0.66904,0.180855,0.184008,...,0.118648,0.455775,0.057913,0.076806,0.280582,0.211163,0.126127,0.117379,0.11095,0.086671
Wii Sports Resort,0.806006,0.686753,1.0,0.530365,0.838024,0.520196,0.835151,0.995444,0.172429,0.176879,...,0.282934,0.295104,0.056737,0.075247,0.280941,0.213287,0.123572,0.114998,0.1087,0.084915


In [64]:
# Create function to get the game recommendation
def CosineGameRecommended(gamename:str, recommended_games:int=5):
  print(f'If user like playing Game: \n{gamename[0]}\n5 Game that the user might like to play:')
  # Look up the unique value of the game the user likes in the cosine sim dataframe row
  # The unique value (arr) is returned in an ordered form from small to large 
  arr, ind = np.unique(cosine_sim_df.loc[gamename[0]], return_index=True)
  # Input similar game names from the second-last index to the nth-last index
  similar_game = []
  for index in ind[-(recommended_games+1):-1]:
    similar_game.append(df_game_name.loc[index][0])
  # Input the cosine scores of similar games starting from the second-last index to the nth-last index.
  cosine_score = []
  for score in arr[-(recommended_games+1):-1]:
    cosine_score.append(score)
  # Return a dataframe with the most recommended game
  return pd.DataFrame(data = {"Game" : similar_game, "Cosine Similarity" : cosine_score}).sort_values(by='Cosine Similarity',ascending=False)

In [65]:
# provides recommendations with cosine similarity on selected games
CosineGameRecommended(df_game_name.loc[111])

If user like playing Game: 
Final Fantasy IX
5 Game that the user might like to play:


Unnamed: 0,Game,Cosine Similarity
4,Final Fantasy VIII,0.833562
3,Final Fantasy Tactics,0.825829
2,Xenogears,0.823134
1,Tales of Destiny II,0.822043
0,Chrono Cross,0.820439


## 6. Recommendation System Model Evaluation

### 6.1 Calinski Harabasz Score

In [66]:
calinski_harabasz_score(df, df_game_name).round(2)

  y = column_or_1d(y, warn=True)


5.09

### 6.2 Davies Bouldin Score

In [67]:
davies_bouldin_score(df, df_game_name).round(2)

  y = column_or_1d(y, warn=True)


2.93

## Closing
A model for game recommendation with *content-based filtering* has been completed. After testing, the model works quite well in providing the top 5 recommendations for games that users might like/play. However, there are still some shortcomings of the model as seen in the Calinski Harabasz and Davies Bouldin scores. To improve it, algorithms can be used to create other recommendation models such as using deep learning or *collaborative filtering* and then compare its performance with the current KNN model.

### References
- Scikit-learn Docummentation: [https://scikit-learn.org/stable/modules/classes.html](https://scikit-learn.org/stable/modules/classes.html)
- Report References: [Contoh Algoritma Sistem Rekomendasi dengan Dokumentasi](https://github.com/fahmij8/ML-Exercise/blob/main/MLT-2/MLT_Proyek_Submission_2.ipynb)
- Dataset: [Game Sales with Rating Dataset](https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings)