# CDS503 - Machine Learning Final Project

Semester 2, Academic Session 2019/2020

Group 6 - **Data Masters**

Members:
- Lee Yong Meng
- Lee Kar Choon
- Tan Wei Chean
- Yee Hoong Yip

## Overview

- [Data Preparation](#Data-Preparation)
- Experiment Set 1: Machine Learning Algorithm
- Experiment Set 2: Feature Selection
- Experiment Set 3: Ensemble Learning
- Experiment Set 4: Training Sample Size

# Data Preparation

Before working on the experiment sets, we need to import some necessary libraries for working on data pre-processing stage.

In [17]:
# Import necessary libraries
import pandas as pd                  # Use pandas.DataFrame to manipulate data
import matplotlib.pyplot as plt      # Standard plotting library
import numpy as np                   # Standard Python library for numerical operations

# Import sklearn modules
from sklearn import preprocessing    # Data preprocessing

Next, we read in the data.

In [8]:
# Read in data
df = pd.read_csv('AppleStore.csv')

# Quick view on the data
df.head()

Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
1,2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
2,3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
3,4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1
4,5,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1


Several columns in the datasets are not helpful in our work. Therefore, we will remove these columns.

- `Id`: App ID
- `track_name`: App name
- Unnamed: the first column, which is the count of the record.

We use the method `.drop()` to remove the specified columns.

In [10]:
# df.drop(['id'], axis = 1, inplace = True)
# df.drop(['track_name'], axis = 1, inplace = True)

# Define columns to drop
columns_drop = ['id', 'track_name']

# Drop columns
df.drop(columns_drop, axis = 1, inplace = True)
df.drop(df.columns[df.columns.str.contains('unnamed', case = False)], axis = 1, inplace = True)
df.head()

Unnamed: 0,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
1,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
2,100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
3,128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1
4,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1


For our business problem, we would like to group the column `user_rating` (i.e. our target column) into three groups, namely "Low", "Medium" and "High". We use the function `pd.cut()` to perform the binning. Then, we add a new column `user_rating_label` into the data frame, which will be shown at the very end when the data frame preview is scrolling horizontally.

In [13]:
df['user_rating_label'] = pd.cut(df['user_rating'], bins = 3, labels = ['Low', 'Medium', 'High'])
df.head()

Unnamed: 0,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,user_rating_label
0,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1,High
1,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1,High
2,100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1,High
3,128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1,High
4,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1,High


Next, we will drop two other columns:
- `user_rating`: not needed because we will be using the new column `user_rating_label` as the target of classification.
- `currency`: not helpful because it only has one unique value "USD" for all examples.

In [14]:
# Define columns to drop
columns_drop = ['user_rating', 'currency']

# Drop columns
df.drop(columns_drop, axis = 1, inplace = True)
df.head()

Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,user_rating_label
0,100788224,3.99,21292,26,4.5,6.3.5,4+,Games,38,5,10,1,High
1,158578688,0.0,161065,26,3.5,8.2.2,4+,Productivity,37,5,23,1,High
2,100524032,0.0,188583,2822,4.5,5.0.0,4+,Weather,37,5,3,1,High
3,128512000,0.0,262241,649,4.5,5.10.0,12+,Shopping,37,5,9,1,High
4,92774400,0.0,985920,5320,5.0,7.5.1,4+,Reference,37,5,45,1,High


We further inspect the data by calling the method `.info()`

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7197 entries, 0 to 7196
Data columns (total 13 columns):
size_bytes           7197 non-null int64
price                7197 non-null float64
rating_count_tot     7197 non-null int64
rating_count_ver     7197 non-null int64
user_rating_ver      7197 non-null float64
ver                  7197 non-null object
cont_rating          7197 non-null object
prime_genre          7197 non-null object
sup_devices.num      7197 non-null int64
ipadSc_urls.num      7197 non-null int64
lang.num             7197 non-null int64
vpp_lic              7197 non-null int64
user_rating_label    7197 non-null category
dtypes: category(1), float64(2), int64(7), object(3)
memory usage: 682.0+ KB


We see that there is no missing values in our data.

Some of the columns contain `String` values which might not be compatible to certain machine learning algorithms that will be implemented in the subsequent sections. Therefore, we need to transform the data into labels encoded by numeric values (i.e., 0, 1, 2, ...).

We use `sklearn.preprocessing.LabelEncoder()` to transform the following columns into numeric labels:
- `prime_genre`: contains 22 unique `String` values.
- `user_rating_label`: contains 3 unique `String` values ("Low", "Medium", "High")

In [16]:
# from sklearn import preprocessing
le = preprocessing.LabelEncoder()

df['prime_genre'] = le.fit_transform(df['prime_genre'])
df['prime_genre'] = df['prime_genre'].astype(str)

df['user_rating_label'] = le.fit_transform(df['user_rating_label'])
df['user_rating_label'] = df['user_rating_label'].astype(str)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7197 entries, 0 to 7196
Data columns (total 13 columns):
size_bytes           7197 non-null int64
price                7197 non-null float64
rating_count_tot     7197 non-null int64
rating_count_ver     7197 non-null int64
user_rating_ver      7197 non-null float64
ver                  7197 non-null object
cont_rating          7197 non-null object
prime_genre          7197 non-null object
sup_devices.num      7197 non-null int64
ipadSc_urls.num      7197 non-null int64
lang.num             7197 non-null int64
vpp_lic              7197 non-null int64
user_rating_label    7197 non-null object
dtypes: float64(2), int64(7), object(4)
memory usage: 731.1+ KB


In [8]:
df.head()

Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,user_rating_label
0,100788224,3.99,21292,26,4.5,6.3.5,4+,20,38,5,10,1,0
1,158578688,0.0,161065,26,3.5,8.2.2,4+,7,37,5,23,1,0
2,100524032,0.0,188583,2822,4.5,5.0.0,4+,15,37,5,3,1,0
3,128512000,0.0,262241,649,4.5,5.10.0,12+,9,37,5,9,1,0
4,92774400,0.0,985920,5320,5.0,7.5.1,4+,8,37,5,45,1,0
