# Using Machine Learning Methods Inorder To Predict The Amount Of Installations Of Games On The Google Play Store.

# Introduction

Hello kaggle community, or probably just Mr. Nolan in this case,

in this workbook we will be looking at using certain machine learning models inorder to make predictions on our android-games data set, which contains data such as: the amounts of a certain game sold on the store, the ratings a game has received, its growth over a certain period of time , the amounts of good ratings it has received etc...


***Our Variables***

machine learning methods are used by data scientists to predict certain outcomes through use of already given information.
in data science, the values used to make predictions are referred to as "X" values aswell as independent variables or features, while the value that is to be predicted is the "y" value or dependant variable.

In our case our "X" Values are;
* Average Rating of the game out of 5 stars
* Price of the game in Australian dollars
* 5 star ratings
* Growth 60 days

And our "y" value is:
* amount of installataions of game (millions)


**Issues Within The Data Set**

out of our four features, the most impactfull on our "y" value will be the price of the game, due to the fact that in any situation, the game becoming costly will decrease its appeal to android users.
this price factor may provide small outliers in our installation values, as they will have slightly less downloads and will lower the download prediction by fractions of a million in turn raising our Mean absolute error by a small amount.

**Discussion Of Machine Learning Model Usage**

inorder to effectivley compare and contrast different machine learning methods, we will be implementing mulitivariant linear regression, aswell as a random forest.

**Multivariant Linear Regression**

Linear Regression is a machine learning method that implements a line of best fit inorder to attempt to replicate the relationships between out "X" and "y" variables.
The term "multivariant" means that multiple variables are used inorder to predict the "y" value, providing a more accurate prediction.

**Random Forest Regressors**

A Random forest is a machine learning method comprised of multiple decison tree regressors, hence the term "forest".
a desicion tree regressor is a machine learning method where the dataset is broken down into smaller and smaller subsets of data untill a conclusion is drawn, the term regression implies that the tree is dealing with numerical values, and not string values such as "hot" or "cold".
In a random forest multiple desicion trees are constructed from bootstrapped data sets that have been reconstructed from random and possibly overlapping values derived from the original data set.
each induvidual bootstrapped set is then used inorder to create a regression tree, and after a large amount of trees have been constructed, the most prominent prediction is taken from all of them, leaving us with a final result.

**Least MAE Method?**
 
I believe that the Multivariant Linear Regression will be the most effective machine learning method, as what we are attempting to predict is a single numerical value, along with features that are also all numerical discreet data.






In [None]:
# importing nescesary libraries and accesing files

import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # data visualisation
from sklearn.metrics import mean_squared_error
from sklearn import linear_model # importing our linear regression
from sklearn.model_selection import train_test_split # our tool to split the data inorder to perform a validation
from sklearn.metrics import mean_absolute_error # mae calculations
from sklearn.ensemble import RandomForestRegressor # importing our random forest

# loading from kaggle
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Exploring Our Data

**Why Was This Data Chosen**

The main reasoning behind the choice of this paticular data set is that it contains only numerical values apart from the name of the game its self and its catagory, meaning it will be easy to work efficiently and make more concise predictions with the data

**Recap of Variables**

Our Features:

* The average Android Users rating of the game, out of 5 stars
* The price of the game in Australian dollars
* The amount of 5 star ratings that the certain game posseses
* The %Growth of the game over a 60 day time period

What we Wish to Predict:
* amount of installations that a game posseses, measured in millions

In [None]:
# Getting CSV file for data
train_file_path = '../input/ggldataset/android-games.csv'

# Create a new Pandas DataFrame with our training data
google_train_data = pd.read_csv(train_file_path)

# Taking a squiz at our data prior to cleaning
google_train_data.describe(include='all')

# Cleaning Our Data

**Why Do We Prepare Our Data?**

in machine learning it is important to have a ready and functional data set, inorder to implement machine learning methods effectivley.

**Our Preparations**

With this data I have chosen to drop the entire row of missing values instead of just creating an average value, as the causes for the missing values may be dependant of the outliers in said rows

In [None]:
# Selecting a number of independent variables ("X" values) we will use to predict our dependent variable ("y" value)
# Note: the "y" value is kept in the data set inorder to be seperated out later, this is done for easief data preperation
selected_columns = ['average rating', 'price', 'growth (60 days)', '5 star ratings','installs']

# Create our new training set containing only the features we want
prepared_data = google_train_data[selected_columns]

# Drop rows (axis=0) that contain missing values
prepared_data = prepared_data.dropna(axis=1)

# Converting all values to intiger values, inorder to deny the possibility of issues with the data set later
prepared_data = prepared_data.astype(int)

# Taking a look at our cleaned data
prepared_data.describe()

In [None]:
# Header of cleaned data
prepared_data.head()

# Splitting and Fitting Our Data

**Multivariant Linear Regression VS Random Forest Regressor**

**Multivariant Linear Regression**

Linear Regression is a machine learning method that implements a line of best fit inorder to attempt to replicate the relationships between out "X" and "y" variables.
The term "multivariant" means that multiple variables are used inorder to predict the "y" value, providing a more accurate prediction.

**Random forest Regressors**

A Random forest is a machine learning method comprised of multiple decison tree regressors, hence the term "forest".
a desicion tree regressor is a machine learning method where the dataset is broken down into smaller and smaller subsets of data untill a conclusion is drawn, the term regression implies that the tree is dealing with numerical values, and not string values such as "hot" or "cold".
In a random forest multiple desicion trees are constructed from bootstrapped data sets that have been reconstructed from random and possibly overlapping values derived from the original data set.
each induvidual bootstrapped set is then used inorder to create a regression tree, and after a large amount of trees have been constructed, the most prominent prediction is taken from all of them, leaving us with a final result.

**Why Do We Split Our Data Set?**

In data science we use a train test split method, meaning splitting our whole data set in to two sets, one for training (the training set) our model, and another for comparing our predictions to (the testing set).

In [None]:
# Separate out the prediction target
y = prepared_data.installs

# Dropping the "y" value from the original dataframe and keeping the rest of the data as our "X" values
X = prepared_data.drop('installs', axis=1)

# Spliting data into training data and validation data, for the "X" and the "y"
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Defining forest
forest_model = RandomForestRegressor(random_state=1)

# Fitting forest
forest_model.fit(train_X, train_y)

# Getting predictions via random forest
forest_prediction = forest_model.predict(val_X)

# Looking at our MAE for our random forest
fmae = mean_absolute_error(val_y, forest_prediction)

# -------------------------------------------------------------------------------------------------------------- #

# Defining regressor
installs_predictor = linear_model.LinearRegression()

# Fasciting regressor
installs_predictor.fit(train_X, train_y)

# Get predicted installations via linear regression
linear_predictions = installs_predictor.predict(val_X)

# Getting at our MAE for our Linear Regression
lmae = mean_absolute_error(val_y, linear_predictions)

# Printing Our Predictions
print(f"""                        linear regression predictions 
      {linear_predictions}
                  Random forest predictions
      {forest_prediction}
      """)

# Printing our MAE
print(f"""       linear regression MAE
       {lmae}
       random forest regressor MAE
       {fmae}""")




# Performance of Our Models

**What Have we Observed?**

After running bot our models simeltaniousley, we can see that the MAE of the random forest model is lower than that of the linear ragression model.
this difference in accuracy could be due to the fact that the forest utilises multiple bootstrapped datasets inorder to make predictions, while the linear regression only takes our singular set of regular training data.

# Tuning Our Hyper Parameters

**What are Hyperparameters?**

In data science, the hyperparameters of a model ar the parts that can be altered after the fact, making for a better MAE.
The hyperparameters of our models are as follows:

**Linear Regressor Tuning**

in a linear regression model, when tuning our hyper paramaters we can consider adding or removing some of the features that we use inorder make predictions, for example the 30 day growth could be implemented as a 5th variable, or used as a replacement for the 60 day growth.

**Random Forest Regressor Tuning**

With a Random Forest there are also hyperparameters that are available for us to tune and alter.
We can look at the amount of leaf nodes in the trees, and change them according to the best prediction.
    

# Conclusion

**Summary**

In this investigation we have looked at the effectivness of different machine learning methods on predicting the amount of installations of games on the google play store.
we have implemented both Multivariant Linear Regression and Random Forest Regression methods to predict our data.

**What Did We Find**

After performing both methods of machine learning, I have found that, disproving my hypothesis, random forests are a more effective method for prediction with our data set.
We have also found that the installation amount was affected the most by the prices of the games, which made the overal spread of the data nearer to the lower end of the spectrum.




