# CMSC471 Artificial Intelligence

# Project Final Submission

## Project Title

*Matt Manzi, CL69490*

## Problem Description

This application of machine learning will attempt to classify the category of a mobile application using various app store data.  This is a multi-class classification problem since the goal is to correctly identify the category of an app given a set of category options.

Every mobile app shares a subset of features like a name, file size, number of downloads, etc.  The intention is to draw (or disprove) correlation between these shared features and the category of an app.

For example, if a majority of utility apps (compared to those of other categories) with over 100,000 downloads have a file size less than 10 MB, that may indicate that it is more important to consumers that utility apps be small in file size.  On the other hand, if there arises no correlation between the features in this dataset, that may indicate that users do no concern themselves over these aspects and neither should the developer.

## Motivation

I develop my own mobile applications and I'm often very curious about what makes certain apps go viral.  I've heard much anecdotal advice, like that I shoud "develop apps that the developer would themselves want to use."  In an effort to help guide some of my decisions, I believe that being able to classify app categories based on these features implies that there is a consistent expectation for apps in a certain category.  Thus, based on the categories of the apps that I want to develop, I will be able to make more informed decisions.

## Dataset

- Link to dataset source: [https://www.kaggle.com/lava18/google-play-store-apps/data](https://www.kaggle.com/lava18/google-play-store-apps/data)

- Target Feature: `Category`
  - _The category that the app is sold under_

- Features include
  - `Rating`: _average rating out of 5.0_
  - `Reviews`: _number of reviews for the app_
  - `Installs`: _number of times the app has been installed to a device_
  - `Price`: _the initial purchase price of the app_
  - `NameLen`: _length of the app's name (computed)_
  - `Bytes`: _number of bytes of the app package (computed)_



In [1]:
import pandas as pd

# if running from personal Colab account, get the CSV from Google Drive
data_file = "googleplaystore.csv"
try:
    from google.colab import drive
    drive.mount("/content/drive/")
    data_file = "/content/drive/My Drive/Year 4 - 2019-20/Semester 7 - Fall 2019/CMSC 471 - 01/Project/datasets/google-play-store-apps/" + data_file
except:
    pass

app_data = pd.read_csv(data_file)
print(app_data.shape)

(10841, 13)


In [2]:
app_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


## Data Preprocessing

- Preprocessing Steps:
  - Compute the `NameLen` column
  - Compute the `Bytes` column
  - Drop the `App`, `Type`, `Size`, `Content Rating`, `Genres`, `Last Updated`, `Current Ver`, and `Android Ver` columns
  - Convert the `Rating` and `Price` columns to floats
  - Convert the `Reviews` column to integers
  - Numerically encode the `Installs` column values
  - Numerically encode the `Category` column values

- Data Splitting:
  - Train/test split: 80% train, 20% test

In [9]:
from sklearn.model_selection import train_test_split
from keras.utils import np_utils

import tensorflow as tf

sklearn.__version__

'0.21.3'

In [4]:
# preview dataset
# print("Original:")
# print(app_data.dtypes)
# print(app_data.isna().sum())
# print()

# compute NameLen
app_data["NameLen"] = app_data.App.str.len()

# compute Bytes
multiplier = {"M": 1024**2, "k": 1024}
def size_convert(value, sclass, multiplier):
    return float(value) * multiplier[sclass]
app_data = app_data[app_data.Size.str.endswith(tuple(multiplier.keys()))]
# print("Unique Size classes:", app_data.Size.str[-1].unique())
app_data["Bytes"] = app_data.apply(lambda df: size_convert(df.Size[:-1], df.Size[-1], multiplier), axis=1)

# drop unneeded columns
drop = ["App", "Size", "Type", "Content Rating", "Genres", "Last Updated",
        "Current Ver", "Android Ver"]
app_data.drop(columns=drop, inplace=True)

# convert Rating values to floats
app_data.Rating = app_data.Rating.astype(float)
app_data = app_data[app_data.Rating <= 5.0]
# print("Unique Rating values:", app_data.Rating.unique())

# convert Price values to floats
app_data.Price = app_data.Price.str.strip("$").astype(float)

# convert Reviews values to integers
app_data.Reviews = app_data.Reviews.astype(int)

# encode Installs values
app_data.Installs = app_data.Installs.str.strip("+").str.replace(",", "").astype(int)
# print("Unique Installs values:", sorted(app_data.Installs.unique()))
app_data.Installs = app_data.Installs.astype("category").cat.codes

# encode Category values (numerical)
cats = app_data.Category.astype("category").cat
app_data.Category = cats.codes
categories = {i: cats.categories[i] for i in cats.codes}
num_cats = len(categories)
# print(num_cats)
# print("Categories:", categories)

# summarize new dataset
# print("\nRevised:")
# print(app_data.dtypes)
# print(app_data.isna().sum())
# print(app_data.Category.unique())
app_data.head()

Unnamed: 0,Category,Rating,Reviews,Installs,Price,NameLen,Bytes
0,0,4.1,159,8,0.0,46,19922944.0
1,0,3.9,967,11,0.0,19,14680064.0
2,0,4.7,87510,13,0.0,50,9122611.2
3,0,4.5,215644,15,0.0,21,26214400.0
4,0,4.3,967,10,0.0,37,2936012.8


In [5]:
# separate
X = app_data[["Rating", "Reviews", "Installs", "Price", "NameLen", "Bytes"]]
y = app_data["Category"]
y_cat = np_utils.to_categorical(app_data["Category"], num_classes=num_cats)

# normalize
X = (X - X.mean()) / X.std()

# split to train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69)

# do more for NN datasets
_, _, y_train_cat, y_test_cat = train_test_split(X, y_cat, test_size=0.2, random_state=69)
train_tensor = tf.data.Dataset.from_tensor_slices((X_train.values, y_train_cat))
test_tensor = tf.data.Dataset.from_tensor_slices((X_test.values, y_test_cat))

train_batch = train_tensor.shuffle(len(X_train)).batch(5)
test_batch = test_tensor.shuffle(len(X_test)).batch(5)

# summarize
X.head()

Unnamed: 0,Rating,Reviews,Installs,Price,NameLen,Bytes
0,-0.135617,-0.158066,-0.622296,-0.064801,1.892152,-0.168784
1,-0.502884,-0.157633,0.31151,-0.064801,-0.32238,-0.382041
2,0.966185,-0.111185,0.934048,-0.064801,2.220231,-0.608095
3,0.598918,-0.042415,1.556585,-0.064801,-0.158341,0.087126
4,0.231651,-0.157633,0.000242,-0.064801,1.153974,-0.859739



## Methods

Sklearn Method No.1: Random Forest Classification

Sklearn Method No.2: Logistic Regression

Tensorflow Method: Multi-Layer Perceptron Neural Network

In [6]:
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import tensorflow as tf

# for tuning
import matplotlib.pyplot as plt
from statistics import mean

# easy testing flag to see verbose output
VERBOSE=0

### Random Forest Classification Model

In [7]:
def run_rf(n_estimators=num_cats):
    rf_model = RandomForestClassifier(n_estimators=n_estimators, random_state=69, verbose=VERBOSE)
    rf_model.fit(X_train, y_train)
    return rf_model

# sample run
m = run_rf()
prob = m.predict_proba(X_test)
if int(sklearn.__version__.split(".")[1]) >= 22:
    print("RF Accuracy:", m.score(X_test, y_test), "RF AUC:", roc_auc_score(y_test, prob, multi_class="ovr"))
else:
    print("RF Accuracy:", m.score(X_test, y_test))

RF Accuracy: 0.315006468305304


#### Estimator Count Tuning

In [8]:
range_est = range(num_cats, num_cats*11, num_cats)
rf_acc = []
maxx_rf = (0, 0)
for est in range_est:
    m = run_rf(est)
    a = m.score(X_test, y_test)
    rf_acc.append(a)
    if a > maxx_rf[0]:
        maxx_rf = (a, est)
print("Best Run:", maxx_rf)
plt.title("Accuracy vs. Number of Estimators")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.plot(range_est, rf_acc)

KeyboardInterrupt: 

### Logistic Regression Classification Model

In [None]:
def run_lr(max_iter=100):
    lr_model = LogisticRegression(penalty="l2", solver="saga", max_iter=max_iter, multi_class="auto", verbose=VERBOSE)
    lr_model.fit(X_train, y_train)
    return lr_model

# sample run
m = run_lr()
prob = m.predict_proba(X_test)
if int(sklearn.__version__.split(".")[1]) >= 22:
    print("LR Accuracy:", m.score(X_test, y_test), "LR AUC:", roc_auc_score(y_test, prob, multi_class="ovr"))
else:
    print("LR Accuracy:", m.score(X_test, y_test))

#### Maximum Iteration Tuning

In [None]:
range_iter = range(100, 201, 10)
lr_acc = []
maxx_lr = (0, 0)
for m_iter in range_iter:
    m = run_lr(m_iter)
    a = m.score(X_test, y_test)
    lr_acc.append(a)
    if a > maxx_lr[0]:
        maxx_lr = (a, m_iter)
print("Best Run:", maxx_lr)
plt.title("Accuracy vs. Maximum Iterations")
plt.xlabel("max_iter")
plt.ylabel("Accuracy")
plt.plot(range_iter, lr_acc)

### Neural Network Classification Model

In [None]:
def run_nn(epochs=10):
    nn_model = tf.keras.Sequential([
        tf.keras.layers.Dense(num_cats*2, input_dim=6, activation="relu"),
        tf.keras.layers.Dense(num_cats*3, activation="relu"),
        tf.keras.layers.Dense(num_cats*2, activation="relu"),
        tf.keras.layers.Dense(num_cats, activation="softmax")
        ])

    nn_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", tf.keras.metrics.AUC(name="epoch_auc")])
    return nn_model.fit(train_batch, epochs=epochs, verbose=VERBOSE)

# sample run
m = run_nn()
print("NN Accuracy:", mean(m.history["acc"]), "NN AUC:", mean(m.history["epoch_auc"]))

#### Epoch Tuning

In [None]:
range_epoch = range(10, 21, 1)
nn_acc = []
maxx_nn = (0, 0)
for epoch in range_epoch:
    m = run_nn(epoch)
    a = mean(m.history["acc"])
    nn_acc.append(a)
    if a > maxx_nn[0]:
        maxx_nn = (a, epoch)
print("Best Run:", maxx_nn)
plt.title("Average Accuracy vs. Number of Epochs")
plt.xlabel("epochs")
plt.ylabel("Avg. Accuracy")
plt.plot(range_epoch, nn_acc)

## Results

In [None]:
# run each model with the best value for the tuned hyperparameter
rf_m = run_rf(maxx_rf[1])
lr_m = run_lr(maxx_lr[1])
nn_m = run_nn(maxx_nn[1])

rf_prob = rf_m.predict_proba(X_test)
if int(sklearn.__version__.split(".")[1]) >= 22:
    print("RF Accuracy:", rf_m.score(X_test, y_test), "RF AUC:", roc_auc_score(y_test, rf_prob, multi_class="ovr"))
else:
    print("RF Accuracy:", rf_m.score(X_test, y_test))
print()
lr_prob = lr_m.predict_proba(X_test)
if int(sklearn.__version__.split(".")[1]) >= 22:
    print("LR Accuracy:", lr_m.score(X_test, y_test), "LR AUC:", roc_auc_score(y_test, lr_prob, multi_class="ovr"))
else:
    print("LR Accuracy:", lr_m.score(X_test, y_test))
print()
print("NN Accuracy:", mean(nn_m.history["acc"]), "NN AUC:", mean(nn_m.history["epoch_auc"]))

### Comparison Table

Let's compare the accuracy and AUC values of the different models:


|                | Random Forest | Logistic Regression | Neural Network |
|:--------------:|:-------------:|:-------------------:|:--------------:|
|    Accuracy    |    33.118%    |      26.326%        |    26.160%     |
|    AUC         |               |                     |     0.840      |

### Plots

Above, you'll see plots for each of the hyperparameter tunings that were completed, placed with their respective models for clarity.

Next, there is a plot of the history of the best-tuned hyperparameter for the neural network.

In [None]:
# neural network history for best tuning of hyperparameter
pd.DataFrame(nn_m.history).plot(figsize=(15, 6))
plt.xticks(range(0, maxx_nn[1]))
plt.title("Progress of Neural Network (Tuned)")
plt.ylabel("Accuracy % / Loss Value / AUC Value")
plt.xlabel("Epoch")
plt.grid(True)

## Discussion

Although we all love a happy ending, that's not always what we get.  In this case, since the accuracy of each of these models has yet to break 50% in a reasonable amount of time, I believe that it is fair to say that one of two reasons may be at fault here:
1. **The dataset is not specific enough.**  Given that there are a small number of features that share values that are all very common amongst the data (e.g. the `Installs` column), it is very possible that the dataset simply does not have enough detail about each entry.
2. **The data is simply not correlated in the way the hypothesis predicted.**  This is not uncommon, however it is also not flashy.  Unfortunately, there may not be correlation between popularity, file size, category, etc. of an app.  All of these aspects of an app may very well range across the spectrum of good apps and bad ones, and rather than a pattern emerging, it seems that an "anti-pattern" of sorts has been found.

Nonetheless, the results were still disappointing.  The low accuracy is not promising, it seems that any continued increases based on hyperparameter tuning would simply cause the models to overfit to the training data, although I would still be curious to see how well these models would given a week's worth of CPU time.  Early on, I did run the models on data that was not normalized, which produced even worse results, however the improvement mainly benefited the neural network, getting its accuracy back around to the same caliber that the `sklearn` models were achieving.  It is also interesting to note the high values that AUC is achieving.  This seems to show that the model is not performing so poorly that it is completely "guessing" at the answer but rather its guesses land often near the answer instead of at it exactly.

As a mobile app developer, I was hoping that these results would give me some direction with how I develop my apps and market them.  Although I did not receive advice on specifics, the results did reinforce my belief and techniques in good app development.

## Grading

Project grading rubric (total 100 points - 20% of the final grade):

- Project proposal: 10 points

- Final submission: 70 points - Breakdown as follows

    - 30 points: Methods, hyperparameter tuning and comparison table
    
    - 20 points: Plots

    - 20 points: Discussion (2 paragraphs)
    
- Project complexity and intellectual efforts judged by the instructor: 20 points
    
<b>Notice:</b> similar to the assignments, up to 10 points may be deducted if your notebook is not easy to read and/or has spelling/grammatical errors, so proofread your notebook!

## How to Submit and Due Date - Late Penalty Will be Strictly Applied!

Name your final project notebook ```Lastname-Project.ipynb```. Submit the notebook file with your dataset file in a zip file named EXACTLY as `Lastname-Project.zip` using the ```Final Project``` link on Blackboard. For groups, only one submission is required.

<font color=red><b>Project Final Submission Due Date: Monday Dec 9th 11:59PM.</b></font>