<a href="https://colab.research.google.com/github/rthorst/Machine_Learning/blob/master/mobile_games/Clustering_Mobile_Strategy_Games.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# There are 8 Main Types of Strategy Games on the Apple Store


## I find that:

I find that there are 8 main types of strategy games on the App Store! (see bottom of notebook for an interactive version of the figure) 

![alt text](https://raw.githubusercontent.com/rthorst/Machine_Learning/master/mobile_games/clusters.png)

* **Classic Strategy Games** (Sudoku, Chess, Tic-Tac-Toe) - Dark Blue
* **Combat Games** (Merc Wars, Jet Battle Combat, Vicious Tanks) - Light Blue
* **Lifestyle Games** (Pocket Clothier, Crazy BBQ, Ada's Fitness Center) - Purple
* **Mind Games** (Master-mind, Guess Who, Number Enigma) - Dark Pink
* **Tower Defense Games** (Bloons TD 4, Tiny Defense, Zombie Tower Shooting Defense) - Salmon
* **Puzzle Games** (2048 Jewels, Soda Pop Match 3, Shape Slide) - Dark Orange
* **Arcade-Strategy Games** (Crazy Pizzeria Kitchen Chef, Bouncy Fat Hungry Panda Jump) - Light Orange
* **Idle/Tycoon Games** (Idle 3Q, Transit King Tycoon) - Yellow

## Problem

There are hundreds of thousands of mobile games - more than any human can understand and summarize. When developing a game, it would be useful to know about the **types of games that already exist** to inform decisions about game design, marketing, and pricing. 

## Approach

I **build a model** to learn the types of games that already exist. I do this **by analyzing > 17,000 games** on the Apple Store. 

## Techniques

I use **several machine learning techniques** including:

*   Word embeddings (to represent text in terms of its high-level semantics)
*   Dimensionality reduction (using a **new neural-network based techinque**: Ivis https://www.nature.com/articles/s41598-019-45301-0)
*   Clustering (kmeans++)




In [0]:
import pandas as pd
import nltk
import numpy as np
#!python -m spacy download en_core_web_md # uncomment to download model. 
import spacy
#!pip install ivis # uncomment to download ivis package. 
import ivis
import seaborn as sns
from sklearn.cluster import DBSCAN, KMeans
import plotly.express as px
import plotly.graph_objects as go

## Load Game Data 

I used a dataset of over 17,000 strategy games on the Apple store. https://www.kaggle.com/tristan581/17k-apple-app-store-strategy-games. The data look like this:

---

>>> **Reversi** 

>>> The classic game of Reversi, also known as Othello, is a much-loved strategy board game. It is often described as taking only a minute to learn but a lifetime to master. ...

---








In [0]:
# Load game descriptions and titles.
data_p = "https://raw.githubusercontent.com/rthorst/Machine_Learning/master/mobile_games/appstore_games.csv"
df = pd.read_csv(data_p)

descriptions = df["Description"].values # shape (n_games,)
titles = df["Name"].values # shape (n_games, )

# Represent Meaning of Text Descriptions

To identify similar descriptions, a good representation should identify descriptions with similar high-level semantic meaning, even if these descriptions do not use the exact same language. 

This type of representation is captured by **word embeddings** which map words to a vector representation of semantics:

![alt text](https://raw.githubusercontent.com/rthorst/Machine_Learning/master/mobile_games/word_embeddings.png)
Image credit: https://bit.ly/2PBzYVs

We use pre-trained word embeddings from the Spacy NLP model. 

In [0]:
# Load spacy model (may take a few minutes)
nlp = spacy.load("en_core_web_md")

In [0]:
# Embed descriptions (takes ~15 minutes for 17k descriptions)
embedded_descriptions = []
n = len(descriptions)

for idx, description in enumerate(descriptions):

  # counter.
  if idx % 100 == 0:
    print("{} / {}".format(idx, n))

  # represent description as a single mean word vector, shape (300, )  
  tokens = nlp(description) # shape (n_tokens)
  vectors = np.vstack([token.vector for token in tokens]) # shape (n_tokens, 300)
  M_vector = np.mean(vectors, axis=0) # shape (300, )

  # add embedded description to data structure.
  embedded_descriptions.append(M_vector)

embedded_descriptions = np.array(embedded_descriptions)
  


In [0]:
# Save embeddings (for future use: can load saved embeddings).
np.save(file="embedded_descriptions.npy", arr=embedded_descriptions)

# Reduce Dimensionality for Visualization

Machines can represent data in high dimensions, but lower-dimensional 3D representations are more useful for people. Here, I use a **recent neural network-based dimensionality reduction technique, Ivis** to reduce the data to 3 dimensions. The basic idea is to train a **siamese neural network** using a **triplet loss function** to represent points nearby in high-dimensional space {anchor, positive point} as closer than points that are distant in high-dimensional space {anchor, negative point}.


![alt text](https://raw.githubusercontent.com/rthorst/Machine_Learning/master/mobile_games/ivis.PNG)

Image credit: Szubert et al 2019, Nature Human Behavior, https://www.nature.com/articles/s41598-019-45301-0

In [0]:
# Reduce dimensions using ivis 
# (takes ~2 minutes on CPU and requires Ivis package installed.)
embedded_descriptions = np.array(embedded_descriptions)
reducer = ivis.Ivis(embedding_dims=3, k=15)
embeddings_reduced = reducer.fit_transform(embedded_descriptions)

## Find Groups of Games Using Clustering

Given this many games, it is not feasible for humans to detect groups in the data. Instead, the groups need to be detected using a class of algorithms known as clustering. Here, we use **k-means++** which detects Gaussian clusters in the data; there are many reasonable approaches which could be used (e.g., DBSCAN would make no assumptions about the number of clusters). 

![alt text](https://raw.githubusercontent.com/rthorst/Machine_Learning/master/mobile_games/kmeans.png)

Image credit: http://blog.mpacula.com/wp-content/uploads/2011/04/kmeans1.png

In [0]:
# Cluster.
cluster_model = KMeans(n_clusters=8, init="k-means++") # flexible: DBSCAN may be appropriate to detect an unknown number of groups. 
cluster_labels = cluster_model.fit_predict(X=embeddings_reduced)

# Visualize Clusters

Finally, we visualize the groups detected by the model. The plot is interactive: hovering over a point provides the title of the game, for exploration.

![alt text](https://raw.githubusercontent.com/rthorst/Machine_Learning/master/mobile_games/clusters.png)

In [0]:
# Cast data to a pandas dataframe, for easy plotting. 
df = pd.DataFrame(data = {"Description Dimension #1" : embeddings_reduced[:, 0], 
                          "Description Dimension #2" : embeddings_reduced[:, 1],
                          "Description Dimension #3" : embeddings_reduced[:, 2],
                          "cluster" : cluster_labels,
                          "title" : titles})

In [0]:
# Plot
fig = px.scatter_3d(df, 
                    x='Description Dimension #1', 
                    y='Description Dimension #2', 
                    z='Description Dimension #3', 
                    color="cluster", 
                    hover_data=["title"])
fig.show()