## Video Games, Vectors and TSNE
This is a very simple notebook just to inspect video games sales visually and to inspect similar clusters of video games visually. We do this primarily through the TSNE algorithm to see similar clusters of games

### First import libraries

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.manifold
sns.set_style("darkgrid")

### Read in the data
We will only use 'Name', 'Platform', 'Year', 'Genre', 'Publisher' and 'Global_Sales' in our analysis

In [None]:
df = pd.read_csv('../input/vgsales.csv',usecols=['Name','Platform','Year','Genre','Publisher','Global_Sales'])
df.head(10)

### We want to One Hot Encode the categorical features and normalize sales by year.
OHE transforms the distinct categories of one feature (say like 'Xbox360' and 'PS3' from 'Platform') into various binary features - effectively transforming categorical features into vectors of 0,1. The pandas get_dummies() method does this for us

In [None]:
def ohe_features_normalize_sales(data,cols):
    new_data = pd.get_dummies(data,columns=cols)
    new_data.dropna(inplace=True)
    new_data.reset_index(drop=True,inplace=True)
    new_data['Global_Sales'] = new_data['Global_Sales'] / new_data.groupby('Year')['Global_Sales'].transform('sum')
    new_data['Year'] = new_data['Year'].astype(int) # convert year to int rather than float
    return new_data

In [None]:
# choose what columns we want to OHE
use_cols = ['Platform','Genre','Publisher']
df_dummies = ohe_features_normalize_sales(df,use_cols)
df_dummies.head(10)

In [None]:
df_dummies.shape

ok, so the above turns 3 categorical features into 621 OHE features :)

## Lets plot some of the data to understand it better

In [None]:
class Plot:
    """
    A class that takes in a dataframe and groups by a columns and sums by another column.
    It then takes that dict to make a seaborn plot.
    We can specify the type of plot, 'pointplot' or 'barplot' through *plot_style*
    """
    def __init__(self, data, group_col, sum_col, plot_style, n_largest = None):
        self.data = data
        self.group_col = group_col
        self.sum_col = sum_col
        self.plot_style = plot_style
        self.n_largest = n_largest
        
    # Transform dataframe into grouped + summed dataframe (e.g. Group by Year and Sum all the sales in that year)
    def get_new_dataframe(self):
        d = dict(self.data.groupby([self.group_col])[self.sum_col].sum())
        d = pd.DataFrame.from_dict(d,orient='index')
        d = d.reset_index()
        d.columns = [self.group_col, self.sum_col]
        if self.n_largest:
            d = d.nlargest(n=self.n_largest,columns=self.sum_col)
        return d
    
    # Plot all the data in the new data frame
    def get_plot(self):
        d = self.get_new_dataframe()
        if self.plot_style == 'pointplot':
            g = sns.pointplot(x=self.group_col, y=self.sum_col, data=d);
            g.xlabel(self.group_col)
            g.ylabel(self.sum_col)
        elif self.plot_style == 'barplot':
            g = sns.barplot(x=self.group_col, y=self.sum_col, data=d);
        for item in g.get_xticklabels():
            item.set_rotation(80)
        return g

### Time Series of All Sales
We plot all sales for a year, year by year

In [None]:
g = Plot(df, 'Year', 'Global_Sales', 'pointplot')
g.get_plot()

Game sales go up year on year. This makes sense with gaming becoming hugely more popular in the last 3 decades. The massive difference between sales in the 80's and 00's warrants that we normalise each years sales in the df_dummies dataframe.

### Which Platforms Sell the Most Games Over All Time?

In [None]:
g = Plot(df, 'Platform', 'Global_Sales', 'barplot')
g.get_plot()

Clear winners are PS2, PS3, Wii, Xbox360 and DS

### What Genres Sell Most?

In [None]:
g = Plot(df, 'Genre', 'Global_Sales', 'barplot')
g.get_plot()

Action wins here, with Sports in second

### What Publishers Sells the Most?

In [None]:
g = Plot(df, 'Publisher', 'Global_Sales', 'barplot', n_largest=10)
g.get_plot()

Nintendo being the Publisher with the highest sales over all time!

### Lets merge some plots to gain a deeper understanding
We saw that the PS2, PS3, Wii and Xbox360 were the most popular platforms by sales in our second plot - so lets merge the Wii, PS3 and Xbox 360 (exclude the PS2 to have some variety between platforms)

In [None]:
top_3_consoles = df[(df['Platform'] == 'Wii') | (df['Platform'] == 'PS3') | (df['Platform'] == 'X360')]
# group by Genre and Platform and sum by Global_Sales
genre_platform_sales = top_3_consoles.groupby(['Genre','Platform'])['Global_Sales'].sum()
genre_platform_sales.unstack().plot(kind='bar',stacked=True,  colormap='Blues', grid=False, figsize=(13,5));
plt.title('Stacked Bar Plot of Sales per Genre for 3 Platforms', fontsize=15)
plt.xlabel('Genre', fontsize=15)
plt.ylabel('Global_Sales', fontsize=15)
plt.xticks(fontsize=12,rotation=70);

The results make sense, if you know your games :) Wii is known for its sports games (Wii Sports), so it will outperform the other platforms in sports. The Xbox360 is known for its shooters (Call of Duty, Gears of War), so it will outperform its pears here. PS3 has the most action games... and Wii has the most Misc games, because why not ;)

## In summary
There is some deffinite patterns in the data, in terms of most popular platforms, genres and publishers. TSNE should pick up on this using the OHE features

## Now we'll use TSNE to reduce the dimensionality of the OHE features from 621 to 2 so we can visualise game clusters

In [None]:
# columns we want to use from dataframe
cols_to_use = list(df_dummies.columns)
cols_to_use.remove('Name') # this is the label

In [None]:
# transform dataframe to matrix. Each row is a game (observation), each column is a feature
matrix = df_dummies.as_matrix(columns=cols_to_use)
matrix

In [None]:
matrix.shape # everything but the 'name' label is in the matrix

### Train TSNE model
Let's spend a few minutes explaining TSNE since its become a hot topic in machine learning and visualisation.

T-Distributed Stochastic Neighbour Embedding (TSNE) takes a data point in high dimensional space, centres a gaussian on it and computes probabilities of "distances" to neighbouring points. It then maps these probabilities to a lower dimensional space (usually 2D). The probabilities in 2D need to match, as close as possible, to the probabilities in the higher dimensional space. TSNE minimises the divergence between these probabilities via gradient descent on the Kullback–Leibler divergence between high dimensional and low dimensional probabilities. The 2D output coordinates dont represent any of the features in the original dataset, and are used purely for visualisation, so are usually just labelled 'X' and 'Y'.

TSNE preserves local structure, but not global structure - so in 2D clusters that are close together are most probably close together in higher dimensions; similarly, clusters in 2D that are far apart are most likely far apart in higher dimensions. But global structure (the bigger picture shape) of the higher dimensional object will most probably be completely distorted in lower dimensions - this kind of makes sense, because how can you visualise 600 dimensional shapes in just 2 dimensions? Like trying to visualise a 3D sphere on a 1D line, it's hard.

The Barnes-Hut method is used so distance probabilities need not be calculated between every single point, rather only groups of points, reducing the complexity from $N^2$ to $N\log N$

p.s. TSNE should only ever be used for visualisation and not to train clustering models on, as asked here :)
https://stats.stackexchange.com/questions/263539/k-means-clustering-on-the-output-of-t-sne/264647#264647

In [None]:
# this make take 5 minutes
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)
matrix_2d = tsne.fit_transform(matrix)

In [None]:
df_tsne = pd.DataFrame(matrix_2d)
df_tsne['Name'] = df_dummies['Name']
df_tsne.columns = ['x','y', 'Name']
# rearrange columns
cols = ['Name','x','y']
df_tsne = df_tsne[cols]
# show the 2D coordinates of the TSNE output
df_tsne.head(10)

In [None]:
g = df_tsne.plot.scatter("x", "y", s=10, figsize=(20, 12), fontsize=20)
g.set_ylabel('Y',size=20)
g.set_xlabel('X',size=20)

Wow! There are some really nice looking clusters above!

# Lets inspect some clusters :)

Here we'll further inspect some of the clusters in the map above.

We'll see that the cluters are quite well defined. Pretty much every cluster in the map above has games from a similar era, genre and platform! We'll inspect 4 in detail.

When zooming in on areas of the graph the names become a bit messy, but we also print the clusters game data directly form the dataframe for a clearer understanding.

### Define functions to look at specific regions

In [None]:
class PlotTsneRegion:
    def __init__(self, data, x_bounds, y_bounds, rand_points=None):
        self.data = data
        self.x_bounds = x_bounds
        self.y_bounds = y_bounds
        self.rand_points = rand_points
        
    def get_slice(self):
        slice = self.data[
            (self.x_bounds[0] <= self.data.x) &
            (self.data.x <= self.x_bounds[1]) & 
            (self.y_bounds[0] <= self.data.y) &
            (self.data.y <= self.y_bounds[1])
        ]
        return slice
    
    def plot_region(self):
        slice = self.get_slice()
        # sample a fraction of rand_points of *slice* incase region is too dense with points
        if self.rand_points:
            slice = slice.sample(frac=self.rand_points)
        ax = slice.plot.scatter("x", "y", s=35, figsize=(10, 8))
        for i, point in slice.iterrows():
            ax.text(point.x + 0.02, point.y + 0.02, point.Name, fontsize=11)

### Early 80's 'Atari 2600' Platform / Shooters

In [None]:
x_bounds, y_bounds = (80,90), (-15,0)
region = PlotTsneRegion(df_tsne,x_bounds=x_bounds, y_bounds=y_bounds, rand_points=0.6)
region.plot_region()

In [None]:
df[df.Name.isin(list(region.get_slice()['Name']))].head(10)

### Late 00's DS Puzzlers

In [None]:
x_bounds,y_bounds = (65,75),(-30,-15)
region = PlotTsneRegion(df_tsne,x_bounds=x_bounds, y_bounds=y_bounds, rand_points=0.3)
region.plot_region()

In [None]:
df[df.Name.isin(list(region.get_slice()['Name']))].head(10)

### Mid 00's Adventure (mainly PS2)

In [None]:
x_bounds,y_bounds = (-80,-75),(-10,5)
region = PlotTsneRegion(df_tsne,x_bounds=x_bounds, y_bounds=y_bounds, rand_points=0.4)
region.plot_region()

In [None]:
df[df.Name.isin(list(region.get_slice()['Name']))].head(10)

### Mid 00's Action Movie (Star Wars, Spiderman, Simpsons, Lego)

In [None]:
x_bounds,y_bounds = (14,24),(-2,15)
region = PlotTsneRegion(df_tsne,x_bounds=x_bounds, y_bounds=y_bounds, rand_points=0.3)
region.plot_region()

In [None]:
df[df.Name.isin(list(region.get_slice()['Name']))].head(10)

### Conclusion
TSNE is awesome! I never thought the games would cluster that well, but there you go. Every cluster has well defined games in it - from specific era's, genres and platforms.

As tempting as it may be, one should not attempt to train a clustering algorithm on TSNE output, because the output can change depending on TSNE input parameters, dimensional mappings and other attributes.