# Videogame Sales
This dataset contains games released from 1980 to present day and ranks them in terms of global sales. This data was scraped from vgchartz.com. Dataset variables are rank, name, platform, year, genre, publisher, North American sales, EU sales, Japan sales, sales from other regions, and Global sales.

Quick note: sales are being represented in millions.

The purpose of this notebook is to practice my python/pandas skills by asking some questions and looking for descriptive stats in the dataset.

## 1. Imports

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

# Plotting - matplotlib, seaborn

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

## 2. Load the dataset and get some basic stats

In [None]:
df = pd.read_csv("../input/vgsales.csv")
df.head()

In [None]:
df.info()

Just at a glance, you can see that this dataset has missing values for Year and Publisher columns -  would need to fill those
in and clean things up.

Let's get some basic quantitative stats to better understand the data variables:

In [None]:
print('--- BASIC STATS ---')

# Years covered?
print('Dataset has games from %d' %df['Year'].min() + ' - %d' %df['Year'].max())

# How many unique games?
print('Number of Unique Games listed: ' + str(len(df['Name'].unique())))

# How many game publishers?
print('Number of Publishers listed: ' + str(len(df['Publisher'].unique())))

# How many game platforms?
print ('Number of Platforms listed: ' + str(len(df['Platform'].unique())))

# How many game genres?
print('Number of Genres listed: ' + str(len(df['Genre'].unique())))
print(df['Genre'].unique())

Games from 2020? Already? Those values would need to be cleaned in the data as well.

In [None]:
# Exactly how many NaN (missing) values?

print('Amount of NaN values for each column:')
for column in df.columns:
    print(column + ':' + str(len(df[df[column].isnull()])))

## 3. Correlation between variables
To get an idea of how things relate with each other, let's do a basic test for correlation between the dataset variables.

In [None]:
correlation = df.corr()
correlation

In [None]:
# Let's visualize the correlations with seaborn
plt.figure(figsize=(10,10))
sns.heatmap(correlation, vmax=1, square=True, annot=True, cmap='cubehelix')
plt.title('Correlation between different features')

### Some insights from the correlation

*There should be a direct correlation between rank and global sales, as the dataset is ranked by global sales. The lack of a correlation here doesn't make sense to me.

*There's extremely strong correlations between both North American and European sales and Global Sales - this indicates that
these markets compromise most of the sales for the majority of games in the dataset.

*JP has the lowest correlation with global sales (although its still strong), so after NA and EU collectively other regions comprise more sales than JP. This makes sense, given that other markets like South America and Australia import videogames.

## 4. Publisher Analysis

Lets take a look at the top publishers in the dataset. I'm interested in ranking the publishers on three different criteria:
top game sales globally, most games published, and highest average global sales per game.

In [None]:
# Top 25 Publishers by Global Sales

publishers = df.groupby(['Publisher']).sum()
top25_publishers = publishers.sort_values(by='Global_Sales', ascending=False)[:25]
top25_publishers

plt.figure(figsize=(8,6))
sns.barplot(y=top25_publishers.index, x=top25_publishers.Global_Sales)
plt.ylabel("Publisher")
plt.xlabel("Global Sales")
plt.title('Top 25 Publishers by Global Sales')
plt.show()

In [None]:
# Top 25 Publishers with most releases

mostgames_publisher = pd.crosstab(df.Publisher, df.Name)
mostgames_sum = mostgames_publisher.sum(axis=1)
top25_games = mostgames_sum.sort_values(ascending=False)[:25]
plt.figure(figsize=(8,6))
sns.barplot(y=top25_games.index, x=top25_games.values, orient="h")
plt.ylabel("Year")
plt.xlabel("Number of Games")
plt.title('Top 25 Publishers by # of Games Released')
plt.show()

In [None]:
# Top 25 Publishers by Avg. Sales per Game Release
# Take the Top 25 Publishers by Global Sales and order them based on the average sales per game release

sales_per_game = (top25_publishers['Global_Sales']/mostgames_sum[top25_publishers.index]).sort_values(ascending=False)[:25]
sales_per_game
plt.figure(figsize=(8,6))
sns.barplot(y=sales_per_game.index, x=sales_per_game.values, orient="h")
plt.ylabel("Publishers")
plt.xlabel("Avg. Sales per Game Release")
plt.title('Top 25 Publishers Globally by Avg. Sales per Game Release')
plt.show()

In this dataset, it's apparent that Nintendo is the big winner. With the most global sales and highest global sales per release, it you can see that the company has a track record of putting out globally popular titles.

Conversely, publishers like EA and Activision seem to rely on sheer output for their success; being 2nd and 3rd place for global sales, and 1st and 2nd for titles put out respectively, these companies flood the market with annual game franchises. Being 8th and 9th place for global sales per release, it's clear that not all of EA and Activision's releases are hits like Nintendo titles.

## 5. Franchise Sales

I'm interested in determining the total global sales for some popular game franchises. To do this, I'll create a function that grabs all titles starting with the same starting words (ex. Uncharted, FIFA, Fire Emblem) and sums their global sales. Most game franchises use these trademarked naming conventions to maintain a consistent brand identity with customers.

In [None]:
def sum_globalsales(keyword):
    '''
    Finds the total amount of 
    Global Sales for a series of
    games with the same starting word/phrase
    '''
    total_sales = 0
    print("'" + keyword + "'" + ' Series')
    print('---TITLES---')
    for title in df['Name'].unique():   # list of unique game titles to avoid duplication below
        if title.startswith(keyword):
            group = df[df.Name == title]   # accounts for games released on multiple platforms
            for key in group.index:
                sales = df.iloc[key]['Global_Sales']
                print(title + ': ' + str(sales) + ' [' + df.iloc[key]['Platform'] + ']')
                total_sales += sales
    print('-'*len('---TITLES---'))
    print(total_sales)

Lets see what the global sales are for three of my most beloved franchises: **The Legend of Zelda**, **Gears of War**, and **Halo**. These examples lend themselves well to the function I've made, but the function could be better improved by searching based on a keyword appearing anywhere in the title (as opposed to the starting words). A use case for this would be searching for games with 'Mario' in the name; New Super Mario Bros, Super Mario 64, Mario and Luigi: Paper Jam, etc...

In [None]:
# The Legend of Zelda

sum_globalsales('The Legend of Zelda')

In [None]:
# Gears of War

sum_globalsales('Gears of War')

In [None]:
# Halo

sum_globalsales('Halo')

##6. Genre Analysis
Let's start by seeing what genres are most popular globally and in each specific region.

In [None]:
data = (df.groupby('Genre').sum())

# Top Genres Globally
plt.figure(figsize=(12,6))
sns.barplot(y=data.index, x=data.Global_Sales, orient="h")
plt.ylabel("Genre")
plt.xlabel("Global Sales")
plt.title('Top Genres Globally')

# Top Genres for NA, EU, JP, Other
fig, (axis1, axis2) = plt.subplots(1,2,figsize=(16,5))
sns.barplot(y=data.index, x=data.NA_Sales, orient="h", ax=axis1)
sns.barplot(y=data.index, x=data.EU_Sales, orient="h", ax=axis2)

fig, (axis1, axis2) = plt.subplots(1,2,figsize=(16,5))
sns.barplot(y=data.index, x=data.JP_Sales, orient="h", ax=axis1)
sns.barplot(y=data.index, x=data.Other_Sales, orient="h", ax=axis2)

Adventure is consistently one of the least popular genres across all regions, and it also seems like a vague genre to me. What's the difference between an adventure game and an action game? I would assume most adventures have some kind of action in them. Let's see the top 10 adventure games.

In [None]:
df[df.Genre == 'Adventure']['Name'].head(10)

This doesn't really clear up the genre's description for me. Of course in all these game there's some sort of adventure going on, but if I had to gather a theme from the top titles it seems that perhaps they focus less on the player doing any fighting (Super Mario Land, Rugrats, Club Penguin, L.A. Noire). Then again, Assasin's Creed kind of throws a wrench in that idea.

Anyways, let's see which genres have the most games published.

In [None]:
genreGame = pd.crosstab(df.Genre, df.Name)
genreGameSum = genreGame.sum(axis=1).sort_values(ascending=False)
plt.figure(figsize=(12,6))
sns.barplot(y=genreGameSum.index, x=genreGameSum.values, orient="h")
plt.ylabel("Genre")
plt.xlabel("Number of Games")
plt.title("Genres with the Most Releases")
plt.show()

Action and Sports seem to have the most published games; this makes sense to me as many of them are franchises (e.g. Grand Theft Auto, Legend of Zelda, FIFA, Madden).