# INTRODUCTION

In this notebook, we will analyze a dataset that contains the **video game sales from 1980 to 2016.**<br> 
The objective is to explode the information contained in the data, to provide a *historical analysis* of the video game industry, uncover the *emerging trends* and provide *insights*, so we can make ourselves a better idea of the evoution of this market.
<br><br>

In [None]:
# print(plt.style.available)
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy import stats 

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff

pd.set_option('display.max_rows', None)  # or 1000
%matplotlib inline
plt.rcParams['figure.figsize']=20,10
# plt.style.use('ggplot')
# plt.style.use('Solarize_Light2')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

The dataset is made of the video games released between 1980 and 2016, and we have information about:
* **Rank**: Order the video games depending on their global sales. Being the 1st the one that achieved the highest sales.
* **Name**: Name of the video game.
* **Platform**: Console for which the video game was created.
* **Year**: When the video game was released.
* **Genre**: Category based on the way players interact with the game.
* **Publisher**: Company owner of the rights of the video game.
* **NA_Sales**: Number of sales of a video game in North America.
* **EU_Sales**: Number of sales of a video game in Europe.
* **JP_Sales**: Number of sales of a video game in Japan.
* **Other_Sales**: Number of sales of a video game in other regions.
* **Global_Sales**: Total number of sales of a video game.

Because of the structure of the dataset, there are **some caveats** we should take into account:

- We don't have sales across time for each game but **only the year it was released**. This is important as we lack of some insightful information as for example: What kind of games sell mostly on their release year and which maintain their sales longer, understand the lifetime of a game and how affect them the offers for older games..
- It is common that the new consoles have backwards compatibility with prior consoles so older games can be used in the new platform. It is not available this kind of information in the dataset.
- It would be interesting to know what games were included in a bundle with a console, as that will increase the number of sales.
- To have data about the price of each game would have given us a lot of interesting information, as the money collected by publisher or console, evaluate the importance of the price in the sales, test if the price is dependent of the genre..
- Often, the launch date of a game is different depending on the region. We just have one date for each region so we can't analyze how that difference can affect the sales.

In [None]:
data = pd.read_csv("../input/videogamesales/vgsales.csv")

In [None]:
data.info()

Looking at the structure of the data, we see some issues we need to fix before starting the analysis:
- **Year** is recorded as a **Float**. We will convert it to an int, in order to save memory and present a cleaner format in the visualizations.
- We can see there are at least two variables with **outliers**: **Year** and **Publisher**, as they have less than 16.598 non-null entries.

When working with data, we need to understand and **handle with care missing data**, as they can have a significant effect on the conclusions drawn from the data. Some of the problems they can cause are:
- Can reduce the statistical power of data, leading a the test to reject the null hypothesis when it shouldn't. 
- Missing data can cause bias in the estimation of parameters. 
- It can reduce the representativeness of the samples. 

Also, it is important to have in mind when looking for **missing values that they can come in different formats**. Sometimes missing values have been replaced by zeros, N/A or other terms. That's why it is important to understand our variables and evaluate the different values they take.

After exploring our variables, **the only missing values we have are:**

In [None]:
data[['Year', 'Publisher']].isnull().sum()

As some games have both Year and Publisher missing, in total there are **307 games with any missing value**. This represents **less than 2% of the total** 16.598 total games in the dataset.

In [None]:
print("Number of games with a missing value: ", len(data[(data['Year'].isnull()) | (data['Publisher'].isnull())]))
print("Total global sales for those games with any missing value", data[(data['Year'].isnull()) | (data['Publisher'].isnull())]['Global_Sales'].sum())

**In terms of the number of sales**, the 108 Million copies sold from games with **missing values represent around a 1%** of the total sales.

In [None]:
data[(data['Year'].isnull()) | (data['Publisher'].isnull())].head(10)

In order to **improve our dataset**, we are going to **complete the missing information** for the most important games in the dataset in terms of sales. As only 10 games, those with global sales higher than 2M, represents around a 30% of all the sales made by games with missing information, we are going to inform the missing values for these games.

In [None]:
# We fix the 10 first games, in Global Sales, with missing values 
data.iloc[179, 3] = 2003
data.iloc[377, 3] = 2003
data.iloc[431, 3] = 2008
data.iloc[470, 3] = 2005
data.iloc[470, 5] = 'THQ'
data.iloc[607, 3] = 1980
data.iloc[624, 3] = 2007
data.iloc[649, 3] = 2001
data.iloc[652, 3] = 2008
data.iloc[711, 3] = 2006
data.iloc[782, 3] = 2008

# Drop the other rows with missing values
data.dropna(inplace=True)

# I also change the type for year
data['Year'] = data['Year'].astype('int64')

Let's now see some very **basic numbers** to get an idea of the data we will work with:

In [None]:
print("Total number of sales:", data['Global_Sales'].sum())
print("Num. of games: ", len(data))
print("Num. of different titles: ", data['Name'].nunique())
print("Num. of publishers: ", data['Publisher'].nunique())
print("Num. of platforms: ", data['Platform'].nunique())
print("Num. of genres: ", data['Genre'].nunique())

Notes:
* The first clarification needed in order to understand the data is that the number of Sales are given in Millions. We have 8.920 Million copies sold for all the 16.598 games released between 1980 and 2016.<br>
* There are 11.493 games with different names, while the other 5.105 games remaining are those which were released in different platforms sharing the same name.
* The data were collected in October 2016, so for that year we only have information until the end of September.

In [None]:
# There are 4 games with Year>2016. As this is an error, and their sales insignificant, we drop them.
df = data[data['Year']<=2016]

# I've also realised a problem with the DS game: Strongest Tokyo University Shogi DS, which is tagged as launched in 1985
df = df[df['Name'] != 'Strongest Tokyo University Shogi DS']

# **EDA**

My first intuition when I think about videogames is that they have been steadily growing since the 90's. I would expect then the sales to behave the same way.<br>
Instead, we can see that this is not completely true. There were a **peak of sales around the year 2008**, and since then the next years' releases have not been able to reach that level of sales.<br> 
*The dashed line used to separe the 2016 year is due to the fact that for that year we only have information about its sales until the month of October.*<br><br>

Let's see how the total 8.920 Million sales are distributed among all the years:

In [None]:
ax = df.groupby('Year')[['NA_Sales', 'JP_Sales', 'EU_Sales', 'Other_Sales']].sum().plot(kind="bar", stacked=True)
plt.title("Global Sales by Year and Region")
plt.xlabel("Year")
plt.ylabel("Global Sales")
L = plt.legend(loc="upper left")
L.get_texts()[0].set_text('North America')
L.get_texts()[1].set_text('Japan')
L.get_texts()[2].set_text('Europe')
L.get_texts()[3].set_text('Other countries')
sns.despine()
ax.axvline(df.index[36]-0.5, color='grey', linestyle='--', lw=2);

*Note: First of all, in order to understand the graph we have to take into account that all the sales of each videogame are imputed to its release year.* <br><br>
It seems that the sales of videogames have behaved very similarly for all regions. The videogame industry started to be a thing in the 80's but it was not until the 90's where it really took off, reaching the peak in the 2008. Nevertheles, **with the information we have, we can't say if the sales have descended since 2008 but the releases of that year has generated the highest sales.**

We see that **the games that generated such a good sales in 2008 are specially the Mario Kart and Grand Theft Auto IV:**

In [None]:
df[df['Year']==2008].head(10)

However, if we look at the data we see that the variation in sales has more to do with the namber of releases by year than the number of sales by game:

In [None]:
fig, ax1 = plt.subplots(figsize=(12,6))

data1 = df.groupby('Year')['Name'].count()
color = 'tab:grey'
ax1.set_xlabel('Year')
ax1.xaxis.label.set_size(14)
ax1.set_ylabel('Number of releases', color=color)
ax1.yaxis.label.set_size(16)
ax1.plot(data1, color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
data2 = df.groupby('Year')[ 'Global_Sales'].sum()/df['Year'].value_counts().sort_index()
color = 'tab:blue'
ax2.set_ylabel('Average sales by release', color=color)  # we already handled the x-label with ax1
ax2.yaxis.label.set_size(16)
ax2.plot(data2, color=color)
ax2.set_ylim([0, 8])
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.title("Number of releases vs. Average sales by release (by Year) ", size = 16)
plt.figure(figsize=(5,10))
plt.show();

The **grey line**, which represents the **number of releases by year**, shows practically the **same distribution** than the one we saw before correspondent to the **Global Sales by year**.<br>
The **blue line, average sales by release**, confirm us the idea that the **drop in sales from 2018** has more to do with a **fewer number of releases every year** than a strong decline in sales by videogame. Except the first years where there were just a few games available, since the videogame industry started to grow in the 90's the average sales by release has remained quite constant.

In respect to this information, there are two questions that automatically arises: 

- Did something change in publishers' business models?

In fact,we have seen that in the later years, the **video games industry has moved to** what has been called **Gaming-as-a-Service**. In this model, the publishers have tried to diversify their revenues avoiding them to come only from the sale of that video game and enlarge the lifetime of the games. Some of the ways they have found to do this are: in-game purchases (for example additional features such as weapons, clothing..), paid for online multiplayer or even new maps or levels.

- Have the sales behaved the same way in the four different regions?

Global Sales are not equally distributed along regions. About **half of historical global sales have been done in the United States**, followed by **Europe with a 27%** of Global Sales, while **Japan only represents a 14.5%**.

In [None]:
round((df[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].sum())/df['Global_Sales'].sum()*100, 1)

The following graph is going to help us to analyze the relative contribution of the four different regions along the years: 

In [None]:
df_regions = df[['Global_Sales', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Year']]
df_regions = df_regions.groupby('Year').sum()
df_regions['NA_Sales%'] = df_regions['NA_Sales']/df_regions['Global_Sales']
df_regions['EU_Sales%'] = df_regions['EU_Sales']/df_regions['Global_Sales']
df_regions['JP_Sales%'] = df_regions['JP_Sales']/df_regions['Global_Sales']
df_regions['Other_Sales%'] = df_regions['Other_Sales']/df_regions['Global_Sales']

plt.figure(figsize=(2, 6))
df_regions.iloc[:, 5:].plot()
plt.title("Region Sales proportion (by Year) ", size = 16)
plt.figure(figsize=(2, 6))
sns.despine()
plt.show();

During the first 15 years, the United States and Japan took turn in leading the sales, but in the middle of the 90's decade the United States consolidated its position as the country leader. <br>
It is interesting to notice how **since 2008 the weight of United States in the Global Sales diminish constantly** year after year. While **Europe, conversely,** has been increasing its importance slowly but constantly since 1980 in such a way that in 2015 it equaled the United States weight. In fact, it seems very likely that in 2016 Europe will be the leader region. <br>

The case of **Japan** is quite intriguing as its weight in the global sales has been **very erratic**. After compiting with United States for the leadership in sales during the first 15 years, their sales went down in 1995 in a noticeable way.

It seems relevant to link the information in this graph to the insights we got in the previous graphs. Even though the sales in all regions has diminished since 2008, this reduction in sales has not been proportional as **in the United States the decrease has been proportionally higher**.

### PLATFORMS

Let's now deepen our knowledge of the videogame industry by understanding the consoles that led the market during all this years.<br>
How does it look like the distribution of number of games by platform?

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(df['Platform'], order = df['Platform'].value_counts().index, color = 'darkblue')
plt.xticks(rotation=90)
plt.title("Number of games by Platform")
sns.despine();

**Play Station 2** and **Nintendo DS** are clearly the consoles which **received more games**, while the oldest terminals just launched a few of games.

Could this be related to how long the consoles were in production and receiving games?.<br> 
If we analyze the number of **years** that **each platform received video games**:

In [None]:
df.groupby('Platform')['Year'].nunique().sort_values(ascending=False).head(10)

We see that the **number of years Nintendo DS and PS2 have been operative does not explain the difference in sales** in respect to the other platforms.

In the next graph, we have ordered all the platforms according to the total number of sales.

In [None]:
plt.figure(figsize=(18, 10))
df.groupby('Platform')['Global_Sales'].sum('Global_Sales').sort_values().plot(kind='barh')
plt.xlabel("Historical Global Sales")
plt.title("Historical Global Sales by Platform")
sns.despine();

It is interesting to notice how only **just 6 out of the 31 platforms, accumulate around 64% of the historical global sales.** <br>
Also, we see that in general, the order of the platforms in both graphs, comparing by Global Sales and Number of releases, coincide quite a lot. However, we can see some differences in Nintendo DS, X360. Nintendo.., so this lead us to think that it would be interesting to compare the platforms by the average number of sales by game:<br>

In [None]:
a = pd.DataFrame(df.groupby('Platform')[['Platform', 'Global_Sales']].sum('Global_Sales').sort_values('Global_Sales', ascending=False))
b = pd.DataFrame(df['Platform'].value_counts())
b.columns = ['NGames']
a = a.merge(b, how='inner', left_index=True, right_index=True)
a['Sales_NGames'] = a['Global_Sales']/a['NGames']
a['Sales_NGames'].sort_values(ascending=False).plot(kind='bar')
plt.ylabel('Number of M. sales by game')
plt.xlabel('Platform')
plt.title('Number of copies sold (in M.) by game released')
sns.despine();

Indeed we see some interesting insights. The 4 **platforms with a better ratio of sales by number of releases are: Game Boy, Nintendo, Sega Genesis and Super Nintendo**, all of them consoles from late 80's/early 90's. <br>
After all this old consoles, we can see the Play Station 4. This is the latest console of Sony and even though it still has not achieved the number of sales than its predecessors, the ratio of sales by number of releases is quite good.<br>
It is also interesting how good is the ratio of the Xbox 360. The X360 is the second console in number of sales in history, just behind PS2, however its ratio of sales by game is much better than the ratio of the PS2.

If we just focus on Global Sales, three of these top selling platforms correspond to the PlayStation saga: PlayStation, PlayStation 2 and PlayStation3. For the moment, the PlayStation 4 is still far from the numbers of their predecessors, but we see that the same happens with XOne and the other terminals of its generation.<br><br>
We can group all the consoles depending on the year they were released. We are going to focus in the latest two generations:
* **Generation 3**: PS3, Wii, X360, PSP, DS 
* **Generation 4**: PS4, WiiU, XOne, PSV, 3DS 

Will the sales of videogames of the 3rd generation terminals get closer in sales to those of the 4th generation?.

In [None]:
platforms = df.groupby('Platform').agg(
    year_min = pd.NamedAgg(column="Year", aggfunc="min"),
    year_max = pd.NamedAgg(column="Year", aggfunc="max"))

bins=[1979,2000,2005,2011,2020]

labels=['Gen1','Gen2','Gen3','Gen4']
platforms['Generation']= pd.cut(platforms["year_min"] + ((platforms["year_max"] - platforms["year_min"]) / 2), bins , labels=labels)

platform_gsales = df.groupby('Platform')['Global_Sales'].sum('Global_Sales')
platforms = platforms.join(platform_gsales, how='left').reset_index()

The next graph shows the evolution of the sales by the terminals of 3rd Generation:

In [None]:
last_10years = df[df['Year']>=2005]
last_10years = pd.merge(last_10years,platforms[['Platform','Generation']],on='Platform', how='left')
last_10years.loc[last_10years['Platform']=='PS2', 'Generation'] = 'Gen2'
last_10years_3gen = last_10years[last_10years['Generation']=='Gen3']

last_15years = df[df['Year']>=2000]
last_15years = pd.merge(last_15years,platforms[['Platform','Generation']],on='Platform', how='left')
last_15years.loc[last_15years['Platform']=='PS2', 'Generation'] = 'Gen2'
last_15years_3gen = last_15years[last_15years['Generation']=='Gen3']

top5_platform_list_3gen = last_15years_3gen.groupby(['Platform'])['Global_Sales'].sum().sort_values(ascending=False).head(5).index

top5_platform_df_3gen = last_15years_3gen[last_15years_3gen.Platform.isin(top5_platform_list_3gen)]
fig, (ax0,ax1) = plt.subplots(2,2, figsize=(17,10))

fig.suptitle('Top 5 Platform Sales (in Millions) by Region - 3rd Generation', fontsize=14, fontweight = 'bold')

sns.lineplot(x='Year', y='NA_Sales', hue='Platform', data=top5_platform_df_3gen, ci=None, ax=ax0[0])
sns.lineplot(x='Year', y='EU_Sales', hue='Platform', data=top5_platform_df_3gen, ci=None, ax=ax0[1])
sns.lineplot(x='Year', y='JP_Sales', hue='Platform', data=top5_platform_df_3gen, ci=None, ax=ax1[0])
sns.lineplot(x='Year', y='Other_Sales', hue='Platform', data=top5_platform_df_3gen, ci=None, ax=ax1[1])

ax0[0].legend(loc='upper left')
ax0[1].legend(loc='upper left')
ax1[0].legend(loc='upper left')
ax1[1].legend(loc='upper left')

# ax1[1].set_ylim(-0.1,1.6)

ax0[0].set_ylabel('NA Sales (in Millions)', fontsize=16)
ax0[1].set_ylabel('EU Sales (in Millions)', fontsize=16)
ax1[0].set_ylabel('Japan Sales (in Millions)', fontsize=16)
ax1[1].set_ylabel('Other Sales (in Millions)', fontsize=16)

ax0[0].set_xlabel('Year', fontsize=16)
ax0[1].set_xlabel('Year', fontsize=16)
ax1[0].set_xlabel('Year', fontsize=16)
ax1[1].set_xlabel('Year', fontsize=16)

sns.despine()

plt.show()

All the terminals of the **3rd generation** have had a **good and constant behaviour**, mantaining a good pace of sales during the last 10 years. However, we see that they are nearing the end of their lifetime as in 2015 their sales dropped quite importantly.<br><br>

There is a big peak in sales for Nintendo Wii in 2006 due to Wii Sports game. As the *Wii console was selled in a bundle with Wii Sports game*, it is accounted a sale for this game for every console sold. We could see the sales of this game as an outlier.
<br><br>

We can also see that the **proportion of sales in the regions is quite similar but in Japan**. For example, X360 is selling a good amount of games in NA and Europe but in Japan it is practically selling nothing. The opposite happens with PSP, while in Japan is selling quite a lot, in the other regions it is not selling much.


In [None]:
last_10years_4gen = last_10years[last_10years['Generation']=='Gen4']

top5_platform_list_4gen = last_10years_4gen.groupby(['Platform'])['Global_Sales'].sum().sort_values(ascending=False).head(5).index

top5_platform_df_4gen = last_10years_4gen[last_10years_4gen.Platform.isin(top5_platform_list_4gen)]
fig, (ax0,ax1) = plt.subplots(2,2, figsize=(17,10))

fig.suptitle('Top 5 Platform Sales (in Millions) by Region - 4th Generation', fontsize=14, fontweight = 'bold')

sns.lineplot(x='Year', y='NA_Sales', hue='Platform', data=top5_platform_df_4gen, ci=None, ax=ax0[0])
sns.lineplot(x='Year', y='EU_Sales', hue='Platform', data=top5_platform_df_4gen, ci=None, ax=ax0[1])
sns.lineplot(x='Year', y='JP_Sales', hue='Platform', data=top5_platform_df_4gen, ci=None, ax=ax1[0])
sns.lineplot(x='Year', y='Other_Sales', hue='Platform', data=top5_platform_df_4gen, ci=None, ax=ax1[1])

ax0[0].legend(loc='upper left')
ax0[1].legend(loc='upper left')
ax1[0].legend(loc='upper left')
ax1[1].legend(loc='upper left')

# ax1[1].set_ylim(-0.1,1.6)

ax0[0].set_ylabel('NA Sales (in Millions)', fontsize=16)
ax0[1].set_ylabel('EU Sales (in Millions)', fontsize=16)
ax1[0].set_ylabel('Japan Sales (in Millions)', fontsize=16)
ax1[1].set_ylabel('Other Sales (in Millions)', fontsize=16)

ax0[0].set_xlabel('Year', fontsize=16)
ax0[1].set_xlabel('Year', fontsize=16)
ax1[0].set_xlabel('Year', fontsize=16)
ax1[1].set_xlabel('Year', fontsize=16)

sns.despine()

plt.show()

The first thing we see is that it doesn't seem possible that the 4th generation can reach the number of sales made by its predecesor, as its sales are also falling in a dangerous way.<br>
Even though the 4th generation achieved good figures in the beginning, it has been suffering a steep decline since then and its future does not seem to be very bright.
<br><br>
At this moment, the breach in sales between both of them is outstanding. The 3rd generation has sold around 3914 Million of videogames, while the 4th generation at this moment just sold 809 Million games.

Nevertheless, it would be unfair to compare the total sales as the 4th Generation has been more years on the market. So, in order to compare their behaviour in a proper way, let's compare both generations in their first 6 years after being launched.

In [None]:
gen3 = ['DS', 'PS3', 'X360', 'Wii']
gen4 = ['3DS', 'PS4', 'XOne', 'WiiU']
df_gen3_gen4 = df[(df['Platform'].isin(gen3)) | (df['Platform'].isin(gen4))]
df_gen3_gen4['Generation'] = df_gen3_gen4.apply(lambda x: 'Gen3' if x['Platform'] in gen3 else 'Gen4', axis=1)
# df_gen3_gen4.groupby('Generation').sum('Global_Sales')

In [None]:
# We mantain only the 6 first years for each gen:
df_gen3_gen4_6years = df_gen3_gen4.drop(df_gen3_gen4[(df_gen3_gen4['Generation']=='Gen3') & (df_gen3_gen4['Year']>2009)].index)
df_gen3_gen4_6years.loc[:, 'Relative_Year'] = df_gen3_gen4_6years.apply(lambda x: (x['Year'] - 2004) if x['Generation'] == 'Gen3' else (x['Year'] - 2011), axis=1)

fig, (ax0,ax1) = plt.subplots(2,2, figsize=(17,10))

fig.suptitle('Generation comparative in their 6 first years', fontsize=14, fontweight = 'bold')

sns.lineplot(x='Relative_Year', y='NA_Sales', hue='Generation', data=df_gen3_gen4_6years, ci=None, ax=ax0[0])
sns.lineplot(x='Relative_Year', y='EU_Sales', hue='Generation', data=df_gen3_gen4_6years, ci=None, ax=ax0[1])
sns.lineplot(x='Relative_Year', y='JP_Sales', hue='Generation', data=df_gen3_gen4_6years, ci=None, ax=ax1[0])
sns.lineplot(x='Relative_Year', y='Other_Sales', hue='Generation', data=df_gen3_gen4_6years, ci=None, ax=ax1[1])

ax0[0].legend(loc='upper right')
ax0[1].legend(loc='upper right')
ax1[0].legend(loc='upper right')
ax1[1].legend(loc='upper right')

# ax1[1].set_ylim(-0.1,1.6)

ax0[0].set_ylabel('NA Sales (in Millions)', fontsize=16)
ax0[1].set_ylabel('EU Sales (in Millions)', fontsize=16)
ax1[0].set_ylabel('Japan Sales (in Millions)', fontsize=16)
ax1[1].set_ylabel('Other Sales (in Millions)', fontsize=16)

ax0[0].set_xlabel('Years after launch', fontsize=16)
ax0[1].set_xlabel('Years after launch', fontsize=16)
ax1[0].set_xlabel('Years after launch', fontsize=16)
ax1[1].set_xlabel('Years after launch', fontsize=16)

plt.show()

It is interesting how it took one more year for the 4th generation to get a notorious growth in sales. This effect is a result of the delay in the launching of PS4 and XOne in respect on the other terminals of their generation. In United States, the region leader in sales, we can see that the 4th Gen. did not reach the same level of sales and also, it lost the interest much sooner than the 3rd Generation did. 

In [None]:
plt.figure(figsize=(12,6))
sns.lineplot(x='Year', y='Global_Sales', data=df_gen3_gen4[df_gen3_gen4['Generation']=='Gen3'], ci=None)
plt.show();

Nevertheless, if we analyze the complete graph for the third generation, we notice that it had a resurgence in sales in 2013 due to the launch of the videogame "Grand Theft Auto V	". In fact:

In [None]:
print("Sales 3rd Gen first 6 years: ", df_gen3_gen4[(df_gen3_gen4['Generation']=='Gen3') & (df_gen3_gen4['Year']<=2009)]['Global_Sales'].sum())
print("Sales 3rd Gen from 2009", df_gen3_gen4[(df_gen3_gen4['Generation']=='Gen3') & (df_gen3_gen4['Year']>2009)]['Global_Sales'].sum())

Around 42% global videogame sales of the 3rd generation of terminals were sold after the first 6 years. So, we can't discard a similar behaviour will occur in the 4th generation. 

### PUBLISHERS

We already saw that there are quite a lot of publishers, 576, which as we can see in the next plot are quite inequaly distributed in the number of video games releases.

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(df['Publisher'], order = df['Publisher'].value_counts().index, color = 'darkblue')
plt.title('Number of games by Publisher')
plt.xticks(rotation=90)
sns.despine();

So let's discover what are the main 20 videogame Publishers in the history looking at its Global Sales:

In [None]:
plt.figure(figsize=(18, 10))
platforms_global = df.groupby('Publisher')['Global_Sales'].sum('Global_Sales').sort_values(ascending=False).head(20)
platforms_global.sort_values().plot(kind='barh')
plt.xlabel("Historical Global Sales")
plt.title("Historical Global Sales by Publisher")
sns.despine();

- **Nintendo** is by far the game **best-seller**, almost doubling the second best, Electronic Arts.
- We can see how the number of releases descend in an exponential way.
- In fact, just these **20 top seller publishers**, agglutinate around a **86% of the historical global sales**.

In [None]:
df.groupby('Publisher')['Global_Sales'].sum('Global_Sales').sort_values(ascending=False).head(20).sum()/df['Global_Sales'].sum()

Centering our attention in the small publishers, more than half of them, 325, have only released 1 or 2 video games. I wonder if this is because there are an increase of small new publishers in the latter years or this is a continuous behaviour during the time. In the next graph, we can see that this is a phenomenon that has always ocurred:

In [None]:
df[df['Publisher'].isin(df['Publisher'].value_counts()[df['Publisher'].value_counts()<3].index)][['Publisher', 'Year']].groupby('Year').count().plot(kind='bar').get_legend().remove()
plt.ylabel("Number of Publishers")
sns.despine();


Going back to the top publishers, how are their sales distributed along the years and in the different regions?

In [None]:
top5_publisher_list = last_10years.groupby(['Publisher'])['Global_Sales'].sum().sort_values(ascending=False).head(5).index

top5_publisher_df = last_10years[last_10years.Publisher.isin(top5_publisher_list)]
fig, (ax0,ax1) = plt.subplots(2,2, figsize=(17,10))

fig.suptitle('Top 5 Publisher Sales (in Millions) by Region', fontsize=14, fontweight = 'bold')

sns.lineplot(x='Year', y='NA_Sales', hue='Publisher', data=top5_publisher_df, ci=None, ax=ax0[0])
sns.lineplot(x='Year', y='EU_Sales', hue='Publisher', data=top5_publisher_df, ci=None, ax=ax0[1])
sns.lineplot(x='Year', y='JP_Sales', hue='Publisher', data=top5_publisher_df, ci=None, ax=ax1[0])
sns.lineplot(x='Year', y='Other_Sales', hue='Publisher', data=top5_publisher_df, ci=None, ax=ax1[1])

ax0[0].legend(loc='upper left')
ax0[1].legend(loc='upper left')
ax1[0].legend(loc='upper left')
ax1[1].legend(loc='upper left')

# ax1[1].set_ylim(-0.1,1.6)

ax0[0].set_ylabel('NA Sales (in Millions)', fontsize=16)
ax0[1].set_ylabel('EU Sales (in Millions)', fontsize=16)
ax1[0].set_ylabel('Japan Sales (in Millions)', fontsize=16)
ax1[1].set_ylabel('Other Sales (in Millions)', fontsize=16)

ax0[0].set_xlabel('Year', fontsize=16)
ax0[1].set_xlabel('Year', fontsize=16)
ax1[0].set_xlabel('Year', fontsize=16)
ax1[1].set_xlabel('Year', fontsize=16)

sns.despine()

plt.show()

While most of the publishers have a lot of ups and downs, we see how Electronic Arts has maintained a constant rise. The big peaks correspond to the launches of games that were sold specially good. For example, the big peak of Take-Two in 2013 is due to the launch of Grand Theft Auto V. <br>
It is also quite striking how in Japan seems that only Nintendo has sold games. Comparing the evolution of the different companies in the last 10 years, we appreciate that **the market in Japan has been practically monopolized by Nintendo.** <br>
**In the other regions**, the videogames market is **quite fragmented** with some companies as: Electronic Arts, Ubisoft, Nintendo, Activision and Take-Two Interactive, fighting to leader the market.

Focusing in the japanese market to try to understand its particularity, we can find an interesting insight in the next table that shows the best-seller Publishers in the last 10 years. Added to the big concentration of sales of Nintendo, we can see that almost all of the top seller companies are japanese.

In [None]:
last_10years.groupby('Publisher')[['Publisher', 'JP_Sales']].sum('JP_Sales').sort_values('JP_Sales', ascending=False).head(10)

This kind of endogamy in the japanese market is not the only different aspect compared to the other markets. We'll see later how the taste for the genre of the games is very different in Japan from the other regions.

If we list the top selling games for this publishers, we see that in general the publishers have a small number of original games with a big success. Then, the publishers try to explode the success of this games launching different sequels and succesive parts of the same game.<br>
For example, grouping this sagas we can see this effect for the three main historical publishers.

- **NINTENDO**:

In [None]:
df['Name_grouped'] = df['Name']
df.loc[(df['Name'].str.contains("Mario")) & (df['Publisher']=='Nintendo'), 'Name_grouped'] = "Mario Bros - FRANCHISE"
df.loc[(df['Name'].str.contains("Pokemon")) & (df['Publisher']=='Nintendo'), 'Name_grouped'] = "Pokemon - FRANCHISE"
df.loc[(df['Name'].str.contains("Pokémon")) & (df['Publisher']=='Nintendo'), 'Name_grouped'] = "Pokemon - FRANCHISE"
df.loc[df['Name'].str.contains("Wii Sports"), 'Name_grouped'] = 'Wii Sports - FRANCHISE'
df.loc[df['Name'].str.contains("Wii Fit"), 'Name_grouped'] = 'Wii Fit - FRANCHISE'
df[df['Publisher']=='Nintendo'][['Name_grouped', 'Global_Sales']].groupby('Name_grouped').sum('Global_Sales').sort_values('Global_Sales',ascending=False).head(10)

- **EA SPORTS**:

In [None]:
df.loc[df['Name'].str.contains("Need for Speed"), 'Name_grouped'] = 'Need for Speed - FRANCHISE'
df.loc[df['Name'].str.contains("Need For Speed"), 'Name_grouped'] = 'Need for Speed - FRANCHISE'
df.loc[df['Name'].str.contains("Battlefield"), 'Name_grouped'] = 'Battlefield - FRANCHISE'
df.loc[df['Name'].str.contains("Medal of Honor"), 'Name_grouped'] = 'Medal of Honor - FRANCHISE'
df.loc[df['Name'].str.contains("Madden NFL"), 'Name_grouped'] = 'Madden NFL - FRANCHISE'
df.loc[df['Name'].str.contains("The Sims"), 'Name_grouped'] = 'The Sims - FRANCHISE'
df.loc[(df['Name'].str.contains("FIFA")) & (df['Publisher']=='Electronic Arts'), 'Name_grouped'] = "FIFA - FRANCHISE"
df.loc[(df['Name'].str.contains("Harry Potter")) & (df['Publisher']=='Electronic Arts'), 'Name_grouped'] = "Harry Potter - FRANCHISE"
df[df['Publisher']=='Electronic Arts'][['Name_grouped', 'Global_Sales']].groupby('Name_grouped').sum('Global_Sales').sort_values('Global_Sales',ascending=False).head(10)

* **ACTIVISION**:

In [None]:
df.loc[(df['Name_grouped'].str.contains("Call of Duty")) & (df['Publisher']=='Activision'), 'Name_grouped'] = "Call of Duty - FRANCHISE"
df.loc[(df['Name_grouped'].str.contains("Tony Hawk's")) & (df['Publisher']=='Activision'), 'Name_grouped'] = "Tony Hawk's - FRANCHISE"
df.loc[(df['Name_grouped'].str.contains("Guitar Hero")) & (df['Publisher']=='Activision'), 'Name_grouped'] = "Guitar Hero - FRANCHISE"
df.loc[(df['Name_grouped'].str.contains("Spider-Man")) & (df['Publisher']=='Activision'), 'Name_grouped'] = "Spider-Man - FRANCHISE"
df[df['Publisher']=='Activision'][['Name_grouped', 'Global_Sales']].groupby('Name_grouped').sum('Global_Sales').sort_values('Global_Sales',ascending=False).head(10)

So, it seems that each publisher have a small group of franchise games that report them the vast majority of the sales. To check this hypothesis, I'm going to use the Gini Index.

The Gini Index is a measure of statistical dispersion that measures the inequality among values of a frequency distribution. It has been frequently used in economics to measure how far a country's wealth or income distribution deviates from a totally equal distribution.<br>

Gini's coefficient ranges between 0 and 1, where zero expresses perfect equality and a coefficient of one means maximal inequality among values.<br>

Here we are going to use this coefficient to express the uneven contribution of sales among the different games for each Publisher. 

In [None]:
df.loc[(df['Name'].str.contains("Gran Turismo")) & (df['Publisher']=='Sony Computer Entertainment'), 'Name_grouped'] = 'Gran Turismo - FRANCHISE'
df.loc[(df['Name'].str.contains("Final Fantasy")) & (df['Publisher']=='Sony Computer Entertainment'), 'Name_grouped'] = 'Final Fantasy - FRANCHISE'
df.loc[(df['Name'].str.contains("Crash Bandicoot")) & (df['Publisher']=='Sony Computer Entertainment'), 'Name_grouped'] = 'Crash Bandicoot - FRANCHISE'
df.loc[(df['Name'].str.contains("Uncharted")) & (df['Publisher']=='Sony Computer Entertainment'), 'Name_grouped'] = 'Uncharted - FRANCHISE'
df.loc[(df['Name'].str.contains("God of War")) & (df['Publisher']=='Sony Computer Entertainment'), 'Name_grouped'] = 'God of War - FRANCHISE'
df.loc[(df['Name'].str.contains("Just Dance")) & (df['Publisher']=='Ubisoft'), 'Name_grouped'] = 'Just Dance - FRANCHISE'
df.loc[(df['Name'].str.contains("Assassin's Creed")) & (df['Publisher']=='Ubisoft'), 'Name_grouped'] = "Assassin's Creed - FRANCHISE"
df.loc[(df['Name'].str.contains("Star Wars")) & (df['Publisher']=='LucasArts'), 'Name_grouped'] = "Star Wars - FRANCHISE"
df.loc[(df['Name'].str.contains("Far Cry")) & (df['Publisher']=='Ubisoft'), 'Name_grouped'] = "Far Cry - FRANCHISE"
df.loc[df['Name'].str.contains("Rugrats"), 'Name_grouped'] = 'Rugrats - FRANCHISE'
df.loc[df['Name'].str.contains("WWF SmackDown"), 'Name_grouped'] = 'WWF SmackDown - FRANCHISE'
df.loc[df['Name'].str.contains("Metal Gear Solid"), 'Name_grouped'] = 'Metal Gear Solid - FRANCHISE'
df.loc[df['Name'].str.contains("World Soccer Winning Eleven"), 'Name_grouped'] = 'World Soccer Winning Eleven - FRANCHISE'
df.loc[df['Name'].str.contains("Pro Evolution Soccer"), 'Name_grouped'] = 'World Soccer Winning Eleven - FRANCHISE'
df.loc[(df['Name'].str.contains("Sonic")) & (df['Publisher']=='Sega'), 'Name_grouped'] = 'Sonic - FRANCHISE'
df.loc[(df['Name'].str.contains("Tekken")) & (df['Publisher']=='Namco Bandai Games'), 'Name_grouped'] = 'Tekken - FRANCHISE'
df.loc[df['Name'].str.contains("Halo"), 'Name_grouped'] = 'Halo - FRANCHISE'
df.loc[df['Name'].str.contains("Gears of War"), 'Name_grouped'] = 'Gears of War - FRANCHISE'
df.loc[df['Name'].str.contains("Forza Motorsport"), 'Name_grouped'] = 'Forza Motorsport'
df.loc[(df['Name'].str.contains("Street Fighter")) & (df['Publisher']=='Capcom'), 'Name_grouped'] = 'Street Fighter - FRANCHISE'
df.loc[(df['Name'].str.contains("Monster Hunter")) & (df['Publisher']=='Capcom'), 'Name_grouped'] = 'Monster Hunter - FRANCHISE'
df.loc[(df['Name'].str.contains("Resident Evil")) & (df['Publisher']=='Capcom'), 'Name_grouped'] = 'Resident Evil - FRANCHISE'
df.loc[(df['Name'].str.contains("Dragon Ball Z")) & (df['Publisher']=='Atari'), 'Name_grouped'] = 'Dragon Ball Z - FRANCHISE'
df.loc[(df['Name'].str.contains("Final Fantasy")) & (df['Publisher']=='Square Enix'), 'Name_grouped'] = 'Final Fantasy - FRANCHISE'
df.loc[(df['Name'].str.contains("Kingdom Hearts")) & (df['Publisher']=='Square Enix'), 'Name_grouped'] = 'Kingdom Hearts - FRANCHISE'
df.loc[(df['Name'].str.contains("LEGO")) & (df['Publisher']=='Warner Bros. Interactive Entertainment'), 'Name_grouped'] = 'LEGO - FRANCHISE'
df.loc[(df['Name'].str.contains("Mortal Kombat")) & (df['Publisher']=='Warner Bros. Interactive Entertainment'), 'Name_grouped'] = 'Mortal Kombat - FRANCHISE'
df.loc[(df['Name'].str.contains("Batman")) & (df['Publisher']=='Warner Bros. Interactive Entertainment'), 'Name_grouped'] = 'Batman - FRANCHISE'
df.loc[(df['Name'].str.contains("Fallout")) & (df['Publisher']=='Bethesda Softworks'), 'Name_grouped'] = 'Fallout - FRANCHISE'
df.loc[(df['Name'].str.contains("The Elder Scrolls")) & (df['Publisher']=='Bethesda Softworks'), 'Name_grouped'] = 'The Elder Scrolls - FRANCHISE'
df.loc[(df['Name'].str.contains("Mortal Kombat")) & (df['Publisher']=='Midway Games'), 'Name_grouped'] = 'Mortal Kombat - FRANCHISE'
df.loc[(df['Name'].str.contains("Game Party")) & (df['Publisher']=='Midway Games'), 'Name_grouped'] = 'Game Party - FRANCHISE'

In [None]:
def GRLC(values):
    '''
    Calculate Gini index, Gini coefficient, Robin Hood index, and points of 
    Lorenz curve based on the instructions given in 
    www.peterrosenmai.com/lorenz-curve-graphing-tool-and-gini-coefficient-calculator
    Lorenz curve values as given as lists of x & y points [[x1, x2], [y1, y2]]
    @param values: List of values
    @return: [Gini index, Gini coefficient, Robin Hood index, [Lorenz curve]] 
    '''
    n = len(values)
    assert(n > 0), 'Empty list of values'
    sortedValues = sorted(values) #Sort smallest to largest

    #Find cumulative totals
    cumm = [0]
    for i in range(n):
        cumm.append(sum(sortedValues[0:(i + 1)]))

    #Calculate Lorenz points
    LorenzPoints = [[], []]
    sumYs = 0           #Some of all y values
    robinHoodIdx = -1   #Robin Hood index max(x_i, y_i)
    for i in range(1, n + 2):
        x = 100.0 * (i - 1)/n
        y = 100.0 * (cumm[i - 1]/float(cumm[n]))
        LorenzPoints[0].append(x)
        LorenzPoints[1].append(y)
        sumYs += y
        maxX_Y = x - y
        if maxX_Y > robinHoodIdx: robinHoodIdx = maxX_Y   
    
    giniIdx = 100 + (100 - 2 * sumYs)/n #Gini index 

    return [giniIdx, giniIdx/100, robinHoodIdx, LorenzPoints]

In [None]:
top20Publishers = df.groupby('Publisher').sum('Global_Sales').sort_values('Global_Sales', ascending=False).head(20).index
df_publishers_gini = pd.DataFrame(index=top20Publishers)
df_publishers_gini['Gini'] = 0

for publisher in top20Publishers:
    publisher_Gini = GRLC(list(df[df['Publisher']==publisher].groupby('Name_grouped')[['Global_Sales']].sum('Global_Sales').sort_values('Global_Sales', ascending=False).reset_index()['Global_Sales']))
    df_publishers_gini.loc[publisher, "Gini"] = publisher_Gini[0]

In [None]:
df_publishers_gini

We have quite high coefficients for all the publishers, resulting in a total average of 73%. For example, Nintendo with a coefficient of 83%, suffers a very high sales dependance of a few games/franchises (Mario Bros, Pokemon and Wii Sports). Activison also has a specially high coefficient, 80%, related to its dependence to the different games of Call of Duty.<br><br>
Thus, the Gini index has proved correct our intuition that a small number of games or franchises concentrate most of the publisher's sales.

## GENRE

The distribution of games by the different Genres:

In [None]:
plt.figure(figsize=(14, 8))
sns.countplot(df['Genre'], order = df['Genre'].value_counts().index, color = 'darkblue')
plt.title('Number of games by Genre')
plt.xticks(rotation=90)
sns.despine();

Action and Sports are both the best sellers and those genres with more different games, while Puzzle, Strategy and Adventure seems to be the least popular games.

In [None]:
plt.figure(figsize=(18, 10))
genre_global = df.groupby('Genre')['Global_Sales'].sum('Global_Sales').sort_values(ascending=False).head(15)
genre_global.sort_values().plot(kind='barh')
plt.xlabel("Historical Global Sales")
plt.title("Historical Global Sales by Genre")
sns.despine();

If we split the sales of the different genres by region, would we find big differences in the Japanese tastes as we saw with the platforms?

In [None]:
top5_genres_list = last_10years.groupby(['Genre'])['Global_Sales'].sum().sort_values(ascending=False).head(5).index

top5_genre_df = last_10years[last_10years.Genre.isin(top5_genres_list)]
fig, (ax0,ax1) = plt.subplots(2,2, figsize=(17,10))

fig.suptitle('Top 5 Genres Sales (in Millions) by Region', fontsize=14, fontweight = 'bold')

sns.lineplot(x='Year', y='NA_Sales', hue='Genre', data=top5_genre_df, ci=None, ax=ax0[0])
sns.lineplot(x='Year', y='EU_Sales', hue='Genre', data=top5_genre_df, ci=None, ax=ax0[1])
sns.lineplot(x='Year', y='JP_Sales', hue='Genre', data=top5_genre_df, ci=None, ax=ax1[0])
sns.lineplot(x='Year', y='Other_Sales', hue='Genre', data=top5_genre_df, ci=None, ax=ax1[1])

ax0[0].legend(loc='upper left')
ax0[1].legend(loc='upper left')
ax1[0].legend(loc='upper left')
ax1[1].legend(loc='upper left')

# ax1[1].set_ylim(-0.1,1.6)

ax0[0].set_ylabel('NA Sales (in Millions)', fontsize=16)
ax0[1].set_ylabel('EU Sales (in Millions)', fontsize=16)
ax1[0].set_ylabel('Japan Sales (in Millions)', fontsize=16)
ax1[1].set_ylabel('Other Sales (in Millions)', fontsize=16)

ax0[0].set_xlabel('Year', fontsize=16)
ax0[1].set_xlabel('Year', fontsize=16)
ax1[0].set_xlabel('Year', fontsize=16)
ax1[1].set_xlabel('Year', fontsize=16)

sns.despine()

plt.show()

We see a **big peak** in the **shooter** genre in Europe, North America and Other countries for the **year 2015**. This effect is due to the release of the **"Call of Duty:Black Ops 3"**, that *sold more than 25M copies* in all the different platforms. <br>There is **another big increase** of shooter's genre sales in the **year 2012** which is related with prequel of the Call of Duty: **"Call of Duty: Black Ops II"**. <br> <br>
Interestingly, in the Japanese market these increases are barely noticed. It seems that the **Japanese is a very particular market** and quite unrelated with the others. In this country, **the biggest increases** in sales in the last ten years comes from the **Role-Playing genre**. In particular because of the launch of different pokemon games, **"Pokemon Black/Pokemon White" in 2010** for Nintendo DS and **"Pokemon Omega Ruby/Pokemon Alpha Sapphire"** for Nintendo 3DS.

In the next graph, we compare the historical proportion of sales by Genre in each Region. We see very clearly how concentrated are the Japanese tastes:

In [None]:
Genre_Region = df.groupby(['Genre']).sum().loc[:, 'NA_Sales':'Global_Sales']
Genre_Region['NA_Sales%'] = Genre_Region['NA_Sales']/Genre_Region['Global_Sales']
Genre_Region['EU_Sales%'] = Genre_Region['EU_Sales']/Genre_Region['Global_Sales']
Genre_Region['JP_Sales%'] = Genre_Region['JP_Sales']/Genre_Region['Global_Sales']
Genre_Region['Other_Sales%'] = Genre_Region['Other_Sales']/Genre_Region['Global_Sales']
sns.heatmap(Genre_Region.loc[:,'NA_Sales%':'Other_Sales%'], vmax =1, vmin=0, annot=True, fmt = '.2%')
plt.title("Proportion of Genre by Region")
plt.show()

In **Japan**, the most popular Genre by far in the last 10 years has been **Role-Playing**, *followed by Action and Miscellaneous*. Conversely, **the other regions** coincide in **Action, Sports and Shooter** as the most appreciated Genres by the gamers.<br>

If we use apply the same inequality analysis we did for Platforms, we see that it is also true for Genre that a few games are responsible for the majority of the sales:

In [None]:
Genres_ordered = df.groupby('Genre').sum('Global_Sales').sort_values('Global_Sales', ascending=False).head(20).index
df_genres_gini = pd.DataFrame(index=Genres_ordered)
df_genres_gini['Gini'] = 0

for genre in Genres_ordered:
    genre_Gini = GRLC(list(df[df['Genre']==genre].groupby('Name_grouped')[['Global_Sales']].sum('Global_Sales').sort_values('Global_Sales', ascending=False).reset_index()['Global_Sales']))
    df_genres_gini.loc[genre, "Gini"] = genre_Gini[0]

In [None]:
df_genres_gini

For example, if we focus on "Role-Playing" with a Gini coefficient of 80.6%, we see that just the Pokemon saga by itself represents 20% of the total sales.

This lead me to another question, how are the Genre and Platform related between each other?<br>
In order to better understand better its relationship, we can make a heatmap of the sales:

In [None]:
top15platforms = list(df.groupby('Platform').sum('Global_Sales').sort_values('Global_Sales', ascending=False).head(10).index)
Platform_Genre = pd.crosstab(df['Platform'],df['Genre'], normalize = 'index')


Platform_Genre['Total'] = Platform_Genre.sum(axis=1)
Platform_Genre.loc['Total'] = Platform_Genre.sum(axis=0)

top_Platforms = Platform_Genre.loc[top15platforms]
top_Platforms = top_Platforms.loc[:,:'Strategy']
top_Platforms

max_sales = top_Platforms.iloc[1:].values.max()
min_sales = top_Platforms.iloc[1:].values.min()

sns.heatmap(top_Platforms, annot=True, fmt=".1%")
plt.xticks(rotation = 45)
plt.show()

**Action** video games sell quite good in **all different platforms**. A **similar** behaviour that shares with the **sport** games, although in a lesser extent.
On the other hand, **shoother** genre is more popular in PC's and in the latter generation of **PlayStation and XBox** platforms. While they are rarely sold in portable consoles.<br>
**Miscellaneous** games are very associated with **Nintendo** platforms (DS and Wii). <br>
**Strategy** games are practically exclusive from PC.<br>
Another interesting fact is how well **Adventure** games are sold in portable consoles (PSP, DS).

We can get an intuition in the heatmap that there is a relationship between the Genre and the Platform. To move from that intuition to a statistical certainty, let's conduct a **chi-square independence hypothesis test** in order to **determine whether the genre and platform variables are independent or not.**<br>

The **Chi-Square test of independence** is used to determine if there is a **significant relationship between two categorical variables.**<br>
Basically, it compares the frequency of each category for one of the variable across the categories of the second nominal variable. To do so, we will create a **contingency table** where each row represents a category for one variable and each column represents a category for the other variable. Using the chi-square test of independence we will examine this relationship, where the **null hypothesis means that there is no relationship** between both variables. On the contrary, the alternative hypothesis is that there is a relationship between them.

Our null hypothesis is going to be that our variables are independent and the alternative hypothesis states that they are not dependent:<br><br>

$H_0$: The number of sales by genre is independent of the platform.<br>
$H_1$: The number of sales by genre is dependent of the platform.

In [None]:
df['Genre'].value_counts()

In [None]:
df['Platform'].value_counts()

In [None]:
pd.crosstab(df['Genre'], df['Platform'], margins=True).T

There are many combinations of the two variables that are zero. This is a problem, as the formula for calculating the Chi-square statistic (X²) :<br>

$X² = \sum \frac{(observed-expected)²}{expected}$<br>

Where ‘observed’ refers to the numbers we see in the contingency table above, and the term ‘expected’ refers to the expected numbers when the null hypothesis is true.

So, it will require that The expected value for each cell needs to be at least five in order to use this test.

Let's use then for the test the main platforms:

In [None]:
top13platforms = ['DS', 'PS2', 'PS3', 'Wii', 'X360', 'PSP', 'PS', 'PC', 'XB', 'GBA', 'GC', '3DS']
df_top13platforms = df[df['Platform'].isin(top13platforms)]
ct = pd.crosstab(df_top13platforms['Genre'], df_top13platforms['Platform'], margins=True).T
ct

So, under the null hypothesis that there is no significant relationship between Platform and Genre, the percentage of games for each genre should be consistent across the 13 different platforms.<br>
Now we have the contingency table, we can prepare the observed values:

In [None]:
obs = np.array([])
np.set_printoptions(suppress=True)
for row in range(ct.shape[0]-1):
    obs = np.append(obs, ct.iloc[row][0:-1].values)

And calculate the expected values:

In [None]:
row_sum = ct.iloc[0:-1,-1].values
exp = []
for j in range(12):
    for col_sum in ct.iloc[-1,0:-1].values:
        exp.append(col_sum * row_sum[j] / ct.loc['All', 'All'])

In order to compute X²:

In [None]:
chi_sq_stats = ((obs - exp)**2/exp).sum()
chi_sq_stats

In [None]:
dof = (ct.shape[0]-2)*(ct.shape[1]-2)
dof

Apart from X², to be able to calculate the p-value we also need the degree of freedom. It is calculated as: (number of categories in the Genre variable - 1) * (number of categories in the Platform variable - 1).<br>

In summary, given the degrees of freedom, we know that the "observed" values should be close to the "expected" under the null hyphotesis, which means that the X² statistic should be relatively small. If that is not the case and X² is higher than a certain threshold, the p-value will be very low and we would reject the null hypothesis.

In [None]:
# p-value
1 - stats.chi2.cdf(chi_sq_stats, dof)

In our case we have a null-value really close to 0, clearly lower than the typical significance level of 0.05. This means that we have probed statiscally our intuition and we can reject the Null Hypothesis that Genre and Platform are independent events.

## GAMES

We can't finish our approximation to the videogame industry without paying attention to what are the best selled videogames in history. If we analyze the top 3 games by year:

In [None]:
top3_year = df.sort_values(['Year', 'Global_Sales'], ascending=False).groupby('Year').head(3)
pd.pivot_table(df[df['Rank'].isin(top3_year['Rank'])], values='Global_Sales', index=['Year', 'Name','Platform', 'Publisher', 'Genre'], aggfunc=np.sum)

The top 15 games with the highest sales in history are:

In [None]:
top_vgames = df.copy()
top_vgames['Name_year'] = top_vgames['Name'] + ' (' + top_vgames['Year'].astype(str) + ')'
top_vgames = top_vgames.groupby(['Name_year', 'Genre'])[['Name_year', 'Genre', 'Global_Sales']].sum('Global_Sales').sort_values('Global_Sales',ascending=False).reset_index().head(15)
top_vgames.sort_values('Global_Sales').set_index('Name_year').plot(kind='barh', legend=False) #, color = [colors[i] for i in top_vgames['Genre']]
plt.xlabel("Sales, M. copies")
plt.ylabel("Videogame, year released")
plt.title("Top 15 best-seller games");

As we have seen during all this analysis, most of this 15 top games are versions of the same franchise. The largest selling game by far is Wii Sports, doubling the Super Mario Bros who is the second in the list. A probable explanation of its high sales are because of the fact that Wii Sports was included as a bundle with every Wii console sold.

How much of an impact did Wii sports have on Nintendo sales?

In [None]:
plt.figure(figsize=(12,6))
df_wii_sports_year = df[df['Publisher']=='Nintendo'].groupby(['Name', 'Year'])[['Name', 'Global_Sales']].sum('Global_Sales').sort_values('Global_Sales', ascending=False).reset_index()
df_wii_sports_year['G_Sales_noWiiSports'] = df_wii_sports_year['Global_Sales']
df_wii_sports_year.loc[df_wii_sports_year['Name']=='Wii Sports', 'G_Sales_noWiiSports'] = 0

sns.lineplot(x='Year', y='Global_Sales', data=df_wii_sports_year, ci=None)
sns.lineplot(x='Year', y='G_Sales_noWiiSports', data=df_wii_sports_year, ci=None)

plt.title('Effect of Wii Sports in Nintendo sales', fontsize=16)
plt.ylabel('Global Sales')

sns.despine()

plt.show()

In [None]:
WiiSports_Sales = df[df['Name']=='Wii Sports']['Global_Sales'].sum()
Nintendo_Sales = df[df['Publisher']=='Nintendo']['Global_Sales'].sum()
print("Wii Sports Global Sales: ", WiiSports_Sales)
print("Wii Sports sales over total Nintendo sales (%): ", "{:.2f}".format(WiiSports_Sales/Nintendo_Sales*100))

We see that even though it is by far the most sold Nintendo's videogame, Wii Sports suppose only a 4.6% over the total sales.

This is because, Nintendo owns the Super Mario saga of videogames who has been reporting a yearly good amount of sales since the very beginning:

In [None]:
# Arreglar los nombres de los ejes y poner leyenda
df[df['Publisher']=='Nintendo'][['Name_grouped', 'Global_Sales']].groupby('Name_grouped').sum().sort_values('Global_Sales', ascending=False).head(15)

In [None]:
plt.figure(figsize=(12,6))
df_SuperMario_year = df[df['Publisher']=='Nintendo'].groupby(['Name_grouped', 'Year'])[['Name_grouped', 'Global_Sales']].sum('Global_Sales').sort_values('Global_Sales', ascending=False).reset_index()
df_SuperMario_year['G_Sales_SuperMario'] = df_SuperMario_year['Global_Sales']
df_SuperMario_year.loc[df_SuperMario_year['Name_grouped']=='Mario Bros - FRANCHISE', 'G_Sales_SuperMario'] = 0

sns.lineplot(x='Year', y='Global_Sales', data=df_SuperMario_year, ci=None, label='Total Nintendo Sales')
sns.lineplot(x='Year', y='G_Sales_SuperMario', data=df_SuperMario_year, ci=None, label='Nintendo Sales w/o Mario games')

plt.legend(loc='upper right')
plt.title('Effect of Mario games in Nintendo sales', fontsize=16)
plt.ylabel("Global Sales")

sns.despine()

plt.show()

We can see very clearly in the graph above the great impact Super Mario games have in Nintendo sales practically every year.

# CONCLUSIONS

* From the 8.920M videogames sold from 1980 to September 2016, a 49% of them have been made in the United States, 27% in Europe, 14% in Japan and just a 9% in other regions.
* With the arrival of the 90's the sales started to grow quite constantly until 2008 when they suffered a change in tendency until the present time. This drop in sales has been especially strong in the United States.
* The 3rd generation of consoles, formed by PS3, Wii, X360, PSP, DS, has been the most prolific both in terms of number of lauches and videogame sales. The descending sales tendency of the 4th generation in this moment make it difficult for us to be optimistic about its future.
* Nintendo is the clear leader publisher in sales, specially thanks to the Super Mario Bros saga which represents around a 30% of the Nintendo total sales.
* EA Sports has established itself in second place, increasing slowly but uninterruptedly its sales.
* Using the Gini coefficient, we demonstrated that there is a small group of videogames that make most of the sales, both by publisher and genre.
* Japan is a very particular market, with a different genre's taste. Also, it seems very difficult to sell games for not Japanese publishers.
* We probed statistically that Platform and Genre are not independent from each other.
