## Importing Libraries


In [None]:
import numpy as np
import pandas as pd
pd.set_option("display.precision", 2)
import seaborn as sns
sns.set_context(
    "notebook",
    font_scale = 1.5,
    rc = {
        "figure.figsize": (11,8),
    "axes.titlesize": 18
    }
)
import matplotlib.pyplot as plt

#To remove warnings
import warnings
warnings.filterwarnings('ignore')

## Importing the Data

In [None]:
df = pd.read_csv('../input/videogamesales/vgsales.csv')

## Data Overview

With the following commands we can see that there are 11 columns and 16598 unique titles, and we can also see that there are null values for the Year column.

In [None]:
print(df.head())
print(df.shape)

In [None]:
print(df.info())

# Cleaning the Data

To convert the Year column to int for easier processing, records with invalid vlaues were removed from the dataset. A total of 307 rows were removed.

In [None]:
df = df.dropna()
df['Year'] = df['Year'].astype('int64')
df.shape

After sorting the data based on Year in descending order, we can see that the data set has 1 record for 2020 and 3 records for the year 2017. These rows are removed as the dataset does not have sufficient information to make any inferences for the performance of years past 2016. It's important to note that the dataset did not have complete information for the year 2016, missing sales for the months of November and December where sales are highest. As a result sales for 2016 were not included in any time-based analyses.

In [None]:
df = df[df['Year'] <= 2016]
df.head()

Now that the dataset has been cleaned and the year data type has been converted, we can take a closer look. Sales are in units of millions, and Rank is in based on relative number of sales. With the table below, we can get some quick insights into the data. On average, sales in the NA region are highest. There must also be large outliers in the dataset as the mean global sales (0.54) is very different from the median global sales (0.17). 75% of videogames make less than 480,000 global sales. 

In [None]:
df.describe()

Because some of the rows have been removed, we need to redo the Rank column. We sort the dataframe by global sales in ascending order and simply add 1 to the index.

In [None]:
df = df.sort_values(by='Global_Sales', ascending=False)
df['Rank'] = df.index + 1
df.head()

# Visualizing the Data

## What is the Relationship Between Rank and Sales?

As shown in the scatterplot below, the relationship between rank and global sales is not at all linear. The first few games have a very high number of sales, but this success quickly drops off and the vast majority of games performs relatively poorly as expected from the results in the table above. 

In [None]:
ax = sns.scatterplot(x = 'Rank', y = 'Global_Sales', data=df);
ax.set_ylabel("Global Sales");

## How do Global Sales Change Over Time?

To trend sales over time, we create a line graph where the line itself shows the median number of global sales released in that year. We use the median instead of the mean because the median is a more robust indicator of average performance that helps minimize the effect of large outliers. The number of global sales has declined since 1990, and the median number of sales from 1993 onwards has remained approximately the same at around 250K. 

In [None]:
ax = sns.lineplot(x='Year', y = 'Global_Sales', data=df, estimator = np.median, ci = False)
ax.set_ylabel("Median Global Sales (Millions)");

## How do Sales Change Over Time for Each Region?

Now we take a closer look at the performance of the individual regions. Median number of sales per game is highest in NA and JP, with EU sales coming third. NA sales overtakes JP sales in 1995 and maintains the lead.

In [None]:
df_sales = df[['Year', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]
ax = sns.lineplot(x = 'Year', y = 'value', data = pd.melt(df_sales,'Year'), hue = 'variable', 
             estimator = np.median, ci=False);
ax.set_ylabel("Median Sales (Millions)")
ax.legend().texts[0].set_text("Region");

### How Does Genre Affect Sales?

With the bar graph below, we can see that genre is a major factor in sales with Platforming games being the most successful and Adventure games being the least successful.


In [None]:
ax = sns.barplot(x="Genre", y="Global_Sales", data=df, ci=False, estimator = np.median)
ax.set_ylabel("Global Sales (Millions)");
plt.xticks(rotation = 90)

We can also trend how the genre affects sales over time. Past 2010, most genres perform similarly. However, platforming games saw a peak in sales in 2014, and Shooting games see a large growth in 2015.

In [None]:
plt.figure(figsize=(20,10))
ax = sns.lineplot(x='Year', y = 'Global_Sales', data=df, estimator = np.median, ci = False, hue = "Genre")
ax.set_ylabel("Median Global Sales (Millions)");

In [None]:
plt.figure(figsize=(20,10))
ax = sns.lineplot(x='Year', y = 'Global_Sales', data=df, estimator = np.median, ci = False, hue = "Genre")
ax.set_ylabel("Median Global Sales (Millions)");
ax.set(xlim=(2000,2016), ylim=(0,1))

### How does the Platform Affect Sales?

We can also see which platforms had largest average sales per game, as well as the total number of game sales per platform. The large value for the median number of sales per game for the NES and GB can be attributed to the small number of games that were released for those consoles. On the other hand the PS2 sold the most games out of any console, but the median sales for that console were relatively low. 

In [None]:
plt.figure(figsize=(20,10))
ax = sns.barplot(x = "Platform", y = "Global_Sales", data = df, ci = False, estimator = np.median)
ax.set_ylabel("Median Global Sales (Millions)");

In [None]:
plt.figure(figsize=(20,10))
platform_group = df.groupby("Platform")
quantity_sold = platform_group.sum()['Global_Sales']
platforms = [platform for platform, df in platform_group]
plt.bar(platforms, quantity_sold)
plt.ylabel("Total Global Sales (Millions)")
plt.xlabel('Platform')
#plt.show()