# Exploratory Data Analysis

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns 

from collections import Counter

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Dataset contains a list of video games with sales > 100,000 copies

NA_Sales = Sales in North America (in millions)
EU_Sales = Sales in Europe (in millions)
JP_Sales = Sales in Japan (in millions)

In [None]:
videogames = pd.read_csv('/kaggle/input/videogamesales/vgsales.csv')
videogames

In [None]:
videogames.isnull().sum()

In [None]:
# Let's drop the null values 
videogames = videogames.dropna()

Let's summarize the data before analysis: 

In [None]:
videogames.describe()

What can we make of this? 
* This dataset has videogame data from 1980 up to 2020
* 25% of the games were released before 2003 
* 50% of the games were released before 2007 
* 75% of the games were released before 2010 
* Sales in North America are significantly higher than those in Europe and Japan

Just by describing the data, we have attained valuable facts that allow us to ask important questions: 
* Has the videogame industry/popularity of videogames been waning since 2010? 
* Do Americans play more videogames than the rest of the world, or are there missing variables to this data? 
* What year were videogames most popular and why? 

Getting into EDA: 

Let's address our questions above regarding the popularity of videogames per year, which year was the most successful for the videogame industry? 

In [None]:
videogames.Year.value_counts().sort_index()

In [None]:
plt.figure(figsize=(12,4))
videogames.Year.value_counts().sort_index().plot(kind='bar')

From this we gather that the rise in videogames started from 1993 to 1994. This rise continued and peaked in 2008 and 2009. The following years (from 2010 to 2020) saw a fall in videogames. Why? Could social media have been a better substitute, with Facebook emerging in 2010? 

In [None]:
# getting the each unique year in the dataframe videogames
uniqueYears = videogames.Year.unique()

# making a dictionary of dataframes to store each dataframe
dataFrameDict = {elem : pd.DataFrame for elem in uniqueYears}

# getting the dataframe for each entered year 
for key in dataFrameDict.keys():
    dataFrameDict[key] = videogames[:][videogames.Year == key]


# By doing this, we will be able to get the sales for each year and see how they compare across
# the US, Europe, and Japan

Using the dataFrameDict above, we can retreive a dataframe containing all the information for each individual year, from this we can compare how the worth of videogames has changed over time, is the industry making more money now than they were in 1980? 1999-2000 and 2016 are good years to compare because they had similar number of games sold.

In [None]:
df1999 = dataFrameDict[1999]
df2016 = dataFrameDict[2016]
df2000 = dataFrameDict[2000]

In [None]:
df1 = pd.DataFrame({'x': ['North America', 'Europe', 'Japan', 'Other'], 'y': [df1999.NA_Sales.sum(), 
                                                                             df1999.EU_Sales.sum(), 
                                                                             df1999.JP_Sales.sum(), 
                                                                             df1999.Other_Sales.sum()]})
df2 = pd.DataFrame({'x':['North America', 'Europe', 'Japan', 'Other'], 'y': [df2016.NA_Sales.sum(), 
                                                                            df2016.EU_Sales.sum(), 
                                                                            df2016.JP_Sales.sum(),
                                                                            df2016.Other_Sales.sum()]})
df3 = pd.DataFrame({'x': ['North America', 'Europe', 'Japan', 'Other'], 'y': [df2000.NA_Sales.sum(), 
                                                                             df2000.EU_Sales.sum(), 
                                                                             df2000.JP_Sales.sum(), 
                                                                             df2000.Other_Sales.sum()]})
df1['hue']=1999
df2['hue']=2016
df3['hue']=2000
res=pd.concat([df1,df3, df2])
sns.barplot(x='x', y='y', data=res, hue='hue')
plt.title('Sales in 1999 and 2016')
plt.show()

This is interesing - 1999 had 338 game titles, 2000 had 349 game titles, and 2016 had 342 game titles. 1999 has the lowest number of game titles yet the highest dollar amount in millions of sales. So can we conclude that the videogames were a lot more valuable in 1999 and became less and less valuable as time went on? 

Now we can shift gears to look at other variables of the data. What seems important. What genre do people like? Has that changed over time? 

In [None]:
videogames.Genre.unique()

So there are 12 types of genres. What was the leading genre in the most popular years? 

In [None]:
df2008 = dataFrameDict[2008]
df2009 = dataFrameDict[2009]

In [None]:
df1 = pd.DataFrame({'x': ['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle', 'Racing', 
                         'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy'], 
                   'y': [df2008.Genre.value_counts().sort_index()[i] for i in range(12)]})
df2 = pd.DataFrame({'x': ['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle', 'Racing', 
                         'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy'], 
                   'y': [df2009.Genre.value_counts().sort_index()[i] for i in range(12)]})
df1['hue']=2008
df2['hue']=2009
res=pd.concat([df1,df2])
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(x='x', y='y', ax=ax, data=res, hue='hue')
plt.show()

Interesting, it seems as though action became a bit more popular in 2009 than it was in 2008. 
Should we also compare 1999 and 2016 as we did previously to see how genre preferences changed over the years? 

In [None]:
df1999.Genre.value_counts().sort_index()

In [None]:
df2016.Genre.value_counts().sort_index()

In [None]:
df1 = pd.DataFrame({'x': ['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle', 
                         'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy'], 
                   'y': [df1999.Genre.value_counts().sort_index()[i] for i in range(12)]})
df2 = pd.DataFrame({'x':['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Racing',
                        'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy'], 
                   'y': [df2016.Genre.value_counts().sort_index()[i] for i in range(11)]})
df1['hue']=1999
df2['hue']=2016
res=pd.concat([df1, df2])
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(x='x', y='y', ax=ax, data=res, hue='hue')
plt.show()

Wow, it seems like action became a lot more popular over the years, while sports was the most popular in 1999.

What else might we be able to analyze about this data? We could now go in the direction of the most popular publishers and platforms.

In [None]:
# we can use Counter to find the most common publishers 
publishers = Counter(videogames['Publisher'].tolist()).most_common(10)
labels = [i[0] for i in publishers]
counts = [i[1] for i in publishers]

fig,ax = plt.subplots(figsize=(12, 6))
sns.barplot(x=labels, y=counts, ax=ax)
plt.xticks(rotation=90)

In [None]:
# We can now use counter for find the most common platforms
platforms = Counter(videogames['Platform'].tolist()).most_common(10)
labels=[i[0] for i in platforms]
counts = [i[1] for i in platforms]

fig,ax = plt.subplots(figsize=(12, 6))
sns.barplot(x=labels, y=counts)
plt.xticks(rotation=90)