# **Exoploratory Anaylsis of Video Game Sales**

Below are insights I gathered as a freshman on my first ever data science project (now republished for general viewing). 

As a novice, I focused only on exploratory analysis -- that is, assessing and interpreting the dataset provided and drawing conclusions based on the data before me. Advanced concepts such as machine learning and predictaive analyses were not used in this project, and if you wish to see such, I welcome you to view my other projects more focused on those areas.

# **Table of Contents**
* **a)** Code Library Set-Up
* **b)** Preliminary Questions
* **c)** Data Visualizations
* **d)** Conclusion


# **a) Code Library Set-Up**

In [None]:

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
data = pd.read_csv("../input/videogamesales/vgsales.csv")



# **b) Preliminary Questions**

As a novice data scientist, I wanted to ask myself some key questions about this dataset before beginning. Such questions were:

* Where did the data come from?
* What does it measure?
* What do the columns consist of?
* Are any discrepancies immediately apparent?
* Does it need to be cleaned, or at least, considered differently?


> Where did the data come from?

As noted by the author, the data was provided via a scrape of the video game website vgchartz.com

> What does it measure? What do the columns consist of?

Using the simple command "data.columns", you can see each of the fields the data is organized under -- all measured via a numerical rank value. Such columns (seen below) are common things such as name, platform, genre, and even targeted sale regions.

 

In [None]:
print(data.columns)

> Are any discrepancies immediately apparent?

One of my first instincts was to dig around in the data using simple commands. One such was digging into the Genre category (using the simple command data['Genre'].unique() ) yielded an entire Misc category.

In [None]:
genre = data.groupby('Genre')
misc = genre.get_group('Misc')
misc = misc.sort_values(by='Global_Sales',ascending = False)
misc.head(10)

As seen above, I believe this Misc cateogry could have unseen implications *especially* if the goal was to provide actionable insights to clients or investors. For instance,games such as "Brain Age" could easily be understood as a Puzzle game but isn't labeled as such. And even the huge gaming phenomenon "Minecraft" is relegated to this throwaway Misc category. 

Although I took no action to alter any of these categories, this whole excercise was a helpful way in reminding a novice such as myself to always be critical of the data given and try and assess the implications of presenting analyses that may not be a accurate representation of the big picture. 

# **c) Data Visualizatons**

Below, I begin the process of visualizing the data within the dataset.

# ***Global Sales by Market***

In [None]:
country_list = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']
labels = country_list
plt.figure(figsize=(10,10))
sizes = data[country_list].sum()
plt.pie(sizes, labels=labels, autopct='%1.1f%%',shadow=True, startangle=90)
plt.show()

**Insight:** Unsurprisingly, North America makes up for nearly half of the world's video game sales. Nothing suprising here.

# ***Global Sales in Market by Time***

In [None]:
plt.figure(figsize=(10,10))

yearly = data.groupby(['Year']).sum()
plt.plot(yearly[country_list])
plt.savefig('timeplot.png', dpi=300)
plt.show()

plt.figure(figsize=(10,10))
for country in country_list: 
    plt.bar(yearly.index, yearly[country], label = country)

plt.legend()
plt.show()

**Insight:** one of the key components of data science is *spatial expertise* -- and after a childhood of playing video games, I believe I can add insight into a seemingly deceptive takeaway of this dataset. 

According to the data as presented, video game sales reached its critical point in 2008 before declining every year thereafter; however, I believe this to be an inaccurate assessment and yet another egregious example of having to critically assess how your data is collected and what exactly it is representing. Although not explicitly noted, further looking into the website vgchartz.com revealed that it only measures ***physical*** copies of game sales. 

Around the year 2008 and thereafter, software versions of titles slowly started becoming more normal due to the increased storage capacities of hardware and faster internet download speeds. Billion dollar contemporary video games such as "Fortnite" specifically actually have no physical version -- and are free to begin with (making money mostly through in-game purchases and thus have no initial "sale"). This dataset, as collected, does not account for such things and should be explicitly acknowledged when drawing conclusions. 

# ***Top Global Games***

In [None]:
top = data.groupby(by=['Name'])['Global_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
topBar = plt.bar(top.index, top)
topBar[0].set_color("midnightblue")
topBar[1].set_color("navy")
topBar[2].set_color("darkblue")
topBar[3].set_color("mediumblue")
topBar[4].set_color("blue")
topBar[5].set_color("slateblue")
topBar[6].set_color("mediumpurple")
topBar[7].set_color("darkorchid")
topBar[8].set_color("mediumorchid")
topBar[9].set_color("plum")

plt.title('Top Games by Global Sales')
plt.xlabel('Game Names')
plt.ylabel('Global Sales')
plt.xticks(rotation=55)
plt.show()


**Insight**: once again bringing in the idea of spatial expertise being crucial to data science, I was immediately drawn to "Wii Sports" being the top game of all time in terms of sales. Knowing that "Wii Sports" was a game that was included with the sale of a Wii console, I had to question whether or not these figures reflected extra titles of "Wii Sports" sold on the side, the titles included in the console bundle, or an amalgamation of both.

# ***Top Global Console***

In [None]:

plt.figure(figsize=(10,10))

platform = data.groupby(by=['Platform'])['Global_Sales'].sum()
platform = platform.sort_values(ascending = False)
platformBar = plt.bar(platform.index, platform)
plt.title('Console Sales Globally')
plt.xlabel('Consoles')
plt.ylabel('Global Sales')
plt.xticks(rotation=55)
plt.show()

**Insight**: I was genuinely surprised to see the Wii console so far down considering "Wii Sports" selling so well. More surprisngly was the older generation PS2 console being the highest sold -- by a considerable margin too. Later on, I discovered how this was so.

# ***Top Global Genre***

In [None]:
genre = data.groupby(by=['Genre'])['Global_Sales'].sum()
plt.figure(figsize=(10,10))
genreBar = plt.bar(genre.index, genre)
genreBar[0].set_color("r")
genreBar[1].set_color("darkred")
genreBar[2].set_color("firebrick")
genreBar[3].set_color("lightcoral")
genreBar[4].set_color("coral")
genreBar[5].set_color("orangered")
genreBar[6].set_color("darkorange")
genreBar[7].set_color("bisque")
genreBar[8].set_color("turquoise")
genreBar[9].set_color("lightseagreen")
genreBar[10].set_color("teal")
genreBar[11].set_color("cadetblue")
plt.title('Genre Sales Globally')
plt.xlabel('Genres')
plt.ylabel('Global Sales')
plt.xticks(rotation=55)

plt.show()

**Insight:** remembering my apprehension of the Misc genre, I immediately recognize how substantial the genre's sales are in comparison to the others. 

# ***Top Global Publishers***

In [None]:
publisher = data.groupby(by=['Publisher'])['Global_Sales'].sum()
publisher = publisher.sort_values(ascending=False).nlargest(10)

plt.figure(figsize=(10,10))
pubBar = plt.bar(publisher.index, publisher)
pubBar[9].set_color("midnightblue")
pubBar[8].set_color("navy")
pubBar[7].set_color("darkblue")
pubBar[6].set_color("mediumblue")
pubBar[5].set_color("blue")
pubBar[4].set_color("slateblue")
pubBar[3].set_color("mediumpurple")
pubBar[2].set_color("darkorchid")
pubBar[1].set_color("mediumorchid")
pubBar[0].set_color("plum")



plt.title('Top Publisher by Global Sales')
plt.xlabel('Publisher')
plt.ylabel('Global Sales')
plt.xticks(rotation=55)

plt.show()

**Insight:** unsurpised that Nintendo is the top publisher considering "Wii Sports" sales. Electronic Arts also publishes numerous Action and Sports titles -- top genres as already seen -- so also no surprise there.

# ***Market Trends with Pies***

# **North America**

**North American Top Games**

In [None]:
top = data.groupby(by=['Name'])['NA_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0, 0)
colors = ['darkred', 'maroon', 'firebrick', 'indianred', 'white', 'whitesmoke', 'lightsteelblue', 'cornflowerblue', 'royalblue', 'navy']



plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight**: I was extremely surprised to see such old titles as "Super Mario Bros.", "Duck Hunt", and "Tetris" still maintaining such market dominance. Such an occurrence also made me question whether or not the dataset qualified re-releases, remasters, or otherwise updated versions all under the original release title.

**North American Top Platforms**

In [None]:
top = data.groupby(by=['Platform'])['NA_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['darkred', 'maroon', 'firebrick', 'indianred', 'white', 'whitesmoke', 'lightsteelblue', 'cornflowerblue', 'royalblue', 'navy']
colors.reverse()


plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**North American Top Genres**

In [None]:
top = data.groupby(by=['Genre'])['NA_Sales'].sum()
top = top.sort_values(ascending = False)

plt.figure(figsize=(10, 10))
labels = top.index
sizes = top
explode = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 )
colors = [ 'lightcoral', 'indianred', 'firebrick', 'maroon', 'darkred', 'white', 'whitesmoke', 'navy', 'royalblue', 'cornflowerblue', 'lightsteelblue', 'aliceblue']


plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)


plt.show()

**North American Top Publisher**

In [None]:
top = data.groupby(by=['Publisher'])['NA_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(8,8))
labels = top.index
sizes = top
explode = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['indianred', 'firebrick', 'maroon', 'darkred', 'white', 'whitesmoke', 'navy', 'royalblue', 'cornflowerblue', 'lightsteelblue']
colors.reverse()

plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)
plt.show()

# **Europe**

**European Top Games**

In [None]:
top = data.groupby(by=['Name'])['EU_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0, 0, 0, 0.1, 0, 0.1, 0.1, 0, 0, 0)
colors = ['blue', 'mediumblue', 'royalblue','cornflowerblue', 'lightsteelblue','honeydew','lightgoldenrodyellow','lightyellow','yellow','gold']

plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight:** initially I was humored by the obviousness of seeing a soccer franchise feautre prominently high sales in Europe. However, I then realized this dataset has no way of representing a game *franchise* -- which could be useful for analysis. Many games are a part of a larger franchise. Seeing as this dataset individualizes each title, I believe more insights could have been gained with this added perspective.

**European Top Platforms**

In [None]:
top = data.groupby(by=['Platform'])['EU_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0.1, 0.1, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['blue', 'mediumblue', 'royalblue','cornflowerblue', 'lightsteelblue','honeydew','lightgoldenrodyellow','lightyellow','yellow','gold']
colors.reverse()
plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight:** was intrigued to see Europe's preference for Sony's Playstation platforms.

**European Top Genres**

In [None]:
top = data.groupby(by=['Genre'])['EU_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10, 10))
labels = top.index
sizes = top
explode = (0, 0.1, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['blue', 'mediumblue', 'royalblue','cornflowerblue', 'lightsteelblue','honeydew','lightgoldenrodyellow','lightyellow','yellow','gold']

plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight:** considering Europe's love of soccer (and sales of the Fifa franchise), seeing Sports weigh so much was unsurprising.

**European Top Publishers**

In [None]:
top = data.groupby(by=['Publisher'])['EU_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0, 0.1, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['blue', 'mediumblue', 'royalblue','cornflowerblue', 'lightsteelblue','honeydew','lightgoldenrodyellow','lightyellow','yellow','gold']
colors.reverse()
plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)


plt.show()

**Insights:** seeing that Electronic Arts publishes the Fifa franchise, I see no surprises here.

# **Japan**

**Japanese Top Games**

In [None]:
top = data.groupby(by=['Name'])['JP_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0.1, 0.1, 0, 0, 0.1, 0, 0.1, 0, 0.1, 0)
colors = ['darkred','maroon','firebrick','brown','indianred','lightcoral','peachpuff','seashell','oldlace','whitesmoke']



plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight:** it is immediately clear that Japan's gaming tastes are distinctly different as compared to the rest of the world market. Noted is the repeated occurence of the Pokemon franchise.

**Japanese Top Genres**

In [None]:
top = data.groupby(by=['Genre'])['JP_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['darkred','maroon','firebrick','brown','indianred','lightcoral','peachpuff','seashell','oldlace','whitesmoke']



plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight:** Pokemon would be labeled under the Role-Playing franchise, so this is unsurprising.

**Japanese Top Platforms**

In [None]:
top = data.groupby(by=['Platform'])['JP_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0.1, 0.1, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['darkred','maroon','firebrick','brown','indianred','lightcoral','peachpuff','seashell','oldlace','whitesmoke']



plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight:** although I expected a handheld console to weigh the most, I was thoroughly suprised to see the first generation Playstation feature second.

**Japanese Top Publishers**


In [None]:
top = data.groupby(by=['Publisher'])['JP_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['darkred','maroon','firebrick','brown','indianred','lightcoral','peachpuff','seashell','oldlace','whitesmoke']



plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight:** seeing Nintendo take up such a huge percentage in Japan is probably the least surprising analysis made in this project.

# **"Other"**

**"Other" Top Games**

In [None]:
top = data.groupby(by=['Name'])['Other_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['navy','royalblue','cornflowerblue','powderblue','mediumspringgreen','springgreen','mediumseagreen','forestgreen', 'green','darkgreen']



plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight:** I had become conditioned to see Grand Theft Auto V as a top game, so I was immediately surprised to see the much older title feature as the top game for the rest of the world. It should be noted this is a Playstation 2 title (the global leader). It could be assumed that maybe the rest of the world is technologically lapsed when it comes to gaming hardware, and takes time to catch up to contemporary standards.

**"Other" Top Genres**

In [None]:
top = data.groupby(by=['Genre'])['Other_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0.1, 0.1, 0, 0, 0, 0, 0, 0, 0, 0.1)
colors = ['navy','royalblue','cornflowerblue','powderblue','mediumspringgreen','springgreen','mediumseagreen','forestgreen', 'green','darkgreen']



plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**"Other" Top Platform**

In [None]:
top = data.groupby(by=['Platform'])['Other_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['navy','royalblue','cornflowerblue','powderblue','mediumspringgreen','springgreen','mediumseagreen','forestgreen', 'green','darkgreen']
colors.reverse()


plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

**Insight:** the data reiterates that the older generation PS2 is the dominant console in the rest of the world. It should be noted however that the newer PS3 isn't altogether that much below.

**"Other" Top Publisher**

In [None]:
top = data.groupby(by=['Publisher'])['Other_Sales'].sum()
top = top.sort_values(ascending = False).nlargest(10)

plt.figure(figsize=(10,10))
labels = top.index
sizes = top
explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
colors = ['navy','royalblue','cornflowerblue','powderblue','mediumspringgreen','springgreen','mediumseagreen','forestgreen', 'green','darkgreen']
colors.reverse()


plt.pie(sizes, explode = explode, labels=labels, colors = colors, autopct='%1.1f%%',
        shadow=True, startangle=90)

plt.show()

# **4) Conclusion**

* Could inaccurate genre definition affect analysis?

* How much does the shifting market (hardware to software sales) impact interpretation?

* Data favors single release titles
    * yearly franchise releases (like Fifa/Call of Duty) are segmented and could more accurately be viewed as one entire franchise
    

* Market breakdowns show divergent trends
     * Europe prefers sports titles
     * Japan vastly favors handheld gaming
         * (Shooter doesn't even breach their Top 10)
     * The rest of the world seems to lag behind in terms of tech