# Data analyzation about Video Games Sales
Authors: Isaac Huang, Keyi Cheng and Shihua(Jimmy) Zhang;        
         (We are from UCLA Statistics  and Math);

In this project, we will use pandas, SQLite3 and other visualization tools to analyze a large data set (about video games sales over 1980-2020), the and perform interactive data visualization

#### (package and data import)

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import sqlite3
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

vgsales = pd.read_csv('../input/videogamesales/vgsales.csv')

## 0. Data set introduction and pre-cleaning

#### (a) dataset description

*This dataset contains a list of video games with sales greater than 100,000 copies. It was
generated by a scrape of vgchartz.com.    
Fields include<mark>(11 columns<mark>)* :

<mark>***Rank***<mark>： Ranking of overall sales <br>
<mark>***Name***<mark>： The games name <br>
<mark>***Platform***<mark>： Platform of the games release (i.e. PC,PS4, etc.) <br>
<mark>***Year***<mark>：  Year of the game’s release <br>
<mark>***Rank***<mark>： Ranking of overall sales <br>
<mark>***Publisher***<mark>： Publisher of the game <br>
<mark>***NA_Sales***<mark>：  Sales in North America (in millions) <br>
<mark>***EU_Sales***<mark>：  Sales in Europe (in millions) <br>
<mark>***JP_Sales***<mark>： Sales in Japan (in millions) <br>
<mark>***Other_Sales***<mark>：  Sales in the rest of the world (in millions) <br>
<mark>***Global_Sales***<mark>： Total worldwide sales <br>
   

**( All the information above is from https://www.kaggle.com/gregorut/videogamesales )**

#### (b) quick preview about the dataset

In [None]:
#view the size of the dataset
vgsales.shape

In [None]:
# Inspect the dataset
vgsales.head()

In [None]:
#summary about the dataset
vgsales.info()

In [None]:
#check out the NA number inside the dataset
vgsales.isna().sum()

We can see that all the `NA` values are either in `Year` or `Publisher`. This may be due to some small games are created by some individual developers, so we don't have detailed information about them. So let's drop the rows with `NA` value.

In [None]:
vgsales = vgsales.dropna()
vgsales.isna().sum()

Now, the data is ready to be used for analysing. Before everything start, we need to put it into the database.

In [None]:
conn = sqlite3.connect("game.db")
vgsales.to_sql("vgsales", conn, if_exists = "append", index = False)

all_query = '''
SELECT * FROM vgsales
'''
df = pd.read_sql_query (all_query, conn)
df

## 1.  Analyzation about the electronic gaming market

### (a) Overwiew electronic gaming market

Before we start to analyze the specific market, it is necessary to check out the trending of the selling in each market in the history

In [None]:
query = '''
SELECT sum(NA_Sales) AS NA_total_sells, Year
FROM vgsales
GROUP BY Year
HAVING Year BETWEEN 1980.0 AND 2010.0
AND NA_Sales > 0
'''
df1 = pd.read_sql_query (query, conn)

query = '''
SELECT sum(JP_Sales) AS JP_total_sells, Year
FROM vgsales
GROUP BY Year
HAVING Year BETWEEN 1980.0 AND 2010.0
AND NA_Sales > 0
'''
df2 = pd.read_sql_query (query, conn)

query = '''
SELECT sum(EU_Sales) AS EU_total_sells, Year
FROM vgsales
GROUP BY Year
HAVING Year BETWEEN 1980.0 AND 2010.0
AND NA_Sales > 0
'''
df3 = pd.read_sql_query (query, conn)

plt.figure(figsize=(16,10))
plt.plot(df1["Year"],df1["NA_total_sells"],label='North America')
plt.plot(df2["Year"],df2["JP_total_sells"],label='Japan')
plt.plot(df3["Year"],df3["EU_total_sells"],label='Europe')
plt.xticks(df1["Year"],rotation='vertical',size=8)
plt.ylabel('Sales [in millions]')
plt.xlabel('Year')
plt.legend()
plt.title('The game selling in 3 main markets in the global between 1980 and 2010')

From the selling history overview of these three markets, we can find that all these three markets have similar start points (1980's sales), but different endpoints (2010's sales). The overall selling trending for the three markets are the same: growing at a different speed in most periods. We can also find that all three markets had some small decrease in some specific period. This may be related to some outside environment, such as the periodic financial crisis(occur every 10-15 years).

In [None]:
query = '''
SELECT sum(Global_Sales) AS Global_total_sells, Platform, Year
FROM vgsales
GROUP BY Year, Platform
HAVING Year BETWEEN 1980.0 AND 2010.0
'''

df = pd.read_sql_query (query, conn)
fig = px.histogram(df, x="Year", y="Global_total_sells", color="Platform", marginal="rug",
                   hover_data=df.columns,
                   title = "The Top Sales Platform in 1980-2010")
fig.show()

From the Platform sales in 1980-2010, we can find that there a lot of platforms occur in the market. Before 1990, we can find that there are not so many platforms occur in the market (electronic gaming just born). After 1990, there are more and more platforms came out, which lead to increasingly fierce competition. After 2000, the platform started to become less and less, in other words, the rest platforms are the temporary winner of the competition.    
Later on, we will discuss the market by the best games, the most welcomed platforms and game genres.

### (b) The sales of the game in different period

From the graph above, we can find that the overall trend of the electronic game selling is upward...       
Let's take a look what are the top selling game in `1980-1990`, `1990-2000`, `2000-2010`; 

In [None]:
#1980-1990, NA

query = '''
SELECT Name, Year, Sales, ranking
FROM
(SELECT sum(NA_Sales) AS Sales, Year, Name,
dense_rank() OVER (order BY NA_Sales DESC) AS ranking
FROM vgsales
GROUP BY Name
HAVING Year BETWEEN 1980 AND 1990
AND NA_Sales > 0
ORDER BY Year)
WHERE ranking BETWEEN 1 AND 5
ORDER BY ranking
'''

df = pd.read_sql_query (query, conn)
fig1 = go.Funnel(y = df["Name"], x = df["Sales"])

#1990-2000, NA

query = '''
SELECT Name, Year, Sales, ranking
FROM
(SELECT sum(NA_Sales) AS Sales, Year, Name,
dense_rank() OVER (order BY NA_Sales DESC) AS ranking
FROM vgsales
GROUP BY Name
HAVING Year BETWEEN 1990 AND 2000
AND NA_Sales > 0
ORDER BY Year)
WHERE ranking BETWEEN 1 AND 5
ORDER BY ranking
'''

df = pd.read_sql_query (query, conn)
fig2 = go.Funnel(y = df["Name"], x = df["Sales"])

#2000-2010, NA

query = '''
SELECT Name, Year, Sales, ranking
FROM
(SELECT sum(NA_Sales) AS Sales, Year, Name,
dense_rank() OVER (order BY NA_Sales DESC) AS ranking
FROM vgsales
GROUP BY Name
HAVING Year BETWEEN 2000 AND 2010
AND NA_Sales > 0
ORDER BY Year)
WHERE ranking BETWEEN 1 AND 5
ORDER BY ranking
'''

df = pd.read_sql_query (query, conn)
fig3 = go.Funnel(y = df["Name"], x = df["Sales"])

#1980-1990, JP

query = '''
SELECT Name, Year, Sales, ranking
FROM
(SELECT sum(JP_Sales) AS Sales, Year, Name,
dense_rank() OVER (order BY JP_Sales DESC) AS ranking
FROM vgsales
GROUP BY Name
HAVING Year BETWEEN 1980 AND 1990
AND NA_Sales > 0
ORDER BY Year)
WHERE ranking BETWEEN 1 AND 5
ORDER BY ranking
'''

df = pd.read_sql_query (query, conn)
fig4 = go.Funnel(y = df["Name"], x = df["Sales"],)

#1990-2000, JP

query = '''
SELECT Name, Year, Sales, ranking
FROM
(SELECT sum(JP_Sales) AS Sales, Year, Name,
dense_rank() OVER (order BY JP_Sales DESC) AS ranking
FROM vgsales
GROUP BY Name
HAVING Year BETWEEN 1990 AND 2000
AND NA_Sales > 0
ORDER BY Year)
WHERE ranking BETWEEN 1 AND 5
ORDER BY ranking
'''

df = pd.read_sql_query (query, conn)
fig5 = go.Funnel(y = df["Name"], x = df["Sales"])

#2000-2010, JP

query = '''
SELECT Name, Year, Sales, ranking
FROM
(SELECT sum(JP_Sales) AS Sales, Year, Name,
dense_rank() OVER (order BY JP_Sales DESC) AS ranking
FROM vgsales
GROUP BY Name
HAVING Year BETWEEN 2000 AND 2010
AND NA_Sales > 0
ORDER BY Year)
WHERE ranking BETWEEN 1 AND 5
ORDER BY ranking
'''

df = pd.read_sql_query (query, conn)
fig6 = go.Funnel(y = df["Name"], x = df["Sales"])

#1980-1990, EU

query = '''
SELECT Name, Year, Sales, ranking
FROM
(SELECT sum(EU_Sales) AS Sales, Year, Name,
dense_rank() OVER (order BY EU_Sales DESC) AS ranking
FROM vgsales
GROUP BY Name
HAVING Year BETWEEN 1980 AND 1990
AND NA_Sales > 0
ORDER BY Year)
WHERE ranking BETWEEN 1 AND 5
ORDER BY ranking
'''

df = pd.read_sql_query (query, conn)
fig7 = go.Funnel(y = df["Name"], x = df["Sales"])

#1990-2000, EU

query = '''
SELECT Name, Year, Sales, ranking
FROM
(SELECT sum(EU_Sales) AS Sales, Year, Name,
dense_rank() OVER (order BY EU_Sales DESC) AS ranking
FROM vgsales
GROUP BY Name
HAVING Year BETWEEN 1990 AND 2000
AND NA_Sales > 0
ORDER BY Year)
WHERE ranking BETWEEN 1 AND 5
ORDER BY ranking
'''

df = pd.read_sql_query (query, conn)
fig8 = go.Funnel(y = df["Name"], x = df["Sales"])

#2000-2010, EU

query = '''
SELECT Name, Year, Sales, ranking
FROM
(SELECT sum(EU_Sales) AS Sales, Year, Name,
dense_rank() OVER (order BY EU_Sales DESC) AS ranking
FROM vgsales
GROUP BY Name
HAVING Year BETWEEN 2000 AND 2010
AND NA_Sales > 0
ORDER BY Year)
WHERE ranking BETWEEN 1 AND 5
ORDER BY ranking
'''

df = pd.read_sql_query (query, conn)
fig9 = go.Funnel(y = df["Name"], x = df["Sales"])


#combine them together

fig = make_subplots(rows=3, cols=3, 
                    subplot_titles=('North America', 'Japan', 'Europe'))

fig.add_trace(fig1, 1, 1)
fig.add_trace(fig2, 2, 1)
fig.add_trace(fig3, 3, 1)
fig.add_trace(fig4, 1, 2)
fig.add_trace(fig5, 2, 2)
fig.add_trace(fig6, 3, 2)
fig.add_trace(fig7, 1, 3)
fig.add_trace(fig8, 2, 3)
fig.add_trace(fig9, 3, 3)
fig.update_layout(showlegend=False, height=600, width=3000)
fig.show()


(From top to the bottom, the period are `1980-1990`, `1990-2000`, `2000-2010`

We can find that almost all the top 5 sales games in each period in eachmarket are face to all age(no violence and blood), which means that they will be more preferred by parents/family.    
By the custom of people in North America, we know that they like to throw a party for social functions and celebration, so we can also find that there is some party game in top5 such as Wii Sports and Super Mario Land;    
For Japan and Europe, we can find that Pokemon's series are most welcomed in 1990 to 2000 period.    
    
In other words, we can find that the most welcomed game are face to all age. These game will have more customer since they can buy by the youngers.

### (c) The most welcomed platform

Let's visualize this by the total sales in the history:

In [None]:
query = '''
SELECT Platform, Year, sum(NA_Sales) AS Sales
FROM vgsales
GROUP BY Platform, Year
HAVING Sales > 0
ORDER BY Year
'''
df = pd.read_sql_query (query, conn)
fig = px.scatter(df, x="Year", y="Sales",        
                 size="Sales", color="Platform",
                 hover_name="Platform", log_x=True, size_max=60,
                 title = "North America Platform Sales change in history")
fig.show()


query = '''
SELECT Platform, Year, sum(JP_Sales) AS Sales
FROM vgsales
GROUP BY Platform, Year
HAVING Sales > 0
ORDER BY Year
'''
df = pd.read_sql_query (query, conn)
fig = px.scatter(df, x="Year", y="Sales",        
                 size="Sales", color="Platform",
                 hover_name="Platform", log_x=True, size_max=60,
                 title = "Japan Platform Sales change in history")
fig.show()


query = '''
SELECT Platform, Year, sum(EU_Sales) AS Sales
FROM vgsales
GROUP BY Platform, Year
HAVING Sales > 0
ORDER BY Year
'''
df = pd.read_sql_query (query, conn)
fig = px.scatter(df, x="Year", y="Sales",        
                 size="Sales", color="Platform",
                 hover_name="Platform", log_x=True, size_max=60,
                 title = "Europe Platform Sales change in history")
fig.show()

We can find that the top 5 Platforms are all from `Nintendo`, `Sony` or `Microsoft` (for all their markets). This also indicates that they are the temporary winner in these three markets and the oligarchs in the market since their sales are super higher than other companies.       
With the large population, the North American market has the highest sales after 2000 for all platforms than the other two markets. For example, the PS2 had closed to 1000 million sales around 2005; But in the same period and the same platform, PS2 only had 200 million sales and Europe only had 700 million.      
We can also find that Japanese customers are more prefer to buy the handheld game console(platform), while North American and European customers prefer the traditional console(platform). This may due to the different culture.

### (c) The most welcomed genre

In [None]:
#NA
query = '''
SELECT Genre, SUM(NA_Sales) AS Sales
FROM vgsales
GROUP BY Genre
HAVING NA_Sales > 0
'''

df = pd.read_sql_query (query, conn)
fig = px.pie(df, values='Sales', names='Genre',
             title='What genre of game that North American are prefer',
             hole=.2)
fig.update_traces(textposition='inside')
fig.show()

#JP
query = '''
SELECT Genre, SUM(JP_Sales) AS Sales
FROM vgsales
GROUP BY Genre
HAVING JP_Sales > 0
'''

df = pd.read_sql_query (query, conn)
fig = px.pie(df, values='Sales', names='Genre',
             title='What genre of game that Japanese are prefer',
            hole=.2)
fig.update_traces(textposition='inside')
fig.show()

#EU
query = '''
SELECT Genre, SUM(EU_Sales) AS Sales
FROM vgsales
GROUP BY Genre
HAVING EU_Sales > 0
'''

df = pd.read_sql_query (query, conn)
fig = px.pie(df, values='Sales', names='Genre',
             title='What genre of game that Eurpoean are prefer',
            hole=.2)
fig.update_traces(textposition='inside')
fig.show()

Combing the most welcomed can find that there are several features for these three markets customers:    
`North America`: Prefer to play Shooter, sport, and action game on the traditional console;    
`Japan`: Role-Playing game and action lovers, and like to play them on the handheld game console(platform); And they don't have too much interested on the shooter games but love puzzle game more than other markets' customers;      
`Europe`: Similar to North American players, but they love action and sports games much more than the North American players.     

     
For all these three markets, strategy and adventure games are the lowest popularity.   

## 2. Best game companies

In [None]:
#reload the whole dataset
dfInitial=pd.read_sql_query (all_query, conn)
dfInitial

top 10 Publisher(with top-10 global sales)

In [None]:
dfPublisher=dfInitial.groupby("Publisher")["Global_Sales"].aggregate([np.sum]).sort_values(by="sum",ascending=False)
dfPublisher=dfPublisher.reset_index()
dfPublisher2=dfPublisher.head(10)
figPublisher = px.bar(dfPublisher2, x='Publisher', y="sum", color="Publisher",log_y=True)
figPublisher.show()

we can see that Nintendo and Electronic Arts are the top-2 popular publisher   
Then analyze game sales and game markets for these two publishers   

### 1. Nintendo

First we want to see the best selling game in its history

In [None]:
dfNintendo=dfInitial[dfInitial.Publisher=="Nintendo"]
dfNintendo

In [None]:
dfNintendoGameName=dfNintendo.groupby("Name")["Global_Sales"].aggregate([np.sum]).sort_values(by="sum",ascending=False)
dfNintendoGameName=dfNintendoGameName.reset_index()
dfNintendoGameName2=dfNintendoGameName.head(10)
figNintendoGame = px.bar(dfNintendoGameName2, x='Name', y="sum", color="Name",log_y=True)
figNintendoGame.show()

Wii Sports is the best selling game for Nintendo   

In [None]:
NintendoGenre= px.pie(dfNintendo, values='Global_Sales', names='Genre',
             title='Global sales distribution for different genre (Nintendo publisher)',
            hole=.2)
NintendoGenre.show()

Among all the games published by Ninteno, Platform, Role-Playing, and Sports are the top-three popular genre

The above analysis is based on an overview of the three gaming markets(NA,EU,JP), next we want to see how individual market behaves for Nintendo publisher 

In [None]:
#NA
dfNintendo=dfNintendo[dfNintendo.NA_Sales>0]
dfNintendoNA=dfNintendo.groupby("Year")["NA_Sales"].aggregate([np.sum])
dfNintendoNA=dfNintendoNA.reset_index()
dfNintendoNA= dfNintendoNA.rename(columns={"sum":"NA_Sales"})
dfNintendoNA=dfNintendoNA.sort_values(by="Year",ascending=True)
dfNintendoNA.head()

#Eu
dfNintendo=dfNintendo[dfNintendo.EU_Sales>0]
dfNintendoEU=dfNintendo.groupby("Year")["EU_Sales"].aggregate([np.sum])
dfNintendoEU=dfNintendoEU.reset_index()
dfNintendoEU= dfNintendoEU.rename(columns={"sum":"EU_Sales"})
dfNintendoEU=dfNintendoEU.sort_values(by="Year",ascending=True)
dfNintendoEU.head()

#JP 
dfNintendo=dfNintendo[dfNintendo.JP_Sales>0]
dfNintendoJP=dfNintendo.groupby("Year")["JP_Sales"].aggregate([np.sum])
dfNintendoJP=dfNintendoJP.reset_index()
dfNintendoJP= dfNintendoJP.rename(columns={"sum":"JP_Sales"})
dfNintendoJP=dfNintendoJP.sort_values(by="Year",ascending=True)
dfNintendoJP.head()

plt.figure(figsize=(16,10))
plt.grid()
plt.plot(dfNintendoNA["Year"],dfNintendoNA["NA_Sales"])
plt.plot(dfNintendoEU["Year"],dfNintendoEU["EU_Sales"])
plt.plot(dfNintendoJP["Year"],dfNintendoJP["JP_Sales"])
plt.xlabel('Year')
plt.ylabel('Sales')
plt.title('Game sale amount in three markets for Ninteno publisher')
plt.legend(["NA_market","EU_market","JP_makret"])

In general,we can see that the gaming sales in all three market for Nintendo publisher fluctuates a lot between years, and reaches peak at around 2006-2007. After the peak, the sales keep going down and in around 2017, the sales amount return to the same level before 1985        
In particular, NA market sales are larger than sales for other two markets, which means that the games on NA market use Ninteno as publisher more. The sale amount for EU market are in general approximately the same as that for JP market, but it reaches a higher peak, meaning the max sales for these three markets are in order NA_market>EU_market>JP_market.     

### 2. Electronic Arts

In [None]:
dfEA=dfInitial[dfInitial.Publisher=="Electronic Arts"]
dfEA[dfEA.Year==2015.0]

In [None]:
dfEA=dfInitial[dfInitial.Publisher=="Electronic Arts"]
dfEAGameName=dfEA.groupby("Name")["Global_Sales"].aggregate([np.sum]).sort_values(by="sum",ascending=False)
dfEAGameName=dfEAGameName.reset_index()
dfEAGameName2=dfEAGameName.head(10)
figEAGame = px.bar(dfEAGameName2, x='Name', y="sum", color="Name",log_y=True)
figEAGame.show()

"FIFA15" is the best selling game for Electronic Arts publisher. And among top-10 best selling games for EA publisher, 6 of them are related to "FIFA", which means that Electronic Arts publisher mainly focuses on "FIFA"-type games publishing

In [None]:
EAGenre= px.pie(dfEA, values='Global_Sales', names='Genre',
             title='Global sales distribution for different genre (EA publisher)',
            hole=.2)
EAGenre.show()

Nearly 50% of the global sales for EA publisher is sports genre, which makes sense since the most popular games that EA publishes is "FIFA" series, which is a kind of football video games

Next we want to see how individual market behaves for Electronic Arts publisher 

In [None]:
#NA
dfElecArts=dfEA[dfEA.NA_Sales>0]
dfEANA=dfElecArts.groupby("Year")["NA_Sales"].aggregate([np.sum])
dfEANA=dfEANA.reset_index()
dfEANA= dfEANA.rename(columns={"sum":"NA_Sales"})
dfEANA=dfEANA.sort_values(by="Year",ascending=True)
dfEANA.head()

#Eu
dfElecArts=dfEA[dfEA.EU_Sales>0]
dfEAEU=dfElecArts.groupby("Year")["EU_Sales"].aggregate([np.sum])
dfEAEU=dfEAEU.reset_index()
dfEAEU= dfEAEU.rename(columns={"sum":"EU_Sales"})
dfEAEU=dfEAEU.sort_values(by="Year",ascending=True)
dfEAEU.head()

#JP 
dfElecArts=dfEA[dfEA.JP_Sales>0]
dfEAJP=dfElecArts.groupby("Year")["JP_Sales"].aggregate([np.sum])
dfEAJP=dfEAJP.reset_index()
dfEAJP= dfEAJP.rename(columns={"sum":"JP_Sales"})
dfEAJP=dfEAJP.sort_values(by="Year",ascending=True)
dfEAJP.head()



plt.figure(figsize=(16,10))
plt.grid()
plt.plot(dfEANA["Year"],dfEANA["NA_Sales"])
plt.plot(dfEAEU["Year"],dfEAEU["EU_Sales"])
plt.plot(dfEAJP["Year"],dfEAJP["JP_Sales"])
plt.xlabel('Year')
plt.ylabel('Sales (in millions)')
plt.title('Game sale amount in three markets for Electronic Arts publisher')
plt.legend(["NA_market","EU_market","JP_makret"])

Electronic Arts publisher almost does not participate in JP market. The main market for Electronic Arts publisher is NA market, and sale amount reaches its peak in 2005.

Both Nintedo and Electronic Arts publisher (publishers with top-2 global sales) reaches peak for sale amount between 2005 and 2010; after 2010, the sales keep going down. This situation is true for three markets. We can guess that there may be some strong opponents appeared after 2010, or there may be substitute goods which are more attractive than video games