# Communicate Data Findings Project : Video Game Sales

## Table of Contents
<ul> 
    <li><a href='#intro'>1. Introduction</a></li> 
    <li><a href='#overview'>2. DataSet General Overview</a></li>  
    <li><a href='#inquiry'>3. Important Inquiries For Research</a></li> 
    <li><a href='#clean'>4. Data Cleaning & Wrangling</a></li> 
    <li><a href='#explore'>5. Exploratory Data Analysis & Visualizations</a></li> 
    <li><a href='#final'>6. Final Report + Explanatory Data Visulaization</a></li> 
</ul>

<a id='intro'></a>
## 1. Introduction

This dataset is showing records of over 16k video games that made over 100k copies worldwide From 1980 till 2020. 

This notebook is aimed at exploring, getting an overview of the dataset followed by cleaning & visualizations to answer important inquiries regarding sales of video games (worldwide/regions). 

Using Different python libraries (pandas, matplotlib, seaborn) for data wrangling, cleaning and creating exploratory & explanatory visualizations for storytelling will be shown along the notebook

<a id='overview'></a>
## 2. DataSet General Overview 

Starting with general overview of the dataset identifying structure of the data and modifications needed to be performed on the data (Wrangling, cleaning, adding more variables). Also, exploring the data to get first insights will help to guide exploratory & explanatory visualization and analysis of the data

In [None]:
# Import libraries for data wrangling 
import numpy as np 
import pandas as pd 
# Import libraries for data visualizations 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline 
# Import other libraries (may be needed)
import warnings, os, time 
# Load the data 
df = pd.read_csv('../input/videogamesales/vgsales.csv')
df.head()

In [None]:
# More viewing of the data 
df.info()

In [None]:
df.describe()

In [None]:
# Check for duplication in values 
def find_remove_duplicates(df): 
    df_no_dup = df.drop_duplicates()
    len_dup = len(df) - len(df_no_dup)
    if len_dup > 0 : 
        print (f'There are {len_dup} records duplicated in the dataset')
        print ('Removing duplicated values....')
        df = df.drop_duplicates()
    else : 
        print ('There are no duplicated values')

In [None]:
find_remove_duplicates(df)

In [None]:
# Get overview of global sales over the years
sns.color_palette()[0]
sns.scatterplot(x=df['Year'], y=df['Global_Sales'], alpha=0.5);

In [None]:
# Get distribution of games sold over the years
sns.boxplot(df['Year']);

In [None]:
# Checking ratio of games sold before 1990 to the total games 
before_1995_ratio = (df['Year']<1995).mean()
print (f'Ratio of games sold before 1995 is {(before_1995_ratio*100):.2f} % of the total data.')

Ratio of games sold before 1995 is 2.93 % of the total data.

#### From general overview of the data, the following is found :
1. Missing Values were found in (Year, Publisher) columns, records don't represent significant percentage of the data so they can be removed
2. No duplicate values were found along the dataset.
3. Grouping of the data will be needed in data wrangling step to create more insightful visualizations.
4. Outlier values were found in year column causing more outlier values in sales columns. removal won't be necessary as we want to get total overview of the data and taking into consideration the changes in game industry as size, sales and purchasing power of individuals.
5. modifying columns names to be easier in working and further wrangling & visualizations.

<a id='inquiry'></a>
## 3. Important Inquiries For Research 

### * The most important inquiries for research are:
1. What are the parameters most affecting sales size globally ??
2. Does the effect of these parameters differ in different regions ? If so why then ?

<a id='clean'></a>
## 4. Data Cleaning & Wrangling 

In [None]:
# Remove all missing values 
df = df.dropna()
df.info()

In [None]:
# modifying columns names making it all lower case ----> easier to work with
df.columns = df.columns.str.lower()
df.info()

In [None]:
# Get final overview 
df.head()

<a id='explore'></a>
## 5. Exploratory Data Analysis & Viusalizations 

### 5.1 Univariate Exploration 

In [None]:
# View distributions for number of games sold over the years
fig, ax = plt.subplots(figsize=(12,6))
sns.histplot(df['year'], kde=True)
plt.xticks(rotation=60)
plt.title('Number of Games Sold Over The Years');

In [None]:
# View count of games on different gaming platforms
fig, ax = plt.subplots(figsize=(12,6))
sns.histplot(df['platform'], kde=True)
plt.xticks(rotation=60)
plt.title('Number of Games Sold on Different Platforms');# View count of games produced in different genres
fig, ax = plt.subplots(figsize=(12,6))
sns.histplot(df['genre'], kde=True)
plt.xticks(rotation=60)
plt.title('Number of Games Produced in Each Genre');

### 5.2 Bivariate Exploration of the data

In [None]:
# Viewing sales amount over the years
sales_years = df.groupby(by='year').sum()
sales_years.drop(columns='rank', inplace=True)
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.lineplot(data=sales_years, dashes=False)
plt.ylabel('Sales (M$)')
plt.title('Sales Amount (Million $) Over the Years');

In [None]:
# Viewing sales amount for different gneres 
sales_genres = df.groupby(by='genre').sum()
sales_genres.drop(columns=['rank','year'], inplace=True)
sales_genres.sort_values(by='global_sales', inplace=True)
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.lineplot(data=sales_genres, dashes=False)
plt.xticks(rotation=45)
plt.ylabel('Sales (M$)')
plt.title('Sales Amount For Different Genres'); 

It appears as the action & sports games get the most sales in America & Europe but case is different in Japan as role-play games are at the top sales .. that will require more investigation

In [None]:
# Viewing sales amount for different platforms 
sales_platforms = df.groupby(by='platform').sum()
sales_platforms.drop(columns=['rank','year'], inplace=True)
sales_platforms.sort_values(by='global_sales', inplace=True)
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.lineplot(data=sales_platforms, dashes=False)
plt.xticks(rotation=60)
plt.ylabel('Sales (M$)')
plt.title('Sales Amount (Million $) on Different Platforms');

Here it's clear that we have very different favorites when it comes to gaming platforms :

* North America : prefer xbox360
* Europe : Prefer PS2 & 3
* Japan : Different as always having Nintendo DS on the top

In [None]:
# Viewing sales amount for publishers above 10 million global sales
sales_publishers = df.groupby(by='publisher').sum()
sales_publishers.drop(columns=['rank','year'], inplace=True)
sales_publishers.sort_values(by='global_sales', ascending=False, inplace=True)
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.lineplot(data=sales_publishers[sales_publishers['global_sales']>50], dashes=False)
plt.xticks(rotation=90)
plt.ylabel('Sales (M$)')
plt.title('Sales Amount For Publishers Making over 50 Million $');

In [None]:
# So what are the best selling games in each region .. Starting with North America 
plt.subplots(figsize=(16,8))
top_games_na = df[['name', 'genre', 'platform','year','na_sales']].sort_values(by='na_sales', ascending=False)[:10]
sns.lineplot(x=top_games_na['name'], y=top_games_na['na_sales'])
plt.xticks(rotation=60)
plt.title('Best Selling Games in North America');

In [None]:
# what are the best selling games in Europe then 
plt.subplots(figsize=(16,8))
top_games_eu = df[['name', 'genre', 'platform','year','eu_sales']].sort_values(by='eu_sales', ascending=False)[:10]
sns.lineplot(x=top_games_eu['name'], y=top_games_eu['eu_sales'])
plt.xticks(rotation=60)
plt.title('Best Selling Games in Europe');

In [None]:
# what are the best selling games in Japan then 
plt.subplots(figsize=(16,8))
top_games_jp = df[['name', 'genre', 'platform','year','jp_sales']].sort_values(by='jp_sales', ascending=False)[:10]
sns.lineplot(x=top_games_jp['name'], y=top_games_jp['jp_sales'])
plt.xticks(rotation=60)
plt.title('Best Selling Games in Japan');

### 5.3 Multivariate Exploration

In [None]:
# Now it's seen that there's a very big gap between top highest sales companies and other companies, how about we view their sales over the years
top_publishers = sales_publishers[sales_publishers['global_sales']>50][:5].index
df_top_publishers = df[df['publisher'].isin(top_publishers)].groupby(by=['publisher','year']).sum()
fig, ax = plt.subplots(figsize=(16,8))
x=df_top_publishers.index.get_level_values('year')
y = df_top_publishers['global_sales']
hue = df_top_publishers.index.get_level_values('publisher')
sns.lineplot(x=x,y=y,hue=hue)
plt.title('Sales For Top 5 Publishers Over The Years');

In [None]:
# Now, Let's view development of top 5 game platforms over the years
top_5_platforms = df.groupby(by='platform').sum().sort_values(by='global_sales', ascending=False).index[:5]
df_top_platforms = df[df['platform'].isin(top_5_platforms)].groupby(by=['platform','year']).sum()
fig, ax = plt.subplots(figsize=(16,8))
x=df_top_platforms.index.get_level_values('year')
y = df_top_platforms['global_sales']
hue = df_top_platforms.index.get_level_values('platform')
sns.lineplot(x=x,y=y,hue=hue)
plt.title('Sales For Top 5 Platforms Over The Years');

<a id='final'></a>
## 6. Final Report + Explanatory Data Visualization

### How was the game selling business in different regions along the years ??

In [None]:
plt.subplots(figsize=(20,10))
sales_years = df.groupby(by='year').sum().drop(columns=['rank', 'global_sales'])
sns.lineplot(data=sales_years)
plt.title('Sales in Different Regions Along the Years', fontsize=20)
plt.legend(['North America', 'Europe', 'Japan', 'Other'], fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Sales (Million $)', fontsize=16);

### Who are the Top 10 Game Publishers Globally and how their sales differ in different regions across the world ?

In [None]:
top_publishers = df.groupby(by='publisher').sum()
top_publishers = top_publishers.sort_values(by='global_sales', ascending=False)[:10]
plt.subplots(figsize=(20,10))
sns.barplot(x=top_publishers.index, y=top_publishers['global_sales'], color=sns.color_palette()[0])
plt.title('Top 10 Game Publishers Around the Globe', fontsize=20)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel(' ')
plt.ylabel('Global Sales (Million $)', fontsize=16);

In [None]:
top_publishers = df.groupby(by='publisher').sum()
top_publishers = top_publishers.sort_values(by='global_sales', ascending=False)[:10]
top_publishers = top_publishers[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']]
plt.subplots(figsize=(20,10))
sns.lineplot(data = top_publishers, color=sns.color_palette()[0])
plt.title('Top 10 Game Publishers Sales in Different World Regions', fontsize=20)
plt.legend(['North America', 'Europe', 'Japan', 'Other'], fontsize=14)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel(' ')
plt.ylabel('Global Sales (Million $)', fontsize=16);

### For the top 5 Game Publishers How was their game selling business going along the years ??

In [None]:
top_5_publishers = df.groupby(by='publisher').sum()
top_5_publishers = top_5_publishers.sort_values(by='global_sales', ascending=False)['global_sales'][:5].index
df_top_publishers = df[df['publisher'].isin(top_5_publishers)].groupby(by=['publisher', 'year']).sum()
plt.subplots(figsize=(20,10))
x = df_top_publishers.index.get_level_values('year')
y = df_top_publishers['global_sales']
hue = df_top_publishers.index.get_level_values('publisher')
sns.lineplot(x=x,y=y,hue=hue)
plt.title('Sales of Top 5 Game Companies Over The Years', fontsize=20)
plt.legend(fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Global Sales (Million $)', fontsize=16);

### What are The Most Popular Game Genres in Different Regions of the World ?

In [None]:
top_genres = df.groupby(by='genre').sum().drop(columns=['year', 'rank', 'global_sales'])
plt.subplots(figsize=(20,10))
sns.lineplot(data=top_genres)
plt.title('Genres Popularity in Different Regions Across The World', fontsize=20)
plt.xlabel(' ')
plt.ylabel('Sales (Million $)', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(['North America', 'Europe', 'Japan', 'Others'], fontsize=14);

### What are The Most Popular Game Platforms in Different Regions of the World ?

In [None]:
top_platforms = df.groupby(by='platform').sum().drop(columns=['year', 'rank', 'global_sales'])
plt.subplots(figsize=(20,10))
sns.lineplot(data=top_platforms)
plt.title('Platforms Popularity in Different Regions Across The World', fontsize=20)
plt.xlabel(' ')
plt.ylabel('Sales (Million $)', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(['North America', 'Europe', 'Japan', 'Others'], fontsize=14);

### Finally and confirm on all before ... What are The Best Selling Games in Different Areas of The World ?

In [None]:
top_games = df[['name', 'na_sales', 'eu_sales', 'jp_sales', 'other_sales']][:15]
top_games.set_index(keys='name', inplace=True)
plt.subplots(figsize=(20,10))
sns.lineplot(data=top_games)
plt.title('Top Games Prefrences in Different Regions Across The World', fontsize=20)
plt.xlabel(' ')
plt.ylabel('Sales (Million $)', fontsize=16)
plt.xticks(rotation=60, fontsize=12)
plt.yticks(fontsize=12)
plt.legend(['North America', 'Europe', 'Japan', 'Others'], fontsize=14);

### Findings in Data Exploratory Analysis shown in Explanatory presentation
1. It Appears that the period from 2007 to 2010 was a golden period for sales of all games even from different publishers in different countries .. Maybe it was due to the rise of Wii games.
2. From the different game publishers, Nintendo takes the lead as No. 1 Game creator in all regions.
3. Although Nintendo is No. 1 game publisher, it's viewed that other publishers like UbiSoft, Activision are more popular in America & Europe than other countries or Japan. While Countries like Namco Bandai & Konami have the most popularity in Japan even more than America & Europe the most consumer.
4. For the history of publishers, it appears that Nintendo was always in the lead while other comapnies like Electronic arts & Activison would beat Nintendo in sales especially in America & Europe but Nintendo keeps coming back with achieving more 150 Million Dollars over other companies in some years.
5. When it comes to games genres, it seems that categories like Action & Sports games are always in the lead achieving more than 600 Million dollars each in US and more than 350 Million Dollars each in Europe. However, in Japan it's different as Role-Playing Games are in the lead achieveing nearly 400 Million Dollars while popular categories like action and sports don't get to 200 Million.
6. For Different platforms there was a lot of diversity across the world. In America, XBox 360 was the most popular followed by Wii and Playstation probably due to having Microsoft controlling American market.
7. In Europe, The Playstation is the most popular followed by Xbox. And as always Japan has it's unique preferences, The Nintendo DS was No. 1 Platform used there followed by Playstation.
8. After viewing of the most selling games, it seems that there was a spike due to appearance of Wii games which caused the spike in 2007 mentioned in first point. Actually all the top 15 selling games are games ade by Nintendo which explains the huge advancement over other game sellers.
### Key Insights For Presentation
In Game industry, there is a lot of variables that control the sales and success of certain company or platform and to describe that through analysis of the data, we can mention the following points:

1. Gaming Platforms Need continous development to keep selling, For Example we can see that Playstation kept working and upgrading to PS2, PS3 to keep steady sales instead of single peak like Wii.
2. You need to study your market region very well before setting game for sales, we can see very different prefrences in America, Europe & Japan for game genres and platforms.
3. However, Like any industry some companies will stay on the top and keep being so like nintendo.