# Video Games Sales Analysis

We are trying to do some analysis on video games sales. we have got this dataset from kaggle and with the help of pandas, seaborn and matplotlib libraries we will try to explore the data and try to find out some information, so that we can use it for future sales to grow more.

## Data Preparation and Cleaning

Now we will try to load the data with the help of pandas library so that it will be more readable


### we will load the dataset with the help of pandas library and store it in some varibale so that we can access it throughout the notebook


In [None]:
#first we will load pandas library
import numpy as np
import pandas as pd

In [None]:
# Now we have to load the videogames sales dataset 
vgsales_df = pd.read_csv('../input/videogamesales/vgsales.csv')

### We will check that columns has categorical data and numerical data

In [None]:
vgsales_df.info()
# As we can see there are only two columns with categorical data Genre, Publisher    

## Now we will check for any missing values in data frame for that we use pandas isnull function, as data is big we can plot a heatmap to see where exact the missing values lies most for further evaluation.

In [None]:
vgsales_df.head()

Let's begin by importing`matplotlib.pyplot` and `seaborn`.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (20, 10)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
#now we will check for missing values in dataframe
vgsales_df.isnull()

### We can see all the missing values with graphical representation in heatmap so lets plot `heatmap`

In [None]:
sns.heatmap(vgsales_df.isnull(), cmap="YlGnBu");

As we can see there are some missing values in publisher and year column so lets check them and fill them with some related values

In [None]:
vgsales_df.isnull().sum()

We can delete rows of missing values but it leads us to loss of data so we are not going to do it,
we will fill missing values with the help of mean,
There are totall `271` rows missing in `Year` column and `58` rows missing in `Publisher` column

In [None]:
#first we will check distribution of year column
vgsales_df.hist(column="Year");

In [None]:
# Replace missing values using mean as year data 
mean = vgsales_df['Year'].mean()
vgsales_df['Year'].fillna(mean, inplace=True)

Now we will again plot heatmap to check whether our missing values are proper or not

In [None]:
sns.heatmap(vgsales_df.isnull(), cmap="YlGnBu");

We can see in above graph missing values in year are filled,same ways we will fill publisher column values as well but we will use unknown category for it as we dont know the perfect value for it

In [None]:
vgsales_df['Publisher'].fillna('Unknown', inplace=True)

In [None]:
#now we will again plot heatmap and check whether our missing values are filled
sns.heatmap(vgsales_df.isnull(), cmap="YlGnBu");

### As we can see now we have handled all missing values and our data frame is ready for further processing

## Exploratory Analysis and Visualization

**TODO** - We will check each and every column and will try to find out relations and will try to extract as much as information as we can



### Total sales every year

In [None]:
plt.xticks(rotation = 75)
x_axis = vgsales_df['Year'].astype(int)
sns.countplot(x= x_axis, data = vgsales_df)
plt.title('Total Game Sales Each Year')
plt.show()

### Now We will Compare sales between each region yearly

In [None]:
sales_year = vgsales_df.groupby(vgsales_df['Year'],as_index = False).sum()
# sales_year
#now we will try to check yearly sale in each region

In [None]:
plt.plot(sales_year['Year'], sales_year['NA_Sales'],color='blue')
plt.plot(sales_year['Year'], sales_year['EU_Sales'],color='red')
plt.plot(sales_year['Year'], sales_year['JP_Sales'],color='yellow')
plt.plot(sales_year['Year'], sales_year['Other_Sales'],color='indigo')
plt.plot(sales_year['Year'], sales_year['Global_Sales'],color='green')
plt.legend(["North America", "European Union","Japan","Other","Global"])
plt.xlabel("Year")
plt.ylabel("Total Sale")
plt.title("Comparison between sales in Region")
plt.show()

### Get Top 5 Games in North America

In [None]:
top_games_NA = vgsales_df.sort_values('NA_Sales',ascending = False).head(5)

matplotlib.rcParams['figure.figsize'] = (20, 10)

explode = np.zeros(len(top_games_NA['NA_Sales']), dtype = float)

explode[0] = 0.1
exploded = tuple(explode)


plt.pie(top_games_NA['NA_Sales'], labels = top_games_NA['Name'], 
        autopct='%1.0f%%', 
        pctdistance=1.1, 
        labeldistance=1.2,explode=exploded,shadow=True)

plt.legend(bbox_to_anchor=(1,0.5), loc="center right", fontsize=15, 
           bbox_transform=plt.gcf().transFigure)

plt.show()


### Now we will try to find Out Top 5 games globally

In [None]:
top_games_G = vgsales_df.sort_values('Global_Sales',ascending = False).head(5)

matplotlib.rcParams['figure.figsize'] = (20, 10)

explode = np.zeros(len(top_games_G['Global_Sales']), dtype = float)

explode[0] = 0.1
exploded = tuple(explode)


plt.pie(top_games_G['Global_Sales'], labels = top_games_NA['Name'], 
        autopct='%1.0f%%', 
        pctdistance=1.1, 
        labeldistance=1.2,explode=exploded,shadow=True)

plt.legend(bbox_to_anchor=(1,0.5), loc="center right", fontsize=15, 
           bbox_transform=plt.gcf().transFigure)

plt.show()

We can see Globally top Game is Wii Sports

## Top 5 Platforms In Japan

In [None]:
top_platforms = vgsales_df.sort_values('JP_Sales',ascending = False).head(5)
japan_sales = top_platforms['JP_Sales']

matplotlib.rcParams['figure.figsize'] = (20, 10)

explode = np.zeros(len(japan_sales), dtype = float)

explode[0] = 0.1
exploded = tuple(explode)


plt.pie(japan_sales, labels = top_platforms['Platform'], 
        autopct='%1.0f%%', 
        pctdistance=1.1, 
        labeldistance=1.2,explode=exploded,shadow=True)

plt.legend(bbox_to_anchor=(1,0.5), loc="center right", fontsize=15, 
           bbox_transform=plt.gcf().transFigure)

plt.show()

Above We can see In Japan Region Game Boy (GB) is top Platform

### Top 5 Publisher

In [None]:
publisher = vgsales_df['Publisher'].value_counts().head(5)

sns.barplot(x=publisher.index, y=publisher,palette="husl");

In above We can see Electronic Arts is the top publisher

## Asking and Answering Questions

Now we will try to extract more information from dataset by asking question below



#### Q1: In which year most games are published

In [None]:
data = vgsales_df['Year']
sns.set_color_codes()
sns.displot(data, binwidth = 3,color='y',bins=30,discrete=True,kde = True)
plt.show()

#### Above we can see that between year 2008 and 2009 most games were published

#### Q2: Which genre people liked most till now

In [None]:
hist_values = vgsales_df['Genre'].value_counts()
Xaxis = hist_values.index
Yaxis = hist_values.values

In [None]:
sns.barplot(x=Xaxis, y=Yaxis);

#### Above chart says that people like action games most till now.

#### Q3: Which is the best seller Game in each Genre ?

In [None]:
game_sales = vgsales_df[['Genre','Name']].groupby(['Genre']).agg(lambda x:x.value_counts().index[0])

In [None]:
game_sales

#### above we can see each genre and its best seller game

#### Q4: What are the Revenues of Publishers Per Region ?

In [None]:
#now we will try to plot each region revenue
plot_publishers = vgsales_df.groupby('Publisher')[['NA_Sales','JP_Sales','EU_Sales','Other_Sales']].mean()

sort_publishers = plot_publishers.sort_values('EU_Sales',ascending=False);
fig = plt.figure(figsize=(12,8));
ax1 = fig.add_subplot(1,4,1);
ax1.set_xticklabels(labels = 'European Union', rotation=90);
sns.barplot(x=plot_publishers.head(5).index, y=sort_publishers.head(5).EU_Sales);
plt.title('European Union');
plt.ylabel('Revenue');
plt.suptitle('Revenues per region',size=22);

sort_publishers = plot_publishers.sort_values('NA_Sales',ascending=False);
ax2 = fig.add_subplot(1,4,2,sharey=ax1);
ax2.set_xticklabels(labels = 'North America', rotation=90);
sns.barplot(x=plot_publishers.head(5).index, y=sort_publishers.head(5).NA_Sales);
plt.title('North America');
plt.ylabel('Revenue');

sort_publishers = plot_publishers.sort_values('JP_Sales',ascending=False);
ax3 = fig.add_subplot(1,4,3,sharey=ax1);
ax3.set_xticklabels(labels = 'Japan', rotation=90);
sns.barplot(x=plot_publishers.head(5).index, y=sort_publishers.head(5).JP_Sales);
plt.title('Japan');
plt.ylabel('Revenue');

sort_publishers = plot_publishers.sort_values('Other_Sales',ascending=False);
ax4 = fig.add_subplot(1,4,4,sharey=ax1);
ax4.set_xticklabels(labels = 'Japan', rotation=90);
sns.barplot(x=plot_publishers.head(5).index, y=sort_publishers.head(5).Other_Sales);
plt.title('Other Regions');
plt.ylabel('Revenue');



#### Q5: Which publisher published most number of games

In [None]:
number_df = vgsales_df.groupby('Publisher')[['Name']].count().sort_values('Name', ascending = False).head(50)
number_df

In [None]:
number_df.plot(kind = 'bar', figsize = (50, 20));
plt.xlabel('Number of video games released', fontsize = 20);
plt.ylabel('Publisher', fontsize = 20);
plt.title('Top Publishers of Games', fontsize = 40);

## Inferences and Conclusion

In conclusion, most of indviduals like action which is why action games are the top selling category overall it may be because of individual wants to take out all the frustration and action games needs energy so thats why they are most popular.