Assume we are a game developer. We don't know the type of games that we want to develop for our next project and we also don't know from where we want to publish it later. Lets try to analyze video game sales data to considering the answer of our problem before.

# Load Library and Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
a_data = pd.read_csv('/kaggle/input/videogamesales/vgsales.csv')
a_data.head()

# Dataset Description

In [None]:
a_data.info()

In [None]:
a_data.describe()

# Preprocessing

## Dropping 'Rank' and 'Year' column.

In [None]:
a_data.drop(['Rank', 'Year'], axis= 1 , inplace=True)
a_data.head()

## Handling Description at 'Name' Column

In [None]:
for nama in a_data['Name'].unique():
    print(nama)

If you look at the list of games name above, you will see that there are some games name that has some description in it ('sales'-like string). Lets remove it!

In [None]:
def fix_name(a_str):
    new_str = a_str.split('(')
    if len(new_str) == 1:
        return new_str[0]
    else:
        return new_str[0][:-1]

a_data['Name'] = a_data['Name'].apply(fix_name)

Lets see the result of the code before.

In [None]:
a_data[a_data['Name'].str.contains('sales')]

## Handling Duplicate 'Platform' Value for the Same Games.

As the result of 'Name' columns threatment, there are some data that has same 'Name' and 'Platform' columns for example 'Need for Speed: Most Wanted' that has two data that has 'PC' as its 'Platform' column value.

In [None]:
a_data[a_data['Name'] == 'Need for Speed: Most Wanted']

Lets solve that problem by grouping it!

In [None]:
a_data = a_data.groupby(['Name', 'Publisher', 'Genre', 'Platform']).sum()
a_data.reset_index(inplace= True)

## Add Up Sales of Same Games Name

I'm curious about whether multiplatform games has better sales number than exclusive games so I change 'Platform' column to 'Platform_Count' then I grouping up the data. I also change its index too then I drop 'Publisher' column because I don't need it.

In [None]:
a_data['Platform'] = 1
a_data.rename(columns = {'Platform': 'Platform_Count'}, inplace=True)
a_data = a_data.groupby(['Name', 'Publisher', 'Genre']).sum()
a_data.reset_index(inplace = True)
a_data.set_index('Name', inplace = True)
a_data.head()

## Generate 'Sales_Region' Column

I'm also curious about any connection of game sales region. Assume that if the game sales at a region, its sales number is not 0. Now lets generate 'Sales_Region' column.

In [None]:
def sales_region(cols):
    type_dict= {'0': 'NA',
                '1': 'EU',
                '2': 'JP',
                '3': 'Others',
                '01': 'NA & EU',
                '02': 'NA & JP',
                '03': 'NA & Others',
                '12': 'EU & JP',
                '13': 'EU & Others',
                '23': 'JP & Others',
                '012': 'NA, EU, & JP',
                '013': 'NA, EU, & Others',
                '023': 'NA, JP, & Others',
                '123': 'EU, JP, & Others',
                '0123': 'All Region'
               }
    type_str = ''
    if cols[3] > 0:
        type_str += '0'
        
    if cols[4] > 0:
        type_str += '1'
        
    if cols[5] > 0:
        type_str += '2'
    
    if cols[6] > 0:
        type_str += '3'
    
    return type_dict[type_str]

a_data['Sales_Region'] = a_data.apply(sales_region, axis=1)
a_data.head()

# Visualization

## Genre

In [None]:
plt.figure(figsize = (12, 8))
sns.countplot(x='Genre', data = a_data, order = pd.DataFrame(a_data['Genre'].value_counts()).index)
plt.xticks(rotation=45, ha='right')
plt.title('Relationship between genre and the number of games')
plt.ylabel('Number of Games')
plt.show()

In [None]:
plt.figure(figsize = (12, 8))
sns.barplot(x='Genre', y='Global_Sales',data = a_data, estimator=np.median, order = pd.DataFrame(a_data['Genre'].value_counts()).index)
plt.title('Relationship between genre and the number of global sales')
plt.ylabel('Global Sales')
plt.xticks(rotation=45, ha='right')
plt.show()

Action is the most game genres in this dataset but its sales is not that good. I think its better to consider Platform as the genres of our latest game.

## Platform_Count

In [None]:
a_data['Platform_Count'].value_counts()

In [None]:
plt.figure(figsize = (12, 8))
sns.countplot(x='Platform_Count', data = a_data)

Platform-exclusive games seems to be the most favorite game types in this dataset but are Platform-genre Platform-exclusive games too?

## Platform_Count & Genre

In [None]:
plt.figure(figsize = (12, 8))
sns.heatmap(pd.crosstab(a_data['Platform_Count'], a_data['Genre']), cmap = 'coolwarm', annot = True)
plt.tight_layout()
plt.xticks(rotation=45, ha='right')
plt.show()

If we only look at Platform-genre, it is better if a game with Platform as its genre to be Platform exclusive too.

## Publisher

In [None]:
platform_data = a_data[a_data['Genre'] == 'Platform']

plt.figure(figsize = (12, 8))
sns.countplot(x='Publisher', data = platform_data, order = pd.DataFrame(platform_data['Publisher'].value_counts()).index[:10])
plt.title('Best Publisher for Platform-genre Games')
plt.ylabel('Number of Games')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
plt.figure(figsize = (12, 8))
sns.barplot(x='Publisher', y='Global_Sales',data = platform_data, estimator=np.median, order = pd.DataFrame(platform_data['Publisher'].value_counts()).index[:10])
plt.title('Best Publisher for Platform-genre Games and its number of global sales')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Publisher')
plt.ylabel('Global Sales')
plt.show()

Nintendo is the most publisher that publish Platform-genre games. Its global sales is also good too. It is better if we can publish our Platform-genre games from Nintendo.

## Sales_Region

In [None]:
plt.figure(figsize = (12, 8))
sns.countplot(x='Sales_Region', data = a_data, order = pd.DataFrame(a_data['Sales_Region'].value_counts()).index)
plt.xticks(rotation=45, ha='right')
plt.title('Relationship between sales region and the number of games')
plt.xlabel('Sales Region')
plt.ylabel('Number of Games')
plt.show()

In [None]:
plt.figure(figsize = (12, 8))
sns.barplot(x='Sales_Region', y='Global_Sales',data = a_data, estimator=np.median, order = pd.DataFrame(a_data['Sales_Region'].value_counts()).index)
plt.xticks(rotation=45, ha='right')
plt.title('Relationship between sales region and the number of global sales')
plt.xlabel('Sales Region')
plt.ylabel('Global Sales')
plt.show()

Even through the rank of the number of games that sale to all region is 3, its sales is the better than its second or first rank. It's seems that it's better to sale to all regions.

# Conclusion

We will develop a game that has Platform as its genre and also exclusive to one platform. We will publish our games from Nintendo and it will be sold to all regions.