# Determination of patterns that form the success of the game

![](https://i.ibb.co/h2wfnr3/Games-2000x1125.jpg)

Historical data on game sales, genres and platforms (for example, **Xbox** or **PlayStation**) are available from open sources. We need to identify the patterns that determine the success of the game. 


## General information about data in operation

### Loading and previewing data

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import seaborn as sns
from scipy import stats as st
sns.set_style("darkgrid")
import warnings
warnings.filterwarnings("ignore")


In [None]:
game_df = pd.read_csv('../input/videogamesales/vgsales.csv')
game_df.head()

In [None]:
game_df.info()

In [None]:
game_df.describe().T

We observe NaN in the `Year`,`Publisher` 

### data describing:
- `Rank` - Ranking of overall sales

- `Name` - The games name

- `Platform` - Platform of the games release (i.e. PC,PS4, etc.)

- `Year` - Year of the game's release

- `Genre` - Genre of the game

- `Publisher` - Publisher of the game

- `NA_Sales` - Sales in North America (in millions)

- `EU_Sales` - Sales in Europe (in millions)

- `JP_Sales` - Sales in Japan (in millions)

- `Other_Sales` - Sales in the rest of the world (in millions)

- `Global_Sales` - Total worldwide sales.


## Summary

- at first glance, no anomalies were found
- in the columns `Year`,` Publisher` - We will study the passes and decide what to do with them.
- it is necessary to replace the data in the `Year` column -` float` with `int`. `datetime` does not need to be entered
- for the rest of the columns, additional study is no needed - good types of data

## Data preprocessing

### Replacing column names.

In [None]:
game_df.columns

Let's convert the column names to lowercase for ease of use

In [None]:
game_df.columns = game_df.columns.str.lower()
game_df.columns

### The presence of duplicates in the data


In [None]:
game_df.duplicated().sum()

No duplicates were found. You need to go to passes and their processing

### Counting and Handling NaNs

In [None]:
game_df.isna().mean().sort_values(ascending=False)

In [None]:
game_df.isna().mean().sort_values(ascending=False).plot(
                                                  kind='bar', figsize=(15,5), 
                                                  grid=True, color='steelblue', 
                                                  edgecolor='black', linewidth=2
                                                  )
plt.title('Visualisation of NaNs')
plt.xlabel('Name of features')
plt.ylabel('Share of NaN')
plt.show()

In [None]:
plt.figure(figsize=(15, 5))
plt.hist(game_df.loc[game_df['publisher'].isna(), 'year'], 
         color='steelblue', edgecolor='black', linewidth=2
        )
plt.title('Distribution of Nan by Year')
plt.xlabel('')
plt.show()

Unfortunately, most of the gaps have been observed since 2000. This cannot be attributed to the absence of internet of other issues.

#### Feature `year`

Let's see the number of Nan of the `year` column:

In [None]:
len(game_df[game_df['year'].isna()])

In [None]:
game_df['year'].unique()

271 NaNs in the year column. We could do some research and restore the gaps, but that's 271 lines. Replace with `0`, after processing, create a new dataframe, excluding these lines

In [None]:
game_df['year'] = game_df['year'].fillna(0)
print('NaNs in year - {}'.format(
                                                   game_df['year'].isna().sum())
                                                   )

#### Feature `publisher`

In [None]:
print('Nan in publisher', len(game_df[game_df['publisher'].isna()]))


In [None]:
game_df[game_df['publisher'].isna()].head()

I don not like Nan. Let us change it to `Unknown`

In [None]:
game_df['publisher'] = game_df['publisher'].fillna('unknown')

We will not delete lines. Exclude from analysis via `query ()`


#### Converting data to other types

In [None]:
game_df.info()


Let's replace the data types in the `year_of_release` columns with an integer one.

In [None]:
game_df['year'] = game_df[game_df['year'].notna()]['year'].astype('int64')
game_df.info()

#### Cumulative sales across all regions

In [None]:
game_df ['total_sales'] = (
                          game_df['na_sales'] + 
                          game_df['eu_sales'] + 
                          game_df['jp_sales'] + 
                          game_df['other_sales']
)

In [None]:
game_df_upd = game_df[game_df['year']!= 0]
game_df_upd.head(15)

In [None]:
game_df_upd.info()

#### Summary

Prepared the data.

Column names were converted to lower case, data types were replaced, and NaNs and duplicates were examined. The lines were decided not to be deleted, excluded via `query ()`. Added additional column `total_sales` to the dataframe.

## Exploratory data analysis


### Analysis of the number and sales of released games for the entire period

Let's take a look at the general information on released games for different platforms by year of release:

In [None]:

game_cross = pd.crosstab(game_df_upd['platform'], 
            game_df_upd['year'], margins=True, 
            margins_name="Total", 
           ).T
game_cross

If we look at the `Total` column, we see that the gaming industry has been actively developing since 1994.  

The peaks are in 2006 - 2011, then we see a decline and some leveling off since 2012 (from 500 to 652 games per year - close to the level of 2001 - 2006). This may be due to the development of games on mobile devices for Android or iOS - * mobile phones *, which are not in the list of platforms.

To simplify perception, we present a histogram grouped by name:

In [None]:
game_df_upd.groupby('year')['name'].count().plot(
            kind='bar', y='name', figsize=(15,5), edgecolor='black'
)
plt.title('Number of games released from 1980 to 2016')
plt.xticks(rotation=42)
plt.xlabel('')
plt.show()
    

Let's look at the sales of games on various platforms in the period under review.

In [None]:
game_df_upd.groupby('platform')['total_sales'].sum().sort_values(ascending=True).plot(
            kind='barh', y='total_sales', figsize=(15,10), edgecolor='black'
)
plt.title('General sales of games on different platforms')
plt.xticks(rotation=42)
plt.xlabel('')
plt.ylabel('')
plt.show()
    

For convenience, let's highlight the top 10 sales platforms:

In [None]:
game_df_upd.groupby('platform')['total_sales'].sum().to_frame(
                                                        'total_sales').sort_values(
                                                        by='total_sales', ascending=False
                                                        ).head(10)

The leaders are PS2, PS3 and Xbox360. Wii and DS are also not far behind. 
Let us find life of time of each platform

In [None]:
list_of_platform = ['PS4', 'PC', '3DS', 'XOne']
games_not_new = game_df_upd.query('platform not in @list_of_platform').copy()
born_year = games_not_new.groupby('platform')['year'].agg(min)
deadline = games_not_new.groupby('platform')['year'].agg(max)
life_time = deadline - born_year
life_time.to_dict()
games_not_new['life_time'] = games_not_new['platform'].map(life_time)
games_not_new.head()


In [None]:
q75 = games_not_new['life_time'].quantile([.75])
q25 = games_not_new['life_time'].quantile([.25])
iqr = q75 - q25
low_range = q25 - (1.5 * iqr)
high_range = q75 + (1.5 * iqr)
plt.figure(figsize=(15, 5))
sns.boxplot(games_not_new['life_time'], color='steelblue')
plt.xlim = (low_range, high_range)
plt.title('The spread of the values ​​of the year of life of the gaming platform')
plt.xlabel('')
plt.show()


Eliminate outliers and look at the mean and median.

In [None]:
games_pivot = games_not_new.query('5 <= life_time <=15 ').pivot_table(index='platform', 
                         values='life_time').sort_values(
                         by='life_time', ascending=False
                        )
games_pivot.head(10)

In [None]:
print('Median life of time: ', games_pivot['life_time'].median(), 'years')
print('Mean life of time: {:.1f}'.format(games_pivot['life_time'].mean()), 'years')


Long-lived platforms include DS, Xbox360, PS2 and PS3, Wii. The lifespan of the platform is 9 years based on this estimate. At the same time, it should be understood that this does not mean a period of release of new [generations of consoles](https://ru.wikipedia.org/wiki/%D0%98%D0%B3%D1%80%D0%BE%D0%B2%D0%B0%D1%8F_%D0%BF%D1%80%D0%B8%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0)- it means release time games for these platforms. On average, generations change once every 7 years.

DS wins over with its mobility and possibly plays on the nostalgia of grown-up gamers. When calculating, we excluded PCs and modern consoles such as PS4, Xbox One, 3DS. The PC as a platform exists independently - the only difference is in the system requirements for games.

Let's see how sales have changed in relation to the number of games released.

In [None]:
plt.figure(figsize=(15, 5))
ax = plt.gca()
games_not_new.groupby('year')['total_sales'].sum().plot(
                                                                 legend=True, 
                                                                 title='Sales and number of releases by year'
                                                                 )
games_not_new.groupby('year')['name'].count().plot(legend=True, grid=True)

plt.ylabel('Number of games released / Sales, mln $')
ax.vlines(x=2012, linestyle='--', color='black', ymin=0, ymax=1600)
plt.xticks(rotation=42)
plt.xlabel('')
plt.show()

#### Conclusion
We see that modern game production has been actively developing since 2001, and this affects the revenue. We assume that this is due to the active development of console platforms, an increase in the performance of gaming hardware, which causes an increased demand for video entertainment.

Let's take the period from 2012 to 2016 - from the moment of a sharp drop in sales and the trend of a decrease in the number of games produced - in the year of the appearance of the first consoles [eighth generation](https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D0%B8%D0%B3%D1%80%D0%BE%D0%B2%D1%8B%D1%85_%D0%BA%D0%BE%D0%BD%D1%81%D0%BE%D0%BB%D0%B5%D0%B9#%D0%92%D0%BE%D1%81%D1%8C%D0%BC%D0%BE%D0%B5_%D0%BF%D0%BE%D0%BA%D0%BE%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5_(%D1%81_2012))



### Analysis of the number and sales of released games for 2012-2016 (2017 - 2020 have many NaNs...not interesting for us)

In [None]:
games_12_16 = game_df_upd.query('2012 <=year <=2016').reset_index(drop=True)
games_12_16.head()

In [None]:
games_12_16.info()

In [None]:
pd.crosstab(games_12_16['platform'], 
            games_12_16['year'],
            margins=True, margins_name="total"
           ).sort_values(by='total', ascending=False)

We see a decrease in the number of games created on the old generation consoles and an increase in production for new consoles: PS4, XOne, WiiU.

Among the portable consoles, the 3DS can be distinguished.
PS vita (PSV), PSP - we are seeing a decline

In [None]:
lead_platforms = ['PS4', 'PC', 'XOne', '3DS']
lead_games = games_12_16.query('platform in @lead_platforms')

In [None]:
plt.figure(figsize=(15, 5))
sns.barplot(y='total_sales', 
            x='year', 
            hue='platform', 
            data = lead_games,
            hue_order = lead_platforms
           )

plt.title('Sales of games of potentially profitable platforms, mln $ ')
plt.xticks(rotation=42)
plt.xlabel('')
plt.ylabel('')
plt.show()

In [None]:
plt.figure(figsize=(15,10))

sns.boxplot(y='platform', x='total_sales',
            data = lead_games.query('total_sales < 4')
            ,order=lead_platforms, orient='h', 
           )
plt.title('Spread of global sales of potentially profitable platforms, mln $ ')
plt.xlabel('')
plt.ylabel('')
plt.show()

#### Colnclusion

Promising platforms are PS4, XOne, 3DS and PC. Despite the fact that we see a general decline - the leaders and the distribution of profits do not change. The PS4 and XOne are developing in about the same way, with the PS4 selling slightly better due to the availability of exclusive games, while Miscrosoft's policy is to play on the PC as well. The 3DS platform is also selling well, the PC is inferior to the console versions, which is associated with the need to update the *hardware* of the computer, which is much more expensive than buying a console

To eliminate outliers, when selecting values ​​for each platform, we will limit the total sales

### Analysis of games by genre

In [None]:
games_12_16

In [None]:
top_games = games_12_16.pivot_table(index='genre', columns='year'
                        ,values='total_sales',aggfunc='sum', margins=True).copy()
top_games.sort_values(by='All', ascending=False)

The most popular genres are action and shooter, but puzzle is disappearing completely. Also, the popular genres include RPG, platform and sports. The largest drop in revenue was received by the action genre, shooter and RPG are developing more steadily

### Colnclusion

The gaming industry has been actively developing since 1994. It should be noted that the ESRP rating has been used since 1993. The peaks are in 2006 - 2011, then we see a decline and some leveling off since 2012 (from 500 to 652 games per year - close to the level of 2001 - 2006). This may be due to the development of games on mobile devices for Android or iOs. Over the past years, the rate of development has been falling, and profits have been falling.

The median lifespan of the platform is 10 years, while the time cycle for the announcement of a new console is approximately 6-7 years


Promising platforms are PS4, XOne, 3DS and PC. Despite the fact that we see a general decline - the leaders and the distribution of profits do not change. PS4 and XOne are developing in a similar fashion, with PS4 selling slightly better due to the availability of exclusive games, while Miscrosoft's policy is to play on PC as well. The 3DS platform is also selling well, the PC is inferior to the console versions, which is associated with the need to update the hardware of the computer, which is much more expensive than buying a console

The biggest drop in revenue was seen in the action genre, shooter and RPG are developing more steadily. At the moment, shooter is among the best sellers, but approximately equal to action. Thus - users are interested in action, RPG and shooter

## Exploring video game users by region


For convenience, let's display our dataframe

In [None]:
games_12_16.head()

Let's write a charting function for popular genres and platforms by region:

In [None]:
def diag_plot (data, column, region):
    region_data = games_12_16.groupby([column])[region].sum().sort_values(ascending=False).head()
    data = region_data
    
    data.plot(y=column, kind='bar', figsize=(15,5),
              color=['red', 'steelblue', 'violet', 'lightgreen', 'lightblue'],
              edgecolor='black'
             )

    
    plt.title('Distribution of platforms by revenue')
    plt.ylabel('')
    plt.xlabel('')
    plt.show()

### Region NA

In [None]:
diag_plot(games_12_16, 'platform', 'na_sales')

In North America, the most popular platform is the X360 - where it comes from. At the same time, with the release of a new generation of consoles, the leadership goes to PS4. 3DS is the least popular console among the leaders

In [None]:
diag_plot(games_12_16, 'genre', 'na_sales')

As for genres, the most popular are action and shooter, closes the top three in the sport genre.

### Region EU

In [None]:
diag_plot(games_12_16, 'platform', 'eu_sales')

Among European users, PS4 and PS3 are leaders, Microsoft's console is much less successful than in North America

In [None]:
diag_plot(games_12_16, 'genre', 'eu_sales')

### Region JP

In [None]:
diag_plot(games_12_16, 'platform', 'jp_sales')

Japanese users prefer consoles from Nintendo and Sony. At the same time, as many as two portable consoles entered the top 5. Japanese users prefer local console makers

In [None]:
diag_plot(games_12_16, 'genre', 'jp_sales')

It is expected that Japanese users prefer RPGs (many famous RPGs originated from Japan, there is a separate subgenre - [jRPG](https://ru.wikipedia.org/wiki/%D0%AF%D0%BF%D0%BE%D0%BD%D1%81%D0%BA%D0%B0%D1%8F_%D1%80%D0%BE%D0%BB%D0%B5%D0%B2%D0%B0%D1%8F_%D0%B8%D0%B3%D1%80%D0%B0)),next comes action.

If Europe and America are more or less similar in preference, then Japan clearly stands out at their level.
The Japanese often work a lot (over [60](https://rb.ru/story/karoshi/) hours per week) - with this approach - there will be no time for a home stationary console - perhaps that is why portable consoles are in the lead. In addition, Japanese housing is often small in size - choosing a portable platform will save space (there is no need to buy a monitor or TV and the console itself will not take up space)

As for RPGs, role-playing games offer the user immersion and passion for a long time, unlike shooters or action games, some immersion and juxtaposition of oneself and the character, which helps to distract from fatigue


### Colnclusion

In North America, the most popular platform is the X360 - where it comes from. At the same time, with the release of a new generation of consoles, the leadership goes to PS4. 3DS is the least popular console among the leaders.

Among European users, PS4 and PS3 are the leaders, Microsoft's console is much less successful than in North America, but both regions are similar in that 3ds is the least popular. 


Japanese users prefer consoles from Nintendo and Sony. At the same time, as many as two portable consoles entered the top 5. Japanese users prefer local console makers


In terms of genres, American and European users preferaction games and shooters, Japanese users prefer 3DS games and RPGs.

## Conclusion



**Promising platforms for 2017 + are Playstation 4 and Xbox One**. Nintendo's 3DS should only be considered if targeting the Japanese market. For the domestic market, the results of the European study are applicable. Accordingly, action and shooter will be promising genres - such games will most likely be in demand. Sports games are also popular - they should also be considered for marketing.

PC is less promising than consoles at the moment, which may be due to the development of online services for buying games - that is, the need for physical media is no longer needed.

In this way,

you need to focus on the 8th generation PS4 and Xbox One consoles, with the first in priority, and action and shooter games