>__Original question:__ Is there a relative difference between the three regions (North America, Europe, Japan) in video game sales when it comes to different genres?

### Import Libraries

In [1]:
import pandas as pd

### Load the dataset

In [2]:
# The csv contains a 'Rank' column which is unique, therefore can be used as index
vgsales = pd.read_csv('vgsales.csv', index_col=0)

In [3]:
vgsales

Unnamed: 0_level_0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...
16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


For ease of interpretation and coding, I'm transforming all column names to lowercase.

In [4]:
vgsales.columns = vgsales.columns.str.lower()

### Check for duplicated values

In [5]:
vgsales[vgsales.duplicated(keep=False)]

Unnamed: 0_level_0,name,platform,year,genre,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
15000,Wii de Asobu: Metroid Prime,Wii,,Shooter,Nintendo,0.0,0.0,0.02,0.0,0.02
15002,Wii de Asobu: Metroid Prime,Wii,,Shooter,Nintendo,0.0,0.0,0.02,0.0,0.02


There is one duplicated value, _Wii de Asobu: Metroid Prime_ appears on the list as both 15,000th and 15,002nd based on overall sales, which is not possible. I decided to drop the second appearance.

In [6]:
vgsales.drop_duplicates(keep='first', inplace=True)

_Note:_ The kept record is later dropped due to missing year information.

### Check for missing values

In [7]:
vgsales_columns_with_missing_data = vgsales.columns[vgsales.isnull().any()].tolist()
print('Columns with missing data in the vgsales dataframe are:\n{}\n'.format(', '.join(vgsales_columns_with_missing_data)))

for c in vgsales_columns_with_missing_data:
    print('The number of missing values in ' + c + ' is ' + str(vgsales[c].isnull().sum()))

Columns with missing data in the vgsales dataframe are:
year, publisher

The number of missing values in year is 270
The number of missing values in publisher is 58


__`year`__ is the year of the game's release, while __`publisher`__ is the publisher of the game.<br>
I can't use the average for `year`, as that would be misleading and incorrect, and there's no way to infer the `publisher` given its textual values, so I decided to drop any rows where either the year or the publisher is missing. This was only a small portion of the data considering the number of missing values.

In [8]:
vgsales.dropna(subset=['year', 'publisher'], inplace=True)

### Describe the data

In [9]:
vgsales.dtypes

name             object
platform         object
year            float64
genre            object
publisher        object
na_sales        float64
eu_sales        float64
jp_sales        float64
other_sales     float64
global_sales    float64
dtype: object

__`year`__ is `float64` because it used to have missing values, but it can be cast to `int` now.

In [10]:
vgsales['year'] = vgsales['year'].astype(int)

#### Describe numeric columns

In [11]:
vgsales.describe()

Unnamed: 0,year,na_sales,eu_sales,jp_sales,other_sales,global_sales
count,16291.0,16291.0,16291.0,16291.0,16291.0,16291.0
mean,2006.405561,0.265647,0.147731,0.078833,0.048426,0.54091
std,5.832412,0.822432,0.509303,0.311879,0.190083,1.567345
min,1980.0,0.0,0.0,0.0,0.0,0.01
25%,2003.0,0.0,0.0,0.0,0.0,0.06
50%,2007.0,0.08,0.02,0.0,0.01,0.17
75%,2010.0,0.24,0.11,0.04,0.04,0.48
max,2020.0,41.49,29.02,10.22,10.57,82.74


- The earliest game release from the list based on __`year`__ is from 1980, while the latest is from 2020.
- In terms of sales, the minimums are 0 in a lot of cases, which is expected, since the dataset contains video games with __`global_sales`__ over 100,000 copies (all sales data is in millions).
- Based on the mean and max values, most sales come from North America (__`na_sales`__), followed by Europe (__`eu_sales`__) and Japan (__`jp_sales`__).

In [12]:
vgsales[vgsales['na_sales']==vgsales['na_sales'].max()]

Unnamed: 0_level_0,name,platform,year,genre,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74


In [13]:
vgsales[vgsales['eu_sales']==vgsales['eu_sales'].max()]

Unnamed: 0_level_0,name,platform,year,genre,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74


In [14]:
vgsales[vgsales['jp_sales']==vgsales['jp_sales'].max()]

Unnamed: 0_level_0,name,platform,year,genre,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


Based on the dataset, both in North America and Europe, _Wii Sports_ is the most sold video game, while in Japan, it's _Pokemon Red/Pokemon Blue_ from 1996. It's interesting to see that all of these are Nintendo's products (GB stands for Game Boy).

#### Describe non-numeric columns

In [15]:
vgsales['platform'].value_counts()[:5]

DS      2131
PS2     2127
PS3     1304
Wii     1290
X360    1234
Name: platform, dtype: int64

In [16]:
vgsales['genre'].value_counts()[:5]

Action          3251
Sports          2304
Misc            1686
Role-Playing    1470
Shooter         1282
Name: genre, dtype: int64

In [17]:
vgsales['publisher'].value_counts()[:5]

Electronic Arts                 1339
Activision                       966
Namco Bandai Games               928
Ubisoft                          918
Konami Digital Entertainment     823
Name: publisher, dtype: int64

- Based on the dataset, most of the video games were released to Nintendo DS platform, followed closely by PlayStation 2. While both Nintendo and Sony have two console generations in the top 4, Microsoft with the Xbox 360 is only in the fifth place.
- Most of the games in this list are either action or sports games. "Misc" is for games that can't be forced into one category or the other.
- The publishers with the most games in this list are Electronic Arts and Activision.

### Save the dataframe

In [18]:
vgsales.to_csv('sales.csv', index=False)