In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import plotly


pio.templates.default = 'ggplot2'

pd.options.display.max_columns = 500

my_colors = [
    "#FC2E00", "#FFA859", "#F1E051", "#89E894",
    "#FEDC28", "#FAD66A", "#B1D8B7", "#008AFA",
    "#E7409E", "#F2B7E4", "#DB1B1B", "#808080",
    "#F47D4A", "#FFF4D9", "#65B765", "#69A6FF",
    "#D72827", "#FFC2C2", "#3B9D57", "#C6DDF1",
    "#E93F33", "#FFF1A8", "#57AB27", "#BBF2F5",
    "#B81F10", "#FFC18A", "#93D5C4", "#0066FF",
    "#F15604", "#FFDFA8", "#C7E0E9", "#F7B2B2",
]

In [2]:
df = pd.read_csv('/kaggle/input/video-games-sales/video_games_sales.csv')

df.head()

Unnamed: 0,rank,name,platform,year,genre,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


### Clean and Organize the Data

In [3]:
df.isnull().sum()

rank              0
name              0
platform          0
year            271
genre             0
publisher        58
na_sales          0
eu_sales          0
jp_sales          0
other_sales       0
global_sales      0
dtype: int64

In [4]:
# There aren't many compared to the total number of rows, so drop the NA values
df = df.dropna()

In [5]:
# Capture the range of years in the data set
unique_years = df['year'].unique()
sorted_years = sorted(unique_years, reverse=True)

print(f'The data set ranges from {sorted_years[-1]} - {sorted_years[0]}')

df['year'].value_counts()

The data set ranges from 1980.0 - 2020.0


2009.0    1431
2008.0    1428
2010.0    1257
2007.0    1201
2011.0    1136
2006.0    1008
2005.0     936
2002.0     829
2003.0     775
2004.0     744
2012.0     655
2015.0     614
2014.0     580
2013.0     546
2001.0     482
1998.0     379
2000.0     349
2016.0     342
1999.0     338
1997.0     289
1996.0     263
1995.0     219
1994.0     121
1993.0      60
1981.0      46
1992.0      43
1991.0      41
1982.0      36
1986.0      21
1989.0      17
1983.0      17
1990.0      16
1987.0      16
1988.0      15
1985.0      14
1984.0      14
1980.0       9
2017.0       3
2020.0       1
Name: year, dtype: int64

In [6]:
# Drop the low data counts from 2017-2020 entry
df = df[df['year'] < 2017]

In [7]:
# Check data types for any unexpected values
df.dtypes

rank              int64
name             object
platform         object
year            float64
genre            object
publisher        object
na_sales        float64
eu_sales        float64
jp_sales        float64
other_sales     float64
global_sales    float64
dtype: object

## Exploration and Visualizations

Questions that come to mind:
- What are the biggest publishers? and are they growing? 
- Are there more games released every year?
- What are the most popular genres? Per region?
- What are the most popular platforms? Per region?
- What are the most popular games? Per region?
- What are the most popular publishers? Per region?

In [8]:
# Quick look at the numbers
df.describe()

Unnamed: 0,rank,year,na_sales,eu_sales,jp_sales,other_sales,global_sales
count,16287.0,16287.0,16287.0,16287.0,16287.0,16287.0,16287.0
mean,8288.969853,2006.402775,0.265695,0.147768,0.078849,0.048437,0.541022
std,4792.138597,5.830382,0.822525,0.50936,0.311916,0.190105,1.56752
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4131.5,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8291.0,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12437.5,2010.0,0.24,0.11,0.04,0.04,0.48
max,16600.0,2016.0,41.49,29.02,10.22,10.57,82.74


#### How many games were released each year? 
Using a simple histogram to view the distribution of games released over the years of the dataset gives a an overall sense of how the number of games released has changed over time. The different colors you are seeing are for the differnt platforms. Its neat way to see the life span and influence of each platform. What we see in the hist is few games being released in the 80 with Atari and Nintendo leading the way in those early years. The 90s saw steady growth in the # of games release each year and then in 2002 BOOM! the # of games takes a 72% jump when 829 games were released! This was due to the Xbox hitting the scene and PlayStation2 being in full force. But the all time the best year for gamers was in 2009 when 1431 games were released!

In [9]:
# Histogram to show how many games were released each year in the dataset
fig = px.histogram(df, x='year', color_discrete_sequence=my_colors, title='Number of Games Released Each Year',\
    color='platform', labels={'year':'Year', 'count':'Number of Games Released'})

fig.update_layout(xaxis_title='Year', yaxis_title='Number of Games Released', title='Number of Games Released Each Year')

fig.show()

#### What do the sales #s look like by region? 
When we analyze the global sales data of the gaming industry, we observe that North America has the highest sales at 49.1%, followed by the European Union with 27.3%. Japan holds a significant market share of 14.6%, showing their love of gaming. Finally, other regions collectively account for 9% of total sales, highlighting the universal appeal of gaming that transcends geographic boundaries!

In [10]:
# Calculate the total sales for each year
yearly_sales = df.groupby('year').sum().drop('rank', axis=1)

# Create a figure with a line chart for each sales market
fig = go.Figure()
for i, market in enumerate(yearly_sales.columns[:-1]):
    fig.add_trace(go.Scatter(x=yearly_sales.index, y=yearly_sales[market], mode='lines', name=market, line=dict(color=my_colors[i])))

# Add a line chart for the global sales
fig.add_trace(go.Scatter(x=yearly_sales.index, y=yearly_sales['global_sales'], mode='lines', name='Global Sales', line=dict(color=my_colors[-1])))

# Update the x and y labels and title
fig.update_layout(xaxis_title='Year', yaxis_title='Sales (in millions of copies sold)', title='Total Sales by Year')

# Show the plot
fig.show()


In [11]:
# Sum the sales for all years by region
region_sales = df.groupby('year').sum().drop(['global_sales','rank'], axis=1).sum()

region_sales.head()
# Convert region_sales to a DataFrame
region_sales = pd.DataFrame({'region_sales': region_sales.values}, index=region_sales.index)

# Plotly pie plot of region_sales
fig = px.pie(data_frame= region_sales,
            values='region_sales', names=region_sales.index,
            title='Total Sales by Region',
            hover_data=['region_sales'],
            color_discrete_sequence=my_colors)

# Define a function to format the values with M or B
def format_sales(x):
    if x >= 1000:
        return '{:.2f}B'.format(x/1000)
    else:
        return '{:.2f}M'.format(x)

# Configure hover pop-out
fig.update_traces(hovertemplate='%{label}: %{value:.2f}M (%{percent:.1%})')


# Configure hover pop-out
fig.update_traces(hovertemplate='%{label}: %{value:.2f}M copies sold')

fig.show()


#### What are the best selling games of all time?

Using a stacked bar plot, Wii sports is the clear winner, followed Super mario Bros, Mario Kart Wii. Nintendo making a strong showing here. 

In [12]:
top_games = df.sort_values('global_sales', ascending=False)[0:10]

# Create the stacked bar chart of regional sales
trace1 = go.Bar(x=top_games['name'], y=top_games['na_sales'], name='NA sales',marker=dict(color=my_colors[0]))
trace2 = go.Bar(x=top_games['name'], y=top_games['eu_sales'], name='EU Sales',marker=dict(color=my_colors[1]))
trace3 = go.Bar(x=top_games['name'], y=top_games['jp_sales'], name='JP Sales',marker=dict(color=my_colors[2]))
trace4 = go.Bar(x=top_games['name'], y=top_games['other_sales'], name='Other Sales',marker=dict(color=my_colors[3]))

# Add both sets of bars to the same figure
fig = go.Figure(data=[trace1, trace2, trace3, trace4])

# Customize the layout
fig.update_layout(barmode='stack',xaxis_title='Game Title', yaxis_title='Sales (in millions of copies sold)', 
title='Top 10 Most Sold Video Games of All Time')

# Show the plot
fig.show()


#### What were the most popular games of each decade?
As a multi-decade gamer myself, I thought being able to explore the top selling games per decade would be fun. There are a few surprises name on there and missing. 

In [13]:
# Create a dataframe with the best selling games for each year
best_sellers = df.sort_values('global_sales', ascending=False).drop_duplicates(['year']).sort_values('year')

# Combine the game name and year into a single string for the y-axis labels
best_sellers['label'] = best_sellers['name'] + ' (' + best_sellers['year'].astype(str) + ')'

# Add a column for the decade
best_sellers['decade'] = (best_sellers['year'] // 10) * 10

# Define the dropdown options
dropdown_options = [{'label': '1980s', 'value': 1980},
                    {'label': '1990s', 'value': 1990},
                    {'label': '2000s', 'value': 2000},
                    {'label': '2010s', 'value': 2010}]

# Initialize the figure
fig = go.Figure()

# Add traces for each decade
for option in dropdown_options:
    visible = option['value'] == 1980
    trace = go.Bar(x=best_sellers.loc[best_sellers['decade'] == option['value'], 'global_sales'],
                   y=best_sellers.loc[best_sellers['decade'] == option['value'], 'label'],
                   orientation='h', name=option['label'], visible=visible, marker={'color': my_colors})
    fig.add_trace(trace)

# Customize the layout
fig.update_layout(xaxis_title='Global Sales (in millions of copies sold)', yaxis_title='Game Title and Year',
                  title='Best Selling Game of Each Year - 1980s')

# Add the dropdown menu
fig.update_layout(updatemenus=[{'type': 'dropdown', 'direction': 'down', 'showactive': True,
                                'x': 2, 'y': 1.2, 'xanchor': 'right', 'yanchor': 'top',
                                'buttons': [{'label': option['label'], 'method': 'update',
                                             'args': [{'visible': [opt['value'] == option['value'] for opt in dropdown_options]},
                                                      {'title': 'Best Selling Game of Each Year - {}'.format(option['label'])}]}
                                           for option in dropdown_options]}])

# Show the plot
fig.show()


#### What are the most popular games for each region?
Each region's top selling games were mostly unique, with exception for some Wii titles. This dataset is really showing how strong of a performer the Wii was in its day. An interesting observation is that Duck Hunt was particularly popular in North America, Nintendogs ranked among the top 5 in the European Union, and Japan had 3 Pokemon games in their top 5 list where as no other region is into Pokemon.  

In [14]:
# Plots for most popular games per region
na_games = df.sort_values('na_sales', ascending=False)[0:5]
eu_games = df.sort_values('eu_sales', ascending=False)[0:5]
jp_games = df.sort_values('jp_sales', ascending=False)[0:5]
other_games = df.sort_values('other_sales', ascending=False)[0:5]

# Create a 2x2 grid of subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=('NA Sales', 'EU Sales', 'JP Sales', 'Other Sales'),
vertical_spacing=0.20, horizontal_spacing=0.30)

# Create the bar charts for the first row
fig.add_trace(go.Bar(x=na_games['name'], y=na_games['na_sales'], showlegend=False, marker=dict(color=my_colors)), row=1, col=1)
fig.add_trace(go.Bar(x=eu_games['name'], y=eu_games['eu_sales'], showlegend=False, marker=dict(color=my_colors)), row=1, col=2)
fig.add_trace(go.Bar(x=jp_games['name'], y=jp_games['jp_sales'], showlegend=False, marker=dict(color=my_colors)), row=2, col=1)
fig.add_trace(go.Bar(x=other_games['name'], y=other_games['other_sales'], showlegend=False, marker=dict(color=my_colors)), row=2, col=2)


# Update the layout to improve the look
fig.update_layout(height=900, width=800, title_text="Top 10 Most Sold Video Games per Region",
                  legend=dict(x=0.75, y=1.1, orientation='h', bgcolor='rgba(0,0,0,0)'))

# # Update the x-axis labels for the entire figure
# fig.update_xaxes(title_text='Video Game')

# Update the y-axis labels for the entire figure
fig.update_yaxes(title_text='Sales (millions of copies sold)')       

# Show the plot
fig.show()


#### What was the biggest platform in total global sales?
 When we look at the total global sales of gaming platforms, the PS2 (Play Station 2) emerges as the top performer, with an impressive market share of 14.5% making it a true gaming legend and a beloved classic among gamers worldwide. The Xbox 360 comes in a close second, at the time offering a next level immersive gaming experience. The PS3 came in next with 11.2%. Finally, the Wii platform rounds out the list, accounting for 10.7% of the total sales. The world loves the Wii. 

In [15]:
# Group the data by the platform column and compute the sum of the sales columns
platform_grouped = df.groupby('platform').sum()

# Sort the resulting dataframe by the global_sales column in descending order
platform_sales = platform_grouped.sort_values('global_sales', ascending=False)
platform_sales = platform_grouped['global_sales']

# Filter out platforms with less than 100 million sales
platform_sales = platform_sales.loc[platform_sales > 100]

fig = px.pie(data_frame=platform_sales, values='global_sales', names=platform_sales.index,
             title='Total Global Sales by Platform',
             hover_data=['global_sales'], 
             labels={'global_sales': 'Sales (in millions of copies)', 'platform': 'Platform'},
             color_discrete_sequence=my_colors)


# Configure hover pop-out
fig.update_traces(hovertemplate='%{label}: %{value:.2f}M copies sold', pull=0.01)


fig.show()


#### What genre of game is the most popular? 
To gauge the popularity of each genre, I employed line plots for both total global sales over time by genre and the count of titles per genre over time. A closer look at the line plots of the 12 genres' global sales reveals that sports and action games have been topping the charts with the highest yearly global sales since the mid-2000s, consistently outperforming other genres in this metric. Examining the count of titles per genre, we can see that the Action genre experienced significant growth from 2004 to 2009. Taking both plots into account, it becomes evident that action has been the #1 since the mid-2000s.

In [16]:
# total sales by genre
print(f"There are {df['genre'].nunique()} genres in the dataset")

# Group the data by the platform column and compute the sum of the sales columns
genre_grouped = df.groupby(['genre','year']).sum()

# Sort the resulting dataframe by the global_sales column in descending order
genre_grouped = genre_grouped.sort_values(['genre', 'year'], ascending=True)
genre_grouped = genre_grouped[['na_sales', 'eu_sales', 'jp_sales', 'other_sales', 'global_sales']]
genre_grouped = genre_grouped.reset_index()

# Create the scatter plot
fig = px.line(genre_grouped, x='year', y='global_sales', color='genre', title='Global Sales by Genre and Year',
              labels={'year': 'Year', 'global_sales': 'Sales (in millions of copies)', 'genre': 'Genre'},
              color_discrete_sequence=my_colors)

# Show the plot
fig.show()

There are 12 genres in the dataset


In [17]:
# line plot of count of genre by year
genre_year_grouped = df.groupby(['genre', 'year']).size()
genre_year_grouped = genre_year_grouped.reset_index(name='count')

# Create the scatter plot
fig = px.line(genre_year_grouped, x='year', y='count', color='genre', title='Game Releases by Genre and Year',
                labels={'year': 'Year', 'count': '# of Releases', 'genre': 'Genre'}, color_discrete_sequence=my_colors)

# update the layout
fig.update_layout(legend=dict(xanchor='right',x=1.25, y=1, orientation='v', bgcolor='rgba(0,0,0,0)'),
                  margin=dict(r=80, t=100, b=80, l=80))
# Show the plot
fig.show()




#### What genres are most popular per region?
In North America, the European Union, and other regions, Action, Sports, and Shooter games have been the most popular genres in that order. In Japan, Role-playing games have emerged as the most popular, followed by Action and Sports games. This is not surprising, given the immense popularity of games like Pokemon in Japan. It's neat to see how different regions have their unique preferences when it comes to gaming genres, showcasing the vibrant diversity of gamers worldwide!

In [18]:
# Create groupbys for each region to plot most popular genres
na_genres = df.groupby(['genre']).sum().sort_values('na_sales', ascending=False)
na_genres = na_genres['na_sales']
na_genres = na_genres.reset_index()

eu_genres = df.groupby(['genre']).sum().sort_values('eu_sales', ascending=False)
eu_genres = eu_genres['eu_sales']
eu_genres = eu_genres.reset_index()

jp_genres = df.groupby(['genre']).sum().sort_values('jp_sales', ascending=False)
jp_genres = jp_genres['jp_sales']
jp_genres = jp_genres.reset_index()

other_genres = df.groupby(['genre']).sum().sort_values('other_sales', ascending=False)
other_genres = other_genres['other_sales']
other_genres = other_genres.reset_index()
  
# Mapping genres to colors
genres_list = sorted(df['genre'].unique())  
genre_color_mapping = dict(zip(genres_list, my_colors))

# Create and sort the dataframes  
na_genres = na_genres.sort_values(by='na_sales', ascending=False)
eu_genres = eu_genres.sort_values(by='eu_sales', ascending=False)
jp_genres = jp_genres.sort_values(by='jp_sales', ascending=False)
other_genres = other_genres.sort_values(by='other_sales', ascending=False)

# Generating the colors for each subplot
na_colors = [genre_color_mapping[genre] for genre in na_genres['genre']]
eu_colors = [genre_color_mapping[genre] for genre in eu_genres['genre']]
jp_colors = [genre_color_mapping[genre] for genre in jp_genres['genre']]
other_colors = [genre_color_mapping[genre] for genre in other_genres['genre']]

# Create a 2x2 grid of subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=('NA Sales', 'EU Sales', 'JP Sales', 'Other Sales'),
    vertical_spacing=0.20, horizontal_spacing=0.20)

# Create the bar charts
fig.add_trace(go.Bar(x=na_genres['genre'], y=na_genres['na_sales'], showlegend=False, marker=dict(color=na_colors)), row=1, col=1)
fig.add_trace(go.Bar(x=eu_genres['genre'], y=eu_genres['eu_sales'], showlegend=False, marker=dict(color=eu_colors)), row=1, col=2)
fig.add_trace(go.Bar(x=jp_genres['genre'], y=jp_genres['jp_sales'], showlegend=False, marker=dict(color=jp_colors)), row=2, col=1)
fig.add_trace(go.Bar(x=other_genres['genre'], y=other_genres['other_sales'], showlegend=False, marker=dict(color=other_colors)), row=2, col=2)

# Update the layout to improve the look
fig.update_layout(height=900, width=800, title_text="Genre Popularity by Region",
                    legend=dict(x=0.75, y=1.1, orientation='h', bgcolor='rgba(0,0,0,0)'))

# Update the x-axis labels for the entire figure
fig.update_yaxes(title_text='Sales (millions of copies sold)')

# Show the plot
fig.show()



#### Who are the biggest publishers?
I employed a line plot to examine the sales of each company over time. It was interesting to note that Nintendo was the only one around in the early days and was overtaken by EA in 2002 as the publisher with the greatest total global sales for about two years. However, Nintendo made a comeback with the release of Wii Sports in 2006, which proved to be a massive hit and propelled Nintendo to the top of the plots once again. 2006 also saw Activision and Ubisoft take off, each more than doubling their total global sales over the course of that year.  

In [19]:
print(f'There are {len(df["publisher"].unique())} unique publishers in the dataset.')

# Create dataframe of top 5 publishers
top_publishers = df.groupby('publisher', as_index=False).agg({'global_sales': 'sum'})
top_5_publishers = top_publishers.nlargest(5, 'global_sales')

# Merge the top 5 publishers back with the original dataframe to get all the games by those publishers
top_5_publishers_games = df[df['publisher'].isin(top_5_publishers['publisher'])]

# Create a dataframe with the top 5 publishers and their total global sales by year
top_5_publishers_by_year = df[df['publisher'].isin(top_5_publishers['publisher'])].groupby(['publisher', 'year'], as_index=False).agg({'global_sales': 'sum'})

# Create a line plot
fig = px.line(top_5_publishers_by_year, x='year', y='global_sales', color='publisher', title='Top 3 Publishers Global Sales Over Time',\
    labels={'year': 'Year', 'global_sales': 'Global Sales (millions of copies sold)'})

# update the layout to improve the look
fig.update_layout(height=600, width=800, title_text="Top 5 Publishers Global Sales Over Time",  
                    legend=dict(xanchor='left', yanchor= 'top', x=0.75, y=1.1, orientation='v', bgcolor='rgba(0,0,0,0)'))

# Show the plot
fig.show()

There are 576 unique publishers in the dataset.


#### Exploring the biggest publishers
Nintendo, EA and Activision lead the way as the worlds largest video game publishers. To gain further insight into the powerhouse games behind these names, I utilized a treemap to showcase the scale and depth of their vast catalogs. The treemap adds a fun dimension to exploring the data. You can click into each publisher to see more clearly their titles. 

In [20]:
# Create dataframe of top 5 publishers
top_publishers = df.groupby('publisher', as_index=False).agg({'global_sales': 'sum'})
top_5_publishers = top_publishers.nlargest(5, 'global_sales')

# Merge the top 10 publishers back with the original dataframe to get all the games by those publishers
top_5_publishers_games = df[df['publisher'].isin(top_5_publishers['publisher'])]

my_color_scale = ["#008AFA", "#FEDC28", "#E7409E"]

# Create a treemap plot
fig = px.treemap(top_5_publishers_games, 
                 path=['publisher', 'name'],
                 values='global_sales',
                 color='global_sales',
                 color_continuous_scale=my_color_scale,
                 title='Exploring the Top 3 Video Game Publishers and Their Best-Selling Titles', )

# Show the plot
fig.show()


### Conclusion

Working through this analysis of the gaming industry has uncovered fascinating insights into the ever-evolving gaming market. Since the early days, the number of games released each year has skyrocketed, with 2009 witnessing an unprecedented high. Platforms have come a long way, from the Atari 2600 to the likes of the PS2, Xbox 360, and Wii. Additionally, genres such as action and sports have dominated sales across various regions, showcasing the universal allure of gaming.

Distinct regional preferences have emerged, with each area boasting unique best-selling games and Japan's undeniable love for Pokemon. Furthermore, our exploration revealed that industry giants Nintendo, EA, and Activision drive the sector's growth as the largest video game publishers.

Examining sales data, genre preferences, and the ascent of prominent platforms allows us to better understand the gaming landscape and appreciate its remarkable journey over the years. This analysis enables us to acknowledge the industry's dynamic nature and anticipate the thrilling developments in store for gamers worldwide.