## Questions 

* Which publisher has been most successful?
* Which console has been most successful?
* What is the percentage of sales by region?
* What is the percentage of global sales by genre?
* What is the percentage of global sales by year?
* What is the correlation between global sales and all other attributes?




In [None]:
# Importing packages that will be useful in my analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import plotly
import plotly.express as px
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
%matplotlib inline
sns.set()

In [None]:
#First, import the dataset and set index to 'Rank'
sales = pd.read_csv('../input/videogamesales/vgsales.csv', index_col = 'Rank')

In [None]:
#Let's see what this data looks like
sales.head()

In [None]:
# Rows and columns?
sales.shape

In [None]:
# General info
sales.info()

### Looks like there are some null values in the year, genre and publisher columns. We're going to be using this data, so lets clean those up

In [None]:
# Remove rows with null values
sales.dropna(inplace= True)
sales.info()

### We also have the opportunity to increase efficiency by changing some columns to be of data type 'category'. First, let's find which of these columns are good canditates for this.

In [None]:
# Number of unique values for each column
sales.nunique()

### Looks like Genre and Platform might be good candidates for changing to datatype of 'category'

In [None]:
sales['Platform'] = sales['Platform'].astype('category')
sales['Genre'] = sales['Genre'].astype('category')

### Now we've cleaned our data up a little, let's look at our first question - which publisher has been the most successful?

First - let's get our total number of global sales. This is going to be useful for most of our questions

In [None]:
total_sales = sales['Global_Sales'].sum()
total_sales

### 8,811.97 Million sales. That's a lot of video games! Lets make a note of this number - it's going to come in useful.

Now we'll take a look at the total sales by publisher, using Groupby and sum

In [None]:
sales_by_publisher = sales.groupby('Publisher').sum() # create new df with sales by publisher
sales_by_publisher.sort_values('Global_Sales', ascending = False, inplace = True) # Sort by global sales, descending
sales_by_publisher.head(10) # Check out the top 10


### Looking at the dataframe, we can clearly see that Nintendo is by far the most successful publisher. Just how much more successful are they, though? Lets see if we can use a visualisation to figure this out.

First, a simple pie chart

In [None]:
# For simplicity, we're going to just look at the top 10 publishers, as the number of 
# global sales tails off sharply as you go down the list

# Copy the index to a separate column so we can use it as a label on the pie chart
sales_by_publisher['Publisher_col'] = sales_by_publisher.index

#Lets use Plotly to throw up a simple pie chart
pie_chart = px.pie(
            data_frame = sales_by_publisher.head(10),
            values = 'Global_Sales',
            names = 'Publisher_col')
pie_chart

### So Nintendo have close to 30% of all sales generated by the top 10 publishers. Some indications of dominance here, but it's not the whole picture. Nintendo are a massive company - maybe they just put out more games than everyone else, and that accounts for the extra sales? Lets see if we can adjust for this, by finding the average number of sales per game for each publisher in the top 10.

In [None]:
# Create a new df with average and total rows for each publisher
avg_sales_by_pub = sales.groupby('Publisher').agg(['mean', 'count'])


# Discard smaller publishers that have published less than 50 games
avg_sales_by_pub = avg_sales_by_pub[avg_sales_by_pub[('Global_Sales', 'count')] > 50]


# Sort by highest mean
avg_sales_by_pub.sort_values(('Global_Sales', 'mean'), ascending = False, inplace= True)
 
# Lets see who's at the top!
avg_sales_by_pub.head(10)


### Remarkable! Nintendo not only have the highest number of total sales (by a high margin), they also sell, on average, almost twice as many copies per game as their nearest rival, Microsoft. 

#### It's looking like Nintendo are the undisputed rulers of the video game industry, but before we make that final conclusion, lets look a little bit more deeply

In [None]:
sales.head(20)

## Woah - 17 of the top 20 games by global sales are all Nintendo. Pretty impressive, but are these data pointing at something else? 

#### Bonus question - are Nintendo a 'Big Game' only company? i.e. are Nintendo's biggest selling games massively pulling up the average sales number? Could it be that the majority of Nintendo games only sell as well (or even worse) than rival companies? 

In [None]:
# First, let's create a new df that only has Nintendo games
only_nintendo = sales[sales['Publisher'] == 'Nintendo']

# Now, let's plot a histogram
fig = px.histogram(only_nintendo, x="Global_Sales", nbins=20)
fig.show()

### So there we go. Ninendo may have the highest number of average sales per game, but the vast majority of their games sold UNDER that average, with their big games dragging that average up a huge amount. Conclusion: Nintendo ARE the most successful game publisher, however, a randomly picked Nintendo game is likely to be no more successful than a randomly picked game from one of their top competitors.