# **Videogames Sells**

Import image with *IPython.display* module and **Image** function.

In [None]:
# from IPython.display import Image
# Image("videogames.jpg")

## Import packages

We must import packages that we will use.
<br>
- **pandas**: Read and manipulate data.
- **numpy**: If we need do calculations.
- **os**: Manipulate or use your operative system.
- **matplotlib.pyplot**: Data visualization.
- **plotly**: Interactive data visualization.
- **seaborn**: Statistical data visualization.

In [None]:
import pandas as pd
import numpy as np

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

%matplotlib inline
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

First of all, we get our *current working directory (cwd)* and our dataset from there.
<br>
After that, we gonna join our path with file name (Dataset in this case) in a variable.

In [None]:
# path = os.getcwd()
# file = "vgsales.csv"

fullpath = r"/kaggle/input/videogamesales/vgsales.csv"

Now we have our whole directory, and we start reading our dataset.
<br>
With *.head()* we can see first 5 rows of our dataset, to confirm if dataset is correct. If we need specify some separator, introduce header...

In [None]:
videogames = pd.read_csv(fullpath)
videogames.head()

We can see in the next lines, every different values from different columns. In this case, I have chosen the next variables or columns:
- Platform
- Genre
- Publisher

I have chosen these variables, because I think they can be important to show who is the most popular, who have more sales...

In [None]:
videogames.info()

With **.info()** we can see all information about our dataframe, like *Year* variable is float.

In [None]:
videogames.shape

**.shape** show us number of columns and rows of our dataframe.

## Clean data

**.isna()** with **.sum()** shows sum of all NaN in our dataframe.

In [None]:
videogames.isna().sum()

We have a lot of NaN in the variable or column *Year* and *Publisher*.
<br>
We have 2 options in this case:
- Remove NaN values.
- Replace NaN values.

### Delete NaN

**.dropna()** to remove NaN values. Inside the parentesis, *how='any'* remove all NaN of the dataframe and *inplace=True* to apply these changes into dataframe.

In [None]:
videogames.dropna(how="any", inplace = True)

videogames.isna().sum()

We removed all NaN values because in this case, the sum of the NaN values is very small comparing with all values of the dataframe and we can remove those rows without problem.

### Convert values

We must convert *Year* column, from *float* to *int*, to remove decimals of the years. 

In [None]:
videogames['Year'] = videogames['Year'].astype(int)

videogames.head()

### Remove values

In [None]:
print(videogames['Publisher'].value_counts().keys().tolist()[:30])

In [None]:
videogames[videogames['Publisher'] == "Unknown"].value_counts().sum()

In the *Publisher* column, we can see where are *Unknown* values. We are gonna remove them.

In [None]:
videogames = videogames.drop(videogames[videogames['Publisher'] == "Unknown"].index)

videogames[videogames['Publisher'] == "Unknown"].value_counts().sum()

## Plots

When we have finished to cleaning our dataframe, now we are gonna start with the plots, *data visualization*.

### Genre

In [None]:
videogames['Genre'].value_counts()

Now we can see each *Genre* values amount, which we will show up these values in the next plot mixing **seaborn** and **matplotlib.pyplot**.

- **plt.figure(figsize=(15, 10))** - Set figure or plot size.
- **sns.countplot(x="column_name", data=df, order = df.value_counts().index)** - Modify plot visualiation settings with *seaborn*.
- **plt.title('title', size=)** - Set the plot title.
- **plt.xticks(rotation=)** - Rotation of each variable in X axis.
- **plt.show()** - Show up the plot directly.

In [None]:
plt.figure(figsize=(15, 10))
sns.countplot(x="Genre", data=videogames, order = videogames['Genre'].value_counts().index)
plt.title('Most videogames releases genres', size=12)
plt.xticks(rotation=45)
plt.show()

In the plot we can see *Action* and *Sports* genres have more releases games.

### Publisher

**.iloc[:]** - Choose specific number of values inside the column. 

In [None]:
plt.figure(figsize=(15, 10))
sns.countplot(x="Publisher", data=videogames, order = videogames['Publisher'].value_counts().iloc[:20].index)
plt.title('Publishers with most releases games', size=12)
plt.xticks(rotation=70)
plt.show()

Publishers with most releases games are *Electronic Arts* followed by *Activision*, *Namco Bandai*, *Ubisoft*.

### Year

Now, we will see which year had more games releases.

In [None]:
plt.figure(figsize=(15, 10))
sns.countplot(x="Year", data=videogames, order = videogames['Year'].value_counts().index)
plt.title('Year with more releases games', size=12)
plt.xticks(rotation=45)
plt.show()

I'm seeing the plot now, and I have realized that *2017* and *2020* almost haven't released games in our dataframe, so we can remove them if you want.
<br>
In the other hand, the year with more game releases are *2009* and *2008*. 

### Platform

In [None]:
plt.figure(figsize=(15, 10))
sns.countplot(x="Platform", data=videogames, order = videogames['Platform'].value_counts().index)
plt.title('Most used platforms', size=12)
plt.xticks(rotation=45)
plt.show()

The most used platforms are *PS2* and *DS*.

## Total revenue by region

- NA_Sales - North America
- EU_Sales - Europe
- JP_Sales - Japan
- Other_Sales - Other region

In the first variable, *top_sales* we join all sales variables.
<br>
After, we do a sum of all values for unique value and *.reset_index()* to reset index again.
<br>
Finally, we rename column names with *.rename(columns={})* and writting actual column name and new column name after that.

In [None]:
top_sales = videogames[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]
top_sales = top_sales.sum().reset_index()
top_sales = top_sales.rename(columns={"index": "Region", 0: "Sales"})
top_sales.head()

In [None]:
plt.figure(figsize=(15, 10))

labels = top_sales['Region']
sizes = top_sales['Sales']

plt.title('Top revenue by region', size=12)
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.show()

We see, which *North America* has almost **50%** of sales.

## Top game sales

In [None]:
top_game_sale = videogames.head(20)
top_game_sale = top_game_sale[['Name', 'Global_Sales']]

In [None]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Name', y='Global_Sales', data=top_game_sale)
plt.title('Game with most global sales')
plt.xticks(rotation=70)
plt.show()

*Wii Sports* is the leader in global sales! With huge differente with the othes, following by *Super Mario Bros*, *Mario Kart Wii*.