# Case Study on Video Game Sales

## Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#baka may libraries pa kayo gusto i-add(?)

For this notebook, we will work on a dataset called `video game sales`. This dataset contains a list of video games with sales greater than 100,000 copies. There are over 16,500 records and was generated by a scrape of vgchartz.com

If you view the `.csv` file in Excel, you can see that our dataset contains 16,598 **observations** (rows) across 11 **variables** (columns). The following are the descriptions of each variable in the dataset.

- **`Rank`**: Ranking of overall sales
- **`Name`**: The games name
- **`Platform`**: Platform of the games release (i.e. `PC`, `PS4`, etc.)
- **`Year`**: Year of the game's release
- **`Genre`**: Genre of the game
- **`Publisher`**: Publisher of the game
- **`NA_Sales`**: Sales in North America (in millions)
- **`EU_Sales`**: Sales in Europe (in millions)
- **`JP_Sales`**: Sales in Japan (in millions)
- **`Other_Sales`**: Sales in the rest of the world (in millions)
- **`Global_Sales`**: Total worldwide sales.

Let's read the dataset.

In [2]:
vgsales_df = pd.read_csv("vgsales.csv")

Now let's display general information about the dataset with the [`info`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) function, which displays general information about the dataset.

In [3]:
vgsales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


## Cleaning the Dataset
The next step in Exploratory Data Analysis is cleaning the data.

Let's first check if the values in relevant variables in the dataset are within the range of accepteble values.

In [4]:
# new_df will be used for testing muna
new_df = vgsales_df

### `Year` variable
Because this dataset was released in the year 2016, we should check to see if any of the records go beyond that.

In [5]:
vgsales_df[vgsales_df['Year'] > 2016]

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
5957,5959,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.27,0.0,0.0,0.02,0.29
14390,14393,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.0,0.0,0.03,0.0,0.03
16241,16244,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.0,0.0,0.01,0.0,0.01
16438,16441,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.0,0.0,0.01,0.0,0.01


For now, we can change these values into `NaN` as their other variables could still be used for our analysis.

In [6]:
new_df['Year'].replace(new_df[new_df['Year'] > 2016], np.nan, inplace=True)
new_df['Year'].unique()

array([2006., 1985., 2008., 2009., 1996., 1989., 1984., 2005., 1999.,
       2007., 2010., 2013., 2004., 1990., 1988., 2002., 2001., 2011.,
       1998., 2015., 2012., 2014., 1992., 1997., 1993., 1994., 1982.,
       2003., 1986., 2000.,   nan, 1995., 2016., 1991., 1981., 1987.,
       1980., 1983., 2020., 2017.])

In [10]:
new_df.sort_values(by=['Year'],ascending = False)

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
5957,5959,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.27,0.00,0.00,0.02,0.29
14390,14393,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.00,0.00,0.03,0.00,0.03
16241,16244,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.00,0.00,0.01,0.00,0.01
16438,16441,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.00,0.00,0.01,0.00,0.01
8293,8295,Shin Megami Tensei IV: Final,3DS,2016.0,Role-Playing,Deep Silver,0.03,0.00,0.14,0.00,0.17
...,...,...,...,...,...,...,...,...,...,...,...
16307,16310,Freaky Flyers,GC,,Racing,Unknown,0.01,0.00,0.00,0.00,0.01
16327,16330,Inversion,PC,,Shooter,Namco Bandai Games,0.01,0.00,0.00,0.00,0.01
16366,16369,Hakuouki: Shinsengumi Kitan,PS3,,Adventure,Unknown,0.01,0.00,0.00,0.00,0.01
16427,16430,Virtua Quest,GC,,Role-Playing,Unknown,0.01,0.00,0.00,0.00,0.01
