# Analyzing Video Games Across Genres

**Participant ID:**  P3

**Date / Time:**

# Introduction
Welcome to our data analysis study. For this study, you'll be working with a dataset sourced [Corgis Datasets Project](https://corgis-edu.github.io/corgis/).

The data was originally published in the publication [“What makes a blockbuster video game? An empirical analysis of US sales data.” Managerial and Decision Economics](https://researchportal.port.ac.uk/en/publications/what-makes-a-blockbuster-video-game-an-empirical-analysis-of-us-s) by Dr Joe Cox. 

The dataset has information about the sales and playtime of over a thousand video games released between 2004 and 2010. The playtime information was collected from crowd-sourced data on ["How Long to Beat"](https://howlongtobeat.com/).

- You will use an extension called PersIst to complete **data cleanup and manipulation** tasks. 
- Interactive charts and tables have been pre-created for your convenience. These can be directly utilized by running the corresponding cells.
- Focus on leveraging the interactive capabilities of Persist for your analysis.
- Carefully follow the step-by-step instructions provided for each task.
- In some cases, you will be asked to document your findings. Please do this in writing in a markdown cell.
- As you work through the tasks, take note of any interesting findings or challenges with the software or pandas that you may encounter, either by speaking your thoughts out loud or taking notes in a markdown cell.
- Feel free to add new code and markdown cells in the notebook as necessary to complete the tasks, but please do attempt the tasks with the PersIst functionality.

In [2]:
import helpers as h
import pandas as pd
import altair as alt

import persist_ext as PR

## Data Description

The table below describes the different columns in the dataset. Each row in the dataset represents a video game.

| Column        | Description                                                                                                                                                           |
|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Title         | Full title of the game.                                                                                                                                               |
| Max Players   | The maximum number of players that can play this game.                                                                                                                |
| Multiplatform | Whether this game is available on multiple platforms.                                                                                                                 |
| Online        | Whether this game supports online play.                                                                                                                               |
| Genres        | The main genre that this game belongs to.                                                                                                                             |
| Licensed      | Whether this game was based off a previously licensed entity.                                                                                                         |
| Publishers    | The publishers who created this game.                                                                                                                                 |
| Review Score  | A review score for this game, out of 100.                                                                                                                             |
| Re-release    | Whether this game is a re-release of an earlier one.                                                                                                                  |
| Year          | The year that this game was released.                                                                                                                                 |
| Comp_Time_All | The median time that players reported completing the game in any way, in hours.                                                                                       |
| Comp_Time_Main| The median time that players reported completing the main game storyline, in hours.                                                                                   |

In [3]:
df = pd.read_csv('video_games.csv')
df.head()

Unnamed: 0,Title,Max Players,Multiplatform?,Online?,Genres,Licensed?,Comp_Time_All,Re-release?,Comp_Time_Main,Review Score,Year,Publishers
0,Super Mario 64 DS,1,True,True,Action,True,24.48,True,14.5,85,2004,Nintendo
1,Lumines: Puzzle Fusion,1,True,True,Strategy,True,10.0,True,10.0,89,2004,Ubisoft
2,WarioWare Touched!,2,True,True,Action,True,2.5,True,1.83,81,2004,Nintendo
3,Hot Shots Golf: Open Tee,1,True,True,Sports,True,-100.0,True,-100.0,81,2004,Sony
4,Spider-Man 2,1,True,True,Action,True,10.0,True,8.0,61,2004,Activision


# Task 1: Column Names and Data Types

In the first task we will perform some basic data cleaning operations to get our dataset ready for further tasks.

### Task 1a: Remove Columns

Remove the following columns to streamline the dataset for further analysis:

- **_Re-release?:_** Boolean flag indicating if the game was a new release or a re-release.
- **_Comp_Time_All:_** Average of all other completition times, we will use one of the others directly

#### **Instructions**
1. **Column Removal:**
	- Use the interactive table feature in PersIst to remove the specified columns.
3. **Verify the Output:**
	- Print the head of the generated dataframe to verify the changes.

In [4]:
PR.PersistTable(df, df_name="df_task_1a")

PersistWidget(data_values=[{'__id_column': '1', 'Title': 'Super Mario 64 DS', 'Max Players': '1', 'Multiplatfo…

In [5]:
df_task_1a.head()

Unnamed: 0,Title,Max Players,Multiplatform?,Online?,Genres,Licensed?,Comp_Time_Main,Review Score,Year,Publishers
0,Super Mario 64 DS,1,True,True,Action,True,14.5,85,2004,Nintendo
1,Lumines: Puzzle Fusion,1,True,True,Strategy,True,10.0,89,2004,Ubisoft
2,WarioWare Touched!,2,True,True,Action,True,1.83,81,2004,Nintendo
3,Hot Shots Golf: Open Tee,1,True,True,Sports,True,-100.0,81,2004,Sony
4,Spider-Man 2,1,True,True,Action,True,8.0,61,2004,Activision


In [6]:
df_task_1a.head()

Unnamed: 0,Title,Max Players,Multiplatform?,Online?,Genres,Licensed?,Comp_Time_Main,Review Score,Year,Publishers
0,Super Mario 64 DS,1,True,True,Action,True,14.5,85,2004,Nintendo
1,Lumines: Puzzle Fusion,1,True,True,Strategy,True,10.0,89,2004,Ubisoft
2,WarioWare Touched!,2,True,True,Action,True,1.83,81,2004,Nintendo
3,Hot Shots Golf: Open Tee,1,True,True,Sports,True,-100.0,81,2004,Sony
4,Spider-Man 2,1,True,True,Action,True,8.0,61,2004,Activision


### Task 1b: Fix Column Names

It looks like our dataset header went wrong when reading the file and some column headers end with a `?`. **Please remove the question marks from all headers**. 

#### **Instructions**
1. **Rename Columns:**
    - Use the interactive  table in Persist to correct the column names by removing the trailing `?` from their names:
        - _Licensed?_ → _Licensed_
        - _Multiplatform?_ → _Multiplatform_
        - _Online?_ → _Online_
2. **Verify the Output:**
	- Print the head of the generated dataframe to verify the changes.

In [7]:
PR.PersistTable(df_task_1a, df_name="df_task_1b")

PersistWidget(data_values=[{'__id_column': '1', 'Title': 'Super Mario 64 DS', 'Max Players': '1', 'Multiplatfo…

In [8]:
df_task_1b.head()

Unnamed: 0,Title,Max Players,Multiplatform,Online,Genres,Licensed,Comp_Time_Main,Review Score,Year,Publishers
0,Super Mario 64 DS,1,True,True,Action,True,14.5,85,2004,Nintendo
1,Lumines: Puzzle Fusion,1,True,True,Strategy,True,10.0,89,2004,Ubisoft
2,WarioWare Touched!,2,True,True,Action,True,1.83,81,2004,Nintendo
3,Hot Shots Golf: Open Tee,1,True,True,Sports,True,-100.0,81,2004,Sony
4,Spider-Man 2,1,True,True,Action,True,8.0,61,2004,Activision


## Task 1c: Correcting Data Type of 'Max Players'

There is a data type issue in `Max Players` column of our dataframe. The column is a category column and should have the pandas dtype of `category`.

In [9]:
df_task_1b.dtypes

Title             string[python]
Max Players       string[python]
Multiplatform            boolean
Online                   boolean
Genres            string[python]
Licensed                 boolean
Comp_Time_Main           Float64
Review Score               Int64
Year                       Int64
Publishers        string[python]
dtype: object

#### **Instructions**
1. **Convert `Max Players` column to category:**
    - Use the column header to switch data type of the `Max Players` column.
2. **`Edit Categories` pop-up:**
    - Inspect the `Edit Categories` pop-up in the toolbar. Look for any redundant categories (e.g. both `1` and `1P`)
3. **Edit and Correct Entries:**
    - Search for the cells having the redundant option using the search box.
    - Edit the cells to remove the trailing `P` (e.g. `1P` to `1`)
3. **Verify the Output:**
	- Print the head of the generated dataframe to verify the changes.
    - Also print the dtypes of the dataframe

In [10]:
PR.PersistTable(df_task_1b, df_name="df_task_1c")

PersistWidget(data_values=[{'__id_column': '1', 'Title': 'Super Mario 64 DS', 'Max Players': '1', 'Multiplatfo…

In [12]:
df_task_1c.dtypes

Title             string[python]
Max Players             category
Multiplatform            boolean
Online                   boolean
Genres            string[python]
Licensed                 boolean
Comp_Time_Main           Float64
Review Score               Int64
Year                       Int64
Publishers        string[python]
is_selected              boolean
dtype: object

# Task 2: Filtering data

In Task 2, we further improve our data by removing outliers and removing certain records to have more consistent data. 

## **Task 2a: Remove Outliers**

In this task, we address data accuracy by filtering out anomalies in the completion time for the main story of game. We observe some records with negative values for completion time, which is obviously incorrect data.

**Remove records with negative completion time.**

#### **Instructions**
1. **Identify and Remove Anomalies:**
    - Interactively select data points negative value for `Comp_Time_Main`.
    - Use Persist's interactive features to remove these anomalous records.
2. **Verify the Output:**
    - Print the head of the generated dataframe to verify the changes.

In [13]:
PR.plot.scatterplot(df_task_1c, "Comp_Time_Main:Q", "Review Score:Q", df_name="df_task_2a")

PersistWidget(data_values=[{'__id_column': '1', 'Title': 'Super Mario 64 DS', 'Max Players': 1, 'Multiplatform…

In [14]:
df_task_2a.head()

Unnamed: 0,Title,Max Players,Multiplatform,Online,Genres,Licensed,Comp_Time_Main,Review Score,Year,Publishers,is_selected
0,Super Mario 64 DS,1,True,True,Action,True,14.5,85,2004,Nintendo,False
1,Lumines: Puzzle Fusion,1,True,True,Strategy,True,10.0,89,2004,Ubisoft,False
2,WarioWare Touched!,2,True,True,Action,True,1.83,81,2004,Nintendo,False
3,Spider-Man 2,1,True,True,Action,True,8.0,61,2004,Activision,False
4,The Urbz: Sims in the City,1,True,True,Simulation,True,15.5,67,2004,EA,False


## Task 2b: Filtering Out Old Data

The interactive barchart below, shows the data aggregated by year. There are noticeably fewer records for `2004` and `2005`.

During this subtask we will remove these older records, keeping only the records for the year 2006 and above.

#### **Instructions**
1. **Analyze the Bar Chart:**
    - Identify the bars that have less than 200 records.
2. **Filter Year:**
    - Select and remove the years with few records.
3. **Verify the Output:**
    - Print the head of the generated dataframe to verify the changes.

In [15]:
PR.plot.barchart(df_task_2a, "Year:O", "count()", selection_type="interval", df_name="df_task_2b")

PersistWidget(data_values=[{'__id_column': '1', 'Title': 'Super Mario 64 DS', 'Max Players': 1, 'Multiplatform…

In [16]:
df_task_2b.head()

Unnamed: 0,Title,Max Players,Multiplatform,Online,Genres,Licensed,Comp_Time_Main,Review Score,Year,Publishers,is_selected
0,Wii Fit,1,True,True,Educational,True,3.93,80,2007,Nintendo,False
1,Halo 3,4,True,True,Action,True,9.0,94,2007,Microsoft,False
2,Call of Duty 4: Modern Warfare,4,True,True,Action,True,7.0,94,2007,Activision,False
3,Super Mario Galaxy,2,True,True,Action,True,15.0,97,2007,Nintendo,False
4,Mario Party DS,1,True,True,Action,True,6.87,72,2007,Nintendo,False


## Task 3: Data Wrangling

### Task 3a: Creating and assigning `'Length'` category

Next, we'll introduce a new categorical variable named `Length` into our dataset. This addition aims to classify each game into one of `Short`, `Average` and `Long` based on `Comp_Time_Main` value.

#### **Instructions**
1. **Define Length Categories:**
    - The `Comp_Time_Main` column represents the median completion time for the main story of the game.
    - Based on median completion time, create a new category called `Length`.
    - Add three options for this category -- `Short`, `Average`, `Long`.
2. **Interactive Assignment:**
    - Use Persist's interactive features to select games and assign it to one of the `Length` values (Short, Average, Long).
    - You should use the following ranges for assigning proper categories:
        - `Short`: 0 - 20 hours
    	- `Average`: 21 - 40 hours,
    	- `Long`: more than 40 hours
3. **Verify the Output:**
    - Print the head of the generated dataframe to verify the changes.

In [17]:
PR.plot.scatterplot(df_task_2b, "Comp_Time_Main:Q", "Review Score:Q", df_name="df_task_3a")

PersistWidget(data_values=[{'__id_column': '1', 'Title': 'Wii Fit', 'Max Players': 1, 'Multiplatform': True, '…

In [18]:
df_task_3a.head()

Unnamed: 0,Length,Title,Max Players,Multiplatform,Online,Genres,Licensed,Comp_Time_Main,Review Score,Year,Publishers,is_selected
0,Short,Wii Fit,1,True,True,Educational,True,3.93,80,2007,Nintendo,False
1,Short,Halo 3,4,True,True,Action,True,9.0,94,2007,Microsoft,False
2,Short,Call of Duty 4: Modern Warfare,4,True,True,Action,True,7.0,94,2007,Activision,False
3,Short,Super Mario Galaxy,2,True,True,Action,True,15.0,97,2007,Nintendo,False
4,Short,Mario Party DS,1,True,True,Action,True,6.87,72,2007,Nintendo,False


## Task 3b: Finding Top Genre for each `Length`

Now we will analyze which genre is most prevelant for games in each length category.

#### **Instructions**
1. **Context:**
    - We have a faceted bar chart. The `x` axis encodes the `Genres` column in the data and the columns encode the newly added category `Length`.
2. **Analyze Genres:**
    - Observe the bar charts to identify the top genre for the each length.
    - You can hover on the bars to get the exact frequency.
3. **Document Findings:**
    - Note down the most common Genre for each length based on your interactive analysis in a new markdown cell.

In [19]:
NEW_COLUMN = "Length"

chart = alt.Chart(df_task_3a).mark_bar().encode(
    x="Genres:N",
    y="count():Q",
    color=f"{NEW_COLUMN}:N",
    column=f"{NEW_COLUMN}:N",
    tooltip="count()"
)
chart
PR.PersistChart(chart)

PersistWidget(data_values=[{'__id_column': '1', 'Length': 'Short', 'Title': 'Wii Fit', 'Max Players': 1, 'Mult…

**Task 3b Notes:**

- Top publisher for `Short` games: Action(385)
- Top publisher for `Average` games: Action(26)
- Top publisher for `Long` games: Role Playing RPG (9)