# Analyzing Video Games Across Genres

**Participant ID:**  
**Date / Time:**

# Introduction
Welcome to our data analysis study. For this study, you'll be working with a dataset sourced [Corgis Datasets Project](https://corgis-edu.github.io/corgis/).

The data was originally published in the publication [“What makes a blockbuster video game? An empirical analysis of US sales data.” Managerial and Decision Economics](https://researchportal.port.ac.uk/en/publications/what-makes-a-blockbuster-video-game-an-empirical-analysis-of-us-s) by Dr Joe Cox. 

The dataset has information about the sales and playtime of over a thousand video games released between 2004 and 2010. The playtime information was collected from crowd-sourced data on ["How Long to Beat"](https://howlongtobeat.com/).

- The PersIst extension is already installed and enabled in this notebook.
- To familiarize yourself with its functionalities, please refer to the provided [tutorial notebook](../tutorial.ipynb).
- Interactive charts and tables have been pre-created for your convenience. These can be directly utilized by running the corresponding cells.
- Focus on leveraging the interactive capabilities of Persist for your analysis.

## Tasks Overview

In this study, you are presented with three fundamental data analysis tasks. Each task is designed to test different aspects of data analysis and manipulation.

- Carefully follow the step-by-step instructions provided for each task.
- As you work through the tasks, take note of any interesting findings or challenges you encounter.
- Feel free to add new code and markdown cells in the notebook as necessary to complete the tasks.
- Document your findings and any challenges faced during the analysis in markdown cells. This can include observations about the data, any issues encountered, and your overall experience with the task/method.

**Support**
- If you require assistance or need further clarification on any of the tasks, please let us know.
- If you find yourself stuck on a task and feel that you will not make any progress, you have the option to skip the task.
- For tasks that build upon the outputs of previous tasks, skipping a task might affect your ability to proceed. If you choose to skip a task, we can assist you by providing the necessary dataset or outputs required for the consecutive tasks.

In [None]:
import helpers as h
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

import persist_ext as PR

## Data Description

The table below describes the different columns in the dataset. Each row in the dataset represents a video game.

| Column        | Description                                                                                                                                                           |
|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Title         | Full title of the game.                                                                                                                                               |
| Handheld      | Whether this is a hand-held game.                                                                                                                                     |
| Max Players   | The maximum number of players that can play this game.                                                                                                                |
| Multiplatform | Whether this game is available on multiple platforms.                                                                                                                 |
| Online        | Whether this game supports online play.                                                                                                                               |
| Genres        | The main genre that this game belongs to.                                                                                                                             |
| Licensed      | Whether this game was based off a previously licensed entity.                                                                                                         |
| Publishers    | The publishers who created this game.                                                                                                                                 |
| Sequel        | Whether this game is a sequel to another game.                                                                                                                        |
| Review Score  | A review score for this game, out of 100.                                                                                                                             |
| Sales         | The total sales made on this game, measured in millions of dollars.                                                                                                   |
| Used Price    | A typical "used" price for this game (i.e. previously returned and sold), measured in dollars.                                                                        |
| Console       | The name of the console that this particular game was released for. Note that the dataset contains multiple copies of the same game, released for different consoles. |
| Rating        | The ESRB rating for this game, either E (for Everyone), T (for Teen), or M (for Mature).                                                                              |
| Re-release    | Whether this game is a re-release of an earlier one.                                                                                                                  |
| Year          | The year that this game was released.                                                                                                                                 |
| CT_All        | The median time that players reported completing the game in any way, in hours. This is the median over all the other categories.                                     |
| CT_Comp       | The median time that players reported completing everything in the game, in hours.                                                                                    |
| CT_MainExtra  | The median time that players reported completing the main game and major extra parts of the game, in hours.                                                           |
| CT_MainOnly   | The median time that players reported completing the main game storyline, in hours.                                                                                   |

In [None]:
df = pd.read_csv('video_games.csv')
df.head()

# Task 1: Column Names and Data Types

In the first task we will perform some basic data cleaning operations to get our dataset ready for further tasks.

### **Task 1a: Remove Columns**

#### **Objective**
Remove certain columns to streamline the dataset for further analysis.
- **_Re-release?:_** Boolean flag indicating if the game was a new release or a re-release.
- **_CT_All:_** Average of all other completition times, we will use one of the others directly

#### **Instructions**
1. **Column Removal:**
	- Use the interactive table feature in PersIst to remove the specified columns.
2. **Generate dataframe:**
	- Assign the modified dataframe to variable `df_task_1a`
3. **Show Output:**
	- Print the head of `df_task_1a` to show the changes.

In [None]:
PR.PersistTable(df)

### **Task 1b: Fix Column Names**

#### **Objective**
For this subtask, we will focus on fixing column names to ensure consistency and clarity. We'll start by identifying the issues with the column names, specifically targeting those with a `?` suffix that needs removal.

#### **Instructions**
1. **Rename Columns:**
    - Use the interactive  table in Persist to correct the column names by removing the trailing `?` from their names:
        - _Handheld?_ → _Handheld_
        - _Licensed?_ → _Licensed_
        - _Multiplatform?_ → _Multiplatform_
        - _Online?_ → _Online_
2. **Generate dataframe:**
    - Assign the revised dataframe to the variable `df_task_1b`.
3. **Show Output:**
    - Display the head of `df_task_1b` to verify the changes.

In [None]:
PR.PersistTable(df_task_1a)

## **Task 1c: Correcting Data Type of 'Max Players'**

#### **Objective**
In this task, we will address a data type issue in the `Max Players` column of our dataframe. We want to convert the data type of the column to `category`. However, the dataset has duplicate category values due to typos.

For e.g. for `Max Players` supported some columns have value of `1` and some have value of `1P` which are both the same.

Remove any trailing `P`s from the `Max Players` column.

In [None]:
df_task_1b.dtypes

#### **Instructions**
1. **Convert `Max Players` column to category:**
    - Use the column header to switch data type of the `Max Players` column.
2. **`Edit Categories` pop-up:**
    - Inspect the `Edit Categories` pop-up in the toolbar. Note the incorrect values (e.g. `1P`).
3. **Edit and Correct Entries:**
    - Search for the cells having an incorrect option using the search box.
    - Edit the cells to remove the trailing `P` (e.g. `1P` to `1`)
4. **Generate Dataframe:**
    - Assign the modified dataframe to a variable `df_task_1c`.
5. **Show Output:**
    - Display the dtypes of `df_task_1c` to verify the data type correction.

In [None]:
PR.PersistTable(df_task_1b)

# Task 2: Filtering data

In Task 2, we further improve our data by removing outliers and removing certain records to have more consistent data. 

We will also take a brief look at relations between cause of an avalanche (`Trigger`) and failure point of ice (`Weak Layer`)

## **Task 2a: Remove Outliers**

#### **Objective**
In this task, we address data accuracy by filtering out anomalies in the completion time for the main story of game.

We observe some records with negative values for completion time, which is obviously incorrect data.

Remove records with negative completion time.

#### **Instructions**
1. **Identify and Remove Anomalies:**
    - Interactively select data points negative value for `CT_MainOnly`.
    - Use Persist's interactive features to remove these anomalous records.
2. **Generate Dataframe:**
    - Assign the cleaned dataframe to a variable `df_task_2a`.
3. **Show Output:**
    - Display the head of `df_task_2a`.

In [None]:
PR.plot.scatterplot(df_task_1c, "CT_MainOnly:Q", "Review Score:Q")

## **Task 2b: Filtering Out Old Data**

The interactive barchart below, shows the data aggregated by year. There are noticeably fewer records for `2004` and `2005`.

During this subtask we will remove these older records, keeping only the records post 2006.

#### **Instructions**
1. **Create and Analyze Bar Chart:**
    - Looking at an interactive bar chart in Persist showing the number of video games released each year, identify the bars showing data we want.
2. **Interactive Year Selection:**
    - Use a brush to interactively select and remove appropriate records.
3. **Generate Dataframe:**
    - Assign the refined dataframe to a variable `df_task_2b`. 
4. **Show Output:**
    - Display the head of `df_task_2b` to verify the removal of earlier years.

In [None]:
PR.plot.barchart(df_task_2a, "Year:O", "count()", selection_type="interval")

## **Task 2c: Identifying top `Publishers` for `Genres` _'Racing'_ and _'RPG'_**

#### **Instructions**
1. **Linked Bar Charts:**
    - You will start with two linked interactive bar charts: one for `Genres` and another for `Publishers`.
    - Both bar charts show `count` for their respective category.
    - You can click on a trigger in the `Genres` bar chart and the `Publishers`' bar chart dynamically updates to show only games corresponding to the selected genre.
2. **Interactive Selection:**
    - Interactively select genres and use the updated `Publishers` bar chart.
3. **Identify the most frequent failure point:**
    - Analyze the filtered `Publishers` bar chart to determine the top publishers for the selected genre and make a note in a markdown cell about both the name of the publisher and number of games published. If the top publisher is `Unknown` note the next highest.
4. **Generate Dataframe and Output:**
    - You will not generate any dataframe for this task. NOTE: Please save the notebook after you are finsihed with interactions. 

In [None]:
pts = alt.selection_point(name="selector", encodings=['x'])

base = alt.Chart(df_task_2b).encode(y="count()")


publishers = base.mark_bar().encode(
    x="Publishers:N",
    color="Publishers:N",
    tooltip="count()"
).transform_filter(pts)

genre = base.mark_bar().encode(
    x="Genres:N",
    color=alt.condition(pts, "Genres:N", alt.value("#ddd")),
    tooltip="count()"
).add_params(pts)

chart = alt.hconcat(
 genre , publishers
).resolve_scale(
    color="independent",
)

PR.PersistChart(chart, data=df_task_2b)

**Task 2c Notes:**

## Task 3: Data Wrangling

### Task 3a: Creating and assigning `'Length'` category**

#### **Objective**

In this subtask, we'll introduce a new categorical variable named `Length` into our dataset. We already have `CT_MainOnly` but it would be useful to have the games grouped into `Short`, `Average` and `Long` category.

We will create a new category `Length` in the dataset and assign each record to `Short`, `Average` and `Long` based on its `CT_MainOnly` value.

#### **Instructions**
1. **Visualization**
    - We will work with an interactive scatterplot in Persist showing the `CT_MainOnly` and `Review Score`.
2. **Define Season Categories:**
    - You will first create a new category called `Length` using the `Edit Categories` button in the header.
    - In the same menu you will add three options for this category: `Short`, `Average` and `Long`.
3. **Interactive Assignment:**
    - Use Persist's interactive features to select games and assign it to one of the `Length` values (Short, Average, Long).
    - You should use the following ranges for assigning proper categories:
        - `Short`: 0 - 20 hours
    	- `Average`: 21 - 40 hours,
    	- `Long`: more than 40 hours
4. **Generate Dataframe:**
    - Assign the updated dataset to a new variable: `df_task_3a`.
5. **Show Output:**
    - Print the head of the dataframe.

In [None]:
PR.plot.scatterplot(df_task_2b, "CT_MainOnly:Q", "Review Score:Q")

# **Task 3b: Finding Top Genre for each `Length`**

#### **Objective**
In this subtask, we'll analyze which Genre is most prevalent for games in different lengths (Short, Average, Long) using the `Length` category created in Task 3a.

#### **Instructions**
1. **Visualization:**
    - We have two linked interactive bar charts: one for `Length` and another for `Genres`.
    - You can select a length to highlight using the **legend** for `Length` bar chart. The `Genres` bar chart will dynamically update in response to your selections.
2. **Analyze Genres Data:**
    - Observe the filtered `Genres` bar chart to identify the top genre for the selected length.
    - You can hover on the bars to get the exact frequency.
3. **Document Findings:**
    - Note down the most common genre for each length based on your interactive analysis in a new markdown cell.

In [None]:
select = alt.selection_point(name="s", fields=["Length"], bind="legend")

base = alt.Chart(df_task_3a).mark_bar()

length = base.encode(
    x=alt.X("Length:N").sort(["Short", "Average", "Long"]),
    y="count()",
    color=alt.condition(select, alt.Color("Length:N").sort(["Short", "Average", "Long"]), alt.value("gray"))
).add_params(select).properties(width=300)

years = base.encode(
    x="Genres:N",
    y="count()",
    color="Genres:N",
    tooltip="count()"
).transform_filter(select)

chart = length | years

chart = chart.resolve_scale(
    color="independent"
)

PR.PersistChart(chart, data=df_task_3a)

**Task 3b Notes:**
