# Final Project

First Messeage

## A Brief Recap of My Data

 For this final project, I will use "Video Game Sales" dataset from Kaggle.com https://www.kaggle.com/datasets/gregorut/videogamesales
<br>This dataset records a list of games that have sold more than 100,000 copies globally. The fields include title, platform, year of release, genre, publisher and sales that can be broken down by region such as North America, Europe, Japan and others. Video games transcend national borders and are loved around the world, and so I thought it would be interesting to analyze their sales and people's preferences by region. 

## Goals and Tasks

### Goals
I will set three goals:
1. By comparing sales in Japan and globally by genre, I aim to understand whether there are genres that are well-received in the Japanese market or, conversely, genres that are well-received globally but not in Japan.
2. To explore whether there are any temporal changes in sales by genre in Japan and globally.
3. To be able to explore and filter popular titles within each genre to help analyze the details.

### Tasks
In order to effectively address the above goals, I will clarify the tasks as follows.

- Why is a task pursued?
    - To explore the characteristics of game genres that sell well in Japan and globally and identify the trends.
- How is a task conducted?
    - Confirm the relationship between sales in Japan and globally at a high level, and use interactive features to examine their time series trends.
- What does a task seek to learn about the data?
    - High level characteristics such as sales by genre and trends
- Where does the task operate?
    - Between sales in Japan and global (Relative reference frame according to this course's video)
- When is the task performed? (Workflow)
    - Anticipate that tasks will be repeated using the interactive features
- Who is executing the task?
    -  Stakeholders in the gaming industry, analyst or people interested in this field

## Key Elements of Visualization

I created three visualizations corresponding to the three goals I set. The key points of each visualization are summarized as follows:

- Scatter Plot
    - I plotted the sales in Japan and globally by genre to visualize the genres with a higher sales ratio in Japan (or globally). Genres located above the regression line have a higher sales ratio in Japan, while those below the line have lower sales in Japan. For example, role-playing games sell well in Japan, whereas shooting and racing games do not sell as much.
- Area Chart
    -  This area chart is linked with the interactive selector from the scatter plot, allowing users to view the time series trends of the selected genre. The sales in Japan and globally are shown in different colors.
- Bar Chart
    - This bar chart allows users to find the top 5 sales by genre and year in Japan and globally. While the scatter plot and area chart provide high-level insights, this graph enables users to obtain detailed information at the game title level.
 
By using these three visualizations, I designed the system to provide an understanding of sales by video game genre in Japan and globally, allowing users to grasp both high-level and detailed data.

### Code and Visualization

In [3]:
import pandas as pd
import altair as alt

# Suppress FutureWarning to improve visibility
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("vgsales.csv")

# Preprocessing
# Calculate global sales by adding up sales in each region
df["Global_Sales"] = df["NA_Sales"] + df["EU_Sales"] + df["JP_Sales"] + df["Other_Sales"]

# Aggregate data by genre. This will be used for drawing scatter plot
genre_df = df.groupby("Genre", as_index=False).agg({"NA_Sales": "sum", "EU_Sales": "sum", "JP_Sales":"sum", "Other_Sales": "sum", "Global_Sales": "sum", "Name": "size"})
genre_df = genre_df.rename(columns={"Name": "Num_of_Game_Titles"})

# Aggregate data by genre and year. This will be used for time series analysis
year_genre_sales = df.groupby(['Year', 'Genre'], as_index=False).agg({'JP_Sales': 'sum', 'Global_Sales': 'sum'})


# Implementing selection
selection = alt.selection_multi(fields=["Genre"])

# Create a scatter plot
scatter_plot = alt.Chart(genre_df).mark_circle().encode(
    x="Global_Sales", 
    y="JP_Sales", 
    color="Genre",
    size="Num_of_Game_Titles",
    tooltip=["Genre", "Num_of_Game_Titles"],
    opacity=alt.condition(selection,alt.value(1),alt.value(.2))
).add_selection(selection)

# Add regression line 
regression_line = scatter_plot.transform_regression(
    "Global_Sales", "JP_Sales"
).mark_line(color="red")

scatter_plot = alt.layer(
    scatter_plot,
    regression_line
).properties(
    title='Global Sales vs JP Sales'
)


# Transform the year_genre_sales to make it more manageable
long_df = year_genre_sales.melt(id_vars=['Year', 'Genre'], 
                                value_vars=['JP_Sales', 'Global_Sales'], 
                                var_name='Sales_Type', 
                                value_name='Sales')

# Create an area chart
area_chart = alt.Chart(long_df).mark_area(opacity=0.5).encode(
    x='Year:O',
    y='Sales:Q',
    color='Sales_Type:N',
    tooltip=['Year', 'Genre', 'Sales_Type', 'Sales']
).transform_filter(
    selection
).add_selection(
    selection
).properties(
    title='Time Series Analysis'
)

# Displaying two interactive graphs
scatter_plot | area_chart

In [4]:
# Retrieve the top 5 titles with the highest sales in Japan by genre and year
jp_top5 = df.sort_values(['Year', 'Genre', 'JP_Sales'], ascending=[True, True, False]).groupby(['Year', 'Genre']).head(5)

# Retrieve the top 5 titles with the highest sales globally by genre and year
global_top5 = df.sort_values(['Year', 'Genre', 'Global_Sales'], ascending=[True, True, False]).groupby(['Year', 'Genre']).head(5)

# Merge
jp_top5['Type'] = 'JP'
global_top5['Type'] = 'Global'
top5_df = pd.concat([jp_top5, global_top5])
top5_df = top5_df[['Year', 'Genre', 'Name', 'JP_Sales', 'Global_Sales', 'Type']]

# Implement a filter based on genre and year
year_genre_selection = alt.selection(
    type='single', 
    fields=['Year', 'Genre'], 
    bind={'Year': alt.binding_select(options=sorted(top5_df['Year'].unique())),
          'Genre': alt.binding_select(options=sorted(top5_df['Genre'].unique()))},
    name='Select'
)

# Visualize the top 5 sales in Japan (horizontal orientation)
jp_chart = alt.Chart(top5_df[top5_df['Type'] == 'JP']).mark_bar().encode(
    y=alt.Y('Name:N', sort='-x', title='Game Title'),
    x=alt.X('JP_Sales:Q', title='JP Sales'),
    color='Genre:N',
    tooltip=['Year', 'Genre', 'Name', 'JP_Sales']
).transform_filter(
    year_genre_selection
).properties(
    width=300,
    height=400,
    title='Top 5 JP Sales by Year and Genre'
)

# Visualize the top 5 sales globally (horizontal orientation)
global_chart = alt.Chart(top5_df[top5_df['Type'] == 'Global']).mark_bar().encode(
    y=alt.Y('Name:N', sort='-x', title='Game Title'),
    x=alt.X('Global_Sales:Q', title='Global Sales'),
    color='Genre:N',
    tooltip=['Year', 'Genre', 'Name', 'Global_Sales']
).transform_filter(
    year_genre_selection
).properties(
    width=300,
    height=400,
    title='Top 5 Global Sales by Year and Genre'
)

# Combine two charts
combined_chart = alt.hconcat(
    jp_chart,
    global_chart
).add_selection(
    year_genre_selection
).resolve_scale(
    x='independent'
).properties(
    title='Top 5 Game Titles by Sales in JP and Global Markets'
)

combined_chart.display()

## Evaluation Approach

I have adopted a journaling study as a method to evaluate the effectiveness of the created visualizations.

### Procedure
The journaling study will be conducted in two parts.
In the first part, participants will receive minimum instructions and immediately begin using the tool.
During this process, I will record the following:

1. The sequence of operations performed
2. The insights gained from these operations
3. Any additional information the participants wished to know

This will allow me to evaluate how users derive insights and whether the tool meets their expectations during these processes.

In the second part, I will ask users questions to determine whether they can obtain accurate information using the tool. In this report, I set goals to compare sales in Japan and globally by genre and to enable in-depth analysis. The purpose is to confirm whether  the tool accurately fulfills this role.

後で

### People Recruited and Results
後で

## Summary