In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(ggplot2)

In [None]:
players <- read_csv("data/players.csv")
players

sessions <- read_csv("data/sessions.csv")
sessions

In [None]:
sessions <- sessions |>
    mutate(start_time = dmy_hm(start_time),
           end_time = dmy_hm(end_time))

sessions

In [None]:
sessions_playtime <- sessions |>
    mutate(play_time = as.numeric(difftime(end_time, start_time)))

sessions_playtime

In [None]:
player_playtime <- sessions_playtime |>
    group_by(hashedEmail) |>
    summarize(total_minutes = sum(play_time, na.rm = TRUE)) |>
    arrange(desc(total_minutes))


top20_player_ids <- player_playtime |>
    slice(1:20) |>
    pull(hashedEmail)

top20_players <- players |>
    filter(hashedEmail %in% top20_player_ids)

top20_players

player_avg_age <- players |>
    summarise(mean_age = round(mean(Age, na.rm = TRUE), 2))

player_avg_played_hours <- players |>
    summarise(mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2))
player_avg_age
player_avg_played_hours 

To answer the question 2, I extract the top 20 players with the highest total play time, and analyze their characteristics, such as experience level, age, and gender. By visualizing these patterns, I am trying to check whether certain types of players are more likely to become heavy contributors and produce large amounts of gameplay data.

**Data Description**

For this project, I am using two datasets collected from a Minecraft research server. Since my project focuses on identifying which kinds of players contribute the most gameplay data, the sessions dataset plays a central role. A single player may appear many times in session data, so I computed the total play time for each player by adding the duration of all their sessions. After creating this aggregated measure, I identified the top 20 players with the highest total play time. These players are used to check whether characteristics such as experience, age, or gender are associated with being a heavy contributor.

Below are the key variables I use from each dataset

**Number of Observations:**
* **players.csv:** 68 obsevations
* **sessions.csv:** 1535 obsevations

**Number of Variables:**
* **players.csv:** 7 variables
* **sessions.csv:** 6 variables

**Summary Statistics (Mean, 2 decimals)**
* **Age:** 21.14
* **played_hours:** 5.85

**From sessions.csv**
* hashedEmail (Character): Player identifier
* start_time (Datetime): when the session started (DD/MM/YYYY HH:MM) format
* end_time (Datetime): when the session ended (DD/MM/YYYY HH:MM) format
* play_time (Numeric): Lengtho of the session in minutes

**From players.csv**
* experience (Character): Player identifier
* gender (Character): Gender
* Age (Numeric): Age in years
* played_hours (Numeric): Self-reported total hours played

**Data Colection**

* Session timestamps were automatically recorded by the Minecraft server
* Player demogrphics were collected through an optional player survey.


**Potential Issues:**

* Some sessions are extremely long, which may indicate players are not playing the game(AFK)
* Several variables in the players dataset, like played_hours, experience level, are self-reported and may not be fully accurate
* The overall sample size is small (68 players)

**Questions**

**Broad Question:** 
* What characteristics of players are associated with contributing a large amount of gameplay data? 

To explore this, I follow with a specific research question:
* Do player characteristics, such as experience level, age, and gender, help explain which players accumulate the highest total play time on the Minecraft research server?

For this question, 
**Response variable:**
* Total play time per player (minutes), calculated by adding all individual gameplay sessions.

**Explanatory variables:**
* Experience level (categorical)
* Age(numeric)
* Gender (categorical)

To address this question, I first aggregated the session data by player to compute each player's total number of minutes played. Then, I identified the top 20 players with the largest total play time and extracted their information from the players.csv file. These datasets allow me to investigate whether certain groups are more likely to be the heavy contributors.

**Exploratory Data Analysis and Visualization:**

* Before performing any modelling, I conducted exploratory data analysis to better understand the dataset and identify potential patterns related to my research question. I finished the minimum necessary wrangling, which converts timestamps to a datetime format to calculate total play time in minutes.

**Summary Statistics (players.csv)**

I computed basic summary statistics for the quantitative variables in the players dataset
* **Age:** 21.14 (mean)
* **played_hours:** 5.85 (mean)

These summary values help give an overview of the distribution of the players.

I created 4 visualizations to better understand the characteristics of the top 20 heavy contributors.

In [None]:

top_20_session_playtime <- player_playtime |>
    slice(1:20)
top_20_player_total_playtime <- top_20_session_playtime |>
    ggplot(aes(x = total_minutes, y = reorder(hashedEmail, total_minutes))) +
    geom_bar(stat = "identity") +
    labs(title = "Top 20 Players by Total Play Time", 
        x = "Total Minutes Played",
        y = "Player (hasedEmail)")
top_20_player_total_playtime

This bar graph plot shows the 20 players with the highest total minutes played.
It indicates that the top 4 players contribute most of amount of gameplay data

In [None]:

experience_vs_played_hours <- top20_players |>
    ggplot(aes(x = experience, y = played_hours, fill = experience)) +
    geom_bar(stat = "identity") +
    labs(title = "Average Played Hours by Experience Level (Top 20 Players)", 
        x = "Experience Level",
        y = "Played Hours")
experience_vs_played_hours

This bargraph plot compares the played hours across different experience levels. Regular players clearly contribute the most play time compared to Pro, Beginner, and Veteran playerss. This suggests that Regular and Amateur players are main contributors of gameplay data.

In [None]:
age_vs_played_hours <- top20_players |>
    ggplot(aes(x = Age, y = played_hours)) +
    geom_point() +
    labs(title = "Age vs Played Hours (Top 20 players)", 
        x = "Age",
        y = "Played Hours")
age_vs_played_hours

This scatter plot compares age and played hours for the top 20 players. The highest playtime mostly comes from age between 15 and 20. However, odler players tend to show lower total hours. This indicates that younger age groups are more active contributors.

In [None]:
gender_vs_played_hours <- top20_players |>
    ggplot(aes(x = gender, fill = gender)) +
    geom_bar() +
    labs(title = "Gneder Distribution of Top 20 Players", 
        x = "Gender",
        y = "Count")
gender_vs_played_hours

This bar graph shows that most of the top 20 high-playtime players are male, with fewer female, agender, and non-binary players. This indicates that male players contribute the largest share of gameplay in this dataset.

**Methods and Plan**

To address my research question whether player characteristics such as age, gender, and experience level can help explain which players contribute the most gameplay time, I choose the plan to see how different factors together relate to one number, which is total play time. This method is appropriate because it helps to see which of these factors matter the most and how they afffect the total play time when considered together.

**Assumptions:**
* Each player's data should be separate and independent
* The predictors should not be too similar to each other

**Limitations and Weaknesses**
* The dataset is small since i focus only on the top 20 players. A small dataset can make the model less reliable
* Some information, such as experience lever or self-reported hours, might not be perfectly accurate.

To understand which player characteristics relate to total play time, I will clean the data by adding up eahc player's session times and merging that with their information, ensuring that categories like gender and experience level are sotred correctly. I will split the data into an 80% training set and a 20% test set so I can build the model on one part and fairly evaluate it on the other. Since the dataset is small, I will use 5-fold cross-validation to reduce overfitting and get a more reliable estimate of how well the model works.