In [None]:
library(tidyverse)
library(repr)
library(tidymodels)

In [None]:
players <- read_csv("data/players.csv")
players

sessions <- read_csv("data/sessions.csv")
sessions

In [None]:
sessions <- sessions |>
    mutate(start_time = dmy_hm(start_time),
           end_time = dmy_hm(end_time))

sessions

In [None]:
sessions_playtime <- sessions |>
    mutate(play_time = as.numeric(difftime(end_time, start_time)))

sessions_playtime

In [None]:
player_playtime <- sessions_playtime |>
    group_by(hashedEmail) |>
    summarize(total_minutes = sum(play_time, na.rm = TRUE)) |>
    arrange(desc(total_minutes))

top20_player_ids <- player_playtime |>
    slice(1:20) |>
    pull(hashedEmail)

top20_players <- players |>
    filter(hashedEmail %in% top20_player_ids)

top20_players

To answer the question 2, I extract the top 20 players with the highest total play time, and analyze their characteristics, such as experience level, age, and gender. By visualizing these patterns, I am trying to check whether certain types of players are more likely to become heavy contributors and produce large amounts of gameplay data.

**Data Description**

For this project, I am using two datasets collected from a Minecraft research server. Since my project focuses on identifying which kinds of players contribute the most gameplay data, the sessions dataset plays a central role. A single player may appear many times in session data, so I computed the total play time for each player by adding the duration of all their sessions. After creating this aggregated measure, I identified the top 20 players with the highest total play time. These players are used to check whether characteristics such as experience, age, or gender are associated with being a heavy contributor.

Below are the key variables I use from each dataset

**Number of Observations:**
* There are 68 players in player.csv dataset, and each player is identified using an anonymized hashed email.
* In sessions dataset, there are total 1535 observations wih detailed logs of gameplay sessions.

**From sessions.csv**
* hashedEmail (Character): Player identifier
* start_time (Datetime): when the session started (DD/MM/YYYY HH:MM) format
* end_time (Datetime): when the session ended (DD/MM/YYYY HH:MM) format
* play_time (Numeric): Lengtho of the session in minutes

**From players.csv**
* experience (Character): Player identifier
* gender (Character): Gender
* Age (Numeric): Age in years
* played_hours (Numeric): Self-reported total hours played

**Potential Issues:**

* Some sessions are extremely long, which may indicate players are not playing the game(AFK)
* Several variables in the players dataset, like played_hours, experience level, are self-reported and may not be fully accurate
* The overall sample size is small (68 players)