Data Science Project – Planning Stage (Individual)

Author: Seerat Waraich

(1) Data Description:

Dataset 1: players.csv
Number of observations: 196
Number of variables: 7
Each row represents one player.

Variables:
experience: categorical — indicates the player’s experience level (Amateur, Regular, Veteran, Pro).
subscribe: logical — TRUE/FALSE variable showing if the player subscribed to the newsletter.
hashedEmail: string — unique encrypted identifier for each player.
played_hours: numeric — total number of hours each player has played.
name: string — player’s chosen username.
gender: categorical — self-reported gender of the player.
Age: numeric — age of the player in years.

Summary statistics (rounded to 2 decimal places):
Mean played_hours: 7.21 hours
Standard deviation of played_hours: 9.84 hours
Minimum played_hours: 0.00 hours
Maximum played_hours: 50.40 hours
Mean Age: 18.74 years
Standard deviation of Age: 5.21 years
Minimum Age: 8.00 years
Maximum Age: 35.00 years
Missing values: 2 missing Age entries

Data quality issues:
Some players have 0 hours of playtime, which could mean they never logged in or data was not recorded correctly.
Two Age values are missing.
experience and gender are categorical and may need conversion to factor type for analysis.
The method used to assign experience level is not specified (self-reported or system-defined).
Potential unseen issues:
Possible inconsistencies in how played_hours was measured (manual entry vs. automated logging).
Shared or duplicate accounts could distort individual statistics.
Survey-based information (such as gender or experience) could include bias or reporting errors.
How data were collected:
Likely gathered through a combination of in-game activity tracking and player registration information.
Each record corresponds to a unique player profile recorded on the server.

Dataset 2: sessions.csv

Number of observations: 1,535
Number of variables: 5
Each row represents one gameplay session (a single login–logout event).

Variables:
hashedEmail: string — encrypted identifier used to match sessions with players.
start_time: string — time when a session started (format: “DD/MM/YYYY HH:MM”).
end_time: string — time when a session ended (same format as start_time).
original_start_time: numeric — UNIX timestamp version of session start time.
original_end_time: numeric — UNIX timestamp version of session end time.

Summary statistics and structure:
Total sessions: 1,535
Missing values: 2 missing end times (end_time and original_end_time).
Each player can appear multiple times (one row per session).
Timestamps recorded between May and July 2024 (approximate range).

Data quality issues:
A few sessions have missing or incomplete end times, which will need cleaning.
Potential for negative or zero durations if times were logged incorrectly.
Time zone information is not provided.
Players with no session records will not appear in this dataset.

Potential unseen issues:
Sessions may include idle or AFK (away-from-keyboard) time, inflating playtime duration.
Server time drift or resets could slightly misalign timestamps.
Some sessions might represent reconnections after disconnections rather than new gameplay.
How data were collected:
Automatically recorded from the server’s login and logout logs.
start_time and end_time represent when a player connects and disconnects.
original_start_time and original_end_time are automatically generated UNIX timestamps.

Broad Question:
Which types of players contribute the most overall playtime on the Minecraft research server?

Specific Question:
Do more experienced players tend to spend more total time playing than less experienced ones?

Variables
Response variable (Y): total playtime (in minutes) - represents how long each player spends playing overall.
Explanatory variables (X): player experience level (Amateur, Regular, Veteran, Pro), and possibly demographic factors such as age and gender.

How the Data Will Address the Question:
The dataset includes both player demographic information (players.csv) and detailed session logs (sessions.csv). By combining these datasets using the shared hashedEmail identifier, total playtime for each player can be calculated by summing all session durations. Once the total playtime is computed, it can be compared across different experience levels to see whether higher-experience players tend to play longer overall. Additional variables such as Age and gender can be used to explore whether demographic factors also relate to total playtime.

Planned Data Wrangling Steps
Convert start_time and end_time in sessions.csv to proper datetime format.
Calculate session_length as the time difference between start and end times (in minutes).
Remove any sessions with missing or invalid durations.
Sum the total session lengths for each player to find their total playtime.
Merge the resulting totals with the player demographic data from players.csv.
The final tidy dataset will contain one row per player with their total playtime, experience level, and demographic information, ready for analysis.

In [4]:
#Data Loading and Initial Wrangling
# Load the tidyverse package (includes readr, dplyr, ggplot2)
library(tidyverse)

# Read both datasets using relative paths

sessions <- read_csv("sessions.csv")
players <- read_csv("players.csv")

# Inspect the data
head(players)
head(sessions)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


In [5]:
# Minimum Wrangling to Create a Tidy Dataset
#Create a session length in minutes
sessions <- sessions |>
  mutate(session_length = as.numeric(difftime(end_time, start_time, units = "mins")))

# Remove missing or negative durations
sessions <- sessions |>
  filter(!is.na(session_length) & session_length > 0)

# Compute total playtime per player
total_time <- sessions |>
  group_by(hashedEmail) |>
  summarize(total_play_minutes = sum(session_length, na.rm = TRUE))

# Merge with players data
tidy_data <- players |>
  left_join(total_time, by = "hashedEmail") |>
  mutate(total_play_minutes = replace_na(total_play_minutes, 0))
