# DSCI 100 – Data Science Project: Planning Stage

## Predicting which players contribute large amounts of data

This notebook is my individual planning report for the DSCI 100 project. I use the provided `players.csv` and `sessions.csv` files from the Minecraft research server to explore the data, describe the variables, state my question, and outline a plan for modelling.

I focus on **Question 2** from the project description:

> We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.


In [None]:
# Load required packages
library(tidyverse)
library(ggplot2)
library(lubridate)

# Make printing cleaner
options(dplyr.width = Inf)

## Data Description

### File 1 – `players.csv`

This file contains one row per unique player, with information about their demographics and how much they have played on the server.

- **Number of observations:** 196 players  
- **Number of variables:** 9  
- Note: 2 of the 9 variables are completely empty (all values are `NA`).

In [None]:
# Variable description table for players.csv
dataset <- matrix(c(
  "experience",       "Character",        "Experience level of players",
  "subscribe",        "Boolean/Logical",  "Whether player is subscribed",
  "hashedEmail",      "Character",        "Encrypted email id of players",
  "played_hours",     "Double/Numeric",   "Number of hours played",
  "name",             "Character",        "Name of player",
  "gender",           "Character",        "Gender of player",
  "age",              "Integer",          "Age of player",
  "individualId",     "Logical",          "Unique player identification (empty)",
  "organizationName", "Logical",          "Name of player's organization (empty)"
), ncol = 3, byrow = TRUE)

colnames(dataset) <- c("Variable Name", "Type of Variable", "Description of Data inside Variable")
rownames(dataset) <- 1:9

table_players <- as.table(dataset)
table_players

### File 2 – `sessions.csv`

This file contains one row per individual play session for players on the server.

- **Number of observations:** 1535 sessions  
- **Number of variables:** 5  

In [None]:
# Variable description table for sessions.csv
dataset2 <- matrix(c(
  "hashedEmail",         "Character",      "Encrypted email id of players",
  "start_time",          "Character",      "Start time of playing session",
  "end_time",            "Character",      "End time of playing session",
  "original_start_time", "Double/Numeric", "Start time in scientific notation",
  "original_end_time",   "Double/Numeric", "End time in scientific notation"
), ncol = 3, byrow = TRUE)

colnames(dataset2) <- c("Variable Name", "Type of Variable", "Description of Data inside Variable")
rownames(dataset2) <- 1:5

table_sessions <- as.table(dataset2)
table_sessions

## Issues and Potential Problems

### Direct issues visible in the data

- **Empty columns in `players.csv`:**  
  The variables `individualId` and `organizationName` are completely `NA`. Based on their names, they likely originally contained identifiers or organization details that were removed to protect player privacy.

- **Missing values in `sessions.csv`:**  
  Some rows are missing `end_time` even though other variables are filled in. This might happen if players did not properly log out. These missing end times can create noise or bias when computing session durations.

### Less obvious / contextual issues

- **Method of data collection not described:**  
  For `sessions.csv`, we are not told exactly how session times were measured. There could be a gap between actual gameplay time and the time someone is logged into the server, which means session duration may not perfectly reflect active play.

## Question

The group chose the project question about "kinds" of players who contribute a large amount of data. In the original project handout this is **Question 2**:

> We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

### Response variable of interest

- **`played_hours`** (from `players.csv`): numeric, total hours played for each player.  
  This is my measure of how much data a player contributes.

### Explanatory variables (main focus)

From `players.csv`:

- **`experience`** (`chr`): main variable to define “kinds” of players (e.g., Beginner, Amateur, Regular, Veteran, Pro).  
- **`age`** (`int`): age of the player.  
- **`gender`** (`chr`): gender identity.  
- **`subscribe`** (`lgl`): whether the player is subscribed.

From `sessions.csv` (after wrangling):

- Aggregated features such as:
  - number of sessions per player  
  - average session duration  
  - total session time  
  - typical times/days of activity

These session-level features, combined with `players.csv`, will help refine what kinds of players generate lots of data.


## Exploratory Data Analysis and Wrangling

In this section I load the data, check basic summaries (dimensions, types, missing values), clean up the date/time columns for sessions, and merge the player and session data using `hashedEmail`. Then I create a few simple visualizations to explore how `played_hours` relates to experience level, age, and session behaviour.

In [None]:
# Read in the two datasets directly from the provided URLs
players_data <- read_csv(
  "https://raw.github.students.cs.ubc.ca/vdixit20/DSCI-100-Planning-Stage-Individual-/refs/heads/main/players.csv?token=GHSAT0AAAAAAAAAIXAT2NPXQ3JDPTAO56F22JAWH4A"
)

sessions_data <- read_csv(
  "https://raw.github.students.cs.ubc.ca/vdixit20/DSCI-100-Planning-Stage-Individual-/refs/heads/main/sessions.csv?token=GHSAT0AAAAAAAAAIXATC5WQ5DNTKELIEH5A2JAWFRQ"
)

players_data
sessions_data


In [None]:
# Summary for players_data
players_data_summary <- list(
  "Number of Observations"                  = nrow(players_data),
  "Number of Variables"                     = ncol(players_data),
  "Variable Name and Datatype"              = sapply(players_data, class),
  "Number of Missing Values in each column" = colSums(is.na(players_data))
)

players_data_summary

In [None]:
# Summary for sessions_data
sessions_data_summary <- list(
  "Number of Observations"                  = nrow(sessions_data),
  "Number of Variables"                     = ncol(sessions_data),
  "Variable Name and Datatype"              = sapply(sessions_data, class),
  "Number of Missing Values in each column" = colSums(is.na(sessions_data))
)

sessions_data_summary

In [None]:
# Convert start_time and end_time to proper date-time format
# The timestamps are in "dd/mm/yyyy hh:mm" format, so use dmy_hms()
sessions_simplified <- sessions_data |>
  mutate(
    start_time = dmy_hms(start_time),
    end_time   = dmy_hms(end_time)
  )

sessions_simplified

In [None]:
# Merge player and session data on hashedEmail and drop rows with missing end_time
data_merged <- merge(players_data, sessions_simplified, by = "hashedEmail") |>
  filter(!is.na(end_time)) |>
  arrange(name)

data_merged

## Visualizations of the Original Data

I now create three basic plots:

1. A scatterplot of **experience level vs played hours**.  
2. A scatterplot of **age vs played hours**.  
3. A histogram of **session durations** (based on start and end times).

These are not final models, but they give an initial sense of which groups of players appear to contribute more hours and how session behaviour varies.


In [None]:
# Scatterplot: Experience vs Played Hours
players_plot_1 <- players_data |>
  ggplot(aes(x = experience, y = played_hours)) +
  geom_jitter(width = 0.2, color = "purple", alpha = 0.6) +
  labs(
    title = "Scatterplot of Experience vs Played Hours",
    x = "Experience level",
    y = "Played hours"
  )

players_plot_1

In [None]:
# --- Scatterplot: Age vs Played Hours (final) ---

# If the column is called "Age", rename it once to "age"
if ("Age" %in% names(players_data)) {
  players_data <- players_data |>
    dplyr::rename(age = Age)
}

players_plot_2 <- players_data |>
  ggplot(aes(x = age, y = played_hours)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Scatterplot of Age vs Played Hours",
    x = "Age (years)",
    y = "Played hours"
  )

players_plot_2

In [None]:
# Histogram of session durations (in hours)
sessions_plot <- sessions_simplified |>
  ggplot(aes(x = as.numeric(difftime(end_time, start_time, units = "hours")))) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(
    title = "Distribution of Session Durations",
    x = "Session duration (hours)",
    y = "Frequency"
  )

sessions_plot

## Methods and Modelling Plan

### Proposed approach

Because the response variable `played_hours` is numeric but the project question is phrased in terms of "kinds" of players, I plan to:

- Treat `played_hours` as a measure of how much data a player contributes.
- Use **classification and/or clustering** to group players into types based on their characteristics and play patterns.

Some possible approaches:

- **Classification:**  
  Create categories such as "high data contributor" vs "low data contributor" based on thresholds of `played_hours`, and then build a classifier using predictors like `experience`, `age`, `gender`, `subscribe`, and aggregated session features (total session duration, number of sessions, average session length).

- **Clustering:**  
  Cluster players using the same variables to see if natural groups emerge (e.g., highly active veterans, casual beginners, etc.). This can help interpret what “kinds” of players correspond to heavy data usage.

### Data processing and model comparison

- Split the data into **training (80%)** and **test (20%)** sets.
- On the training set, use **cross-validation** (e.g., 5-fold) to compare models and tune hyperparameters.
- Evaluate model performance on the held-out test set (e.g., accuracy for classification, or appropriate clustering validation measures).
- Compare how much each variable contributes to identifying players who generate large amounts of data.
- Use the final model(s) and exploratory plots to answer Question 2: which combinations of experience level, demographics, and session behaviour describe the players who contribute the most data.

# GitHub Repository

A link to the project repository will be included here, with a minimum of five meaningful commits documenting data loading, wrangling, visualizations, and report development. 

https://github.students.cs.ubc.ca/vdixit20/DSCI-100-Planning-Stage-Individual-.git
