# Utilizing player age & experience level to predict average playtime

## Introduction

Video games have exploded in popularity in recent years, becoming a dominant form of entertainment and fostering massive global communities. With the rise of multiplayer online games like Minecraft, understanding player behavior has become crucial for managing servers and optimizing resources. This project leverages data on player activity to uncover patterns and inform strategies for sustaining engagement.


In the project, we will address: **Can we predict the average hours a player will play?**

We will utilize 2 datasets to determine the player characteristics that are most predictive of high data contribution. Moreover, the Pacific Laboratory for Artificial Intelligence (PLAI) at UBC collected the data for the 2 datasets by setting up a Minecraft server where players actions are recorded as they navigate through the world. 

---

### Dataset #1: players.csv (A list of all unique players, including data about each player)

- Number of observations: 196
- Number of variables: 9

| Variable | Type | Description |
|:--------:|:--------:|:--------:|
|  experience   |  categorical   |  Player's experience level (e.g., Beginner, Amateur, Regular, Veteran, Pro)  |
|  subscribe   |  categorical  |  Indicates whether the player subscribes to a service (TRUE/FALSE)   |
|  hashedEmail   |  categorical   |  Unique identifier for each player  |
|  played_hours   |  quantitative   |  Total hours played by the player  |
|  name   |  categorical   |  Name of the player  |
|  gender   |  categorical   |  Gender of the player (e.g., Male, Female, Non-binary, Prefer not to say)  |
|  age   |  quantitative   |  Age of the player  |
|  individualId   |  categorical   |  Unique individual ID  |
|  organizationName   |  categorical   |  Identifier for the organization the player may be associated with (if any)  |

---

### Dataset #2: sessions.csv (A list of individual play sessions by each player, including data about the session)

- Number of observations: 1535
- Number of variables: 5

| Variable | Type | Description |
|:--------:|:--------:|:--------:|
|  hashedEmail   |  categorical   |  Unique identifier for each player  |
|  start_time   |  quantitative  |  The start time of each play session  |
|  end_time   |  quantitative   |  The end time of each play session  |
|  original_start_time   |  quantitative   |  Scheduled start time (UNIX timestamp format)  |
|  original_end_time   |  quantitative   |  Scheduled end time (UNIX timestamp format)  |

---

Furthermore, we will use **age** and **experience** from the players.csv dataset as predictors to predict the average hours a player will play in Minecraft.

## Methods & Results

#### Preliminary exploratory data analysis:

Step 1) Imported libraries along with player & session data from google drive links

Step 2) Cleaned and tidied data by removing irrelevant variables from datasets and adding individual date & time columns

Step 3) Split the data into training and testing sets (only working with the training set until the very end)

Step 4) Summarized the training set to make predictions regarding how we want our classifier to operate

Step 5) Visualized the training dataset

**.** **.** **.**

### Preliminary exploratory data analysis:

#### Importing libraries

In [2]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)

#### Importing Player & Session Datasets

We utilized **read_csv** to import the player & session datasets from the Google Drive URLs

In [5]:
players_url <- "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
sessions_url <- "https://drive.google.com/uc?export=download&id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"

players <- read_csv(players_url)
sessions <- read_csv(sessions_url)

head(players)
head(sessions)

nrow(players)
nrow(sessions)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, age
[33mlgl[39m (3): subscribe, individualId, organizationName

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this messag

experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<lgl>,<lgl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9,,
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17,,
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17,,
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21,,
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21,,
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17,,


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


#### $Figure$ $1$ 

Player dataset contains columns that are either categorical or quantitative. Moreover, the **start_time** and **end_time** variables in the Session dataset contains both date & time. As such, they will need to be separated. 

### Cleaning and tidying the data

Some columns in the player & session datasets will be irrelevant to predicting the average hours a player will play. Therefore, they will be removed. Additionally, we will need to separate **start_time** and **end_time** into individual date & time columns in the session dataset.

### Splitting the data into training & testing sets

### Summarizing the data

### Visualization

### Data analysis

### Visualization of data analysis

## Discussion

### Summarizing what we found
...

### Expected findings vs outcome
...

### What impact could such findings have?
...

### What future questions could this lead to?
...