# Data Science Final Report Project

## Introduction

A group of our UBC, led by Frank Wood, has created a Minecraft server with the aim of collecting useful data on how players play video games. Each player's activity is logged as they play the game and interact with the world, which has contributed to a rich set of data that will be used today to answer the second area of interest; Which types of players are most likely to contribute a large amount of data on the Minecraft server. To do this, I would like to examine and focus the relationship between the variables aAge against Played Hours, to see if an indivudals age impacts the number of hours someone will play and therefore, how much data a player will contribute to the data set. 

## Question
This dataset is used to answer the question:
Can a player's age predict how many hours they contribute to the dataset? My goal with this question is to be able to identify the typical person,  based on Age and experience that will be the most contributing allowing the research team to focus their recruitment efforts on the most valuable participants.




## Dataset Description: 
The dataset collected by the research group consists of two filers - Players.csv and Sessions.csv but for my analysis, I will be using players.csv. Players.csv is a dataset with 7 variables and 196 observations. 
The dataset includes 7 variables. 

| Variable      | Data Type            | Description                                                       |
| ------------- | -------------------- | ----------------------------------------------------------------- |
| experience    | Categorical (chr)    | Player’s self-reported experience level (e.g., “Beginner”, “Pro”) |
| subscribe     | Logical (TRUE/FALSE) | Indicates whether the player subscribed to updates                |
| hashedEmail   | Categorical (chr)    | Anonymized player email (not used in analysis)                    |
| played\_hours | Numeric (dbl)        | Total hours spent on the server (used as the response variable)   |
| name          | Categorical (chr)    | Player’s first name (not used in analysis)                        |
| gender        | Categorical (chr)    | Player’s self-identified gender                                   |
| Age           | Numeric (dbl)        | Player’s age in years (used as a predictor variable)              |





In [None]:
library(tidyverse)
library(repr)
library(dplyr)
library(tidymodels) 

## Data Wrangling

In [None]:
players_data<-read_csv("data/players.csv")

players_data


In [None]:
players_data<-players_data|>
select(Age, played_hours)|>
mutate(Age = as.numeric(Age),played_hours = as.numeric(played_hours))|>
filter(!is.na(Age), !is.na(played_hours))

players_data

In [None]:
options(repr.plot.width = 15, repr.plot.height = 15)
players_plot<- players_data|>
ggplot(aes(x=Age,  y= played_hours)) +
  geom_point(alpha = 0.5, na.rm = TRUE) +
  labs(title = "Figure 1: Relationship between Age and Played Hours",
    x = "Age (in years)",
    y = "Played Hours (in hours)")

players_plot

I started by importing the CSV with read_csv("data/players.csv") to load my dataset into R. Next, I called select(Age, played_hours) from dplyr to keep only the two variables I need, and used mutate(Age = as.numeric(Age), played_hours = as.numeric(played_hours)) to make both of these variablesnumeric to guarantee R treats them as continuous for plotting and modeling. Next, I visualised my data and to give my scatterplot plenty of room, I set options(repr.plot.width = 15, repr.plot.height = 15), which expands the output area. For the plot itself, I built a ggplot(aes(x = Age, y = played_hours)) object, added geom_point(alpha = 0.5, na.rm = TRUE) so overlapping points become semi-transparent and any missing values are ignored, and finished with labs() to supply a clear title and axis labels to ensure that my visualisation was clear to follow. However,  I think the main issue with my visualisation is how densely clustered some of the points are which makes it difficult to interpret. I tried to mitigate this by changing the plot height and width, but, if I had the skills, I'd want to figure out a way to fix this. Since I didn't notice a strong linear relationship, I will next use knn regression and also try to use linear regression just to see. 


## KNN Regression

In [None]:
set.seed(123)
split         <- initial_split(players_data, prop = 0.8)
players_train <- training(split)
players_test  <- testing(split)

knn_recipe <- recipe(played_hours ~ Age, data = players_train) |>
  step_scale(all_predictors())|>
  step_center(all_predictors())


knn_spec <- nearest_neighbor(
    weight_func = "rectangular",
    neighbors   = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

knn_workflow <- workflow() |>
  add_model(knn_spec) |>
  add_recipe(knn_recipe)

knn_folds <- vfold_cv(players_train, v = 5)

knn_results <- tune_grid(
  knn_workflow,
  resamples = knn_folds,
  grid      = tibble(neighbors = 1:10),
  metrics   = metric_set(rmse)
)

best_k <- knn_results |>
  select_best("rmse") |>
  pull(neighbors)
#Best K=10

final_knn_spec <- nearest_neighbor(
    mode      = "regression",
    neighbors = best_k
  ) |>
  set_engine("kknn")

final_knn_workflow <- workflow() |>
  add_model(final_knn_spec) |>
  add_recipe(knn_recipe)

final_knn_fit <- final_knn_workflow |>
  fit(data = players_train)

knn_rmse <- predict(final_knn_fit, players_test) |>
  bind_cols(players_test) |>
  metrics(truth = played_hours, estimate = .pred) |>
  filter(.metric == "rmse") |>
  pull(.estimate)

knn_rmse


## Linear Regression


In [None]:
set.seed(123)
split         <- initial_split(players_data, prop = 0.8)
players_train <- training(split)
players_test  <- testing(split)

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

lm_wf <- workflow() |>
  add_model(lm_spec) |>
  add_formula(played_hours ~ Age)

lm_fit <- lm_wf |>
  fit(data = players_train)

lm_rmse <- lm_fit |>
  predict(players_test) |>
  bind_cols(players_test) |>
  rmse(truth = played_hours, estimate = .pred) |>
  pull(.estimate)

lm_rmse

Based on the RMSE of both regression types, linear and knn, it's apparent that Linear regression is a better analysis tool for this dataset. Lower RMSE means the model’s predictions are, on average, closer to the true values, so it’s a better fit. RMSE measures the square-root of average squared errors, so a smaller number indicates smaller prediction errors and in this case, linear regression has an RMSE of 35.9 while KNN regression has an RMSE of 36.9. 

In [None]:
library(ggplot2)

linear_reg_plot<-ggplot(players_data, aes(x = Age, y = played_hours)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Figure: Linear Regression of Played Hours on Age",
    x     = "Age (years)",
    y     = "Played Hours") 

linear_reg_plot