In [None]:
library(tidyverse)

In [None]:
url <- "https://github.com/user-attachments/files/23466468/players.csv"

In [None]:
players_data <- read_csv(url)
players_data

In [None]:
num_observations <- nrow(players_data)
print(num_observations)

In [None]:
# 1) Data Description: 
# (i) Players dataset: 
# Number of observations = 196
# Number of variables = 7
# Variable summary:

# The following information is listed in this format: Variable name = Type; Description

# •experience = Character; Player experience level (i.e., Pro, Regular, etc.)
# •subscribe = Logical; Whether the player has a subscription to the game-related newsletter or not (True/False)
# •hashedEmail = Character; Anonymous and unique IDs of players
# •played_hours = Dbl; Number of total hours played by each player
# •name = Character; Name of the player
# •gender = Character/Factor; Gender of the player
# •age = Integer; Age of the player)

In [None]:
# Issues in the dataset: Some missing values from the data including individuals who have not stated their genders and some N/As for the age column may affect the results.
# These missing values can cause us to be unable to interpret some important information that could be beneficial for studying a research question. 
# A potential issue that we cannot see directly is that some users have 0 played hours which might be due to logging errors and therefore, does not represent correct data. 
# If a user cannot log in or use the game platform, then it is not correct to interpret the data as players not playing. The data may have been collected using personal information from the players' profiles. 
# Data about the start and end time is likely collected from players' log-in history in their accounts.

In [None]:
print("Our group will be basing our work on the question which kinds of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. In specific, we will focus on the effect of skill level on the amount of data retrieved of the player. Our goals with this question include: determining the correlation between skill and amount of data contributed and possible ways we can target those players in recruitment.")


In [None]:
# To study my proposed question, I would first need to select columns that I am interested in studying. 
# This includes selecting the following columns: experience and played_hours from the player's dataset. 
# Then, I will use the data to explain how players' experiences might predict the amount of time they spend in each game session.

In [None]:
# Note: 
The players dataset is already tidy and does not need any further wrangling. 
Each row is a single observation, each column is a single variable, and each value is a single cell, meaning its entry in the data frame is not shared with another value.

In [None]:
# Selecting columns I need for analysis and arranging them
players_select <- players_data |> 
select(experience, played_hours, gender, Age) |>
arrange(by = desc(played_hours))  
players_select

In [None]:
# Computing mean values for played hours in players.csv data
mean_played_hours <- mean(as.numeric(players_select |> pull(played_hours),na.rm = TRUE))
mean_played_hours

In [None]:
#Computing mean values for age in players.csv data
mean_age <- mean(players_data |> pull(Age), na.rm = TRUE)
mean_age

In [None]:
#Computing values in a table
table_mean_values <- tibble(Variable = c("played_hours", "age"),Mean_values = c(mean_played_hours, mean_age))
table_mean_values

In [None]:
# Summary statistics for players’ dataset:
For my project, the mean, median, and standard deviation of played_hours and experience will be useful, as I am exploring how a player’s skill level (represented by experience) relates to the amount of data they might contribute. I will compare the average played_hours across different levels of experience to see if more experienced players tend to spend more time in the game. This analysis will help reveal whether players with higher skill levels are more likely to generate larger amounts of data, supporting our goal of identifying which kinds of players to target in recruitment.

In [None]:
#Computing median and SD values
median_played_hours <- median(as.numeric(players_select |> 
                                         pull(played_hours),na.rm = TRUE))
median_played_hours


In [None]:
SD_played_hours <- sd(as.numeric(players_select |> pull(played_hours),na.rm = TRUE))
SD_played_hours


In [None]:
#Computing mean values according to experience level 
players_pro <- players_select |>filter(experience == "Pro")
players_pro

In [None]:
mean_pro <- mean(as.numeric(players_pro |> pull(played_hours),na.rm = TRUE))
mean_pro

In [None]:
players_veteran <- players_select |>
filter(experience == "Veteran")
players_veteran
mean_veteran <- mean(as.numeric(players_veteran |> 
                                pull(played_hours),na.rm = TRUE))
mean_veteran

In [None]:
players_amateur <- players_select |>
filter(experience == "Amateur")
players_amateur
mean_amateur <- mean(as.numeric(players_amateur |> 
                                pull(played_hours),na.rm = TRUE))
mean_amateur

In [None]:
players_beginner <- players_select |>
filter(experience == "Beginner")
players_beginner

mean_beginner <- mean(as.numeric(players_beginner |> pull(played_hours),na.rm = TRUE))
mean_beginner
                                 

In [None]:
players_regular <- players_select |>
filter(experience == "Regular")
players_regular
mean_regular <- mean(as.numeric(players_regular |> 
                                pull(played_hours),na.rm = TRUE))
mean_regular 

In [None]:
experiences_mean <- tibble(
  Experience_level = c("Pro", "Veteran", "Amateur", "Beginner", "Regular"),
  Mean_values = c(mean_pro, mean_veteran, mean_amateur, mean_beginner, mean_regular)
)

experiences_mean

In [None]:
# As seen in the table, players with regular experience spend the most time in game sessions. 
# Next, I will create a bar graph to visualize the data and see differences in played_hours among the different experience groups.

In [None]:
new_players <- players_select |>
  group_by(experience) |>
  summarise(played_hours = sum(played_hours, na.rm = TRUE)) |>
  mutate(experience = fct_reorder(experience, played_hours, .desc = TRUE))

played_hours_plot <- ggplot(new_players, aes(x = experience, y = played_hours)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(
    x = "Experience levels",
    y = "Number of played hours (in hrs)",
    title = "Experience of players vs. number of played hours"
  )

options(repr.plot.width = 10, repr.plot.height = 7)

played_hours_plot <- played_hours_plot +
  theme(text = element_text(size = 14))

played_hours_plot

In [None]:
# INSIGHTS: Players with regular experience levels have the most total number of played hours. 
# The differences between the groups are great, indicating that experience levels might influence the number of hours played. 
# However, we cannot be sure since the number of players in each group might be different, suggesting that it could be a cause for the differences seen above. 
# If there are only a few players with Pro or Veteran levels, then it is possible for them to have fewer played hours than other levels. 
# Therefore, to accurately interpret the data, using the mean values would be more useful. In this case, the mean values show the same result as the graph. Regular experienced players have the highest average number of played hours followed by amateur, beginner, pro, and veteran.

In [None]:
(4) Methods and planning:
For my project, I will use a K-nearest neighbours (KNN) classifier to predict the amount of time spent in game sessions by the variable, players’ experience. 
Why is this method appropriate?
A KNN classifier is appropriate because:
•It is a simple and interpretable classification based on similarity to other players in the same group (i.e. experience levels). 
•It works well when data has clear clusters (e.g., players with similar playtime belong to the same experience level). 
•It can capture non-linear relationships between experience levels and played hours. 
Which assumptions are required, if any, to apply the method selected?
•The predictor (played hours) should contain meaningful differences between experience levels. 
•Proper Scaling: Since KNN relies on distances, all numerical features (played hours) should be standardized. 
•The dataset should have a roughly equal number of players in each experience category to avoid bias. This can ensure that the number of players does not affect the predictive model. 
•No Outliers: Extreme values can distort distance calculations, so outliers should be addressed.