# Indivudial Planning Report

Name: Faye Shipp

Section: 002

In [None]:
library(tidyverse)
library(readxl)

In [None]:
player_data <- read_csv("https://raw.githubusercontent.com/vel-lapel/dsci100-2025/refs/heads/main/players.csv")
head(player_data)
session_data <- read_csv("https://raw.githubusercontent.com/vel-lapel/dsci100-2025/refs/heads/main/sessions.csv")
head(session_data)
slice(session_data, 11, 13)

## 1. Data summary

### Player Dataset
- There are 196 rows -> Data was collected on 196 participants
- There are 7 columns -> Data was collected on 7 different variables
- The columns are as follows:
	- experience (chr) -> a categorical variable ranking each player's experience. The options are Pro, Veteran, Amateur, Regular, Beginner
	- hashedEmail (chr) -> shows each player's unique hashed email. Used to identify players. 
	- name (chr) -> the name of each player. Also used to identify players
	- gender (chr) -> a categorical variable showing each player's gender. The options are Agender, Female, Male, Non-Binary, Other, Prefer Not to Say, Two-Spirited
	- played_hours (dbl) -> a numerical value of how many total hours each player has played.
	- Age (dbl) -> a numerical value corresponding to the age of each player. 
	- subscribe (lgl) -> a logical (true/false) value indicating whether or not a player has subscribed to the data collection

#### Issues with the data
- There are two observations where the age is "n/a"
- There are multiple observations where played_hours is 0
- It only gives us total hours played, not "how many sessions played". 
- Otherwise, the data is in a tidy format.

### Session Data
- There are 1535 rows -> There are 1535 unique play sessions
- There are 5 columns -> data was colelcted on 5 different variables for each session
- The columns are as follows:
	- hashedEmail (chr) -> shows each player's unique hashed email. Used to identify players.
	- start_time (chr) -> the starting time of each session, in character form
	- end_time (chr) -> the ending time of each session, in character form
	- original_start_time (dbl) -> presumably a starting time of something, in a numerical form
	- original_end_time (dbl) -> presumably an ending time of something, in a numerical form

#### Issues with the data
- the "original start time" and "original end time" are not very well labeled. It is unsure what they represent. There is very little variation between the two for each session.
- start_time and end_time are character variables, which means that very few functions can be performed on them. This is not very tidy.


## 2. Questions

We would like to know what "kind" of player contributes a large amount of data to target those players in recruitment efforts.

-> Given a player's age and experience, how many sessions will they play?

We can use the data in "sessions" dataset to count the number of sessions per player, then bind that with the data in "players" to get a dataframe showing the number of sessions each player played. From there, we can create a KNN model that predicts the number of sessions based on the aforementioned variables. 

## 3. Exploratory Data Analysis and Visualization

In [None]:
# Note that the data has already been loaded into this notebook.

In [None]:
# Calculate mean player data from the player dataframe
mean_player_data <- player_data |> 
                    summarize(
                        mean_played_hours = mean(played_hours),
                        mean_age = mean(na.omit(Age))) # since there are only 2 observations where age is N/A, we'll omit those just for calculating mean(Age).


plot1 <- player_data_no_na |> ggplot(aes(y=played_hours, x=Age)) + 
                        geom_point() +
                        labs(x= "Age", y= "Total Playtime (Hours)", title = "Age and Playtime of Players") +
                        aes(alpha= 1)


#tidy the data --> convert the <chr> date variable into <dttm>
session_data1 <- session_data |> mutate(start_time= as.POSIXct(start_time, format = "%d/%m/%Y%H:%M"), end_time= as.POSIXct(end_time, format = "%d/%m/%Y%H:%M"))

plot2 <- session_data1 |> ggplot(aes(x=start_time))+
                        geom_histogram() +
                        labs(x = "Date", y = "Number of sessions", title = "Frequency of sessions played over time")

plot2
plot1
mean_player_data

From the table and plots above, we see that the mean total playtime is around 5.8 hours, and that the average player is around 20 years old. 

Getting an overview of the total playtime, we see that players in the 15-25 range typically play the most hours. 

Looking at the distribution of sessions over time, it appears the data was collected from April to October of 2024, and peak data collection happened around the end of June into early July. The data is slightly skewed left. 

## Methods and Plan

##### I propose creating a KNN regression model that predicts the number of sessions played from the variables age and experience.

Since number o

To do this, we will:
- from "players", filter out those with played_hours = 0
- group_by and summarize the "sessions" data to get a count of how many sessions per player
- sort the data 