# Data Science Project
Individual Planning Report by Simon San

## Data Description
The Pacific Laboratory for Artificial Intelligence (PLAI) under Frank Wood have set up a Minecraft server to investigate how people play games. However, they are also interested how to recruit players, and making sure they have enough resources for these players. Thus, they have collected data to help solve these issues.


### players.csv
This data set includes information about each player.
It consists of the following seven variables:

- `experience` - categorized level of experience of player
- `subscribe` - whether the player is subscribed to a game-related newspaper
- `hashedEmail` - hashed email of players (to protect privacy)
- `played_hours` - total amount of server play time (in hours)
- `name` - name of player
- `gender` - gender of player
- `Age` - age of player

In [28]:
library(tidyverse)
players <- read_csv(file = "data/players.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


The dataset has a total of 196 observations. The variables `experience`, `hashedEmail`, `name`, and `gender` are all character vectors. The variables `played_hours`, and `Age` are numeric vectors. The variable `subscribe` is a logical vector.

Below are relevant summary statistics for the different variables:

In [70]:
total_obs <- nrow(players)
subscribed <- players |>
    group_by(subscribe) |>
    summarize(count = n()) |>
    filter(subscribe == TRUE) |>
    mutate(pct_true = count/total_obs) |>
    select(pct_true) |>
    pull()
percentage_subscribed <- round(subscribed*100, 2)
percentage_subcribed_display <- paste0(percentage_subscribed, "%")
print("Percentage of subscribed players:")
percentage_subcribed_display
print("Percentage of non-subscribed players:")
percentage_non_subscribe_display <- paste0(100-percentage_subscribed, "%")
percentage_non_subscribe_display

print("Summary statistics of hours played:")
sum_played_hours <- summary(players$played_hours, na.rm = TRUE)
sum_played_hours_display <- format(round(sum_played_hours, 2))
sum_played_hours_display

print("Summary statistics of age:")
sum_Age <- summary(players$Age, na.rm = TRUE)
sum_Age_display <- format(round(sum_Age, 2))
sum_Age_display

print("Percentage of different experience levels of players:")
fct_exp <- players |>
    mutate(experience = as.factor(experience)) |>
    group_by(experience) |>
    summarize(percentage_experience = n()/total_obs*100) |>
    mutate(percentage_experience = round(percentage_experience, 2)) |>
    arrange(desc(percentage_experience))
fct_exp

print("Percentage of different gender identifying players:")
fct_gender <- players |>
    mutate(gender = as.factor(gender)) |>
    group_by(gender) |>
    summarize(percentage_gender = n()/total_obs*100) |>
    mutate(percentage_gender = round(percentage_gender, 2)) |>
    arrange(desc(percentage_gender))
fct_gender

[1] "Percentage of subscribed players:"


[1] "Percentage of non-subscribed players:"


[1] "Summary statistics of hours played:"


[1] "Summary statistics of age:"


[1] "Percentage of different experience levels of players:"


experience,percentage_experience
<fct>,<dbl>
Amateur,32.14
Veteran,24.49
Regular,18.37
Beginner,17.86
Pro,7.14


[1] "Percentage of different gender identifying players:"


gender,percentage_gender
<fct>,<dbl>
Male,63.27
Female,18.88
Non-binary,7.65
Prefer not to say,5.61
Two-Spirited,3.06
Agender,1.02
Other,0.51


## sessions.csv
This data set includes information about every play session. It consists of the following 5 variables:

- `hashedEmail` - hashed email of player (same as in `players.csv`, to identify player)
- `start_time` - start time of player session (dd/mm/yyyy hh/mm)
- `end_time` - end time of player session (dd/mm/yyyy hh/mm)
- `original_start_time`
- `original_end_time`

In [74]:
sessions <- read_csv(file = "data/sessions.csv")

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


The dataset has a total of 1535 observations. The variables `hashedEmail`, `start_time`, and `end_time` are all character vectors. The variables `original_start_time`, and `original_end_time` are numeric vectors.

This dataset seems to present some issues:
1. It is unclear how `original_start_time` and `original_end_time` are measured
2. `original_start_time` and `original_end_time` may not be sufficient in calculating relevant values like play time, due to its high value (e.g. observation # 1: can calculate a difference between start time and end time using `start_time` and `end_time`, not `original_start_time` and `original_end_time`)

In [75]:
obs_1 <- sessions |>
    select(-hashedEmail) |>
    slice(1)
obs_1

start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<dbl>,<dbl>
30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0


3. The data is untidy; `start_time` and `end_time` have observations which include month, day, year, hour, and minutes.

## Questions

## Exploratory Data Analysis and Visualization

## Methods and Plan