Each report should include the following sections:

### Title
### Introduction:
provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
clearly state the question you tried to answer with your project
identify and fully describe the dataset that was used to answer the question
### Methods & Results:
describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
your report should include code which:
loads data 
wrangles and cleans the data to the format necessary for the planned analysis
performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
performs the data analysis
creates a visualization of the analysis 
note: all figures should have a figure number and a legend
### Discussion:
summarize what you found
discuss whether this is what you expected to find
discuss what impact could such findings have
discuss what future questions could this lead to
### References
You may include references if necessary, as long as they all have a consistent citation style.

Predicting `experience` from `age` and `average session length`.


### DSCI 100 Final Report Group Project

Members: Ewan Maclachlan, Islam Soliman, Sun Lo, Vicky Tan

Group: 004-27

TA: Jordan Yu

In [2]:
library(tidyverse)
library(lubridate)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


### Part 1: Introduction

#### Background Information
This report explores the relations between the attributes of Minecraft players and their expertise level in the game. The raw datasets `players_data` and `sessions_data` will be used in the analysis. The data was collected by Pacific Laboratory of Artificial Intelligence (PLAI), a research group in Computer Science at the University of British Columbia led by Frank Wood.

#### Question of Interest
- Can `average_session_length` and `Age` predict the `experience` of the Minecraft player?

#### Data Description of `players_data`
- details the identity of players and was collected through the recording of players' actions in Minecraft servers
- 196 observations
- 7 variables

##### Name of Variables and Types
- `experience` is character type, meaning it represents text values
- `subscribe` is logical type, meaning it contains boolean values, TRUE or FALSE, for each corresponding observation
- `hashedEmail` is character type, meaning it stores text values
- `played_hours` is double(numeric) type, meaning it records decimal numbers
- `name` is character type, meaning it displays text values
- `gender` is character type, meaning it shows text values
- `Age` is double(numeric) type, meaning it shows numbers and in this case integers

##### Variable Description
- `experience` is the level of expertise
- `subscribe` is whether or not the player has a subcription
- `hashedEmail` is their email identifier
- `played_hours` is the number of hours spent on the game
- `name` is the player's name
- `gender` is the player's gender
- `Age` is the player's age in years

#### Data Description of `sessions_data`
- records the exact start and end times, including dates, of each player's Minecraft gaming session formatted in DD/MM/YYYY HH/MM and UNIX time
- 1535 observations
- 5 variables

##### Name of Variables and Types
- `hashedEmail` is character type, meaning it stores text values
- `start_time` and `end_time` are character types, meaning they contain text values
  - however, they can be converted into POSIXct, a date-time class in R
- `original_start_time` and `original_end_time` are double (numeric type), meaning numerals that represent the miliseconds

##### Variable Description
- `hashedEmail` is their email identifier
- `start_time` is the time when the player started gaming, down to the exact minute of the date
- `end_time` is the time when the player stopped gaming, down to the exact minute of the date
- `original_start_time` is the game's start in UNIX format
- `original_end_time` is the game's end in UNIX format

### Part 2: Methods & Results

The datasets, `players_data` and `sessions_data`, have been loaded into R below.

In [3]:
url_players <- "https://raw.githubusercontent.com/vckytn22/DSCI-100-Final-Report-Group-Project-004-27-/refs/heads/main/players.csv"
players_data <- read_csv(url_players)
head(players_data)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


In [4]:
url_sessions <- "https://raw.githubusercontent.com/vckytn22/DSCI-100-Final-Report-Group-Project-004-27-/refs/heads/main/sessions.csv"
sessions_data <- read_csv(url_sessions)
head(sessions_data)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


#### Wrangling Data
- wrangling of `sessions_data` to produce the `average_session_length` of each player
- `players_data` and `average_session_length_data`, joined together by matching players' hashed emails
    - ensures that `average_session_length` corresponds to the correct email identifier and thus `experience` category

In [10]:
sessions_data_converted <- sessions_data |>
    mutate(
    start_time = dmy_hm(start_time),
    end_time = dmy_hm(end_time))
session_length_data <- sessions_data_converted |>
    mutate(session_length = as.numeric(difftime(end_time, start_time, units = "mins")))

average_session_length_data <- session_length_data |>
    group_by(hashedEmail) |>
    summarize(average_session_length = mean(session_length, na.rm = TRUE))

tidy_data <- players_data |>
    left_join(average_session_length_data, by = "hashedEmail") |>
    select(experience, Age, average_session_length)
head(tidy_data)

experience,Age,average_session_length
<chr>,<dbl>,<dbl>
Pro,9,74.77778
Veteran,17,85.0
Veteran,17,5.0
Amateur,21,50.0
Regular,21,9.0
Amateur,17,
