# Individual Planning Report

DSCI 100 003  
Wendy Liao, Group 16

In [1]:
library(tidyverse) 
library(repr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## 1. Data Description 

- There are 2 different data files: 1. Sessions 2. Players 

### 1. **Sessions**

- Total 5 variables
- 1535 observations
- every observation is a full observation (no "N/A")

| Variable               | Data Type |           Meaning                                     |
| -----------------------| --------- |-------------------------------------------------------|
|hashedEmail             |   chr     | participant email that has been hashed                | 
|start_time              |   chr     |participant start time in 24 hour clock and dd/mm/yyyy |
| end_time               |   chr     |participant end time in 24 hour clock  and dd/mm/yyyy  |
|original_start_time     |   dbl     |participant UNIX timestamp of date and time started     |
|original_end_time_time  |   dbl     |participant UNIX timestamp of date and time ended       | 

- **Issues**
    1. variable name hashedEmail is not in snake_case
    2. start_time has 2 variables recorded seperately: date and time (e.g. 30/06/2024 18:12)
    3. end_time has 2 variables2 variables recorded seperately: date and time (e.g. 30/06/2024 18:24)
    4. start_time and end_time are chr, depending on question, may have to turn into dbl 

### 2. **Players**

- total 7 variables
- total 196 observations (participants)
- one of the Age observations contains an "NA"

  
| Variable  | Data Type | Meaning|
| --------- | --------- |--------|
|experience | chr  |participant level of experience in Minecraft (Beginner, Amateur, Regular, Veteran,Pro)| 
|subscribe  |  lgl |whether or not (True/False) participant is subscribed to game-related newsletter|
|hashedEmail|chr|participant email that has been hashed |
| played_hours| dbl |hours played in Minecraft  |
|name  |   chr  |  Participant name |
|gender|chr|Gender of participant |
|Age | dbl |participant age|

- **Issues**
    1. variable name hashedEmail is not in snake_case
    2. Age variable is not in snake_case (very small issue, however, inconsistent)
    3. "NA" in Age variable 
       
<br>

- **Summary Statistics**


| Variable      | Mean    | Min | Max |
| ------------- | ------- | --- | --- |
| played_hours  | 5.85    |0    |223.1|
| Age           |  21.14  | 9   | 58  |

| Variable | # of True | # of False |
| -------- | --------- | ---------- |
| subscribe| 144       | 152        |

| Variable      | Beginner | Amateur | Regular | Pro | Veteran |
| ------------- | -------  | ------- | ------- | --- | ------- |
| played_hours  |  35      |  63     |  36     | 14  | 48      |

In [5]:
sessions <- read_csv("sessions.csv") 
players <- read_csv("players.csv") 

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [23]:
played_hours_summary <- players |>
    select(played_hours) |>
    summarize( min_played_hours = min(played_hours), 
               max_played_hours = max(played_hours),
               mean_played_hours = mean(played_hours)) 
played_hours_summary

ages_summary <- players |>
    select(Age) |>
    summarize( min_age = min(Age, na.rm = TRUE), 
               max_age = max(Age, na.rm = TRUE),
               mean_age = mean(Age, na.rm = TRUE)) 
ages_summary

subscribe_summary <- players |>
    select(subscribe, hashedEmail) |>
    group_by(subscribe = as_factor(subscribe)) |>
    summarize(count = n())
subscribe_summary

experience_summary <- players |>
    select (experience, hashedEmail) |>
    group_by(experience = as_factor(experience)) |>
    summarize(count = n()) 
experience_summary 

min_played_hours,max_played_hours,mean_played_hours
<dbl>,<dbl>,<dbl>
0,223.1,5.845918


min_age,max_age,mean_age
<dbl>,<dbl>,<dbl>
9,58,21.13918


subscribe,count
<fct>,<int>
False,52
True,144


experience,count
<fct>,<int>
Pro,14
Veteran,48
Amateur,63
Regular,36
Beginner,35
