## Rama Almoustadi's Individual Planning Report

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [36]:
#exploring and loading the datasets 
# players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

#used this code to calculate the summary statistics for players.csv
summ_stats_players <- players |>
    summary()

#used this code to get the number of players in each experience level, looking at the output of rows
amateur_count <- players |>
    filter(experience == "Amateur") |>
    nrow()
amateur_count

beginner_count <- players |>
    filter(experience == "Beginner") |>
    nrow()
beginner_count 

regular_count <- players |>
    filter(experience == "Regular") |>
    nrow()
regular_count 

pro_count <- players |>
    filter(experience == "Pro") |>
    nrow()
pro_count 

veteran_count <- players |>
    filter(experience == "Veteran") |>
    nrow()
veteran_count 

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience
<chr>
Pro
Veteran
Amateur
Regular
Beginner


### 1) Data Descriptions - two files

#### a) players.csv

Contains 196 observations (rows) and 7 variables (columns)

Variables, their types, and descriptions:

- `experience` - character type, desribing players experience level between amateur, beginner, regular, pro, and veteran
- `subscribe` - logical type, whether the players are subscibed to the game-related newsletter
- `hashedEmail` - character type, a unique string of characters that correspond digitally to the players emails
- `played_hours` - double type, the total game time (in hours) by a player
- `name` - character type, the players' names
- `gender` - character type, the players' genders
- `Age` - double type, the age of the players

These data were likely collected by self-reports of variables (experience, name, gender, age, email, subscribe) by each player, and the additional variable, played_hours, in the data set were available through play time data.

The data is presented in a tidy format, but still may present issues with incorrect data types for some of the variables depending on the following analysis. For instance, age would be better off as an integer type, and experience as a factor.

#### Summary statistics

| Summary Statistic | Age | Played Hours |
|-------------------|-----|---------------|
| Maximum | 58 | 223.1 |
| Minumum | 9 | 0 |
| Mean | 21 | 5.8 |



| Experience Level | # of Players |
|------------------|--------------|
| Amateur | 63 |
| Beginner | 35 |
| Regular | 36 |
| Pro | 14 |
| Veteran | 48 |


#### b) sessions.csv

Contains 1535 observations (rows), and 5 variables (columns)

Variables, their types, and descriptions: 

- `hashed_email` - character type, digitalized unique string of characters for each players email
- `start_time` - character type, the start time of a play session
- `end_time` - character type, the end time of a play session
- `original_start_time` - double type, the value of the start time of a play session in milliseconds
- `original_end_time` - double type, the value of the end time of a play session in milliseconds

These data were collected by recordings of the players' Minecraft server gameplay, recording the start and end time of a play session. 

Issues present themselves with the untidy format of the data, particularly the columns start_time and end_time, which currently hold both the date and time in one column. These values should be separated into different columns. Once these are split, the date column should be data type `date`.

#### Summary statistics

Since the data are in an untidy format and have not yet been converted to their suitable data types, there are no helpful summary statistics that can be performed.

### 2) Questions

**Broad:** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific:** Can the age of a player predict if they will subscribe to a game-related newsletter?

Using the age and subscribe variables from the dataset players.csv, a k-nn classification model could be used to explore this question. The data must be wrangled to be applied, specifically the data types of the two variables at hand, with the age variable being converted to an integer and the subscribe variable (our categorical variable) must be converted to a factor, with renamed factor values.
