# Individual Plan

https://github.com/sylsh7/dsci-100-individualplan.git

## Importing Data

In [9]:
library(tidyverse)
url_sessions = 'https://drive.google.com/uc?export=download&id=1_aMpg_TZTyEj9sJu9264Zt4foR91Wcai'
sessions <- read_csv(url_sessions)
print("Sessions:")
head(sessions)

url_players = 'https://drive.google.com/uc?export=download&id=1ukKjI8m_bL_16jLSki9lDcT1dMNaPLYL'
players <- read_csv(url_players)
print("Players:")
head(players)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[1] "Sessions:"


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[1] "Players:"


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


In [10]:
distinct(players,experience)
distinct(players,gender)
players |>
count(Age, sort = TRUE) |>
head(10)

players |>
  summarize(
    min_hours = min(played_hours, na.rm = TRUE),
    max_hours = max(played_hours, na.rm = TRUE))

experience
<chr>
Pro
Veteran
Amateur
Regular
Beginner


gender
<chr>
Male
Female
Non-binary
Prefer not to say
Agender
Two-Spirited
Other


Age,n
<dbl>,<int>
17,73
21,18
22,15
20,14
23,13
24,10
18,7
19,7
26,4
16,3


min_hours,max_hours
<dbl>,<dbl>
0,223.1


## Data Description
### Sessions
There are 1535 observations and 5 variables. Those 5 variables include:  
1. `hashedEmail`: character, this records the hashing encoded email addresses of the player
2. `start_time`: character, this records the cleaned and standardized start time of each play session
3. `end_time`: character, this records the cleaned and standardized end time of each play session
4. `original_start_time`: double, this records the raw, unedited timestamp recorded by the system when each play session began
5. `original_end_time`: double, this records the raw, unedited timestamp recorded by the system when each play session ended

### Players
There are 196 observations and 7 variables. Those 7 variables include:
1. `experience`: character, describes the experience level of each player, as: `Pro`, `Veteran`, `Amateur`, `Regular`, `Beginner`
2. `subscribe`: logical, describes whether the player has subscribed to the game info, as: `TRUE`, and `FALSE`
3. `hashedEmail`: character, this records the hashing encoded email addresses of the player
4. `played_hours`: double, this records the play time in hours of each player, ranges from 0 - 223.1
5. `name`: character, this records the name of each player
6. `gender`: character, this records the gender of each player, as `Male`, `Female`, `Non-binary`, `Prefer not to say`, `Agender`, `Two-Spirited`, and `Other`
7. `Age`: double, this records the age of each player

## Potential Problems

In [3]:
players |>
  select(played_hours, Age) |>
  summary()

  played_hours          Age       
 Min.   :  0.000   Min.   : 9.00  
 1st Qu.:  0.000   1st Qu.:17.00  
 Median :  0.100   Median :19.00  
 Mean   :  5.846   Mean   :21.14  
 3rd Qu.:  0.600   3rd Qu.:22.75  
 Max.   :223.100   Max.   :58.00  
                   NA's   :2      

+ The numeric columns, `played_hours` and `Age`, are on different scales because `played_hours` ranges from 0 to 223, while `Age` ranges from 9 to 58. Their means and standard deviations are also very different. The model will incorrectly assume the feature with the larger scale(`played_hours`) is more important.
+ The `original_start_time`/`original_end_time` columns and the regular `start_time`/`end_time` columns represent the same information, just in different formats. Using both is redundant and will confuse the model.
+ The column `experience` is not numerical, we can use one-hot encoding to process it first.

## Questions
I will address the broad question 2: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. The specific question I'm focusing is **Can a player's characteristics(experience) predict the total hours of data they contribute?**

This is a **regression** problem, as the response variable `played_hours` is numerical. Also, I need to convert `experience` into numerical before performing regression.  
  
There are two ways I can come up with process `experience`, and each of them has its own limitations.  
+ **one-hot encoding** can convert `experience` into 5 columns: `Pro`, `Veteran`, `Amateur`, `Regular`, and `Beginner`, each with values 0 and 1. The limitation is that it creates many extra columns and treats every category as totally separate, so the model can’t tell that some categories follow an order.  
+ **ordinary encoding** keeps `experience` as one column, and assigns values to its context as: `Pro`=5, `Veteran`=4, `Amateur`=3, `Regular`=2, `Beginner`=1. The limitation is that it forces a numerical order that isn’t real. Treating categories as numbers assumes equal spacing between them, which can mislead the model and create bias. 

I will perform both **k-NN** regression and **linear** regression. And use errors to compare which model is better.  
I will split the data into *training* and *testing* sets.  
I will first use **cross-validation** and **tuning k** to choose the best k for k-NN regression. After applying the best k value to the *training* set, I will use the *testing* set to find errors to see how well my model is and describe what might be done to improve it.  
I will also train a **linear** regression model on the same scaled *training* set.  
Then, I will use the testing set to evaluate both models. I will calculate the **Root Mean Squared Error (RMSE)** for the k-NN model and the linear regression model.  
I will compare the RMSE from both models. The model with the **lower** RMSE will be considered the better predictor.