In [None]:
library(tidyverse)
library(repr)
library(dplyr) 
library(tidymodels)
options(repr.matrix.max.rows = 6)

We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.



Session: This is likely collected via an automated system that they embedded within the server.

Variable summary:
There are 5 variables in the sessions.csv. 1535 observations.
| Variable name       | Data type | Description                                 |
| :--------           | :-------  | :---------                                  |
| hashedEmail         | Categorial| User identification via email               |
| start_time          | Date-time | Beginning of a session in dd/mm/yyyy, hh:mm |     
| end_time            | Date-time | End of a session in dd/mm/yyyy, hh:mm       | 
| original_start_time | Numerical | Start time recorded in milliseconds         |
| original_end_time   | Numerical | End time recorded in milliseconds           |

Summary statistics:

Session duration:

|Count	|Mean  | Median |Standard Deviation	 |Variance |	Minimum	| Maximum  |
|:----  |:---  |:----   |:---- |:----    |:---- |:---   |
|1533	|48.49 |0	    | 79.69|6350.52	 |   0	| 333.33|


Sessions per user: 


|Mean	|Minimum  | Maximum |
|:----  |:---  |:----   |
|12.28	|310 |1	    |


Issues: 
1. There is missing data for end times for rows like 681 and 1019.
2. What constitues a session. Large playtime can be someone genuinely playing or forgetting to close the game. Any trackers of activity built-in?
3. Unknown period of gaming for each individual. Meaning, a player can have 50 sessions and they can all come from within the same month, or someone could have 50 sessions but it's spread out within all 4 months. Session frequency does not really say much on its own. Need to perform more analyses to find out each players' sessions and frequency.
4. High variance. Makes this dataset bad for prediction.


Player: This is likely collected via a combination surveys and an automated system that they embedded within the server.

Variable summary: There are 7 variables in the players.csv.

| Variable name       | Data type | Description                                 |From where
| :--------           | :-------  | :---------                                  |:----
| experience         | Categorial| Player skill classification: beginner, amateur, regular, veteran, pro|self-reported|
| subscribe          | Boolean | True or false, whether a player has subscribed to something|      from system|
| hashedEmail            | Categorical| User identification via email| from system| 
| played_hours | Numerical | Hours of gameplay for each player       | from system  |
| name   | Categorical | Name/chosen name of player          |self-reported
| gender   | Categorical | Player's identity         | self-reported |
| age   | Numerical | Players age in years          | self-reported |

Summary statistics:


|Stat	|Age  | played_hours |
|:----  |:---  |:----   |
|Mean	|21.14 |5.85	    |
|Median	|19 |0.1	    |
|Standard Deviation	|7.39 |28.36	    |
|Variance	|54.61 |804.14	    |
|Minimum	|9 |0	    |
|Maximum|58 |223.1	    |

Issues: 
1. There is missing data for ages.
2. Weird outliers. The top 6 highest hours ranges from 53 to 223 hours. These outliers heavily affect the skewedness and thus the data's use in prediction. 
3. Quick scroll through and there are a lot of 0 hours in there. Even spotted a supposed Veteran with 0 hours of playtime (103)
4. Much of the data here appears to be self-reported, which will result in bias and misrepresentation as everyone's definition of each of the 5 classifications would inevitably be different



Issues: First, the variance in the data is extremely high, which indicates that the data is inconsistent, making it lose power in prediction. Though you can't really expect consistency in people's gaming times. I suppose this suggests that time spent won't be a good variable to answer the question.

Secondly, 681, 1019 no end time. Different time formats, ie ddmmyyyyhhmm and unix. The unix's scientific notation is jarring and unreadable for humans. 

Thirdly, here's no clear indicator of what constitues a session. Like, someone could just be leaving the game open without doing anything on it and it'd constitue as a session. I think this poses as a big issue if we want to use players with a large playtime as a predictor, as it could indicate both engagement and forgetfulness, the latter being detrimental to any analysis regarding the target audience. Forgetful players probably won't even notice that you are targetting them in recuiting efforts. 

Lastly, some users have an insane amount of sessions. But I feel the exact "insanity" should be dependent on how long the observation period was held for. For example, if the observation period only consisted of 10 days, 310 sessions would most likely indicate bot activity, regardless of how long each session lasts. But if it was over 4 years (I doubt it), 310 is very, very casual, human even; it averages out to around 1.5 sessions per week. So bot activity is very hard to find here, and it also poses as an obstacle to finding the target audience, because a person with 310 sessions could range from 217 sessions per week to 1.5 depending on the observation period, making it indiscernable whether they are active players, casual players, or bots. In other words, the session count is effectively meaningless without time bounds.
could be one off as well, like maybe a player played for hours on end for a couple months and lost interest rapidly.

For this dataset, the session amount and time are the only possible predictors for the question posed. So given the multiple drawbacks of this dataset, I highly doubt it would be used much in the model.



In [42]:
sessions <- read_csv("data/sessions.csv")
players <- read_csv("data/players.csv")

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [49]:
players_hours <- players|>
    select(played_hours)|>
    arrange(desc(played_hours))|>
    slice(1:6)
players_hours #weird outliers

player_age_stats<-players|>
    filter(!is.na(Age))|>
    summarise(
        count = n(),
        mean = round(mean(Age), 2),
        median = round(median(Age), 2),
        sd = round(sd(Age), 2),
        variance = round(var(Age), 2),
        min = round(min(Age),2),
        max = round(max(Age),2))
player_age_stats

player_time_stats<-players|>
    summarise(
        count = n(),
        mean = round(mean(played_hours), 2),
        median = round(median(played_hours), 2),
        sd = round(sd(played_hours), 2),
        variance = round(var(played_hours), 2),
        min = round(min(played_hours),2),
        max = round(max(played_hours),2))
player_time_stats

played_hours
<dbl>
223.1
218.1
178.2
150.0
56.1
53.9


count,mean,median,sd,variance,min,max
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
194,21.14,19,7.39,54.61,9,58


count,mean,median,sd,variance,min,max
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
196,5.85,0.1,28.36,804.14,0,223.1


In [43]:
session_stats<-sessions|>
    filter(!is.na(original_end_time))|>
    mutate(session_duration_mins = (original_end_time - original_start_time) / 60000)|>
    summarise(
        count = n(),
        mean = round(mean(session_duration_mins), 2),
        median = round(median(session_duration_mins), 2),
        sd = round(sd(session_duration_mins), 2),
        variance = round(var(session_duration_mins), 2),
        min = round(min(session_duration_mins),2),
        max = round(max(session_duration_mins),2))
session_stats

#check for players with multiple sessions
user_session_count <- sessions|>
    count(hashedEmail, name = "session_count")|>
    arrange(desc(session_count))

head(user_session_count)

mean_sessions_per_user <- sessions |>
    count(hashedEmail, name = "session_count")|>
    summarise(mean = mean(session_count),
              max = max(session_count),
              min = min(session_count))
mean_sessions_per_user

#check for egregious hours
session_duration <- sessions|>
    filter(!is.na(original_end_time))|>
    mutate(session_duration_mins = (original_end_time - original_start_time) / 60000)|>
    arrange(desc(session_duration_mins))|>
    slice(1:6)|>
    select(hashedEmail, start_time, end_time, session_duration_mins)
session_duration

count,mean,median,sd,variance,min,max
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1533,48.49,0,79.69,6350.52,0,333.33


hashedEmail,session_count
<chr>,<int>
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,310
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,219
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,159
ad6390295640af1ed0e45ffc58a53b2d9074b0eea694b16210addd44d7c81f83,147
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,130
b622593d2ef8b337dc554acb307d04a88114f2bf453b18fb5d2c80052aeb2319,95


mean,max,min
<dbl>,<int>,<int>
12.28,310,1


hashedEmail,start_time,end_time,session_duration_mins
<chr>,<chr>,<chr>,<dbl>
b622593d2ef8b337dc554acb307d04a88114f2bf453b18fb5d2c80052aeb2319,30/06/2024 16:21,30/06/2024 20:32,333.3333
b622593d2ef8b337dc554acb307d04a88114f2bf453b18fb5d2c80052aeb2319,01/07/2024 21:53,02/07/2024 02:05,333.3333
b622593d2ef8b337dc554acb307d04a88114f2bf453b18fb5d2c80052aeb2319,29/08/2024 01:17,29/08/2024 05:32,333.3333
b622593d2ef8b337dc554acb307d04a88114f2bf453b18fb5d2c80052aeb2319,03/08/2024 04:59,03/08/2024 09:12,333.3333
b622593d2ef8b337dc554acb307d04a88114f2bf453b18fb5d2c80052aeb2319,30/08/2024 21:36,31/08/2024 01:14,333.3333
b622593d2ef8b337dc554acb307d04a88114f2bf453b18fb5d2c80052aeb2319,03/08/2024 21:36,04/08/2024 01:51,333.3333
