In [None]:
library(tidyverse)
library(repr)
library(dplyr) 
library(tidymodels)
options(repr.matrix.max.rows = 6)

Session: This is likely collected via an automated system that they embedded within the server.

Variable summary:
There are 5 variables in the sessions.csv. 1535 observations.
| Variable name       | Data type | Description                                 |
| :--------           | :-------  | :---------                                  |
| hashedEmail         | Categorial| User identification via email               |
| start_time          | Date-time | Beginning of a session in dd/mm/yyyy, hh:mm |     
| end_time            | Date-time | End of a session in dd/mm/yyyy, hh:mm       | 
| original_start_time | Numerical | Start time recorded in milliseconds         |
| original_end_time   | Numerical | End time recorded in milliseconds           |

Summary statistics:

Session duration:

|Count	|Mean  | Median |Standard Deviation	 |Variance |	Minimum (hours)	| Maximum (hours) |
|:----  |:---  |:----   |:---- |:----    |:---- |:---   |
|1533	|48.49 |0	    | 79.69|6350.52	 |   0	| 333.33|


Sessions per user: 


|Mean	|Minimum  | Maximum |
|:----  |:---  |:----   |
|12.28	|310 |1	    |


Issues: 
1. There is missing data for end times for rows like 681 and 1019.
2. What constitues a session. Large playtime can be someone genuinely playing or forgetting to close the game. Any trackers of activity built-in?
3. Unknown period of gaming for each individual. Meaning, a player can have 50 sessions and they can all come from within the same month, or someone could have 50 sessions but it's spread out within all 4 months. Session frequency does not really say much on its own. Need to perform more analyses to find out each players' sessions and frequency.
4. High variance. Makes this dataset bad for prediction.


Player: This is likely collected via a combination surveys and an automated system that they embedded within the server.

Variable summary: There are 7 variables in the players.csv. 196 observations

| Variable name       | Data type | Description                                 |From where
| :--------           | :-------  | :---------                                  |:----
| experience         | Categorial| Player skill classification: beginner, amateur, regular, veteran, pro|self-reported|
| subscribe          | Boolean | True or false, whether a player has subscribed to something|      from system|
| hashedEmail            | Categorical| User identification via email| from system| 
| played_hours | Numerical | Hours of gameplay for each player       | from system  |
| name   | Categorical | Name/chosen name of player          |self-reported
| gender   | Categorical | Player's identity         | self-reported |
| age   | Numerical | Players age in years          | self-reported |

Summary statistics:


|Stat	|Age  | played_hours |
|:----  |:---  |:----   |
|Mean	|21.14 |5.85	    |
|Median	|19 |0.1	    |
|Standard Deviation	|7.39 |28.36	    |
|Variance	|54.61 |804.14	    |
|Minimum	|9 |0	    |
|Maximum|58 |223.1	    |

Issues: 
1. There is missing data for ages.
2. Weird outliers. The top 6 highest hours ranges from 53 to 223 hours. These outliers heavily affect the skewedness and thus the data's use in prediction. 
3. Quick scroll through and there are a lot of 0 hours in there. Even spotted a supposed Veteran with 0 hours of playtime (103)
4. Much of the data here appears to be self-reported, which will result in bias and misrepresentation as everyone's definition of each of the 5 classifications would inevitably be different, unless a very specific metric was provided for that specific question. As for the players' ages, they could lie about it and nobody would know unless the survey asked for direct identification. So there is high uncertainty for the reliability of the data.



Issues: First, the variance in the data is extremely high, which indicates that the data is inconsistent, making it lose power in prediction. Though you can't really expect consistency in people's gaming times. I suppose this suggests that time spent won't be a good variable to answer the question.

Secondly, 681, 1019 no end time. Different time formats, ie ddmmyyyyhhmm and unix. The unix's scientific notation is jarring and unreadable for humans. 

Thirdly, here's no clear indicator of what constitues a session. Like, someone could just be leaving the game open without doing anything on it and it'd constitue as a session. I think this poses as a big issue if we want to use players with a large playtime as a predictor, as it could indicate both engagement and forgetfulness, the latter being detrimental to any analysis regarding the target audience. Forgetful players probably won't even notice that you are targetting them in recuiting efforts. 

Lastly, some users have an insane amount of sessions. But I feel the exact "insanity" should be dependent on how long the observation period was held for. For example, if the observation period only consisted of 10 days, 310 sessions would most likely indicate bot activity, regardless of how long each session lasts. But if it was over 4 years (I doubt it), 310 is very, very casual, human even; it averages out to around 1.5 sessions per week. So bot activity is very hard to find here, and it also poses as an obstacle to finding the target audience, because a person with 310 sessions could range from 217 sessions per week to 1.5 depending on the observation period, making it indiscernable whether they are active players, casual players, or bots. In other words, the session count is effectively meaningless without time bounds.
could be one off as well, like maybe a player played for hours on end for a couple months and lost interest rapidly.

For this dataset, the session amount and time are the only possible predictors for the question posed. So given the multiple drawbacks of this dataset, I highly doubt it would be used much in the model.



In [None]:
sessions <- read_csv("data/sessions.csv")
players <- read_csv("data/players.csv")

In [None]:
players_hours <- players|>
    select(played_hours)|>
    arrange(desc(played_hours))|>
    slice(1:6)
players_hours #weird outliers

player_age_stats<-players|>
    filter(!is.na(Age))|>
    summarise(
        count = n(),
        mean = round(mean(Age), 2),
        median = round(median(Age), 2),
        sd = round(sd(Age), 2),
        variance = round(var(Age), 2),
        min = round(min(Age),2),
        max = round(max(Age),2))
player_age_stats

player_time_stats<-players|>
    summarise(
        count = n(),
        mean = round(mean(played_hours), 2),
        median = round(median(played_hours), 2),
        sd = round(sd(played_hours), 2),
        variance = round(var(played_hours), 2),
        min = round(min(played_hours),2),
        max = round(max(played_hours),2))
player_time_stats

In [None]:
session_stats<-sessions|>
    filter(!is.na(original_end_time))|>
    mutate(session_duration_mins = (original_end_time - original_start_time) / 60000)|>
    summarise(
        count = n(),
        mean = round(mean(session_duration_mins), 2),
        median = round(median(session_duration_mins), 2),
        sd = round(sd(session_duration_mins), 2),
        variance = round(var(session_duration_mins), 2),
        min = round(min(session_duration_mins),2),
        max = round(max(session_duration_mins),2))
session_stats

#check for players with multiple sessions
user_session_count <- sessions|>
    count(hashedEmail, name = "session_count")|>
    arrange(desc(session_count))

head(user_session_count)

mean_sessions_per_user <- sessions |>
    count(hashedEmail, name = "session_count")|>
    summarise(mean = mean(session_count),
              max = max(session_count),
              min = min(session_count))
mean_sessions_per_user

#check for egregious hours
session_duration <- sessions|>
    filter(!is.na(original_end_time))|>
    mutate(session_duration_mins = (original_end_time - original_start_time) / 60000)|>
    arrange(desc(session_duration_mins))|>
    slice(1:6)|>
    select(hashedEmail, start_time, end_time, session_duration_mins)
session_duration

Clearly state one broad question that you will address, and the specific question that you have formulated. Your question should involve one response variable of interest and one or more explanatory variables, and should be stated as a question. One common question format is: “Can [explanatory variable(s)] predict [response variable] in [dataset]?”, but you are free to format your question as you choose so long as it is clear. Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.

Broad question: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Specific question: Can **experience** and **subscription** predict the **total session hours** in the players dataset? 

Merge the two datasets together, find their played hours based on the sessions dataset since the data in there is within the observation period. To find each player's total sessions, group them by their email and sum up their total sessions' minutes. First I'd have to convert that UNIX stuff into minutes. For further insight, maybe I can find how long each players' typical session lasts. Like, when they are subscribed, if they usually do long sessions. 
I would subtract the original_start_time from the original_end_time just to see how much time the player spends on each session. And convert it to minutes because UNIX is unreadable.

|Stat	|Age  | played_hours |
|:----  |:---  |:----   |
|Mean	|21.14 |5.85	    |

In [None]:
combined_data <- sessions|>
    filter(!is.na(original_end_time))|>
    mutate(session_duration_mins = (original_end_time - original_start_time) / 60000)|>
    group_by(hashedEmail)|>
   summarise(
        total_sessions = n(),
        avg_session_mins = mean(session_duration_mins),
        total_session_mins = sum(session_duration_mins))|>
    left_join(players, by = "hashedEmail")|>
    mutate(total_session_hrs = total_session_mins / 60)|>
    mutate(avg_session_hrs = avg_session_mins / 60)
combined_data
#avg_session_hrs is within the observation period. Use that instead. Not sure about the timeframe of played_hours or what it actually reprsents.    

In [None]:
subsribed_players <- combined_data|>
    group_by(subscribe)|>
    summarise(
        count = n(),
        avg_total_sesh_hrs = round(mean(total_session_hrs, na.rm = TRUE), 2))|>
    mutate(percentage = round(count/sum(count) *100), 1)
subscribed_players

In [None]:
options(repr.plot.height = 10, repr.plot.width = 10)

time_vs_sub_plot <- combined_data|>
    ggplot(aes(x = total_session_hrs, fill = subscribe))+
    geom_histogram()+
    labs(x = "Total time spent on game within observation period (hours)", fill = "Subscription status")+
    theme(text = element_text(size = 20))+
    ggtitle("Time spent on game versus Subscription status")
time_vs_sub_plot

class_vs_sub_plot <- combined_data |>
     ggplot(aes(x = experience, fill = subscribe))+
    geom_bar()+
    labs(x = "Experience level", fill = "Subscription")+
    theme(text = element_text(size = 20))+
    ggtitle("User's experience level versus Subscription status")
class_vs_sub_plot

time_vs_class_plot <- combined_data |>
    ggplot(aes(x = experience, y = total_session_hrs))+
    geom_bar(stat="identity")+
    labs(x = "Experience level", y = "Total time spent on game within observation period (hours)")+
    theme(text = element_text(size = 20))+
    ggtitle("Time spent on game versus User's experience level")
time_vs_class_plot

cd_filtered <- combined_data|>
    filter(total_session_hrs <= 10)|>
    filter(total_session_mins <= 500)
cd_filtered

time_vs_sub_plot2 <- cd_filtered|>
    ggplot(aes(x = total_session_hrs, fill = subscribe))+
    geom_histogram()+
    labs(x = "Total time spent on game within observation period (hours)", fill = "Subscription status")+
    theme(text = element_text(size = 20))+
    ggtitle("Time spent on game versus Subscription status")
time_vs_sub_plot2


time_vs_sub_plot3 <- cd_filtered|>
    ggplot(aes(x = total_session_mins, fill = subscribe))+
    geom_histogram()+
    labs(x = "Total time spent on game within observation period (minutes)", fill = "Subscription status")+
    theme(text = element_text(size = 20))+
    ggtitle("Time spent on game versus Subscription status")
time_vs_sub_plot3

From the time vs experience, I can infer that the experience level is likely self-declared because of how irregular the data appears. Usually, people with the most hours spent would be of veteran status and least be of beginner status. So this might not be a very reliable metric of measurement to find a target audience, since it is very likely to be biased.

I made a second version of the time vs sub plot because of the outliers, and since all outliers appear to have been subscribed, anyway. However, there seems to be a jarring count of people who spent 0 minutes on the game and still subscribed. I looked back at the sessions data, and I saw that a lot of the original_start_time and original_end_time data points are exactly the same, making their difference 0. This might indicate I would need to shift over to working with the start_time and end_time data by converting the ddmmyyhhmm time into straight minutes or seconds in order for this analysis to work. Or if even that doesn't work, I'll have to scrape the idea of a combined dataset and use played_hours from the players dataset instead.

It appears that no matter what skill level the player has, the people who have subscribed will always be larger in numbers than those who haven't. 

There also is a small relationship between those who subscribed vs their play time, despite the data. 

In [None]:
subscribed_players <- combined_data|>
    group_by(subscribe)|>
    summarise(
        count = n(),
        avg_total_sesh_hrs = round(mean(total_session_hrs, na.rm = TRUE), 2))|>
    mutate(percentage = round(count/sum(count) *100), 1)
subscribed_players