Data is from a research group in Computer Science at UBC exploring how people play video games. Players navigation of the world is recorded. The team created 2 csvs - players.csv observing unique players and data about them, and sessions.csv observing individual sessions and information about the session. Purpose of study: to target recruitment efforts. 

Sessions.csv (1535 rows) 
- 5 columns: 
    - 2 decimal:
        - original_start_time and original_end_time -  indicating raw timestamp of session
    - 3 character:
        - hashedEmail
        - start_time + end_time (human-readable date-time format, reporting when a user logged on/off for a session). 

Players.csv (197 rows) 
- 7 columns: 
  - 2 decimal:
    - Age (of user) 
    - played_hours (indicating users' hours played). 
  - 3 character:
    - experience (specifies user gaming experience - pro, veteran, amateur, regular, beginner)
    - name
    - gender (male, female, prefer not to say, non-binary, other, two-spirited, agender). 
  - 1 logical:  
    - Subscribe (indicates whether user is subscribed to game newsletter - TRUE/FALSE). 

Players.csv has unique hashed emails, while sessions.csv repeats hashedEmail per session. 


In [5]:
library(tidyverse)

In [6]:
sessions<- read_csv("sessions.csv")
players <- read_csv("players.csv")
head(sessions)
head(players)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


Part 2 of this step - calculating some stats for organization and understanding, in this case, counts of non quantitative variables in order to be able to later understand relationships among them. This is calculating the counts of the variables that are not dbls. The counts are for the sake of understanding how much of each subcategory within each variable are in the data. Later on this may be important to look back on as it will tell me if I need to over sample a specific group within a variable, and just to keep in mind each group within a variables quantity when trying to make comparisons. Overall it allows me to organize the data. Specifics of each step are commented above the step (explains why each count was applied to that variable specifically and what it does). 

In [None]:
#this calculates the count of subscribers and non-subscribers -> in a way this turns the logical variable of true and false into quantiative numbers 
#that could potentially be looked at and worked with differently down the line
#this information may be useful later on because this is the 'class' variable that will likely be used later on in KNN so we need to know if I must oversample 
#the 'rare' class. 

subscribe_count <- players|>
count(subscribe)
subscribe_count

#this calculates the count of each experience (ie, beginner, amateur, pro, veteran, etc) -> this turns the character value of that column into clear counts 
#so I can easily see how much of  each experience type exists
experience_count <- players |>
  count(experience)

experience_count

#this calculates the count of each gender (ie, male, female, nonbinary, agender, etc) -> this turns the character value of that column into clear counts 
#so I can easily see how much of  each gender type exists
gender_count <- players |>
  count(gender)

gender_count

#this calculates the count of each email in the players dataset-> this turns the character value of that column into clear counts 
#so I can easily see how many times each user is in the player dataset (checking for duplicates as the player dataset
#should not show multiple hashed emails -> all clear only 1 per person
players_email_count <- players |>
  count(hashedEmail)

players_email_count


#this calculates the count of each email in the sessions dataset-> this turns the character value of that column into clear counts 
#so I can easily see how many times each user logged in in the sessions dataset (checking to see if a user logged in and played
#multiple times (there are more rows in this dataset 1535 > 197, so it's highly likely that there are multiple sessions per person. 
sessions_email_count <- sessions |>
  count(hashedEmail)

sessions_email_count

#this calculates the count of each name in the players dataset-> this turns the character value of that column into clear counts 
#so I can easily see how many times each user is in the player dataset (checking for duplicates as the player dataset
#should not show multiple names -> all clear only 1 per person
name_count <- players |>
  count(name)

name_count

#this calculates the count of each age in the dataset - a demographic, so it will be needed later on.
age_count <- players |>
count(Age)

age_count

analyze what the means mean.