In [1]:
    install.packages("rmdwc")
    library(rmdwc)
    ipynbcount(files = "your_notebook.ipynb", celltype = "markdown")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



ERROR: Error: lexical error: invalid char in json text.
                                       your_notebook.ipynb
                     (right here) ------^



Data is from a research group in Computer Science at UBC exploring how people play video games. Players navigation of the world is recorded. The team created 2 csvs - players.csv observing unique players and data about them, and sessions.csv observing individual sessions and information about the session. Purpose of study: to target recruitment efforts. 

Sessions.csv (1535 rows) 
- 5 columns: 
    - 2 decimal:
        - original_start_time and original_end_time -  indicating raw timestamp of session
    - 3 character:
        - hashedEmail
        - start_time + end_time (human-readable date-time format, reporting when a user logged on/off for a session). 

Players.csv (197 rows) 
- 7 columns: 
  - 2 decimal:
    - Age (of user) 
    - played_hours (indicating users' hours played). 
  - 3 character:
    - experience (specifies user gaming experience - pro, veteran, amateur, regular, beginner)
    - name
    - gender (male, female, prefer not to say, non-binary, other, two-spirited, agender). 
  - 1 logical:  
    - Subscribe (indicates whether user is subscribed to game newsletter - TRUE/FALSE). 

Players.csv has unique hashed emails, while sessions.csv repeats hashedEmail per session. 


In [None]:
library(tidyverse)

In [None]:
sessions<- read_csv("sessions.csv")
players <- read_csv("players.csv")
head(sessions)
head(players)

# **Summary Statistics:**

## **Count Findings:**

**Subscribe:** Will need to oversample rare class (FALSE) in prediction

In [None]:
subscribe_count <- players|>
count(subscribe)

subscribe_count

**Played Hours:** 85/197 played 0 hours - raises concern. 

In [None]:
played_hours_count <- players|>
count(played_hours)

played_hours_count

**Experience:** Relatively equal - lower pro count.

In [None]:
experience_count <- players |>
  count(experience)

experience_count

**Gender:** Mostly males, followed by females. Unequal count distribution

In [None]:
gender_count <- players |>
  count(gender)

gender_count

**Player Hashed Email:** only 1 per player

In [None]:
players_email_count <- players |>
  count(hashedEmail)

head(players_email_count)

**Sessions Hashed Email:** Some players play more often (range from 1-310)

In [None]:
sessions_email_count <- sessions |>
  count(hashedEmail)

(sessions_email_count)

**Age:**
- Range = 9-58 
- mode = 17
- majority players <25

In [None]:
age_count <- players |>
count(Age)

age_count

## **Standard Deviation Findings:**

**Played Hours:** Avg playtime differs by 28 hours (High SD)

In [None]:
sd_hours_played <- players |>
    summarize(sd_played_hours = sd(played_hours)) |>
    round(digits = 2)
sd_hours_played

**Age:** 7.4 years (High SD)

In [None]:
sd_age<- players |>
    summarize(sd_age = sd(Age, na.rm =TRUE)) |>
    round(digits = 2)
sd_age

**OST:** 3557491589 seconds from mean timestamp

In [None]:
sd_original_start_time <- sessions |>
    summarize(original_start_time_sd = sd(original_start_time, na.rm=TRUE)) |>
    round(digits = 2)
sd_original_start_time

**OET:** 3552813134 seconds from mean timestamp

In [None]:
sd_original_end_time <- sessions |>
    summarize(original_end_time_sd = sd(original_end_time,na.rm=TRUE)) |>
    round(digits = 2)
sd_original_end_time

## **Mean Findings:**

**Hours Played:** 5.85 (low-end) 
**Age:** Young
**OST:** 1.719201e+12
**OET:** 1.719196e+12

In [None]:
mean_hours_played <- players |>
    summarize(mean_played_hours = mean(played_hours)) |>
    round(digits = 2)
mean_hours_played

**Age:** Relativley Young (21) considering range (9-58)

In [None]:
mean_age<- players |>
    summarize(age_mean = mean(Age, na.rm =TRUE)) |>
    round(digits = 2)
mean_age

**OST:** 4:43 pm UTC - people tend to start in late afternoon.

In [None]:
mean_original_start_time <- sessions |>
    summarize(original_start_time_mean = mean(original_start_time, na.rm=TRUE)) |>
    round(digits = 2)
mean_original_start_time

**OET:** Approx. 3:20 pm UTC - people tend to end in late afternoon. 

In [None]:
mean_original_end_time <- sessions |>
    summarize(original_end_time_mean = mean(original_end_time,na.rm=TRUE)) |>
    round(digits = 2)
mean_original_end_time

## **Min Findings:**

**Played Hours:** some did not play

In [None]:
min_hours_played <- players |>
    summarize(min_played_hours = min(played_hours)) |>
    round(digits = 2)
min_hours_played

**Age:** young players

In [None]:
min_age<- players |>
    summarize(age_min = min(Age, na.rm =TRUE)) |>
    round(digits = 2)
min_age

**OST + OET:**
- Earliest session started and ended at same time

In [None]:
min_original_start_time <- sessions |>
    summarize(original_start_time_min = min(original_start_time, na.rm=TRUE)) |>
    round(digits = 2)
min_original_start_time

min_original_end_time <- sessions |>
    summarize(original_end_time_min = min(original_end_time,na.rm=TRUE)) |>
    round(digits = 2)
min_original_end_time

## **Max Findings:**

**Played Hours:** 
- wide variation across min and max hours played (0-223.1)

In [None]:
max_hours_played <- players |>
    summarize(max_played_hours = max(played_hours)) |>
    round(digits = 2)
max_hours_played

**Age:** 58 (old players) 
    - wide variation

In [None]:
max_age<- players |>
    summarize(age_max = max(Age, na.rm =TRUE)) |>
    round(digits = 2)
max_age

**OST**: 1.72733e+12 - explain what means 

In [None]:
max_original_start_time <- sessions |>
    summarize(original_start_time_max = max(original_start_time, na.rm=TRUE)) |>
    round(digits = 2)
max_original_start_time

**OET:** 1.72734e+12 - explain what means 

In [None]:
max_original_end_time <- sessions |>
    summarize(original_end_time_max = max(original_end_time,na.rm=TRUE)) |>
    round(digits = 2)
max_original_end_time

## **Questions:**
- **Broad**: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

- **Subquestion:** Can Gender and Age predict Subscription in players.csv?

# **Data Analysis - Exploratory**

Proportion data computed to understand the counts better, accounts for category imbalances - relevant for graphs. 

In [None]:
ratio_gender_subscribed <- players |>
  group_by(gender) |>
  summarize(
    total = n(),
    subscribed = sum(subscribe == TRUE),
    percentage_subscribed = (subscribed / total)*100)


ratio_gender_subscribed

ratio_experience_subscribed <- players |>
  group_by(experience) |>
  summarize(
    total = n(),
    subscribed = sum(subscribe == TRUE),
    percentage_subscribed = (subscribed / total)*100)


ratio_experience_subscribed

**Quantitative means - players.csv**
- **Played Hours:** fairly low (5.85 hours) 
- **Age:** young age (21.14)

In [None]:
#mean of the quantitative variable played hours
mean_hours_played <- players |>
    summarize(mean_played_hours = mean(played_hours)) |>
    round(digits = 2)
mean_hours_played


#mean of the quantitative variable age
mean_age<- players |>
    summarize(age_mean = mean(Age, na.rm =TRUE)) |>
    round(digits = 2)
mean_age

# **Graphs**

Subscribed Count (TRUE) VS Not Subscribed Count (FALSE). 
- Clarifies need to ovsersample rare class (FALSE)  in future clear that in future steps. 

In [None]:
subscribed_count_bargraph <- subscribe_count|>ggplot(aes(x=subscribe, y=n, fill=subscribe))+
geom_bar(stat="identity") + 
labs(x="Subscribed to Newsletter", y="Count of Subscribed or Not", fill="Subscribed?")+
ggtitle("Subscribed to Newsletter VS Not: Count Visualized")

subscribed_count_bargraph


Age vs played_hours, color by subscribed. 
- Younger players have higher played hours (somewhat negative relationship, weak, nonlinear). 
- More data points for younger players (>25). Higher played hours + lower age correlate to subscription.
- Played hours + age likely predictive of subscription. 


In [None]:
age_vs_played_hours <- players|>ggplot(aes(x=Age, y=played_hours, color=subscribe))+
geom_point() +
scale_y_log10() +
labs(x="User Age", y="Hours Played by User", color="Subscribed?")+
ggtitle("Relationship between Age and Played Hours: Colored by Subscription")

age_vs_played_hours

Gender counts show males then females have highest subscriptions. 
- Likely because more data points. 

In [None]:
gender_subscribed_bar <- ratio_gender_subscribed|>ggplot(aes(x=gender, y=subscribed,, fill=gender))+
geom_bar(stat="identity") + 
labs(x="User Gender", y="Count of Subscribed or Not", fill="Gender")+
ggtitle("Subscription Count by Gender")

gender_subscribed_bar

Percentage subscribed by gender - women have a highest subscription rate.
- Small-category sizes limit inference.
- Data limitations prevalent ( limited data for agender and other).
- Change in the dominating gender observed when looking at proportion instead of count.
- Likely a predictive variable. 

In [None]:
gender_subscribed_ratio_bar <- ratio_gender_subscribed|>ggplot(aes(x=gender, y=percentage_subscribed, fill=gender))+
geom_bar(stat="identity") +
labs(x="User Gender", y="Percentage of Subscribed or Not", fill="Gender")+
ggtitle("Percentage Subscribed by Gender")


gender_subscribed_ratio_bar


Experience vs subscribed count. Highest subscribe count = Amateurs - also highest total count. 
- Relatively even distribution. 


In [None]:
experience_subscribed_bar <- ratio_experience_subscribed|>ggplot(aes(x=experience, y=subscribed, fill=experience))+
geom_bar(stat="identity") +
labs(x="User Experience", y="Count of Subscribed or Not", fill="Experience")+
ggtitle("Subscription Count by Experience")

experience_subscribed_bar

Regulars have highest subscribed percentage users across various experience categories - differences are small, so experience may be less predictive than gender or played hours.

In [None]:
experience_subscribed_ratio_bar <- ratio_experience_subscribed|>ggplot(aes(x=experience, y=percentage_subscribed, fill=experience))+
geom_bar(stat="identity") +
labs(x="User Experience", y="Percentage of Subscribed or Not", fill="Experience")+
ggtitle("Percentage Subscribed by Experience")
experience_subscribed_ratio_bar

# **Model**

K-Nearest Neighbors (K-NN) classification will be applied. This is a classification question as the response variable is subscribed vs not subscribed (TRUE or FALSE in the way it's written in the dataset) - it is binary, making classification appropriate.

K-NN is suitable because it makes minimal assumptions about the data and can capture non-linear relationships between player characteristics, predictors (in this project Age and Gender) and the likelihood of subscribing. K-NN assumes similar behaviour (predictors) can predict similar outcomes (response variable). This is beneficial in this project. 

I will need to impute missing data (NAs) and oversample the rare class (FALSE - nearly 3x as many people are subscribed as not). The data will also be split into 70% training, 30% testing sets. 

All numerical predictors need to be standardized within the recipe stage, ensuring none have a dominating effect on the distance calculation within K-NN. Next, model tuning, as we will need to determine the optimal number of K's using cross-validation (5-fold), on the training data. The K with the highest cross-validated accuracy is the one that will be chosen for the final model (refit on training, and applied on testing for prediction).

Limitation: K-NN slow in large datasets.
