# Individual Plan

https://github.com/yihui777c/dsci-100-individual-plan.git

## Importing Data

In [2]:
import pandas as pd
file_id_1 = '1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz'
url_1 = f'https://drive.google.com/uc?export=download&id={file_id_1}'
players = pd.read_csv(url_1)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
file_id_2 = '14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB'
url_2 = f'https://drive.google.com/uc?export=download&id={file_id_2}'
sessions = pd.read_csv(url_2)
sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12
...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12


In [4]:
players['experience'].unique()

array(['Pro', 'Veteran', 'Amateur', 'Regular', 'Beginner'], dtype=object)

In [5]:
players[['experience']].drop_duplicates()

Unnamed: 0,experience
0,Pro
1,Veteran
3,Amateur
4,Regular
12,Beginner


In [6]:
players[['played_hours', 'age']].agg(['max', 'min', 'mean'])

Unnamed: 0,played_hours,age
max,223.1,99.0
min,0.0,8.0
mean,5.845918,21.280612


## Data Description
1. players.csv: there are 196 obervations and 9 variables, the varibales are:
+ `experience`- charactor: describes the experience level of the players, as: `Pro`, `Veteran`, `Amateur`, `Regular`, and `Beginner`.
+ `subscribe`- logical: describes whether or not the player has subscribed the game info, as: `TRUE`, and `FALSE`
+ `hashedEmail`- charactor: this records the hashing encoded email addresses of players.
+ `played_hours`- double: records the game playtime in hours, with a range from 0 to 99.
+ `name`- character: the first name or nickname of the player.
+ `gender`- categorical: gender identity of the player (Male, Female, Other, Prefer not to say).
+ `age`-  integer: player age, with a range from 8 to 99 years (mean ≈ 21.3).
+ `individualId`- character: internal player identifier; mostly missing (NaN).
+ `organizationName`- character: organization affiliation; mostly missing (NaN).

2. sessions.csv: there are 1535 obervations and 5 variables, the varibales are:
+ `hashedEmail`- character: unique identifier matching the player from players.csv.
+ `start_time`- timestamp: session start time.
+ `end_time`- timestamp: session end time.
+ `original_start_time`— numeric: UNIX-like timestamp for the start time.
+ `original_end_time`— numeric: UNIX-like timestamp for the start time.


### potential issues
+ Missing values in individualId and organizationName.
+ Anonymized emails not interpretable。
+ Categories like Agender, Other, and Two-Spirited have very few records; they can be merged into an “Other” group if using gender as a variable.
+ Data with a played hours of 0 may disrupt the ayalysis.
+ Categorical variables like experience require encoding (e.g., one-hot).

### how the data were collected  
The datasets were collected from an online gaming platform that tracks player activity.
+ players.csv was generated from player registration and profile information, including experience level, subscription status, and demographic attributes.
+ sessions.csv was automatically recorded by the game system, logging each player’s session start and end times.


## Question

Answer question 1: **What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?**

+ The response variable: subscribe
+ Predictors: experience, played_hours, and age.
+ This is a k-nearest neighbors (kNN) classification problem.
+ Two ways come up when process experience:
   + one-hot encoding
   + ordinary encoding

## Exploratory Data Analysis and Visualization

In [18]:
import altair as alt

In [19]:
filtered_players = players[
    (players['played_hours'] <= 10) &
    (players['age'] <= 50)
]

alt.Chart(filtered_players).mark_point(filled=True).encode(
    x=alt.X('played_hours', title='Played Hours (≤10)'),
    y=alt.Y('age', title='Player Age (≤50)'),
    color=alt.Color('subscribe:N', title='Subscribed'),
    tooltip=['age', 'played_hours', 'experience', 'subscribe']
).properties(
    title='Relationship between Played Hours, Age, and Subscription'
)


The scatter shows a concentration of subscribers among casual players, with no clear relation between age and playtime.

In [22]:
alt.Chart(players).mark_bar().encode(
    x='subscribe',
    y='count()'
).facet(
 'experience', columns=5)

Subscription rates differ by experience. “Pro” and “Veteran” players show relatively higher subscription counts, suggesting that experienced players are more engaged with game-related content.

In [23]:
alt.Chart(players).mark_bar().encode(
    x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=30), title='Played Hours (0–50)', scale=alt.Scale(domain=[0, 50])),
    y=alt.Y('count()', title='Number of Players'),
    color=alt.Color('subscribe:N', title='Subscribed')
).properties(
    title='Distribution of Played Hours and Subscription Status'
)

The majority of players spent fewer than 10 hours in total. Among this group, most have subscribed, indicating that light players form the largest share of subscribers.

## Methods and Plan

I will use a k-nearest neighbors (knn) classification model to predict whether a player subscribes to a game-related newsletter based on experience, played_hours, and age.

+ knn is suitable for categorical prediction problems and works well with mixed numerical and categorical data. It can capture nonlinear relationships and is easy to interpret for understanding player similarity.
+ I assume that players with similar experience, age, and playtime will show similar subscription behavior, and that distance metrics can meaningfully reflect similarity among players.
+ knn can be sensitive to outliers and the scale of variables. It may perform poorly on imbalanced data or when irrelevant predictors are included. It is also computationally expensive for large datasets.
+ I will use cross-validation to tune the parameter k and compare models with different k values. The model with the highest validation accuracy will be selected.
+ I will clean missing values, normalize numerical variables, and encode categorical features (experience) using one-hot or ordinal encoding. I will split the data into 70% training and 30% testing sets. Cross-validation will be applied during training to ensure reliability.