# Plaicraft Research Project

[Plaicraft.ai](https://plaicraft.ai/) collects data about how people play Minecraft. 

## (1) Data Description:

Two datasets are provided:

`players.csv` - Contains individual player demographics after signing up and playing (196 players, 9 variables):
* `experience`: exposure to Minecraft (Veteran, Pro, Regular, Amateur, Beginner)
* `subscribe`: subscription status (True/False)
* `hashedEmail`: unique coded email
* `played_hours`: total hours played per player
* `name`: player name
* `gender`: gender identity (Male, Female, Non-binary, Agender, Two-Spirited, Other, Prefer not to say)
* `age`: age in years (7-99 years)
* `individualId` and `organizationName`: Unused variables

Note: `played_hours` and `age` are quantitative variables, while the others are qualitative

`sessions.csv` - Lists individual session times when a player starts and ends a session (1535 sessions, 5 variables):

* `hashedEmail`: unique coded email
* `start_time`: session start time as DD/MM/YY 00:00 24-hour clock
* `end_time`: end time as DD/MM/YY 00:00
* `original_start_time`: session start time as UNIX time (seconds since Jan 1, 1970)
* `original_end_time`: session end time as UNIX time

Note: `hashedEmail` is a qualitative variable, while the others are quantitative

Several issues exist:
* Experience, Gender, and Age are self-reported, so players can respond untruthfully.
* `IndividualID` is blank, but `hashedEmail` can identify players.
* start and end time in UNIX time are similar even when the DD/MM/YY times are different. DD/MM/YY times can be used.

In [1]:
# Import packages
import altair as alt
import pandas as pd

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

DataTransformerRegistry.enable('vegafusion')

In [2]:
# Import players.csv dataframe as URL
url_players = "https://drive.google.com/uc?id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url_players)

players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
# Import sessions.csv dataframe as URL
url_sessions = "https://drive.google.com/uc?id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"
sessions = pd.read_csv(url_sessions)

sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12
...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12


## (2) Question:

To target recruitment strategy, **which "kinds" of players are most likely to contribute a large amount of data**. 

Specifically, can `age` predict `played_hours`, the amount of data likely to contribute.

Finding ages that predict the highest `played_hours` can allow researchers to market in locations where that age group is prevalent (eg. elementary school vs. workplace).

Both variables are present in `players.csv`, so no data from `sessions.csv` or additional wrangling will be used to find variables.

## (3) Exploratory Visualization

In [4]:
players_age = (
    alt.Chart(players, title="Number of Players per Age on Plaicraft").mark_bar().encode(
        x = alt.X("age").title("Age (years)").bin(maxbins=20),
        y = alt.Y("count()").title("Number of players"),
    )
    .configure_axis(titleFontSize=12)
)

players_age

Most participants are 15-25 years old (around 70+95=165 participants). K-NN may not predict well in other ages with fewer or no data points.

In [5]:
players_scatterplot = (
    alt.Chart(players, title="Hours Contributed by Each Player with Age").mark_circle(opacity=0.40).encode(
        x = alt.X("age").title("Age (years)"),
        y = alt.Y("played_hours").title("Player's Total Time Played (hours)")
    )
    .configure_axis(titleFontSize=12)
)

players_scatterplot

Younger players (ages 10-30) contibuted large hours (20+), but most individuals contributed <10 hours.

In [6]:
## Zoom in on Hours 0-8
minimum_hours = 0
maximum_hours = 8

players_played = (
    players[
        (players["played_hours"] >= minimum_hours) &
        (players["played_hours"] <= maximum_hours)
    ]
)

players_scatterplot = (
    alt.Chart(players_played, 
              title="Player Time per Age and Experience between 0-8 hours in Plaicraft")
    .mark_point(opacity=0.50)
    .encode(
        x = alt.X("age").title("Age (years)"), #.scale(type="log")
        y = alt.Y("played_hours")
            .title("Player's Total Time Played (hours)")
            .scale(domain=[minimum_hours, maximum_hours]),
        color = alt.Color("experience").title("Experience Level")
    )
    .configure_axis(titleFontSize=12)
)
players_scatterplot

Many players play 0-1 hours. Researchers should define a "large" amount of data (eg. >1 hour)

In [7]:
players_mean = (
    players.groupby("age").mean("played_hours").reset_index()
)

players_scatterplot = (
    alt.Chart(players_mean, title="Average Total Time Played Per Age").mark_line().encode(
        x = alt.X("age").title("Age (years)").scale(domain=[7, 100]),
        y = alt.Y("played_hours").title("Average Total Time Played (hours)")
    )
    .configure_axis(titleFontSize=12)
)
players_scatterplot

As age increases, the average `played_hours` fluctuates, suggesting a non-linear pattern with `age`.

In [8]:
players_experience = alt.Chart(players, title="Total Time Played per Experience Level in Plaicraft").mark_bar().encode(
    x = alt.X("played_hours").title("Time Played (hours)"),
    y = alt.Y("experience").title("Experience Level")
    .sort(["Pro", "Veteran", "Amateur", "Regular", "Beginner"]),
).configure_axis(titleFontSize=12)

players_experience

In [9]:
players_gender = alt.Chart(players, title="Total Time Played per Gender in Plaicraft").mark_bar().encode(
    x = alt.X("played_hours").title("Time Played (hours)"),
    y = alt.Y("gender").title("Experience Level").sort("-x"),
).configure_axis(titleFontSize=12)

players_gender

K-NN Regression is one way to predict  using `age` because `played_hours` is a quantitative variable, not qualitative without making assumptions about data linearity.

However, the model does not predict values well if few or no data exists, especially ages other than 15-25 years. K-NN Regression can be computationally expensive with more data and its interpretation is less intuitive.

Model selection:
* 70%-30% train-test split with a set random seed.
* 5-fold cross-validation to select K with the lowest cross-validation RMSPE.
* Fit the training data
* Predict with the testing data
* Calculate the model's RMSPE
* Conduct linear regression to calculate RMSPE
* Comparing RMSPE of the two models: selecting the model with the lowest test dataset RMSPE