# DSCI 100 Project Proposal 

In [2]:
import pandas as pd 
import altair as alt

# (1) Data Description:

**players.csv Data**

In [3]:
players_data = pd.read_csv("https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz") 
players_data 

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


**Data Description of players_data**

There are 196 individuals recorded in this dataset and there are 9 variables.

In [4]:
data = {
    "Variable Name": [
        "Experience", "Subscribe", "hashedEmail", "played_hours",
        "name", "gender", "age", "individualid", "organizationName"
    ],
    "Data Type": [
        "Categorical", "Boolean", "Text/String", "Numeric (Float)",
        "Text/String", "Categorical", "Numeric (Integer)", "Text/String", "Text/String"
    ],
    "Description": [
        "User’s experience level/skill",
        "Subscription Status",
        "Anonymized email identifier",
        "Total Hours Played",
        "User’s first name",
        "User’s Gender identity",
        "User's Age in years",
        "User’s ID",
        "Associated Organization"
    ]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Variable Name,Data Type,Description
0,Experience,Categorical,User’s experience level/skill
1,Subscribe,Boolean,Subscription Status
2,hashedEmail,Text/String,Anonymized email identifier
3,played_hours,Numeric (Float),Total Hours Played
4,name,Text/String,User’s first name
5,gender,Categorical,User’s Gender identity
6,age,Numeric (Integer),User's Age in years
7,individualid,Text/String,User’s ID
8,organizationName,Text/String,Associated Organization


### Summary Statistics

**Numerical Variables:**
- **played_hours**: Ranges from 0 to 240.0 hours. The distribution is highly right-skewed, with majority of players having played fewer than 30 hours.
- **age**: Ranges from 9 to 91 years. The majority of players fall within the 15-25 age range.

**Categorical Variables:**
- **experience**: Amateur (63 players), Beginner (35 players), Pro (14 players), Regular (36 players), Veteran (48 players)
- **gender**: Male (161 players), Female (37 players), Other (1 players), Prefer not to say (11 players), Non-Binary (15 players), Agender (2 players), Two spirited (6 players).
- **subscribe**: True (144 players), False (52 players)

**Data Quality Issues Quantified:**
- 2 columns (individualId, organizationName) are completely empty (100% NaN values)
- Approximately 83 players have 0 hours played, which may indicate they registered but never engaged
- 2 age outliers detected: one player aged 9 and one aged 91, which warrant verification or careful handling

**Data Collection Method**

User registration/profile data
- signup forms
- Demographics entered by users 
- Self selected experience level or system assigned 

Session logging 
- System timestamps when users start/end activities 
- Linked via hashed email for privacy 

# (2) Question:

The question of research is developed using Q2. (We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.)

Specific Predictive Question: “can we predict the contributing level of data (High or Low) with age, gender, and experience level?

The **response variables** will be “played_hours". The **explanatory/predictor** variables will be “experience”, “gender” and “age”. Not much wrangling will be done for the players.csv data set. However, the columns: “subscribe”, “individualId”, and “orgnaizationName” will be removed for lack of relevance. A possible wrangling that might have to happen later would be the splitting/categorizing of "played_hours into "high" and "low" contributing levels.

# (3) Exploratory Data Analysis and Visualization

In [7]:
players_data = players_data.rename(columns={"hashedEmail": "hashed_Email"})
players_data_clean = players_data[["hashed_Email","experience","played_hours","gender","age"]] #This removes the unwanted last two columns. 

players_d = players_data_clean
players_d

Unnamed: 0,hashed_Email,experience,played_hours,gender,age
0,f6daba428a5e19a3d47574858c13550499be23603422e6...,Pro,30.3,Male,9
1,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,Veteran,3.8,Male,17
2,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,Veteran,0.0,Male,17
3,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,Amateur,0.7,Female,21
4,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,Regular,0.1,Male,21
...,...,...,...,...,...
191,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,Amateur,0.0,Female,17
192,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,Veteran,0.3,Male,22
193,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,Amateur,0.0,Prefer not to say,17
194,f19e136ddde68f365afc860c725ccff54307dedd13968e...,Amateur,2.3,Male,17


**Exploratory Visualizations**

In [8]:
exp_vs_hours_plot_scatter = alt.Chart(players_d).mark_point(opacity=0.6, size=60).encode(

    x=alt.X("experience").title("Level of Experience"),
    y=alt.Y("played_hours").title("Hours Played"),
    color=alt.Color("experience").title("Experience Level")
).properties(
    title="Experience VS Hours of Gameplay"
)
exp_vs_hours_plot_scatter


**Insight Gained From Plot**:
Using this plot, we can determine what level of experience (Amateur, Beginner, Pro, Regular, Veteren) generally has the most number of hours played. We can see from this plot that the experience level with the most significant, noticeable hours played are amateur and regular players. Additonally, we can see that the other experince levels do indeed play the game, just not for a long period of time. The range of hours played by Beginners, Pros, and Veterens fall between approximately 0-30 hours.

In [9]:
gender_vs_hours_plot = alt.Chart(players_d).mark_point(opacity=0.6, size=60).encode(

    x=alt.X("gender").title("Gender"),
    y=alt.Y("played_hours").title("Hours Played"),
    color=alt.Color("gender").title("Gender")
).properties(
    title="Gender VS Hours of Gameplay"
)
gender_vs_hours_plot

**Insight Gained From Plot**: This plot gives us information about the distriution of hours spent playing minecraft based on genders. As we can see from the plot, male and female gamers shows a significant increase in hours of gameplay when comepared to others. An addidiional imput based on the data is that only a small number of individuals have played the game for over 60 hours.

In [11]:
age_vs_hours_plot = alt.Chart(players_d).mark_bar().encode(

    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Hours Played"), 
).properties(
    title="Age VS Hours of Gameplay"
)
age_vs_hours_plot

**Insight Gained From Plot**: This bar chart displays the total hours played aggregated by player age. The data shows that the 15-25 age range accounts for the majority of total gameplay hours, with peak contributions around ages 18-20. While the server attracts players across a broad age spectrum (9-91 years), engagement is heavily concentrated in the young adult demographic. The steep decline after age 25 indicates that older players contribute substantially less total gameplay time.

In [12]:
hours_distribution_plot = alt.Chart(players_d).mark_bar().encode(
    x=alt.X("played_hours:Q")
        .bin(maxbins=20)
        .title("Hours Played"),
    y=alt.Y("count()")
        .title("Number of Players")
).properties(
    title="Distribution of Total Hours Played by Players"
)

hours_distribution_plot

**Insight Gained From Plot**: This histogram reveals that the distribution of hours played is highly right-skewed, with the majority of players (approximately 150+ players) having played fewer than 10 hours. Most players contribute minimal data to the research project, with engagement concentrated in the 0-10 hour range. However, there are notable outliers with several players exceeding 60 hours, and a few exceptional cases with over 150 and 220 hours played. This skewed distribution is important for our modeling approach, as these high-engagement outliers could significantly influence KNN predictions. The pattern suggests that while most players try the server briefly, a small subset becomes highly engaged, which directly addresses our research question about identifying player types who contribute large amounts of data.