## Project Planning ##

### Data Descriptions ###

For this project we were given two distinct datasets, the players dataset and the sessions dataset. I decided to pick the former for this proposal:

In [38]:
import pandas as pd
import altair as alt

players = pd.read_csv("data/players.csv")

# If you can't read the datasets try these lines
# url= "https://raw.githubusercontent.com/sydlpeters/dsci-group-2025w1-group-101-1/refs/heads/main/data/players.csv"
# players = pd.read_csv(url)

Let's first get a look at the players dataset:

In [39]:
print(players.shape)
players.head()

(196, 9)


Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,


By the shape, we also know that there are 196 distinct observations of this dataset, which means that we have had 196 distinct players in the server.

We also have 9 distinct fields that represent the following:

| Field   | Data Type | Description  | 
| :------- | :------:  | :-------    | 
| experience     | String (Nominal) | Classifies players in one of 5 labels depending on the player's experience (Pro, Veteran, Amateur, Regular and Beginner)        |
| subscribe | Bool  (Nominal) | Tells us if a player is or is not subscribed to the server. |
| hashedEmail | String (Nominal) |Individual email of each player. It can serve as an individual ID for each player. |
| played_hours |Float (Quantitative)|The amount of hours a person has played in the server in hours. It also represents the amount of data generated by a player. Minutes are only accesible through conversion.|
| name |String (Nominal)| A player's name.|
|gender|String (Nominal) |Describes a players gender. |
|age|Int (Quantitative/Nominal)| Describes the player's age.|
|IndividualId|N/A|No associated data was found in the dataset|
|organizationName|N/A|No associated data was found in the dataset|

There were some problems with the data. The name field,  gives us no relevant information to answer any of the questions and is redundant since we have the hash of an email as an identifier. We also see that the individualID and organazationName fields are empty. To clean the data, we will have to get rid of them in a future wrangling task.

It is also important to note that while we are getting most of our data from the players dataset, there is also important information that we can get from the sessions dataset. It can help us determine the habits of our players, which are important segmentations to make. The player's relationship with the server might point to more particular causes that lead them to play more hours.

### Question ###

As a team, we agreed on solving question 2: 
>We would like to know which "kinds" of players are most likely to contribute a large amount of data?

I personally will try to answer the question: 

>"Which type of player is more likely to be a consistent player in the server?"

Variables of Interest:

**Exploratory Variables**: will include any variable that segments players in groups. 

**Response Variable**: total_hours, this represents the amount of data generated by our players, we should seek to _maximize_ this variable.

In general, the idea is to use the exploratory variables to find the "consistent player". That is, the player who plays many sessions over time. We are looking for the player who will maximize total_hours. Consistent players are likely to generate more data because they regularly connect to the server. With this in mind, I would expect regular, veteran, and pro players to be more consistent and, thus, have more hours. I would also expect younger players to have more total hours.

### Exploratory Data Analysis ###

Let's first clean the data to make it tidy.

In [42]:
#We drop the ununsed fields
players_wrangled = players.drop(columns=["name", "individualId", "organizationName"])
players_wrangled

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Male,21
...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Female,17
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Male,22
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Prefer not to say,17
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Male,17


In [61]:
alt.Chart(players_wrangled).mark_bar().encode(
    x=alt.X("count(played_hours):Q", title="Average Total Hours"),
    y=alt.Y("experience:O", title="Experience"),
    color=alt.Color("experience").title("Experience")
)

In [58]:
alt.Chart(players_wrangled).mark_bar().encode(
    x=alt.X("count(played_hours):Q", title="Average Total Hours"),
    y=alt.Y("subscribe:O", title="Experience"),
    color=alt.Color("subscribe")
)

In [57]:
alt.Chart(players_wrangled).mark_bar().encode(
    x=alt.X("count(played_hours):Q", title="Average Total Hours"),
    y=alt.Y("gender:O", title="Experience"),
    color=alt.Color("gender").title("Gender")
)

In [62]:
alt.Chart(players_wrangled).mark_point().encode(
    alt.X("age").title("Age"),
    alt.Y("played_hours").title("Hours Played"),
    alt.Color("experience").title("Gender")
)

According to this exploration, we can see that the group that contributes the most are males for gender, around 20 years of age and amateur for expirience. We can look out for these groups as we continue our analysis as a way to get more data. 

### Proposals ###

**1. Wrangling and using the Sessions Dataset**

I propose we explore the data further via using the sessions dataset to get information on how the player's behave in terms of the length of their sessions and number of sessions. This will allow us to dispel a problem we saw in our exploration: we don't know who will generate data consistently. This exploration allowed us to understand which type of player _looks at_ the server, but not who _stays in_. This will allow us to understand who is the consistent player and thus, who is more likely to generate more data in the long run, which is the overall objective of this analysis. 

**2. Linear Regression and Segmentation Analysis**

After we know what the relationship of each player with the server looks like; we can apply linear regresion with a 70/30, train/test split to maximize the total number of hours. To do so, we can do regressions on different subsets of the data. We will regress for two key variables, number of sessions and hours per session; we can then joim the two resulting regressions in a prediction that tells us how many total hours is one player of a subset likely to contribute. Then, we can pick the subset which the regression shows that has the highest hours-per-session and focus on recruiting those players as a way to maximize total_hours.

