## Data Science 100 - Final Project

An investigation into the popularity of play times using data from a Minecraft server. 

By Group #4: Chloe Glesby, Ryann Wilson, Victoria Levner, Joyce He

### Introduction
Minecraft, created by Mojang Studios, is an online video game where users can place ‘blocks’ to create and build different structures. The game has a multiplayer aspect where users can interact and collaborate. The Pacific Laboratory for Artificial Intelligence (PLAI), founded by Professor Frank Wood, created a Minecraft server - PLAICraft - to conduct research and data analysis about the video game and its users. Once players log onto this specific server, all kinds of data is collected from them. Some examples of this data include the kind of player they are, the time of day they log on and off, and what things they say while playing. This information can be useful for the PLAI team as it can showcase patterns to help formulate models for predictive purposes. These predictions can be used in many aspects of video game research, including what the most popular times will be and how many people the server needs to accommodate - which is our goal for our data analysis.

Using the PLAI team’s Minecraft data, an analysis will be conducted in order to answer:
- How does the popularity of playing Minecraft on the PLAICraft server differ throughout the day?

Popularity is determined by:
  - The number of players on the server simultaneously
  - The number of minutes played per session


To begin the data anlysis, some Python packages must first be imported into the notebook:

In [1]:
import numpy as np
import pandas as pd
import altair as alt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import set_config
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from IPython.display import display, Markdown
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

Next, the data was loaded as a simple data frame using the 'pandas' package 

In [2]:
sessions_data = pd.read_csv(
    'https://drive.google.com/uc?id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB'
) # the file path for the dataframe in brackets
sessions_data.head(10)

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
5,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0
6,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,15/04/2024 07:12,15/04/2024 07:21,1713170000000.0,1713170000000.0
7,ad6390295640af1ed0e45ffc58a53b2d9074b0eea694b1...,21/09/2024 02:13,21/09/2024 02:30,1726880000000.0,1726890000000.0
8,96e190b0bf3923cd8d349eee467c09d1130af143335779...,21/06/2024 02:31,21/06/2024 02:49,1718940000000.0,1718940000000.0
9,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,16/05/2024 05:13,16/05/2024 05:52,1715840000000.0,1715840000000.0


### The Data 
The Sessions dataframe has 1535 observations and 5 variables which provide information about the time frames that players of the PLAICraft server logged on and off. 

https://drive.google.com/file/d/14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB/edit

- The column `hashedEmail` describes individual player email's in an encoded fashion, using a string of numbers and letters. The observations in this column are not useful in this data analysis, as the identity of the players is not important to our question.  
- The columns `start_time` and `end_time` provide the exact time and date that players logged on and off the server in day-month-year-time format. This makes the data untidy though, as there are multiple variables within one cell. 
- The columns `original_start_time` and `original_end_time` are information about the log-on and off times in Unix time format. This column is not useful for this data analysis as it is in a time format that is difficult to manipulate and interpret.

Using the `count` - a simple numbered list of players, and the`start_time` and `end_time` columns, some prelminary data wrangling and filtering will be conducted to better visulize any patterns in the Sessions data set.

The start and end time format is easier to use than Unix time, but it will be more efficient to organize it into a format that Python recognizes and classifies as tidy:

In [3]:
#to convert the time into 'datetime' format and split each variable into a separate column

sessions_data['start_time'] = pd.to_datetime(sessions_data['start_time'],format='%d/%m/%Y %H:%M')
sessions_data['end_time'] = pd.to_datetime(sessions_data['end_time'],format='%d/%m/%Y %H:%M')
sessions_data["start_hour"] = sessions_data["start_time"].dt.hour
sessions_data["start_month"] = sessions_data["start_time"].dt.month
sessions_data["start_year"] = sessions_data["start_time"].dt.year 

# drop unnecessary columns and any columns with NaN values
sessions_data = sessions_data.drop(columns = ['hashedEmail','original_start_time','original_end_time'])
sessions_data = sessions_data.dropna()

# tell python to show the new data set 
sessions_data.head(10)

Unnamed: 0,start_time,end_time,start_hour,start_month,start_year
0,2024-06-30 18:12:00,2024-06-30 18:24:00,18,6,2024
1,2024-06-17 23:33:00,2024-06-17 23:46:00,23,6,2024
2,2024-07-25 17:34:00,2024-07-25 17:57:00,17,7,2024
3,2024-07-25 03:22:00,2024-07-25 03:58:00,3,7,2024
4,2024-05-25 16:01:00,2024-05-25 16:12:00,16,5,2024
5,2024-06-23 15:08:00,2024-06-23 17:10:00,15,6,2024
6,2024-04-15 07:12:00,2024-04-15 07:21:00,7,4,2024
7,2024-09-21 02:13:00,2024-09-21 02:30:00,2,9,2024
8,2024-06-21 02:31:00,2024-06-21 02:49:00,2,6,2024
9,2024-05-16 05:13:00,2024-05-16 05:52:00,5,5,2024


Next, we created a column in the dataframe that calculates the total minutes played in each playing session. This will help us in determining how long playing sessions are during different time slots in the day.

In [4]:
sessions_data['time_played_mins'] = ((sessions_data['start_time'].dt.hour * 60 + sessions_data['start_time'].dt.minute) -
(sessions_data['end_time'].dt.hour * 60 + sessions_data['end_time'].dt.minute)).abs()
sessions_data.head(10)

Unnamed: 0,start_time,end_time,start_hour,start_month,start_year,time_played_mins
0,2024-06-30 18:12:00,2024-06-30 18:24:00,18,6,2024,12
1,2024-06-17 23:33:00,2024-06-17 23:46:00,23,6,2024,13
2,2024-07-25 17:34:00,2024-07-25 17:57:00,17,7,2024,23
3,2024-07-25 03:22:00,2024-07-25 03:58:00,3,7,2024,36
4,2024-05-25 16:01:00,2024-05-25 16:12:00,16,5,2024,11
5,2024-06-23 15:08:00,2024-06-23 17:10:00,15,6,2024,122
6,2024-04-15 07:12:00,2024-04-15 07:21:00,7,4,2024,9
7,2024-09-21 02:13:00,2024-09-21 02:30:00,2,9,2024,17
8,2024-06-21 02:31:00,2024-06-21 02:49:00,2,6,2024,18
9,2024-05-16 05:13:00,2024-05-16 05:52:00,5,5,2024,39


To complete our wrangling, we created another dataframe with the start hours and the total count of playing sessions that began in that hour. From this dataframe, will be able to easily visualize the distribution of playing hours, as well as easily perform our data analysis later.

In [5]:
start_hour_count = sessions_data["start_hour"].value_counts()
start_hour_data = pd.DataFrame(start_hour_count).reset_index()
start_hour_data.head(10)

Unnamed: 0,start_hour,count
0,2,152
1,4,150
2,3,131
3,0,128
4,23,122
5,21,91
6,22,89
7,5,88
8,1,79
9,20,74


### Visualizations

Below are vizualizations comparing the relationship between hour of the day and number of playing sessions (**Figure 1.0**, `Number of playing sessions vs hour of the day`) and the relationship between hour of the day minutes played per session (**Figure 1.1**, `Total minutes per playing session vs hour the day`). **Figure 1.0** will help us in determining the number of players on the server simultaneously and **Figure 1.1** will help us in determining the number of minutes played per session.

In [6]:
hours_scatter = alt.Chart(start_hour_data).mark_point().encode(
    x=alt.X('start_hour').title('Hour of the day'),
    y=alt.Y('count').title('Number of players'),
).properties(title='Number of playing sessions vs hour of the day')

display(hours_scatter)
display(Markdown("**Figure 1.0:** Exploratory visualization of number of players active throughout the day."))

**Figure 1.0:** Exploratory visualization of number of players active throughout the day.

In [7]:
minutes_played_scatter = alt.Chart(sessions_data).mark_point().encode(
    x=alt.X("start_hour").title("Hour of the day"),
    y=alt.Y("time_played_mins").title("Total minutes per playing session")
).properties(title="Total minutes per playing session vs hour of the day")
minutes_played_scatter

display(minutes_played_scatter)
display(Markdown("**Figure 1.1:** Exploratory visualization of playing time throughout the day."))

**Figure 1.1:** Exploratory visualization of playing time throughout the day.

### Data Summary

For our question, we are most interested in the number of players who began their playing session each hour, as well as the relationship between the length in minutes of each playing session and the hour in which the sessions began. For our first question, we can see from our `start_hour_data` dataframe and `Number of playing sessions vs hour of the day` visualization that 2 AM, 4 AM, and 3 AM are the most popular times for players to begin playing and 11 AM, 1 PM, and 12 PM are the least popular times for players to begin playing. For our second question, we can most clearly see the relationship between the variables in our vizualization, `Total minutes per playing session vs hour the day`. The most common time for a playing session to be long (> 1000 minutes) is late in the day, at hours between 8 PM and 11 PM.

In these vizualizations, we can begin to see some clusters, so we know clustering will be effective method of anaylsis to group our data initially. In **Figure 1.0**, we can see a cluster early in the day with a high number of playing sessions, a cluster mid-day with a low number of playing sessions, and a cluster late in the day with a high number of playing session. In **Figure 1.1**, we can see a cluster early in the day with few total minutes, a cluster late in the day with few total minutes, and a cluster late in the day with high total minutes.

Additionally, these visualizations provide us with some information to use for our predictive analysis. In order to predict the amount of time played during one session, we can use KNN Regression, as **Figure 1.1** shows a non-linear relationship between our variables. Although it may seem counterintuitive when looking at **Figure 1.0** as it looks like the relationship follows the pattern of a parabola, we will perform linear regression to predict the number of players playing on the server at one time. To make linear regression work well on our data, we will make a few tweaks in the analysis so, instead of producing a straight line like linear regression normally would, it will produce a parabola.

### Method of Analysis

#### Exploratory Analysis: KNN Clustering

Since we are interested in analyzing how popularity playing the game differs throughout the day, we will be using clustering as our first method of analysis to help group similar behaviours within the data. Because the data is unlabeled, the unsupervised nature of clustering offers us an easy way to handle data that isn't already categorized. That being said, we can identify time-based patterns dependent on player activity and distinguish which hours are the most and least popular to play during. We will use clustering to determine **(1)** groupings of hours based on the number of people playing and **(2)** groupings of hours based on the number of minutes played per session. We will also use visualizations to display clustered activity trends throughout the hours of the day. With these clusters, we can highlight key time periods such as "high-activity hours," "moderate-activity hours," and "low-activity hours," which could prove useful in any future game development. 

Starting with determining **(1)**, the number of people playing throughout different hours of the day, we scaled the data to ensure all features are contributing equally and to ensure model accuracy. We also set a range of clusters to choose from later and compute the within-cluster sum of squares for each k-value in the range.

In [8]:
# Scale the data
players_preprocessor = make_column_transformer(
    (StandardScaler(), ['start_hour', 'count']),
    remainder = 'drop',
    verbose_feature_names_out=False,
)

# Range of clusters to choose from
players_ks = range(1, 11)

# Commpute WSSD for each k-value
players_wssd = [
    make_pipeline(
        players_preprocessor,
        KMeans(n_clusters=k, random_state=1234) # Always set this seed to reproduce data
    ).fit(start_hour_data)[1].inertia_ # Fit the pipeline and compute its WSSD
    for k in players_ks
]

Next, we created an elbow map to easily choose the most applicable amount of clusters.

In [9]:
# Create a new data frame with cluster data
players_model_stats = pd.DataFrame({
    'k' : players_ks,
    'wssd' : players_wssd,
})

# Elbow plot
players_elbow_plot = alt.Chart(players_model_stats).mark_line(point=True).encode(
    x=alt.X('k').title('Number of Clusters'),
    y=alt.Y('wssd').title('Within-cluster Sum of Squares'),
).properties(title='Elbow Plot (players)')

# Display plot and figure title
display(players_elbow_plot)
display(Markdown("**Figure 2.1:** Total WSSD for K clusters ranging from 1 to 10."))

**Figure 2.1:** Total WSSD for K clusters ranging from 1 to 10.

This visualization shows that 3 is the most optimal amount of clusters. This is because it is the last number with a significant decrease in WSSD. If we were to pick a lower number of clusters, our results would be overgeneralized. If we were to pick a higher number of clusters, we would run into the risk of subdividing clusters. 

Lastly, we created a new visualization that is colored accordingly to the three clusters.

In [10]:
# Preform clustering with 3 clusters
players_cluster_k3 = KMeans(n_clusters=3, random_state=1234)

# Create new pipeline
players_pipe = make_pipeline(players_preprocessor, players_cluster_k3).fit(start_hour_data)

# Assign clusters
clustered_players = start_hour_data.assign(
    cluster=players_pipe[1].labels_
)

# Visualization
players_cluster_plot = alt.Chart(clustered_players).mark_circle().encode(
    x=alt.X('start_hour').title('Hour of the Day'),
    y=alt.Y('count').title('Number of Players'),
    color=alt.Color('cluster:N').title('Cluster'),
).properties(title='Number of Players vs Hour of the Day')

# Display plots and figure titles 
display(players_cluster_plot)
display(Markdown("**Figure 2.2:** Clustering of session start times based on number of players active."))

**Figure 2.2:** Clustering of session start times based on number of players active.

Here, we can see that we have clusters for hours (0-9), (10-17), and (8-23).

We completed the same process to **(2)**, cluster hours based on the amount of minutes played per session. 

In [11]:
# Scale the data
minutes_preprocessor = make_column_transformer(
    (StandardScaler(), ["start_hour", "time_played_mins"]),
    remainder="drop",
    verbose_feature_names_out=False
)

# Range of clusters to choose from
minutes_ks = range(1, 11)

# Commpute WSSD for each k-value
minutes_wssds = [
    make_pipeline(
        minutes_preprocessor,
        KMeans(n_clusters=k, random_state=1234) # Always set this seed to reproduce data
    ).fit(sessions_data)[1].inertia_ # Fit the pipeline and compute its WSSD
    for k in minutes_ks
]

# Create a new data frame
minutes_model_stats = pd.DataFrame({
    "k":minutes_ks,
    "wssd":minutes_wssds
})

# Elbow plot
minutes_elbow_plot = alt.Chart(minutes_model_stats).mark_line(point=True).encode(
    x=alt.X("k").title("Number of Clusters"),
    y=alt.Y("wssd").title("Total Within-cluster Sum of Squares")
).properties(title = 'Elbow Plot (minutes)')

# Preform clustering with 3 clusters
minutes_kmeans = KMeans(n_clusters=3, random_state=1234)

# Create a new pipeline and fit data
minutes_pipe = make_pipeline(minutes_preprocessor, minutes_kmeans)
minutes_pipe.fit(sessions_data)

# Assign clusters
minutes_clustered = sessions_data.assign(
    cluster = minutes_pipe[1].labels_
)

# Visualization
minutes_scatter_clustered = alt.Chart(minutes_clustered).mark_circle().encode(
    x=alt.X('start_hour').title('Hour of The Day'),
    y=alt.Y('time_played_mins').title('Total Playing Time per Session (minutes)'),
    color=alt.Color('cluster:N').title('Cluster'),
).properties(title='Playing Time vs. Hour of the Day')

# Display plots and figure titles
display(minutes_elbow_plot)
display(Markdown("**Figure 2.3:** Total WSSD for K clusters ranging from 1 to 10."))

display(minutes_scatter_clustered)
display(Markdown("**Figure 2.4:** Clustering of session start times based on time played."))

**Figure 2.3:** Total WSSD for K clusters ranging from 1 to 10.

**Figure 2.4:** Clustering of session start times based on time played.

With `Elbow Plot (minutes)` we can clearly see that 3 is the best choice for the number of clusters when grouping hours based on the playing time. In the `Playing Time vs. Hour of the Day` plot, we can see that there are clusters for the hours (0-10), (12-23), and (20-23).

#### Predictive Analysis: KNN Regression

During the exploratory analysis, we performed clustering to understand the inherent patterns in the distribution of player activity throughout the day. By grouping sessions based on their start hour and the time played, we were able to observe trends in how players engage with the game at different times. This gives us insight on the behavior of players and can lead to speculation about why these patterns are occuring. With the clustering analysis offering a view of how sessions are distributed throughout the day, we now move on to predicting the number of players (or the total playing time) at different hours of the day using K-Nearest Neighbors (KNN) regression. While clustering helped us identify patterns, KNN regression allows us to build a predictive model that can estimate time played for any given hour of the day. This transition from clustering to regression is essential because, while clustering reveals groups or patterns, KNN regression provides a means to quantify these trends and make predictions for future instances which can be helpful to PLAICraft researchers. We will be using KNN in this case because the data is non-linear.

In [12]:
# Split the data into training and testing sets
sessions_train, sessions_test = train_test_split(
    sessions_data, train_size=0.75, random_state=1234
)

# Identify feature data and target data in testing and training sets
X_train = sessions_train[['start_hour']] 
y_train = sessions_train['time_played_mins'] 

X_test = sessions_test[['start_hour']]
y_test = sessions_test['time_played_mins'] 

# Preprocess the data and make a pipeline
sessions_preprocessor = make_column_transformer((StandardScaler(), ['start_hour']))
sessions_pipe = make_pipeline(sessions_preprocessor,KNeighborsRegressor())

# Create 5-fold GridSearchCV object
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 201, 3),
}
sessions_gridsearch = GridSearchCV(
    estimator=sessions_pipe,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)

# Dit the GridSearchCV object
sessions_gridsearch.fit(X_train,y_train)

# Find which parameter value results in the lowest RMSPE
sessions_gridsearch.best_params_

  _data = np.array(data, dtype=dtype, copy=copy,


{'kneighborsregressor__n_neighbors': 49}

From the `best_params` function of GridSearchCV, we can see that the most optimal value of `k` in the KNN Model is 49. GridSearchCV will automatically apply the best parameter when used for predictions.

In [13]:
# Evaluating on the test set
sessions_test["predicted"] = sessions_gridsearch.predict(sessions_test)
RMSPE = mean_squared_error(
    y_true=sessions_test["time_played_mins"],
    y_pred=sessions_test["predicted"]
)**(1/2)
RMSPE

np.float64(262.26607833218856)

In [14]:
# Assign predictions to data 
sessions_preds = sessions_train.assign(
    predictions= sessions_gridsearch.predict(sessions_train[['start_hour']])
)

# Plotting the data
base_plot = alt.Chart(sessions_preds).mark_circle(opacity=0.4).encode(
    x=alt.X('start_hour').title('Hour of the Day'),
    y=alt.Y('time_played_mins').title('Total Playing Time per Session (minutes)'),
)

# Plotting the predictions
pred_line = (
    alt.Chart(sessions_preds)
    .mark_line(color='black')
    .encode(
        x='start_hour',
        y='predictions'
    )
)

# Visualizing the data and predictions together
sessions_plot = alt.layer(base_plot, pred_line).properties(
    title='Total time played vs. Hour of the Day'
)
# Create a figure title and display the plot
display(sessions_plot)
display(Markdown("**Figure 5:** Predicted values of playing time (black line) for the KNN regression model."))

**Figure 5:** Predicted values of playing time (black line) for the KNN regression model.

#### Predictive Analysis: Linear Regression

Next, we will use linear regression to predict the number of players at different hours of the day. Linear regression analyzes the relationship between a dependent variable (in our case, `Number of players`) and one or more independent variables (`Hour of the day`) and finds the best-fit line that can be used to predict new values. Although linear regression normally only works on relationships that have a clear linear association, and the relationship we are looking at does not, we have found a way around this. As seen in **Figure 1.0**, the relationship between `Number of players` and `Hour of the day` appears to have a shape that follows a parabola. We will use `PolynomialFeatures`, a preprocessing tool in scikit-learn, to achieve our linear regression model ("PolynomialFeatures," n.d., Sucky, 2023).

First, we will split the `start_hour_data` into a training set and a testing set, with 60% of the data being training and 30% of the data being testing. This allows us to assess the model's performance and prevents the model being evaluated on the same data used to train it.

In [15]:
# Split data into training and testing sets
hour_train, hour_test = train_test_split(
    start_hour_data, test_size=0.3, random_state=1234
)

# Identify feature data and target data in testing and training sets
X_train_hour = hour_train[['start_hour']] 
y_train_hour = hour_train['count']
X_test_hour = hour_test[['start_hour']]
y_test_hour = hour_test['count']

Next, we will create a pipeline to chain together our `PolynomialFeatures` and `LinearRegression` steps. By setting the `degree` parameter inside `PolynomialFeatures` equal to 2, we are telling it to raise all the `start_hour` values to a power of 2. This will allow us to have our best-fit line look like a parabola when we perform that step later. After we create our pipeline, we will fit the pipeline to our training data.

In [16]:
# Create pipeline with PolynomialFeatures and LinearRegression
hour_pipeline = Pipeline([
    ("poly", PolynomialFeatures(degree=2)),
    ("lm", LinearRegression())
])

# Fit pipeline to training data
hour_fit = hour_pipeline.fit(X_train_hour, y_train_hour)
hour_fit

Then, we will use our fitted pipeline to predict values. We will use the `predict` function to create predictions for the training data and the `assign` function to create a new column in the dataframe with those values. Next, we will create a vizualization of the model predictions by overlaying a line of the predictions over the training data.

In [17]:
# Creating predictions
hour_preds = hour_train.assign(
    predictions = hour_fit.predict(X_train_hour)
)

# Plotting the data
hour_base_plot = alt.Chart(hour_preds).mark_circle().encode(
    x=alt.X('start_hour').title('Hour of the Day'),
    y=alt.Y('count').title('Number of Players'),
)

# Plotting the predictions
hour_pred_line = (
    alt.Chart(hour_preds)
    .mark_line(color='black')
    .encode(
        x='start_hour',
        y='predictions'
    )
)

# Visualizing the data and predictions together
hour_plot = alt.layer(hour_base_plot, hour_pred_line).properties(
    title='Number of Players vs Hour of the Day'
)

# Create a figure title and display the plot
display(hour_plot)
display(Markdown("**Figure 6.0:** Predicted number of players (black line) for the linear regression model on training data."))

**Figure 6.0:** Predicted number of players (black line) for the linear regression model on training data.

Now we will move onto the test data. Once again, we will use the `predict` and `assign` functions to create predictions and then create a visualization of the predictions on top of the test data. 

In [18]:
# Creating predictions
hour_test_preds = hour_test.assign(
    predictions = hour_fit.predict(X_test_hour)
)

# Plotting the data
hour_test_base_plot = alt.Chart(hour_test).mark_circle().encode(
    x=alt.X('start_hour').title('Hour of the Day'),
    y=alt.Y('count').title('Number of Players'),
)

# Plotting the predictions
hour_test_pred_line = (
    alt.Chart(hour_test_preds)
    .mark_line(color='black')
    .encode(
        x='start_hour',
        y='predictions'
    )
)

# Visualizing the data and predictions together
hour_test_plot = alt.layer(hour_test_base_plot, hour_test_pred_line).properties(
    title='Number of Players vs Hour of the Day'
)

# Create a figure title and display the plot
display(hour_test_plot)
display(Markdown("**Figure 6.1:** Predicted number of players (black line) for the linear regression model on test data."))


**Figure 6.1:** Predicted number of players (black line) for the linear regression model on test data.

Next, we will calculate the slopes and intercept of our linear regression model by using the `coef_` and `intercept_` functions and create an equation for our model. Because our we used a two-degree polynomial for linear regression, we will get two slopes: one for a variable with degree two and one for a variable with degree one. From this, we will be able to construct an equation to predict the number of players at one time based on the hour of the day.

In [19]:
# Access the linear regression model from the pipeline
model = hour_fit.named_steps['lm']

# Create dataframe with slopes and intercept
pd.DataFrame({"slope for $x^2$":[model.coef_[2]], "slope for $x$":[model.coef_[1]], "intercept":[model.intercept_]})

Unnamed: 0,slope for $x^2$,slope for $x$,intercept
0,0.885244,-21.193246,141.647599


$Predicted\:Number\:of\:Players = 0.885244*(Hour\:of\:the\:Day)^2-21.193246*(Hour\:of\:the\:Day)+141.647599$

We can draw a few conclusions from this equation. At 12 AM (hour of the day = 0), we can expect to see 142 players on the server (intercept = 141.647599. Because we can't have fractions of people, we will round to the nearest whole number). Each additional hour of the day initially decreases the expected number of players by 21 people ($x$ coefficient = -21.193246). The positive coefficient indicates an increasing number of players on the server as hour of the day increases ($x^2$ coefficient = 0.885244). We can generalize these findings: at early and late hours of the day, there will be a high number of players on the server; during midday, the number of players on the server will be low.

Finally, we will calculate the root mean square prediction error (RMSPE) on the test data. RMPSE is the square root of the average of the squared difference between the predicted and true value for each observation in our dataset and provides an estimation of how well our model is able to predict target values. We will then use the predictions to calculate RMSPE using the `mean_squared_error` function.

In [20]:
hour_rmspe = mean_squared_error(
    y_true=y_test_hour,
    y_pred=hour_test_preds["predictions"]
)**(1/2)
hour_rmspe

np.float64(37.39855464128995)

Our model's test error is 37.399 players.

### Discussion
Using the PLAI team’s Session data, our group aimed to analyze how the popularity of playing times on the PLAICraft server differ throughout the day. We defined popularity by the number of players on the PLAICraft server at once and the number of minutes played per session. Using KMeans clustering algorithm (Figure 2.1) to answer these questions, it was found that there were three clear clusters - one showing shorter playing sessions early in the day, one showing shorter playing sessions late in the day, and one showing longer playing sessions late in the day. These clusters help predict the times of the day where players play the longest. Using KMeans clustering algorithm again (Figure 2.3), it was found that there were three clear clusters demonstrating the relationship between the hour of the day and the number of players playing. The first clusters showed a high number of players early in the day, the second cluster showed a low number of players midday, and the last cluster showed a high number of players late in the day. These clusters help predict the times of day where there is a high number of simultaneous players. A KNN regression model was run on the data showing a relationship between hour of the day and playing session length (Figure 3.0). The model predicts the length of the playing session at any given point in the day. A second-degree linear regression model was run on the data showing a relationship between the number of players at a given hour in the day (Figure 6.1). The model predicts how many simultaneous players are on the server at a given hour of the day.

These four models help to analyze the popularity of the PLAICraft server throughout the day by predicting when the server will be busiest and for how long - late in the day will be the most popular time. This conclusion was expected, as there is some research on when people play video games the most (Peracchia & Curcio, 2018). Unexpectedly though, our model predicted that early morning is also popular, and afternoon has a clear drop in playing sessions altogether. These findings could be useful to the PLAICraft team as it gives clear information on the popularity of their server so the team can be better equipped to accommodate the peak in players and playtime duration. It can help the PLAI Team allocate funds and resources to improve efficiency of their data collection. Additionally, data analysis could be focused on a certain data cluster, as there might be patterns within each cluster. These are just some of the ways to use this data analysis to interpret the results of the PLAICraft project, and there are many other methods to try that could lead to new conclusions. In future analysis of this data set one could ask: why there is a spike in the number of players in the morning/night, what could this analysis present for the other data frame the PLAICraft provided and, how would the PLAI team attract more players in the ‘slow’ hours? These are just a handful of the potential information that could be extracted from data sets like Sessions. With all that said, this data analysis was successful in showcasing the popularity difference overtime and provides a starting point for potential answers to many more future questions.

### References

Peracchia, S., & Curcio, G. (2018). Exposure to video games: Effects on sleep and on post-sleep cognitive abilities. A systematic review of experimental evidence. *Frontiers in Human Neuroscience*, *12*, 320. https://doi.org/10.3389/fnhum.2018.00320

PolynomialFeatures. (n.d.). *scikit-learn*. https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

Sucky, R. N. [RegenerativeToday]. (2023, June 12). *Polynomial Regression in Python - sklearn* [Video]. YouTube. https://www.youtube.com/watch?v=nqNdBlA-j4w