### Data Science 100 - Final Project

An investigation of the popularity of certain play-times using data from a MineCraft server. 

By Group #4: Chloe Glesby, 

### Introduction:
MineCraft, published by Mojang Studios is an online video game where users can place ‘blocks’ to create and build different structures. The game has a multiplayer aspect where users can interact and collaborate. The Pacific Laboratory for Artificial Intelligence (PLAI), founded by Professor Frank Wood created a MineCraft server to conduct research and data analysis about this video game and its users. Once players log onto this specific server, all kinds of data is being collected. Some examples of this data include the kind of player they are, the time of day they log on and off, and what things they say while playing. This information can be useful for the PLAI team as it can showcase patterns to help formulate models for prediction purposes. These predictions can be used in many aspects of video game research including what the most popular times will be and how many people the server needs to accommodate - which is the goal of this data analysis. 

Using the PLAI team’s minecraft data, an analysis will be conducted in order to predict:

a) What time windows are most likely to have a large number of simultaneous players?

b) test

To begin the data anlysis, some Python packages must first be imported into the notebook:

In [1]:
import numpy as np
import pandas as pd
import altair as alt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn import set_config


# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

Next, the data was loaded as a simple data frame using the 'pandas' package 

In [4]:
sessions_data = pd.read_csv('sessions.csv') # the file path for the dataframe in brackets
sessions_data

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12
...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12


### The Data 
The Sessions dataframe has 1535 observations and 5 variables which provide information about the time frames that players of the PLAICraft server logged on and off. 

https://drive.google.com/file/d/14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB/edit

- The column 'hashedEmail' describes induvidual player email's in an encoded fashion, using a string of numbers and letters. The observations in this column are not useful in this data analysis, as the identify of the players is not important to the type of predictive questions in this analysis.  
- The columns 'start_time' and 'end_time' provide the exact time and date that players logged on and off the server in day-month-year-time format. This makes the data untidy though, as there are multiple variables within a cell. 
- The columns 'original_start_time' and 'original_end_time' are information about the log-on and off times in Unix time format. This column is not useful for this data analysis as it is in a time format that is difficult to manipulate and interpret.

Using the 'count' - a simple numbered list of players, and the 'start_time' and 'end_time' columns, some prelminary data wrangling and filtering will be conducted to better visulize any patterns in the Sessions data set.






The start and end time format is easier to use than Unix time, but it will be more efficient to organize it into a format that Python recognizes and classifies as tidy:

In [3]:
#to convert the time into 'datetime' format and split each variable into a sperate column

sessions_data['start_time'] = pd.to_datetime(sessions_data['start_time'],format='%d/%m/%Y %H:%M')
sessions_data['end_time'] = pd.to_datetime(sessions_data['end_time'],format='%d/%m/%Y %H:%M')
sessions_data["start_hour"] = sessions_data["start_time"].dt.hour
sessions_data["start_month"] = sessions_data["start_time"].dt.month
sessions_data["start_year"] = sessions_data["start_time"].dt.year 

# drop unnecessary columns 
sessions_data = sessions_data.drop(columns = ['hashedEmail','original_start_time','original_end_time'])

# tell python to show the new data set 
sessions_data

Unnamed: 0,start_time,end_time,start_hour,start_month,start_year
0,2024-06-30 18:12:00,2024-06-30 18:24:00,18,6,2024
1,2024-06-17 23:33:00,2024-06-17 23:46:00,23,6,2024
2,2024-07-25 17:34:00,2024-07-25 17:57:00,17,7,2024
3,2024-07-25 03:22:00,2024-07-25 03:58:00,3,7,2024
4,2024-05-25 16:01:00,2024-05-25 16:12:00,16,5,2024
...,...,...,...,...,...
1530,2024-05-10 23:01:00,2024-05-10 23:07:00,23,5,2024
1531,2024-07-01 04:08:00,2024-07-01 04:19:00,4,7,2024
1532,2024-07-28 15:36:00,2024-07-28 15:57:00,15,7,2024
1533,2024-07-25 06:15:00,2024-07-25 06:22:00,6,7,2024


In [4]:
sessions_data['time_played_mins'] = ((sessions_data['start_time'].dt.hour * 60 + sessions_data['start_time'].dt.minute) -
(sessions_data['end_time'].dt.hour * 60 + sessions_data['end_time'].dt.minute)).abs()
sessions_data

Unnamed: 0,start_time,end_time,start_hour,start_month,start_year,time_played_mins
0,2024-06-30 18:12:00,2024-06-30 18:24:00,18,6,2024,12.0
1,2024-06-17 23:33:00,2024-06-17 23:46:00,23,6,2024,13.0
2,2024-07-25 17:34:00,2024-07-25 17:57:00,17,7,2024,23.0
3,2024-07-25 03:22:00,2024-07-25 03:58:00,3,7,2024,36.0
4,2024-05-25 16:01:00,2024-05-25 16:12:00,16,5,2024,11.0
...,...,...,...,...,...,...
1530,2024-05-10 23:01:00,2024-05-10 23:07:00,23,5,2024,6.0
1531,2024-07-01 04:08:00,2024-07-01 04:19:00,4,7,2024,11.0
1532,2024-07-28 15:36:00,2024-07-28 15:57:00,15,7,2024,21.0
1533,2024-07-25 06:15:00,2024-07-25 06:22:00,6,7,2024,7.0


In [5]:
time_played_plot = alt.Chart(sessions_data).mark_bar().encode(
    x=alt.X('time_played_mins').title('Time played (min)').bin(maxbins=20),
    y=alt.Y('count()').title('Number of players')
).properties(title='Distribution of player count based on time played')
time_played_plot 
# shows how long players typically play for

In [6]:
hours_plot = alt.Chart(sessions_data).mark_bar().encode(
    x=alt.X('start_hour').title('Hour of the day').bin(maxbins=20),
    y=alt.Y('count()').title('Number of players')
).properties(title='Distribution of player count throughout the day')
hours_plot

In [7]:
hours_scatter = alt.Chart(sessions_data).mark_point().encode(
    x=alt.X('start_hour').title('Hour of the day'),
    y=alt.Y('count()').title('Number of players')
).properties(title='Distribution of player count throughout the day')
hours_scatter

In [8]:
hours_line = alt.Chart(sessions_data).mark_line().encode(
    x=alt.X('start_hour').title('Hour of the day').bin(maxbins=20),
    y=alt.Y('count()').title('Number of players')
).properties(title='Distribution of player count throughout the day')
hours_line

In [9]:
sessions_data

Unnamed: 0,start_time,end_time,start_hour,start_month,start_year,time_played_mins
0,2024-06-30 18:12:00,2024-06-30 18:24:00,18,6,2024,12.0
1,2024-06-17 23:33:00,2024-06-17 23:46:00,23,6,2024,13.0
2,2024-07-25 17:34:00,2024-07-25 17:57:00,17,7,2024,23.0
3,2024-07-25 03:22:00,2024-07-25 03:58:00,3,7,2024,36.0
4,2024-05-25 16:01:00,2024-05-25 16:12:00,16,5,2024,11.0
...,...,...,...,...,...,...
1530,2024-05-10 23:01:00,2024-05-10 23:07:00,23,5,2024,6.0
1531,2024-07-01 04:08:00,2024-07-01 04:19:00,4,7,2024,11.0
1532,2024-07-28 15:36:00,2024-07-28 15:57:00,15,7,2024,21.0
1533,2024-07-25 06:15:00,2024-07-25 06:22:00,6,7,2024,7.0


I kept the datetime columns bc I think they might be useful if we want to pull out other values later