# Methods for Analysis

Before we can conduct any analysis on the loaded datasets, all of the necessary packages must be loaded in. 

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

The dataset, players.csv is loaded directly from the web to ensure that our methods and findings are reproducable. The dataset is saved as "players".

In [None]:
URL <- "https://drive.google.com/uc?export=download&id=1w_vUI6QgOW2d9bF07o1XM4MAaSF3dpea"
player_data <- read_csv(URL)
player_data

This dataset includes information for seven variables, as seen above, relating to 196 players. While this data is tidy, there are irrelevant variables present which are not conducive to answering the research question. To conduct an accurate and easily comprehensible anaylsys, the columns for name and hashedEmail will be excluded using the slice function, since they do not provide insight into the "kind" of players present in the server.  

In [None]:
player_data_filtered <- player_data |> select(experience, subscribe, played_hours, gender, Age)
player_data_filtered

Now the dataset has been refined to include only the data relevant to answering the question, "Which types of players are most likely to contribute a large amount of data?". We can now conduct analysis which will focus on identifying these "kinds" of players. This can be effectively done by determining how different variables influence the played_hours variable, since players with the highest playtime will generate the most data and should therefore be prioritized for recruiting efforts.

However, prior to doing this we acknowledge a key issue arises within the dataset regarding players misreporting their age on the server. For instance, the 1st data point records an age of 9years. Provided that this server is open only to the people in our class, this is clearly an unrepresentative data point. Including such data would lead to misleading conclusions about the relationship between age group and playtime. To mitigate this, only data for players aged 15-30 will be considered when analyzing the relationship between age and playtime.

Playtime will also be examined in relation to experience, gender, and subscription status. Bar graphs will be used to explore these relationships in this stage to determine the relevant pieces of data prior to the group project.

The first graph we looked at plotted experience vs mean hours played. Using the ggplot2 package, we grouped the data by experience and calculated the mean of hours played. We then used geom_bar with stat = "identity" to generate a bar plot, placing experience level on the X-axis and average hours played on the Y-axis.