# DSCI 100 Haiyu Long Individual Stage
# Study Question: Will the players' gaming experience, whether or not they subscribe to the game-related newsletter, and age become useful predictors for their total hours of playing?

We will start by loading the most correlated dataset players.csv.

In [None]:
### Run this cell before continuing.
library(tidyverse)
library(repr)
jjvjk
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

In [None]:
players <- read_csv("")
write_csv(players, '')

From this dataset, we can find several columns, where each row represents a single player's data. There are 196 players (rows) in the dataset, each with seven columns describing specific information.

- The "experience" section indicates the player's level of proficiency in the game, divided into five categories: "beginner," "regular," "Amateur," "veteran" and "pro."
- This is followed by the 'subscribe' column, which shows if the player has a habit of subscribing to news related to the game.
- The next column is the email address of the hashed player.
- The 'played hour' column measures each player's play time on an hour basis.
- The last three columns mark the player's name, gender, and age.

We find that, except for play_hours and age, which are stored as doubles, and subscribe, which are stored as logical vectors, the rest of the columns are stored as strings in the dataset.
In the columns above, we focus on the metrics' experience ', 'subscribe', and 'age' and the response variable 'played_hours'. We try to analyze whether the first three indicators have a sufficient data correlation with the latter to qualify as predictors.

Next, we will do a basic collation of the data set and identify potential problems in the data set that may hinder our analysis.

In [None]:
players_mdfied <- players |>
    select(-hashedEmail) |> #remove the obvious not useful column
    mutate(experience = as_factor(experience)) |>
    mutate(subscribe = as_factor(subscribe)) |>
    mutate(subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE")) #factorize two important column for later analysis

players_na_sum <- colSums(is.na(players_mdfied))

players_mdfied
players_na_sum

After sorting out the data, we found that two na values appeared in the age column. Due to the small number of NA values, we can consider directly discarding these two rows of data for the time being. In other words, we will apply na.rm in all following data summarize and visualization sections. 
Next, we will make a summary analysis for each columns in the dataset.

In [None]:
hours_played_summary <- players_mdfied |> summarize(played_hours_min = min(played_hours, na.rm = TRUE),
    played_hours_max = max(played_hours),
    played_hours_mean = mean(played_hours),
    played_hours_median = median(played_hours),
    played_hours_sd = sd(played_hours))

age_summary <- players_mdfied |> summarize(age_min = min(Age, na.rm = TRUE),
    age_max = max(Age, na.rm = TRUE),
    age_mean = mean(Age, na.rm = TRUE),
    age_median = median(Age, na.rm = TRUE))

ctg_summary <- players_mdfied |>
  select(experience, subscribe, gender) |>
  pivot_longer(cols = experience:gender, names_to = "Column", values_to = "categories") |>
  group_by(Column, categories) |>
  summarize(Count = n())

hours_played_summary
age_summary
print(ctg_summary)

From the above statistics, we can find a lot of interesting information.

The minimum value of play time is 0, which means that some players registered an account but did not play. Comparing the maximum with the mean and the median also shows that there may be a large number of players who play less than two hours or even less than an hour. Based on the age statistics, we found that most players are probably still young adults (15-35).

Finally, in terms of player proficiency, apart from the fact that the number of pro players is relatively small (which is also common sense), the number of skilled players is fairly evenly distributed. At the same time, more than half of the data in the entire dataset was contributed by male players. Among the players, most of them have subscribed to the newsletter related to the game.

These statistics will provide hints for our subsequent analysis and suggest that our conclusions may be very limited. For example, data composed mostly of male players may indicate that our conclusions are only applicable to male players.

In [None]:
options(repr.plot.height = 6, repr.plot.width = 7)

age_hours_plot <- ggplot(players_mdfied, aes(x = Age, y = played_hours)) +
  geom_point() +
  labs(x = "Age", y = "Total Hours Played")

exp_hours_plot <- ggplot(players_mdfied, aes(x = experience, y = played_hours)) +
  geom_boxplot() +
  labs(x = "Gaming Experience", y = "Total Hours Played")

subs_hours_plot <- ggplot(players_mdfied, aes(x = subscribe, y = played_hours)) +
  geom_boxplot() +
  labs(x = "Subscribe or not", y = "Total Hours Played")

age_hours_plot
exp_hours_plot
subs_hours_plot

# Methods and plans

We plan to use a multifactor linear regression model as a preliminary prediction tool, using players' gaming experience, whether they subscribe to newsletters, and age to predict the total amount of time a player will play.

First, the linear regression model is suitable for predicting continuous response variables (such as total play time), and can handle both numerical variables (age) and categorical variables (experience and subscription after factoring and numericalized).

Analytical presupposition
- Assume a linear relationship between each predictor and response variable.
- Need to convert character variables to factors or numerical numbers (0 or 1).

Potential limitations or weaknesses
- If there are nonlinear relationships or significant interactions between the predictor and response variables, a simple linear model may not be sufficient to capture these complex relationships.
- The current dataset may not collect all the important factors that affect the game duration, thus reducing the explanatory power.

project
1. Check and clean the data (dealing with missing values and outliers).
2. Convert character variables (such as experience and subscribe) into factors and construct dummy variables if necessary.
3. Split the data set into training set and test set. The recommended ratio is 70% training set and 30% test set.
4. In the training stage, cross-validation is used to further adjust the model parameters to ensure that the model has good generalization performance on the unseen data.
5. Use k-fold cross-validation (5 or 10 fold) to evaluate the model's performance on the training set to prevent overfitting or underfitting.