(1) Data Description:
Provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics, number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset(s) will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on.

There are two datasets available, players and sessions. In the players dataset, there are 7 columns - experience, subscribe, hashedEmail, played_hours, name, age, and gender. The sessions dataset includes 5 columns - hashedEmail, start_time, end_time, original_start_time, and original_end_time. The players dataset contains 196 rows while the sessions dataset contains 1535 rows.
Players dataset:
Experience: The experience column describes the experience level of the player. There are 5 different experience levels including beginner, amateur, regular, pro, and veteran. 
Subscribe: The subscribe column has two options including TRUE and FALSE. This column indicates whether the player has subscribed to a game relate newsletter or not. If they have subscribed to the newsletter, their row will say TRUE.
Hashed Email: This column contains a series of numbers, or an encryption, that allows you to identify the players.
Played Hours: This column contains the number of hours each player has played the game. 
Name: The column contains the name of each player so that you can identify who is playing the game. 
Gender: This column contains the gender of each person playing the game.
Age: This column indicates the age of each player playing the game.

Sessions dataset:
Hashed Email: This colum contains the same thing as in the players dataset: an encryption that allows you to identify the players.
Start Time: This column indicates what time each player started playing the game as well as the date they played it.
End Time: This column indicates the time each player stopped playing the game as well as the date they stopped playing.
Original Start Time: This column indicates the orginal start time of playing the game for each player. The 1st quartile for this column is 1.716e+12, the min is 1.712e+12, the median is 1.719e+12, the mean is 1.719e+12, the 3rd quartile is 1.722e+12, and the max is 1.727e+12. 
Original End Time: This column indicates the original end time of playing the game for each player. The 1st quartile for this column is 1.716e+12, the min is 1.712e+12, the mmedian is 1.719e+12, the mean is 1.719e+12, the 3rd quartile is 1.722e+12, and the max is 1.727e+12.

If this data is based on what the players choose to share (like age, gender, name), the data might not be correct. Some players might lie about their age or gender when self-reporting. In addition, are the played hours calculated by active time playing or by the amount of time the game is open. Some people may leave the game open and come back to it so this could cause issues when looking at the data.

In [None]:
# library(dplyr)

# data = read_csv("data/sessions.csv")

# # summary(data)

(2) Questions:
Clearly state one broad question that you will address, and the specific question that you have formulated. Your question should involve one response variable of interest and one or more explanatory variables, and should be stated as a question. One common question format is: “Can [explanatory variable(s)] predict [response variable] in [dataset]?”, but you are free to format your question as you choose so long as it is clear. Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.


Broad question: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. Specific question: What gender of players is most likely to contribute to playing time (in hours) based on age? Age, gender and hours played are variables given in the players dataset so, using this data, I can us the group_by and summarize functions to find the ages and playing time (in hours) and predict the gender of that player. 

(3) Exploratory Data Analysis and Visualization
In this assignment, you will:

Demonstrate that the dataset can be loaded into R.
Do the minimum necessary wrangling to turn your data into a tidy format. Do not do any additional wrangling here; that will happen later during the group project phase.
Compute the mean value for each quantitative variable in the players.csv data set. Report the mean values in a table format.
Make a few exploratory visualizations of the data to help you understand it.
Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
Explain any insights you gain from these plots that are relevant to address your question
Note: do not perform any predictive analysis here. We are asking for an exploration of the relevant variables to demonstrate that you understand them well before performing any additional modelling, and to identify potential problems you anticipate encountering.

In [None]:
library(tidyverse) 
players<- read_csv("data/players.csv")

players_wrangled <- players |>
                      select(Age, gender, played_hours)


mean_values <- players_wrangled |>
                 summarize(mean_age = mean(Age, na.rm = TRUE), 
                           mean_played_hours = mean(played_hours, na.rm = TRUE))
mean_values


scatter_plot<-ggplot(players_wrangled, aes(x = Age, y = played_hours))+
        geom_point(aes(colour = gender), alpha = 0.4)+
        labs(x = "Age of players",
             y = "Hours Played",
             title = "Age of Players vs. Hours Played")+
        scale_fill_brewer(palette = "Pastel1") +
        theme(text = element_text(size = 12))
scatter_plot



bar_plot <- players_wrangled |>
   ggplot(aes(x = Age, fill = gender)) + 
   geom_bar(position = 'fill') + 
   xlab("Age") +
   ylab("Hours Played") +
   labs(fill = "Gender") +
   ggtitle("Predicting Gender by Age and Hours Played")+
   #scale_fill_brewer(palette = "Pastel1") +
   theme(text = element_text(size = 12))
bar_plot

The scatterplot is a little hard to read since there are some outliers. If the outliers were removed the scale of the plot would be changed and may be easier to read. The bar plot is a little hard to read because of the colours. The bar plot may not be the best choice s

(4) Methods and Plan
Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
 