# Group 18 Project Proposal
By: Sandra Radic, Charlie Sushams, Alex Grinius, & Clare Vu

## Introduction

In the world of tennis (and most sports), events and tournaments are divided into categories of males and females for fairness (excluding partnered matches). Therefore, female and male statistics are not usually compared in one group. However, there are many outstanding female players that could very well pose as serious competition to the male dominated sport - 23-time Grand Slam champion Serena Williams is definitely one of them. Williams is ranked number 1 in international women's tennis, and started her professional career in her early teenage years, reflecting her exceptional skill and sportsmanship. This leads us to our question: How would top player Serena Williams rank amongst the top 500 male players? Our group intends to analyze the factors that separate men's and women's sports, and use them in a classification model to predict which category (intervals from 1 - 501 players) Williams would fall under if she were to partake in a men's only tournament. To conduct this analysis, we will be using the Player Stats for Top 500 players dataset, from  https://www.ultimatetennisstatistics.com/. We will be considering the factors of height, number of seasons played, and the age the player entered. This will allow us to examine whether physical stature serves as a male advantage or not, and otherwise compares skill level. Consequently, we will be creating categories using the "best rank" column to further sort our data and choose a new rank level for Serena. 

## Preliminary exploratory data analysis:

instructions from sandra:

* Read player_stats.csv
* Read what I wrote in "Method" to see what we need to do to create a smaller tidy dataset
* Find out serena williams' height, age turned pro, and # of seasons played and mutate it as a new column
* Use only TRAINING DATA to make a table and visualization
* add to method section about how we can visualize our data (eg scatterplots to compare variables maybe?)

In [1]:
library(tidyverse)
library(cowplot)
library(scales)
library(stringr)

"package 'tidyverse' was built under R version 4.0.5"
-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mggplot2[39m 3.3.3     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.5     [32mv[39m [34mdplyr  [39m 1.0.7
[32mv[39m [34mtidyr  [39m 1.1.3     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.4.0     [32mv[39m [34mforcats[39m 0.5.1

"package 'tibble' was built under R version 4.0.5"
"package 'tidyr' was built under R version 4.0.4"
"package 'readr' was built under R version 4.0.4"
"package 'dplyr' was built under R version 4.0.5"
"package 'stringr' was built under R version 4.0.5"
"package 'forcats' was built under R version 4.0.4"
-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filt

In [22]:
player_stats <- read_csv("../player_stats.csv")

# select only the necessary columns and filter out any players whose stats aren't known

player_stats_filtered <- player_stats %>%
    select(Height, Seasons, Name, "Turned Pro", Seasons, Age) %>%
    filter(!is.na(Height))
    
# tidy up dataframe (ensure correct datatypes and appropriate column names)

player_stats_tidy <- player_stats_filtered %>%
    separate(col = Age, into = c("Age", "Birthdate"), sep = " ") %>%
    mutate(Age = as.numeric(Age)) %>%
    mutate_at("Height", str_replace, " cm", "") %>%
    mutate(Height = as.numeric(Height))

# create new column that lists a player's age when they turned pro

player_stats_age_pro <- player_stats_tidy %>%
     mutate(Age_turned_Pro = Age - Seasons)
    
player_stats_age_pro

# create a dataframe for Serena Williams - currently contains junk data
serena_data <- data.frame(Name = "Serena Williams", Age = 99, Birthdate = "(30-30-1900)", "Turned Pro" = 1900,
    Age_turned_Pro = 99, Seasons = 99, Height = "199 cm")

serena_data

"Missing column names filled in: 'X1' [1]"

[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  .default = col_character(),
  X1 = [32mcol_double()[39m,
  `Turned Pro` = [32mcol_double()[39m,
  Seasons = [32mcol_double()[39m,
  Titles = [32mcol_double()[39m,
  `Best Season` = [32mcol_double()[39m,
  Retired = [32mcol_double()[39m,
  Masters = [32mcol_double()[39m,
  `Grand Slams` = [32mcol_double()[39m,
  `Davis Cups` = [32mcol_double()[39m,
  `Team Cups` = [32mcol_double()[39m,
  Olympics = [32mcol_double()[39m,
  `Weeks at No. 1` = [32mcol_double()[39m,
  `Tour Finals` = [32mcol_double()[39m
)
[36mi[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m for the full column specifications.




Height,Seasons,Name,Turned Pro,Age,Birthdate,Age_turned_Pro
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>
185,14,Lukas Lacko,2005,32,(03-11-1987),18
193,11,Bernard Tomic,2008,27,(21-10-1992),16
198,14,Juan Martin Del Potro,2005,31,(23-09-1988),17
190,14,Marcel Granollers,2003,33,(12-04-1986),19
198,15,Sam Querrey,2006,32,(07-10-1987),17
180,7,Andrej Martin,2005,30,(20-09-1989),23
178,16,Fabio Fognini,2004,32,(24-05-1987),16
180,11,Dusan Lajovic,2007,29,(30-06-1990),18
175,13,Daniel Evans,2006,29,(23-05-1990),16
183,5,Gregoire Barrere,2012,25,(16-02-1994),20


Name,Age,Birthdate,Turned.Pro,Age_turned_Pro,Seasons,Height
<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
Serena Williams,99,(30-30-1900),1900,99,99,199 cm


## Methods

To tidy our data, we will be selecting our data to only use the columns for "Height" and "Seasons". We will be creating an additional column "Career Start Age" using mation on the columns "Current Age" (in 2019) and "Year Turned Pro", to calculate the difference in years and find out how old players were at the start of their career. An additional mutation will be used to manually add a column for Serena Williams' data (including her height, seasons played, and career start age). 

As mentioned in our introduction, we will be creating a classification model by filtering intervals from the "Best Rank" column as categories. Players ranked in the top 10 will be defined as Top Players, players ranked between 11-30 will be Competitive Players, players ranked between 31 - 50 will be Great players, players ranked between 51-100 will be Good players, and the rest will be considered Average Players. The goal is to place Serena Williams in one of these categories by the end of our analysis. 

We plan to visualize our data by ___. 


## Expected Outcomes and Significance