# Tennis analysis

### Introduction
Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal

Clearly state the question you will try to answer with your project

Identify and describe the dataset that will be used to answer the question

### Preliminary exploratory data analysis
Demonstrate that the dataset can be read from the web into R 

Clean and wrangle your data into a tidy format

**Using only training data**, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 

**Using only training data**, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [34]:

# loading libraries

library(repr)
library(tidyverse)
library(tidymodels)
library(RColorBrewer)
library(stringr)

set.seed(420)
options(repr.matrix.max.rows = 10)

In [35]:
##choosen dataset:
tennis_stats_data <- read_csv("https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS")

##rename the columes name so that there is no spaces
colnames(tennis_stats_data) = make.names(colnames(tennis_stats_data))
tennis_stats_data

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m500[39m [1mColumns: [22m[34m38[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (25): Age, Country, Plays, Wikipedia, Current Rank, Best Rank, Name, Bac...
[32mdbl[39m (13): ...1, Turned Pro, Seasons, Titles, Best Season, Retired, Masters, ...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...1,Age,Country,Plays,Wikipedia,Current.Rank,Best.Rank,Name,Backhand,Prize.Money,⋯,Facebook,Twitter,Nicknames,Grand.Slams,Davis.Cups,Web.Site,Team.Cups,Olympics,Weeks.at.No..1,Tour.Finals
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
0,26 (25-04-1993),Brazil,Right-handed,Wikipedia,378 (97),363 (04-11-2019),Oscar Jose Gutierrez,,,⋯,,,,,,,,,,
1,18 (22-12-2001),United Kingdom,Left-handed,Wikipedia,326 (119),316 (14-10-2019),Jack Draper,Two-handed,"$59,040",⋯,,,,,,,,,,
2,32 (03-11-1987),Slovakia,Right-handed,Wikipedia,178 (280),44 (14-01-2013),Lukas Lacko,Two-handed,"US$3,261,567",⋯,,,,,,,,,,
3,21 (29-05-1998),"Korea, Republic of",Right-handed,Wikipedia,236 (199),130 (10-04-2017),Duck Hee Lee,Two-handed,"$374,093",⋯,,,,,,,,,,
4,27 (21-10-1992),Australia,Right-handed,Wikipedia,183 (273),17 (11-01-2016),Bernard Tomic,Two-handed,"US$6,091,971",⋯,,,,,,,,,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
495,20 (13-04-1999),France,Right-handed,Wikipedia,382 (95),380 (11-11-2019),Dan Added,Two-handed,"$57,943",⋯,,,,,,,,,,
496,26 (03-09-1993),Austria,Right-handed,Wikipedia,5 (5890),4 (06-11-2017),Dominic Thiem,One-handed,"$22,132,368 15th all-time leader in earnings",⋯,1.Dominic.Thiem,@ThiemDomi,Dominator,,,dominicthiem.tennis,,,,
497,23 (14-03-1996),Netherlands,Left-handed,Wikipedia,495 (60),342 (05-08-2019),Gijs Brouwer,,,⋯,,,,,,,,,,
498,24 (17-05-1995),Ukraine,,Wikipedia,419 (81),419 (20-01-2020),Vladyslav Orlov,,,⋯,,,,,,,,,,


In [12]:
colnames(tennis_stats_data)

In [57]:

tennis_cleaned_data <- tennis_stats_data |>
                select(Prize.Money, Age, Country, Plays, Backhand, Weight, Current.Elo.Rank, Best.Elo.Rank, Current.Rank)|>
                mutate(across(Prize.Money: Current.Rank, function(col) {str_extract(col, "^[^ ]+")}))|> #remove any special character and adjust the cellk
                mutate(Prize.Money = as.numeric(gsub("[^0-9.]+", "", Prize.Money)))|> #changing the prize money into number by removing special characters
                mutate(Age = as.numeric(Age))|>
                mutate(across(Weight:Current.Rank, as.numeric))|>  # convert chr to dbl for the rest of the columes
                mutate(across(Country:Backhand, as.factor))

tennis_cleaned_data

Prize.Money,Age,Country,Plays,Backhand,Weight,Current.Elo.Rank,Best.Elo.Rank,Current.Rank
<dbl>,<dbl>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
,26,Brazil,Right-handed,,,,,378
59040,18,United,Left-handed,Two-handed,,,,326
3261567,32,Slovakia,Right-handed,Two-handed,,144,60,178
374093,21,"Korea,",Right-handed,Two-handed,,,,236
6091971,27,Australia,Right-handed,Two-handed,,100,21,183
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
57943,20,France,Right-handed,Two-handed,,,,382
22132368,26,Austria,Right-handed,One-handed,82,6,5,5
,23,Netherlands,Left-handed,,,,,495
,24,Ukraine,,,,,,419


In [59]:
#splitting the data into traning set and testing set
tennis_split <- initial_split(tennis_cleaned_data, prop = 0.75, strata = Current.Rank)

tennis_train <- training(tennis_split)
tennis_test <- testing(tennis_split)


tennis_train
     

Prize.Money,Age,Country,Plays,Backhand,Weight,Current.Elo.Rank,Best.Elo.Rank,Current.Rank
<dbl>,<dbl>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
1517157,22,Poland,Right-handed,Two-handed,,33,33,31
1893476,19,Canada,Right-handed,Two-handed,,51,30,22
25889586,31,Argentina,Right-handed,Two-handed,97,4,3,121
1285541,20,Serbia,Right-handed,Two-handed,,61,60,54
2722314,22,United,Right-handed,Two-handed,,56,29,34
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
,24,United,,,,,,493
,21,Italy,,,,,,485
,25,France,Right-handed,,,,,486
57943,20,France,Right-handed,Two-handed,,,,382


In [103]:
observation_num <- tennis_train |>
                select(where(is.numeric)) |>
                pivot_longer(everything(), names_to = "var", values_to = "val") |>
                filter(!is.na(val)) |>
                group_by(var) |>
                summarise(count = n(), mean = mean(val, na.rm = TRUE)) |>
                mutate(mean = round(mean, digits = 2), percentage_missing = (1- count/373)*100) |>
                select(-count)
observation_num

observation_factor <- tennis_train |>
                    select(where(is.factor)) |>
                    pivot_longer(everything(), names_to = "var", values_to = "val") |>
                    filter(is.na(val)) |>
                    group_by(var) |>
                    summarise(na_count = n()) |>
                    mutate(observation_count = 373 - na_count)

observation_factor

var,mean,percentage_missing
<chr>,<dbl>,<dbl>
Age,25.88,0.2680965
Best.Elo.Rank,78.75,53.6193029
Current.Elo.Rank,95.55,63.2707775
Current.Rank,248.81,1.3404826
Prize.Money,2356148.3,20.6434316
Weight,83.76,95.4423592


var,na_count,observation_count
<chr>,<int>,<dbl>
Backhand,64,309
Country,1,372
Plays,31,342


### Methods
Explain how you will conduct either your data analysis and which variables/columns you will use.

<u> Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction? </u>

Describe at least one way that you will visualize the results

### Expected outcomes and significance:

What do you expect to find?

What impact could such findings have?

What future questions could this lead to?