# Predicting Tennis Match Outcomes

### INTRODUCTION 

**Background:**

Having originated in the 12th century in Europe, tennis is now popularly played competitively and recreationally globally. The game, known as a match, involves using rackets to hit a ball across a net in the court between two players or pairs. Points are scored when a player is unable to return the incoming ball across the court or when the ball bounces twice on their side of the court. 

**Research question:**

What would the outcome of a match between two players be based on previous match statistics?

**Data Set:**

Data analysis will be done on the "Game results for Top 500 Players from 2017-2019" data set [1], collected by the Association of Tennis Professionals (ATP). In the data set, each observation is data for a specific match while each variable is a match or player statistic.

Variable definitions: [2] 
- tourney_id = unique identifier for each tournament 
- tourney_name = tournament name
- surface = court surface
- draw_size = total tournament draw size
- tourney_level = tour events
- tourney_date = eight digits (YYYYMMDD) usually the Monday of the tournament week
- match_num = a match-specific identifier
- winner_id/loser_id = player_id of the match winner/loser
- winner_seed/loser_seed = seed of match winner/loser
- winner_entry/loser_entry = 'WC' - wild card, 'Q' - qualifier, 'LL' - lucky loser, 'PR' - protected ranking, 'ITF' - ITF entry
- winner_name/loser_name = name of winner/loser
- winner_hand/loser_hand = dominant hand of winner/loser
- winner_ht/loser_ht = height in cm
- winner_ioc/loser_ioc = 3-character country code
- winner_age/loser_age = age in years
- score = final score
- best_of = '3' or '5' indicating the number of sets for this match
- round = round of tournament
- minutes = match length
- w_ace/l_ace = ace count
- w_df/l_df = double fault counts
- w_svpt/l_svpt = serve points
- w_1stIn/l_1stIn = first serves made
- w_1stWon/l_1stWon = first serve points won
- w_2ndWon/l_2ndWon = second serve points won
- w_SvGms/l_SvGms = service games won
- w_bpSaved/l_bpSaved = break points saved
- w_bpFaced/l_bpFaced = break points faced
- winner_rank/loser_rank = ATP or WTA rank, as of the tourney_date or most recent ranking date before tourney_date
- winner_rank_points/loser_rank_points = number of ranking points

### METHODS AND RESULTS

The variables used for data analysis: (for both winner and loser)
- Country of representation
- Age (years)
- Aces 
- Double fault counts 
- Serve points
- First serves made
- First serve points won
- Second serve points won
- Service games won
- Break points saved
- Break points faced

NOTE: rows in which W/O was found in the score column were filtered out. 

Quantitative match statistics are the focus of our research question, thus all such variables were used for data analysis. Additional variables such as age and country of representation were chosen based on their perceived influence on likelihood to win or lose a match [3], but they don't appear to be relevant in our preliminary analysis.

A K-nearest neighbors classification model will be made to evaluate the outcome of a match between two tennis players using their previous match data with the above variables as predictors. The model will then be tuned with the training data and finally performance will be assessed using the testing data. The results will be visualized by plotting K to show the effect of each predictor on match outcome. 

In [3]:
library(plyr)
library(tidyverse)
library(repr)
library(tidymodels)
set.seed(1)

**Exploring the Data:**

In [4]:
# reading the data frame from a URL link
tennis <- read_csv("https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn")
head(tennis)

“Missing column names filled in: 'X1' [1]”
Parsed with column specification:
cols(
  .default = col_double(),
  tourney_id = [31mcol_character()[39m,
  tourney_name = [31mcol_character()[39m,
  surface = [31mcol_character()[39m,
  tourney_level = [31mcol_character()[39m,
  winner_seed = [31mcol_character()[39m,
  winner_entry = [31mcol_character()[39m,
  winner_name = [31mcol_character()[39m,
  winner_hand = [31mcol_character()[39m,
  winner_ioc = [31mcol_character()[39m,
  loser_seed = [31mcol_character()[39m,
  loser_entry = [31mcol_character()[39m,
  loser_name = [31mcol_character()[39m,
  loser_hand = [31mcol_character()[39m,
  loser_ioc = [31mcol_character()[39m,
  score = [31mcol_character()[39m,
  round = [31mcol_character()[39m
)

See spec(...) for full column specifications.



X1,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,⋯,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,⋯,54,34,20,14,10,15,9,3590,16,1977
1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,⋯,52,36,7,10,10,13,16,1977,239,200
2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,⋯,27,15,6,8,1,5,9,3590,40,1050
3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,⋯,60,38,9,11,4,6,239,200,31,1298
4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,⋯,56,46,19,15,2,4,16,1977,18,1855
5,2019-M020,Brisbane,Hard,32,A,20181231,295,104871,,⋯,54,40,18,15,6,9,40,1050,185,275


In [None]:
# wrangling and tidying the data
tennis <- filter(tennis, score!= "W/O")
tennis["Index"] <- seq(1, 6823) # "Index" column is added to keep track of matches and therefore players
tennis <- select(tennis, winner_ioc, loser_ioc, winner_age, loser_age, w_ace, w_df, w_svpt, w_1stIn, w_1stWon, 
    w_2ndWon, w_SvGms, w_bpSaved, w_bpFaced, l_ace, l_df, l_svpt, l_1stIn, l_1stWon, l_2ndWon, l_SvGms, 
    l_bpSaved, l_bpFaced)

# seperate the statistics for winning and losing players
tennis_w <- select(tennis, starts_with("w")) 
tennis_w["Outcome"] <- "W"
colnames(tennis_w) = gsub("w_","",colnames(tennis_w))
colnames(tennis_w) = gsub("winner_","",colnames(tennis_w))
tennis_w["Index"] <- seq(1, 6823)

tennis_L <- select(tennis, starts_with("l"))
tennis_L["Outcome"] <- "L"
colnames(tennis_L) = gsub("l_","",colnames(tennis_L))
colnames(tennis_L) = gsub("loser_","",colnames(tennis_L))
tennis_L["Index"] <- seq(1, 6823)

# rejoin the statistics for winning and losing players
tennis <- rbind(tennis_w, tennis_L)

# changing the column names that start with numbers
names(tennis)[6] <- "firstIn"
names(tennis)[7] <- "firstWon"
names(tennis)[8] <- "secondWon"

In [None]:
# split the data set into training and testing sets
tennis_split <- initial_split(tennis, prop = 0.75, strata = Outcome)
tennis_train <- training(tennis_split)
tennis_test <- testing(tennis_split) 
tennis_train

# the following exploratory data analysis only uses the training set

In [None]:
# exploratory data analysis table

means_table <- tennis_train %>%
    select(-ioc, -Outcome, -Index) %>%
    map_df(mean, na.rm = TRUE)
means_table

In [None]:
# exploratory data analysis plot 1:

options(repr.plot.width = 17, repr.plot.height = 9) 

ioc_plot <- tennis_train %>% 
   ggplot(aes(x = ioc, fill = Outcome)) + 
   geom_histogram(stat = "count", position = position_dodge(), width = .5) + 
   xlab("Country of Representation") +
   ylab("Count") +
   theme(text = element_text(size = 18)) +
   theme(axis.text.x = element_text(angle = 50, hjust = 1)) + 
   ggtitle("Players' Country of Representation") +
   theme(plot.title = element_text(hjust = 0.5))
ioc_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 2:

options(repr.plot.width = 15, repr.plot.height = 8) 

age_plot <- tennis_train %>% 
   ggplot(aes(x = round(age), fill = Outcome)) + 
   geom_bar(stat = "count", position = position_dodge(), width = 0.5) + 
   xlab("Age (Rounded to the nearest Year)")+
   ylab("Count") +
   theme(text = element_text(size = 18)) +
   ggtitle("Players' Ages") + 
   theme(plot.title = element_text(hjust = 0.5))
age_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 3:

options(repr.plot.width = 15, repr.plot.height = 8) 

ace_plot <- tennis_train %>% 
   ggplot(aes(x = ace, fill=Outcome)) + 
   geom_histogram(stat = "count", position=position_dodge(), width = 0.5) + 
   xlab("Aces") +
   ylab("Count") +
   theme(text = element_text(size = 18)) +
   ggtitle("Frequency of Aces") + 
   theme(plot.title = element_text(hjust = 0.5))
ace_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 4:

options(repr.plot.width = 13, repr.plot.height = 7) 

df_plot <- tennis_train %>% 
   ggplot(aes(x = df, fill = Outcome)) + 
   geom_histogram(stat = "count", position=position_dodge(), width = 0.5) + 
   xlab("Double Faults") +
   ylab("Count") +
   theme(text = element_text(size = 17)) +
   ggtitle("Frequency of Double Faults") + 
   theme(plot.title = element_text(hjust = 0.5))
df_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 5: 

options(repr.plot.width = 18, repr.plot.height = 9)

svpt_plot <- tennis_train %>% 
   ggplot(aes(x = svpt, fill = Outcome)) + 
   geom_histogram(stat = "count", position = position_dodge(), width = 0.5) + 
   xlab("Serve Points") +
   ylab("Count") +
   theme(text = element_text(size = 18)) +
   ggtitle("Frequency of Serve Points") + 
   theme(plot.title = element_text(hjust = 0.5))
svpt_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 6:

options(repr.plot.width = 18, repr.plot.height = 9)

firstIn_plot <- tennis_train %>% 
   ggplot(aes(x = firstIn, fill=Outcome)) + 
   geom_histogram(stat = "count", position = position_dodge(), width = .5) + 
   xlab("First Serves Made") +
   ylab("Count") +
   theme(text = element_text(size = 19)) +
   ggtitle("Frequency of First Serves Made") + 
   theme(plot.title = element_text(hjust = 0.5))
firstIn_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 7:

options(repr.plot.width = 18, repr.plot.height = 10)

firstWon_plot <- tennis_train %>% 
   ggplot(aes(x = firstWon, fill = Outcome)) + 
   geom_histogram(stat="count", position = position_dodge()) + 
   xlab("First Serves Points Won") +
   ylab("Count") +
   theme(text = element_text(size = 20)) +
   ggtitle("Frequency of First Serve Points Won") + 
   theme(plot.title = element_text(hjust = 0.5))
firstWon_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 8:

options(repr.plot.width = 18, repr.plot.height = 9)

secondWon_plot <- tennis_train %>% 
   ggplot(aes(x = secondWon, fill=Outcome)) + 
   geom_histogram(stat = "count", position=position_dodge(), width = .5) + 
   xlab("Second Serves Points Won") +
   ylab("Count") +
   theme(text = element_text(size = 19)) +
   theme(axis.text.x = element_text(angle = 40, hjust = 1)) + 
   ggtitle("Frequency of Second Serve Points Won") +
   theme(plot.title = element_text(hjust = 0.5))
secondWon_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 9:

options(repr.plot.width = 15, repr.plot.height = 8)

SvGms_plot <- tennis_train %>% 
   ggplot(aes(x = SvGms, fill=Outcome)) + 
   geom_histogram(stat = "count", position=position_dodge(), width = .7) + 
   xlab("Service Games Won") +
   ylab("Count") +
   theme(text = element_text(size = 18)) +
   theme(axis.text.x = element_text(angle = 40, hjust = 1)) + 
   ggtitle("Frequency of Service Games Won") +
   theme(plot.title = element_text(hjust = 0.5))
SvGms_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 10:

options(repr.plot.width = 14, repr.plot.height = 7)

bpFaced_plot <- tennis_train %>% 
   ggplot(aes(x = bpFaced, fill=Outcome)) + 
   geom_histogram(stat = "count", position=position_dodge(), width = .5) + 
   xlab("Break Points Faced") +
   ylab("Count") +
   theme(text = element_text(size = 16)) +
   theme(axis.text.x = element_text(angle = 40, hjust = 1)) + 
   ggtitle("Frequency of Break Points Faced") +
   theme(plot.title = element_text(hjust = 0.5))
bpFaced_plot

In [None]:
# INSIGHTS FROM GRAPH

In [None]:
# exploratory data analysis plot 11:

options(repr.plot.width = 15, repr.plot.height = 8)

bpSaved_plot <- tennis_train %>% 
   ggplot(aes(x = bpSaved, fill=Outcome)) + 
   geom_histogram(stat = "count", position=position_dodge(), width = .5) + 
   xlab("Break Points Saved") +
   ylab("Count") +
   theme(text = element_text(size = 17)) +
   theme(axis.text.x = element_text(angle = 40, hjust = 1)) + 
   ggtitle("Frequency of Break Points Saved") +
   theme(plot.title = element_text(hjust = 0.5))
bpSaved_plot

In [None]:
# INSIGHTS FROM GRAPH

**Classification:**

In [None]:
# KNN classification

In [None]:
# visualization 

### DISCUSSION

- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

FROM PROPOSAL:

We expect to find a correlation between match statistics and match outcome. For example, a player that has a high number of aces is more likely to win the match, and vice versa. 

This data analysis can have an impact on the training of tennis players. If certain match statistics are found to increase the chance of winning a match, players could focus on training those skills. Additionally, the data could also benefit audiences who participate in betting. Using previous match statistics or data from early matches, participants could more accurately place money on the winning player.  

**Future questions that could be investigated:**
- Could these predictions be used to find rank difference, winner rank and loser rank?
- Are certain match statistics more influential in the match outcome than others?

### REFERENCES
[1] https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn

[2] https://count.co/notebook/j0OYDOaWDmn

[3] De Seranno, A. (2020). Predicting Tennis Matches Using Machine Learning (dissertation). Ghent University, Ghent. 