## 2017 ATP Tour Ranking Predictions

### Introduction:

The Association of Tennis Professionals (ATP) is an official tennis governing body that holds an annual international tour in which the world’s professional male tennis players are able to showcase their skills and tenacity against various opponents from different countries. Tournaments are held in many countries, and winning a match in these tournaments awards the victors points. Each week, the ATP updates the rankings of these players, adding the number of points gained within the 7-day period to the total points the player has garnered over the season (Association of Tennis Professionals, 2019).

As a result of the amount of tournaments held over a single season, shift in rankings is fairly common. The goal of this project is to predict the rankings of these professional tennis players based on a number of variables. The focus of the following data analysis is on the 2017 ATP World Tour (Sackmann, 2017), which comprises of data gathered from tennis tournaments held between the period of January 2017 to November 2017. The data includes the names of winners and losers from every round of national and international tennis tournaments. Multiple statistics are included in this data, which aids us in the prediction of our question: how well do variables relating to player statuses, such as win percentage, first-serve wins, and age predict the ATP ranking of the tennis player in later seasons?

### Preliminary exploratory data analysis:

#### Downloading the dataset:

In [2]:
import pandas as pd

In [4]:
url = "https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn"
tennis_data = pd.read_csv(url)

tennis_data

Unnamed: 0.1,Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2,...,54.0,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0
1,1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4,...,52.0,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0
2,2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2,...,27.0,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0
3,3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,...,60.0,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0
4,4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4,...,56.0,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6861,2881,2017-0605,Tour Finals,Hard,8,F,20171113,300,105777,6.0,...,54.0,42.0,22.0,15.0,11.0,15.0,6.0,3650.0,8.0,2975.0
6862,2882,2017-M-DC-2017-WG-M-BEL-FRA-01,Davis Cup WG F: BEL vs FRA,Hard,4,D,20171124,1,105676,,...,53.0,33.0,21.0,14.0,6.0,11.0,7.0,3775.0,18.0,2235.0
6863,2883,2017-M-DC-2017-WG-M-BEL-FRA-01,Davis Cup WG F: BEL vs FRA,Hard,4,D,20171124,2,104542,,...,54.0,30.0,12.0,12.0,5.0,11.0,15.0,2320.0,76.0,667.0
6864,2884,2017-M-DC-2017-WG-M-BEL-FRA-01,Davis Cup WG F: BEL vs FRA,Hard,4,D,20171124,4,105676,,...,54.0,44.0,13.0,14.0,7.0,10.0,7.0,3775.0,15.0,2320.0


#### Filtering the Dataset:

In this step, we want to filter the original tennis dataset to only include games held in 2017.

In [21]:
tennis_data_2017 = tennis_data[tennis_data["tourney_date"]//10000 == 2017]

tennis_data_2017

Unnamed: 0.1,Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
4906,0,2017-M020,Brisbane,Hard,32,A,20170102,271,104678,,...,53.0,33.0,13.0,11.0,6.0,10.0,29.0,1385.0,100.0,604.0
4907,1,2017-M020,Brisbane,Hard,32,A,20170102,272,106378,,...,67.0,39.0,27.0,12.0,9.0,10.0,45.0,1001.0,141.0,443.0
4908,2,2017-M020,Brisbane,Hard,32,A,20170102,273,106298,6.0,...,42.0,29.0,16.0,12.0,0.0,4.0,15.0,2156.0,25.0,1585.0
4909,4,2017-M020,Brisbane,Hard,32,A,20170102,276,111442,,...,43.0,23.0,13.0,9.0,10.0,15.0,79.0,689.0,160.0,372.0
4910,6,2017-M020,Brisbane,Hard,32,A,20170102,278,105777,7.0,...,36.0,21.0,7.0,8.0,4.0,8.0,17.0,2035.0,33.0,1320.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6861,2881,2017-0605,Tour Finals,Hard,8,F,20171113,300,105777,6.0,...,54.0,42.0,22.0,15.0,11.0,15.0,6.0,3650.0,8.0,2975.0
6862,2882,2017-M-DC-2017-WG-M-BEL-FRA-01,Davis Cup WG F: BEL vs FRA,Hard,4,D,20171124,1,105676,,...,53.0,33.0,21.0,14.0,6.0,11.0,7.0,3775.0,18.0,2235.0
6863,2883,2017-M-DC-2017-WG-M-BEL-FRA-01,Davis Cup WG F: BEL vs FRA,Hard,4,D,20171124,2,104542,,...,54.0,30.0,12.0,12.0,5.0,11.0,15.0,2320.0,76.0,667.0
6864,2884,2017-M-DC-2017-WG-M-BEL-FRA-01,Davis Cup WG F: BEL vs FRA,Hard,4,D,20171124,4,105676,,...,54.0,44.0,13.0,14.0,7.0,10.0,7.0,3775.0,15.0,2320.0


#### Adding a 'Win Percentage' Column:

In this step, we add another predictor column called 'win_percent' which is the winning percentage of each player in the tournaments. Win Percentage is the total number of wins divided by the total number of games they've played.

### Methods:

During our preliminary data analysis, we first tidied and scaled the data and then narrowed it down to variables that would be relevant to our analysis. All the variables used during the data analysis were averaged, the statistics of each player would be collected after each match they played during the season. As a result, each player’s ranking, service points, aces, and other predictor variables can be similar compared to each other, and the data could be adequately tidied, so multiple cells of the same statistic are not paired up to a single player’s name.

During the process of tidying the data, we filtered out the tennis dataset to only contain tennis matches in 2017.

### Expected outcomes and significance: