# Lab 3: Data Manipulation


## Preliminaries

1. Piazza reminders
    - When asking a homework question, gather all relevant information to your problem to make it *reproducible*
        - What code did you run before running into your problem?
        - What does the output say?
        - Take a screenshot or picture for best results
2. Google Colab demonstration
    - Download the homework/lecture from github (.ipynb file). 
    - Open Google colab: https://colab.research.google.com/notebooks/, and choose the "upload" option. Choose the file you've downloaded from github. 
    - If you want to make a new notebook from scratch, you can use this link: https://colab.research.google.com/#create=true&language=r
    - Save as pdf: click file - print - save as pdf.
    

In [None]:
library(tidyverse) # load tidyverse

In [None]:
tennis_data <- read.csv('https://raw.githubusercontent.com/bmanzo/stats306_labs/master/lab03/FrenchOpen-men-2013.csv')

In [None]:
head(tennis_data)

https://archive.ics.uci.edu/ml/datasets/Tennis+Major+Tournament+Match+Statistics#

## `dplyr` functions

The `filter` function is used to retrieve a subset of the full dataset

Roger Federer is a very famous tennis player. Let's use `filter` to find all the matches in which he played in the 2013 French Open.

In [None]:
(federer <- tennis_data %>% 
            filter(Player1 == 'Roger Federer' | Player2 == 'Roger Federer')) 

If you want to assign as well as print the variable, enclose the command in parentheses.

The above table is useful, but we don't need all of the columns. We can use `select` to only show a subset of the columns. Create a new table, `fed_select`, which only shows the fields `Player1`, `Player2`, `Round`, and `Result`.

In [None]:
names(federer)

In [None]:
(fed_select <- tennis_data %>% 
                 filter(Player1=='Roger Federer' | Player2=='Roger Federer') %>% select(Player1:Result))

We can use functions such as `between` or the `%in%` operator. 

In [None]:
top_three <- tennis_data %>% 
                filter(Player1 %in% c('Roger Federer', 'Novak Djokovic', 'Rafael Nadal') | Player2 %in% c('Roger Federer', 'Novak Djokovic', 'Rafael Nadal'))

In [None]:
middle_round <- tennis_data %>% 
                filter(between(Round, 3, 5))

Suppose we are interested in the later rounds of the tournament. We can use the `arrange` function to order rows instead of filtering for a subset of them. 

In [None]:
tennis_data %>% arrange(desc(Round))

Notice how in the above code, we use `desc()` to sort from largest to smallest. 

Unforced errors are bad, so we might be interested in finding matches with the fewest unforced errors.  
Again we'll use the `select` function because we are only interested in some of the columns. 

In [None]:
tennis_data %>% 
    arrange(UFE.1 + UFE.2) %>% 
    select(Player1:Result, UFE.1, UFE.2) %>% head() # use head so the whole table doesn't print out

Remember that `select` has some helper functions. How could we rewrite the above code using `starts_with`?

In [None]:
tennis_data %>% 
    arrange(UFE.1+UFE.2) %>% 
    select(Player1:Result, starts_with('UFE')) %>% 
    head()

We can also use `contains()`

In [None]:
tennis_data %>% 
    arrange(UFE.1+UFE.2) %>% 
    select(Player1:Result, contains('UFE')) %>% 
    head()

Notice that variables corresponding to `Player1` end in `1`. How would we select all the player 1 variables?

In [None]:
tennis_data %>% 
    select(ends_with('1')) %>% 
    head() 

### `mutate`

We are likely interested in some aggregate statistics, i.e., combining the results of players 1 and 2 in a match. We'll use `mutate` to create new variables to analyze these statistics.  

Suppose we're interested in looking at the length of matches (how many sets are played). One way to do this is to add `FNL1` (total number of sets won by player 1) to `FNL2` (total for player 2). 

In [None]:
tennis_data_2 = tennis_data %>% 
                    mutate(total_sets = FNL.1 + FNL.2)

Now we can sort the matches from longest to shortest. 

In [None]:
tennis_data_2 %>% 
    arrange(desc(total_sets)) %>% 
    head()

## Exercises

1. A better measure of match length might be to measure the total number of points played. Compute `total_points` from the variables `TPW.1` and `TPW.2`. Add this to `tennis_data_2`.
2. Create a variable ace_rate which is the total number of aces in a match divided by the total number of points played. Add this to tennis_data_2.
3. Create a variable cilic that is TRUE for all matches in which Marin Cilic played and FALSE otherwise.
4. Sort the data by Round, then by ace_rate.
5. Create a table containing all matches before the 6th round in which both players had a first serve percentage above 65%.
6. A player wins in straight sets if his opponent does not win a single set. How many matches were not won in straight sets.

## `summarise`

You'll learn more about data summaries in this week's lecture, but we'll introduce the concept here. 

In [None]:
tennis_data_2 %>% 
    summarise(total_matches=n(), avg_points = mean(total_points), avg_sets = mean(total_sets))

We can combine the summarise operation with other operations from `dplyr`

In [None]:
tennis_data_2 %>% 
    group_by(Round) %>% 
    summarise(total_matches=n(), avg_points = mean(total_points))

In [None]:
usa_players = c('Sam Querrey', 'John Isner')
tennis_data_2 %>% 
    group_by(Player1 %in% usa_players | Player2 %in% usa_players) %>% 
    summarise(avg_ace = mean(ace_rate))

We can even sort the summary table based on the results of the summary statistics

In [None]:
tennis_data_2 %>% 
    filter(Round < 5) %>% 
    group_by(Round) %>% 
    summarise(avg_FSP = mean((FSP.1 + FSP.2)/2)) %>% 
    arrange(desc(avg_FSP))

We can assign summary tables to variables and then plot them.

In [None]:
round = tennis_data_2 %>% 
        filter(total_sets > 2) %>%
        group_by(Round) %>%
        summarise(avg_ace = mean(ace_rate), avg_points = mean(total_points))

In [None]:
ggplot(round) + 
    geom_bar(aes(x=Round, y=avg_points),stat='identity', fill='green')

## Feedback

Please fill out the form: https://forms.gle/BoJeoQUwYBorEZaTA 