<a href="https://colab.research.google.com/github/vaibhavjha06/ds1002-fhy5uh/blob/main/vaibhav-jha_lab2_race_results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DS1002 Lab 2: Determine Race Results with R

In this lab you will work with a dataset, writing R to generate the deliverables specified in the cells below.

The dataset for this lab is made up of fictitious results from a road race. Runner information and results is provided in the data.

Answer the questions below with the appropriate R code. Point assignments are indicated for each section. There are 10 total points possible for this lab.

Useful reference material (check all R modules within the Canvas site for more help)
- [R Reference Material](https://canvas.its.virginia.edu/courses/78571/modules#module_219810).
- [Plots Samples](https://colab.research.google.com/github/nmagee/ds1002/blob/main/notebooks/25-plots-in-r.ipynb)

## Group Submissions

If you are working in a group to complete this lab, you may have no more than 3 members to a group. Group members should be indicated in the cell below -- list both names and UVA computing IDs.

Each student should then submit **the same URL** for the lab in Canvas. (If a group has Member1, Member2, and Member3, only one member needs to save the completed work back to GitHub and all members should submit that URL for grading.)

In [None]:
# List group members (if applicable). Identify names and computing IDs
#
# Name                    Computing ID
Vaibhav Jha               fhy5uh

## 1. Load Libraries & Data (1 pt)

https://raw.githubusercontent.com/nmagee/ds1002/main/data/road-race.csv

Import any necessary libraries and load the remote CSV file below into a data frame.

In [None]:
#
library(tidyverse)
library(dplyr)
library(ggplot2)

df <- read.csv('https://raw.githubusercontent.com/nmagee/ds1002/main/data/road-race.csv')
df


## 2. Get Summary Data (1 pt)

In code, display how many rows and columns are in the raw dataset.

In [None]:
#
str(df)
# gives observations by variables (764,5)

nrow(df)
# number of rows is 764
ncol(df)
# number of columns is 5

## 3. Clean and Organize the Data (2 pts)

Check for data quality.

- Resolve any duplicate rows.
- If a runner does not have a finish time, they are DNF and should not be counted in the dataset.



In [None]:
#

nrow(df[duplicated(df), ])
# 124 duplicated rows

df2 <- df[!duplicated(df), ]
df2
# cleaned, 640 rows now



df2[df2==""] <- NA
df2
# filled blanks with NA

df3 <- na.omit(df2)
df3

# df3 is without duplicates and NAs


Now display the first 10 rows of the cleaned dataset.

In [None]:
#

head(df3, 9)

## 4. Calculate Elapsed Time (3 pts)

Using R, add a new column named `["finish_minutes"]` to the dataframe that calculates the number of minutes it took for the runner to complete the race. Ideally this is a column consisting of plain integers.

The starting gun was fired at precisely 12:00pm that day.

Note: This is calculated using a built-in function of R, `difftime()` which takes 3 parameters:

- End time
- Start time
- Units

The result is an output that figures the difference between the two: `3 days`, `14 years`, `112 mins`, etc.

The syntax for that function is below. Take care to use the proper order of parameters. The `as.POSIXct` casting makes it possible to read a long datetime in the `YYYY-MM-DDTHH:MM` format, a common `datetime` value. The `format` parameter specifies the pattern you are trying to read.

```
df$new-column <- (difftime( as.POSIXct(df$end-column, format="%Y-%m-%dT%H:%M"),
                            as.POSIXct(df$start-column, format="%Y-%m-%dT%H:%M"),
                            units="min"))
```

In [None]:
#
df3$start_time <- c('12:00')
# added a start time column

df3$finish_minutes <- (difftime(
    as.POSIXct(df3$finish_time, format="%H:%M"),
    as.POSIXct(df3$start_time, format="%H:%M"),
    units="min"))
# calculated the finish_minutes column

df3$finish_minutes <- as.integer(df3$finish_minutes)
df3
# set the finish_minutes column as an integer

## 5. Identify Winners by Gender (2 pts)

Based on the minutes it took each runner to complete the race, identify the top three places for each gender.

There are several ways to do this, some of which require less code than others. You will only be graded for producing the correct output, not on how elegant/advanced your programming is.

In [None]:
#
df3 %>%
arrange(df3$finish_minutes) %>%
group_by(runner_gender) %>%
slice(1:3)

# Got the top 3 times for each gender by arranging for finish_minutes, grouping
# by gender, and slicing for the first three in each category


## 6. Plot the Data (3 pts)

Finally, using `ggplot2` create two plots of the data -- density plots of race finishers.

- In the first plot use `finish_minutes` as the x axis.
- In the second plot use `runner_age` as the x axis.
- Use `runner_gender` as the fill.
- We suggest using a `geom_density(alpha=0.2)` or therabouts to see layers through one another.
- Use the `gridExtra` library's `grid.arrange()` method to plot them both.

You will note that since this is artificial data you will be able to see the gender layers clearly enough but they will not be statistically meaningful.

In [None]:
#


ggplot(df3, aes(x = finish_minutes, fill = runner_gender)) +
geom_density(alpha=0.2)

ggplot(df3, aes(x = runner_age, fill = runner_gender)) +
geom_density(alpha=0.2)


# Plotted both plots by gender, first is finish_minutes second is runner_age
# Could not work the grid.arrange() function.
# I tried importing the gridExtra library but it said no such thing exists
# I couldn't figure it out
