analytics.Rmd

---
title: "Baseball Analytics"
author: "Ryan Mavilia"
date: "April 1, 2018"
output:
  html_document:
    df_print: paged
  pdf_document:
    fig_height: 10
    fig_width: 12
---


```{r connectdb}
# make sure you write the path to your sqlite path here
db <- DBI::dbConnect(RSQLite::SQLite(), "lahman2016.sqlite")
library(dplyr)
library(tidyr)
library(ggplot2)
```
Here we connect to the baseball database and import libraries which will be needed later on.
``` {sql run_query, connection=db, output.var="payroll_df"}
SELECT MAX(100.0 * team.W / team.G) AS winPercentage, SUM(sal.salary) AS payroll, team.W, team.G, team.yearID, team.teamID
FROM Teams AS team, Salaries AS sal 
INNER JOIN teamsFranchises AS teamfran ON
      team.yearID = sal.yearID 
      AND team.teamID = sal.teamID
GROUP BY sal.yearID, sal.teamID
```
I run an SQL query which will ask the database to return a table with the calculated payroll & win percentages for the different teams on a per year basis.
```{r}
payroll_df %>%
    head()
```


#Payroll distribution

```{r}
payroll_df %>%
  group_by(yearID) %>%
    filter(1990 <= yearID & yearID <= 2014) %>%
    ggplot(mapping=aes(y=payroll, x=factor(yearID))) +
    geom_boxplot() +
    xlab("Year") +
    ylab("Payroll") +
    ggtitle("Payroll vs Year 1990 - 2014") +
    scale_y_continuous(labels = scales::dollar) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

In order to graph the payroll in regards to the year we must first filter out years we don't want which is anything outside of 1990-2014. After that we create a boxplot with payroll as our Y-axis and yearID as our X-axis (using the factor because R reads yearID as a continuous variable). We also label the axes, title, add dolar signs to the payroll labels, and rotate the years to make them easier to read.

As time goes on payrolls for the league have increased in both mean and spread. We can see how large the boxes get towards the the end of the 2000's and that there is an upward trend.  I would say that payroll increased greatly from 1990 to 2007 and then evened out which reflects the projects reflection on how things evened out after 2005 when other teams "caught up". 


```{r fig.height=10, fig.width=12}
payroll_df <- filter(payroll_df, 1990 <= yearID & yearID <= 2014)
plot1 <- payroll_df %>%
  mutate(yearRange = cut(payroll_df$yearID, breaks = 5)) %>%
  group_by(yearRange) %>%
  ggplot(mapping = aes(x=payroll)) +
  geom_histogram(bins = 100) +
  scale_x_continuous(labels = scales::dollar) +
  xlab(label = "Payroll") +
  ylab(label="Count")+
  ggtitle("Mode of Payroll for 5 Year Periods of 1990-2014")
plot1 + facet_wrap(~yearRange)
```

As we can see from these graphs the spread is increasing and the amount being payed annually by teams has a large increase during the 1990's. I used the mean of the payrolls in each range to create the bars. We can see from here that the spread increases dramatically from 1990-2004 and pay increases greatly as well.


##Correlation between payroll and winning percentage

```{r fig.height=10, fig.width=12}
payroll_df$yearRange <- cut(payroll_df$yearID, breaks=5) 

plot1 <- payroll_df %>%
  group_by(teamID, yearRange) %>%
  summarize(m = mean(payroll), n=mean(winPercentage))%>%
    ggplot(mapping=aes(x=m, y=n, col = ifelse(teamID=="OAK", "Oakland", "Other"))) +
      geom_point() + 
      geom_smooth(method=lm) +
     labs(color='Team Colors') +
  theme(axis.text.x = element_text(angle = 20, hjust = 1)) +
    scale_x_continuous(labels = scales::dollar) +
  ylab(label="Win Percentage") +
  xlab(label="Average Payroll") +
    ggtitle("Average Payroll vs Win Percentage")
plot1+facet_wrap(~yearRange, nrow = 3, ncol = 2)
```
I used geom_smooth and scatter plots to show the regression lines of pay vs winningness over the 5 periods between 1990-2014.

The Yankees are the most consistent in terms of spending more and landing near the top of the win percentage axis. As for Oakland their performance varied.
Oakland across the time periods:
90-95' high pay low win
95-00' low pay low win
00-04' low pay very high win
05-09' low pay low win
09-14' low pay high win

#Data transformations

##Standardization across years

```{r fig.height=10, fig.width=12}

payroll_df <- payroll_df %>%
  group_by(yearID) %>%
  mutate(standardized_payroll = (payroll-mean(payroll))/sd(payroll))
payroll_df[,c(5:8)]
```

```{r fig.height=10, fig.width=12}
plot5 <- payroll_df %>%
  group_by(teamID, yearRange) %>%
  summarise(n = mean(standardized_payroll), m = mean(winPercentage)) %>%
    ggplot(aes(x=n, y=m)) +
      geom_point() + 
      geom_text(aes(label=teamID),hjust=0, vjust=0) +
      geom_smooth(method=lm) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1)) +
  xlab(label="Standardized Payroll") +
  ylab(label="Win Percentage") +
    ggtitle("Standardized Payroll vs Win Percentage for 5 Time Periods in 1990-2014")
plot5+facet_wrap(~yearRange)
```
Here I have recomoputed the same plots from problem 4 with the new standardized payroll data.

This shows us which teams were really spending the most vs least and how that affected their wins. For example in 2000-2004 Oakland is spending way less than other teams but still winning a lot. We can see this because Oakland is far to the left denoting a they are to the left of the continuous payroll plot so they are spending the least. It works the same as a line graph where the farther left the lower your number (payroll) is.


```{r}
payroll_df %>%
  ggplot(mapping = aes(x=standardized_payroll, y = winPercentage)) +
  geom_point() +
  geom_smooth(method = lm)+
  xlab(label="Standardized Payroll") +
  ylab(label="Win Percentage") +
    ggtitle("Standardized Payroll vs Win Percentage for All Teams & All Years")
```
Here I created a plot using all of the points and we can see that as spending increases the win percentage increases as well.

```{r}
payroll_df <- payroll_df %>%
  mutate(expected_win_pct = (50 + 2.5 * standardized_payroll)) %>%
  mutate(efficiency = winPercentage - expected_win_pct)

payroll_df
```
I've created the expected win percentage and efficiency calculations based on the formulas given using the dplyr mutate function.

##Spending efficiency

```{r}
payroll_df %>%
  filter(teamID %in% c("OAK", "BOS", "BAL", "NYA", "ATL", "TBA")) %>%
  ggplot(mapping = aes(x=yearID, y=efficiency, color = teamID)) +
  geom_smooth(method=lm) +
  geom_point() +
  labs(color='Team Colors') +
  xlab(label="Year") +
  ylab(label="Efficiency") +
  ggtitle("Efficiency vs Year for Different Teams")
```
Here I've created a plot with the efficiency mapped against the year for the 5 teams mentioned in the documentation as well as the Baltimore Orioles since they are my favorite team. I used geom_smooth as specified and also provided coloring.


The Oakland Athletics did extremely well during the MoneyBall year. This shows just how well their strategy worked for them before others caught up such as Tampa Bay in 2010. I think this would do really well as a 3D graph with another factor added in such as team size, batting average, etc. so that we can see how efficiency interleaves with other factors.