Permalink
Browse files

Added reworked version of Case Study 2 (basketball)

  • Loading branch information...
1 parent 0ae922e commit ba60b8e093392d2c05f7bca5422dea32f3ee9517 @garrettgman garrettgman committed Apr 7, 2010
View
@@ -0,0 +1,28 @@
+# making data set
+
+load("08-lal.rdata")
+head(lal)
+lal$home_team <- substr(lal$.id, 13, 15)
+lal$away_team <- substr(lal$.id, 10, 12)
+lakers <- lal[,c(13, 35, 34, 12, 15, 14, 25, 29, 26, 31, 32, 33)]
+write.table(lakers, file = "lakers.csv", sep = ",")
+
+# running case study
+
+lakers <- read.csv("lakers.csv")
+
+# separating date and time
+lakers$date <- substr(lakers$time, 1, 10)
+lakers$date <- ymd(lakers$date)
+
+time_minutes <- as.numeric(substr(lakers$time, 15, 16))
+time_secs <- as.numeric(substr(lakers$time, 18, 19))
+game_clock <- eminutes(time_minutes) + eseconds(time_secs)
+lakers$game_time <- eminutes(12) * lakers$period - game_clock
+
+# histogram of plays from start of game
+library(ggplot2)
+qplot(as.integer(game_time), data = lakers, geom = "histogram")
+
+lakers$demo <- ymd("2010-01-01") + lakers$game_time
+qplot(demo, data = lakers, geom = "histogram", binwidth = 120)
View
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View

Large diffs are not rendered by default.

Oops, something went wrong.
View
@@ -0,0 +1,90 @@
+# getting and cleaning data
+
+file.names <- list.files()
+file.names2 <- file.names[substr(file.names, 10, 12) == "LAL" | substr(file.names, 13, 15) == "LAL"]
+
+options(stringsAsFactors = F)
+
+data <- list()
+for (i in 1:length(file.names2)){
+ data[[i]]<- read.csv(file.names2[i])
+ data[[i]]$date <- rep(substr(file.names2[i], 1, 8), nrow(data[[i]]))
+ data[[i]]$home <- rep(substr(file.names2[i], 13, 15), nrow(data[[i]]))
+ data[[i]]$away <- rep(substr(file.names2[i], 10, 12), nrow(data[[i]]))
+}
+
+lal <- data[[1]]
+for(j in 2:length(file.names2))
+ lal <- rbind(lal, data[[j]])
+
+lakers <- lal[,c(33, 19, 16, 12, 11, 14, 13, 24, 28, 25, 30, 31, 32)]
+
+to_expand <- which(nchar(lakers$time) < 5)
+lakers$time[to_expand] <- paste("0", lakers$time[to_expand], sep = "")
+
+frees <- which(lakers$etype == "free throw" & lakers$result == "made")
+lakers$points[frees] <- 1
+lakers$points[is.na(lakers$points)] <- 0
+
+
+write.table(lakers, file = "lakers.csv", sep = ",")
+
+
+# running case study
+
+lakers <- read.csv("lakers.csv", stringsAsFactors = F)
+
+# histogram of dates played
+# parsing dates
+lakers$date <- ymd(lakers$date)
+qplot(date, 0, data = lakers, geom = "point", colour = lakers$home == "LAL") + scale_colour_discrete(name = "Venue", labels = c("home game", "away game"))
+ggsave("dates-points.png", width = 8, height = 4)
+
+qplot(wday(date, label = T, abbr = F), data = lakers, geom = "histogram")
+ggsave("weekdays-histogram.png", width = 8, height = 4)
+
+
+# creating time variable with durations
+time_min <- as.numeric(substr(lakers$time, 1 , 2))
+time_sec <- as.numeric(substr(lakers$time, 4 , 5))
+lakers$time <- eminutes(time_min) + eseconds(time_sec)
+lakers$time <- eminutes(12) * lakers$period - lakers$time
+
+# histogram of plays from start of game
+library(ggplot2)
+qplot(as.integer(time), data = lakers, geom = "histogram", binwidth = 60)
+ggsave("play-time-histogram.png", width = 8, height = 4)
+
+lakers$demo <- ymd("2008-01-01") + lakers$time
+qplot(demo, data = lakers, geom = "histogram", binwidth = 60)
+ggsave("play-time-histogram2.png", width = 8, height = 4)
+
+# picking out one game to examine
+game1 <- lakers[lakers$date == ymd("20081028"),]
+
+
+# histogram of waiting times between plays for one game
+attempts <- game1[game1$etype == "shot",]
+attempts$wait <- attempts$time - c(eseconds(0), attempts$time[-nrow(attempts)])
+
+qplot(as.integer(wait), data = attempts, geom = "histogram", binwidth = 2)
+ggsave("wait-histogram.png", width = 8, height = 4)
+
+# histogram of seconds until first score
+scores <- lakers[lakers$etype == "jump ball" | lakers$result == "made",]
+start <- which(scores$time == eseconds(0))
+first <- scores$time[start + 1]
+qplot(as.integer(first), geom = "histogram", binwidth = 2)
+ggsave("first-histogram.png", width = 8, height = 4)
+
+# cumulative score graph
+
+game1_scores <- ddply(game1, "team", transform, score = cumsum(points))
+game1_scores <- game1_scores[game1_scores$team != "OFF",]
+
+qplot(ymd("2008-01-01") + time, score, data = game1_scores, geom = "line", colour = team)
+ggsave("score-comparison.png", width = 8, height = 4)
+
+
+
+
View
@@ -440,7 +440,7 @@ \section{Daylight Savings Time}
\section{Case Study 1}
-The next two sections will work through some techniques using \pkg{lubridate}. First, we will use \pkg{lubridate} to calculate the dates of holidays. Then we'll use \pkg{lubridate} to explore an example data set (\code{basketball}).
+The next two sections will work through some techniques using \pkg{lubridate}. First, we will use \pkg{lubridate} to calculate the dates of holidays. Then we'll use \pkg{lubridate} to explore an example data set (\code{lakers}).
\subsection{Thanksgiving}
Some holidays, such as Thanksgiving (U.S.) and Memorial Day (U.S.) do not occur on fixed dates. Instead, they are celebrated according to a common rule. For example, Thanksgiving is celebrated on the fourth thursday of November. To calculate when Thanksgiving will be held in 2010, we can start with the first day of 2010.\\
@@ -488,53 +488,111 @@ \subsection{Memorial Day}
\section{Case Study 2}
-Now let's explore the \code{basketball} data set. The \code{basketball} data set contains play by play statistics of every major league basketball game played in the 2008-2009 season. This data is from \url{http://www.basketballgeek.com/downloads/2009-2010/}. \\
+Now let's explore the \code{lakers} data set. The \code{lakers} data set contains play by play statistics of every major league basketball game played by the Los Angeles Lakers during the 2008-2009 season. This data is from \url{http://www.basketballgeek.com/downloads/2008-2009/}. \\
-\code{R> head(basketball)}\\
+\code{R> head(lakers)}\\
-First we'll examine when during the year basketball games are held. We choose to use the \pkg{ggplot2} package to create our graphs. Please see \url{http://had.co.nz/ggplot2/} for more information about \pkg{ggplot2}.\\
+First we'll examine when during the year the Lakers have games. We choose to use the \pkg{ggplot2} package to create our graphs. Please see \url{http://had.co.nz/ggplot2/} for more information about \pkg{ggplot2}. \\
-\code{R> qplot(date, data = basketball, geom = "histogram", binwidth = 86400)}\\
+\code{str(lakers$date[1])}\\
+\code{int 20081028}\\
+
+\proglang{R} recognizes the dates in the \code{lakers} data set as integers. So our first task is to parse the dates, or read them into \proglang{R} as date-time objects. We recognize that the dates include the year element first, followed by the month element, and then the day element. Hence, we should use the \code{ymd()} parsing function.\\
+
+\code{R> lakers$date <- ymd(lakers$date)}\\
+\code{R> qplot(date, 0, data = lakers, geom = "point", colour = lakers$home == "LAL") + scale_colour_discrete(name = "Venue", labels = c("home game", "away game"))}\\
\begin{figure}[htpb]
\centering
- \includegraphics[width=.5\textwidth]{game-dates-histogram.png}
- \caption{Number of games played per date}
+ \includegraphics[width=\textwidth]{dates-points.png}
+ \caption{Dates of Lakers games for 2008-2009 season}
\label{fig:games-date}
\end{figure}
-Figure~\ref{fig:games-date} shows that games are played continuously throughout the season with one short break and a few one day breaks, which may be holidays. The histogram also reveals a cyclic pattern to the data. We can investigate this pattern by looking at the weekdays on which each game is played. \\
+Figure~\ref{fig:games-date} shows that games are played continuously throughout the season with a few short breaks. The frequency of games seems lower at the start of the season and games appear to be grouped into clusters of home games and away games. Notice the tick marks on the x axis; the labels and breaks are automatically generated by \code{pretty.date()}, which is in the \pkg{lubridate} package. Next we'll examine how Lakers games are distributed throughout the week.\\
-\code{R> qplot(wday(date), data = basketball, geom = "histogram")}\\
+\code{R> qplot(wday(date, label = T, abbr = F), data = lakers, geom = "histogram")}\\
\begin{figure}[htpb]
- \centering
- \includegraphics[width=.49\textwidth]{game-days-histogram.png}
- \includegraphics[width=.49\textwidth]{weekdays-histogram.png}
+ \centering
+ \includegraphics[width=\textwidth]{weekdays-histogram.png}
\caption{Number of games played per weekday}
\label{fig:games-days}
\end{figure}
-\emph{Note: this is better accomplished with the non-lubridate function qplot(weekdays(date), data = basketball, geom = ``histogram"), shown on right}\\
+The frequency of basketball games appears to vary throughout the week, figure~\ref{fig:games-days}. Surprisingly, the highest number of games are played on Tuesdays.
+
+Now let's look at the games themselves. In particular, let's look at the distribution of plays throughout the game. The \code{lakers} data set lists the time that appeared on the game clock for each play. These times begin at 12:00 at the beginning of each period and then count down to 00:00, which marks the end of the period. The first two digits refer to the number of minutes left in the period. The second two digits refer to the number of seconds.
+
+The times have not been parsed as date-time data to \proglang{R}, but we can collect the minutes and seconds information using simple regular expression commands. This extracts the information as a character string, which we then convert to a number.\\
+
+\code{R> time_min <- as.numeric(substr(lakers$time, 1 , 2))}\\
+\code{R> time_sec <- as.numeric(substr(lakers$time, 4 , 5))}\\
+
+It would be difficult to record the time data as a date-time object because the data is incomplete: a minutes and seconds element are not sufficient to identify a unique date-time. We can instead capture this information as a \emph{duration} object as defined in Section~\ref{sec:durations}. This allows us to directly compare different durations. It would also allow us to determine exactly when each play occurred by adding the duration to the \emph{instant} the game began. (Unfortunately, the starting time for each game is not available in the data set). We can now use our time information to create a \emph{duration} object whose minutes and seconds are equal to the original time data. We can then subtract this from a duration of 12, 24, 36, or 48 minutes (depending on the period of play) to create a new duration that records exactly how far into the game each play occurred.\\
+
+\code{lakers$time <- eminutes(time_min) + eseconds(time_sec)}\\
+\code{lakers$time <- eminutes(12) * lakers$period - lakers$time}\\
+
+Unfortunately, \pkg{ggplot2} does not support plotting durations, or difftimes the object class used by durations. To plot our data, we can extract the integer value of our durations, which will equal the number of seconds that occurred in each duration.\\
+
+\code{R> qplot(as.integer(time), data = lakers, geom = "histogram", binwidth = 60)}\\
+
+\begin{figure}[htpb]
+ \centering
+ \includegraphics[width=\textwidth]{play-time-histogram.png}
+ \caption{Distribution of plays within game}
+ \label{fig:plays}
+\end{figure}
+
+Alternatively, we can create date-times, which \pkg{ggplot2} does support, by adding each of our durations to the same starting instant. This creates a plot whose tick marks are determined by \code{pretty.date()}. This helper function recognizes the most intuitive binning and labeling of date-time data, which further enhances our graph.\\
+
+\code{R> lakers$demo <- ymd("2008-01-01") + lakers$time}\\
+\code{R> qplot(demo, data = lakers, geom = "histogram", binwidth = 60)}\\
+
+\begin{figure}[htpb]
+ \centering
+ \includegraphics[width=\textwidth]{play-time-histogram2.png}
+ \caption{Distribution of plays within game}
+ \label{fig:plays}
+\end{figure}
+
+
+We see that the number of plays peaks within each of the four periods and then plummets at the beginning of the next period, figure~\ref{fig:first-score}. Observations that occur after 48 minutes suggest games that were decided in overtime.
+
+Now lets look more closely at just one basketball game: the first game of the season. This game was played on October 28, 2008. For this game, we can easily model the amounts of time that occurred between each shot attempt.\\
+
+\code{R> game1 <- lakers[lakers$date == ymd("20081028"),]}\\
+\code{R> attempts <- game1[game1$etype == "shot",]}\\
+
+The waiting times between shots will just be the time span that occurs between each shot attempt. Since we've recorded the time of each shot attempts as a duration (above), we can record the differences by subtracting the two durations. This automatically creates a new duration whose length is equal to the difference between the first two durations.\\
-The frequency of basketball games appears to vary throughout the week, figure~\ref{fig:games-days}. Surprisingly, the most games are played on Fridays and on Wednesdays.\\
+\code{R> attempts$wait <- attempts$time - c(eseconds(0), attempts$time[-nrow(attempts)])}\\
+\code{R> qplot(as.integer(wait), data = attempts, geom = "histogram", binwidth = 2)}\\
-Now let's look at the games themselves. In particular, let's look at the time until the first score is made.
+\begin{figure}[htpb]
+ \centering
+ \includegraphics[width=\textwidth]{wait-histogram.png}
+ \caption{Distribution of wait times between shot attempts}
+ \label{fig:waits}
+\end{figure}
-\emph{Note this required manipulating the data set so much with plyr, that it would appear rather complicated if I listed all of the script here.}
+We plot this information in Figure~\ref{fig:waits}. We see that 30 seconds rarely go by without at least one shot attempt, but on occasion up to 60 seconds will pass without an attempt.
+We can also examine changes in the score throughout the game. This reveals that the first game of the season was uneventful: the Lakers maintained a lead for the entire game, Figure~\ref{fig:scores}. Note: the necessary calculations are made simpler by the \code{ddply()} function from the \pkg{plyr} package, which \pkg{lubridate} automatically loads. For more information about \pkg{plyr} see \url{http://had.co.nz/plyr/}.\\
-\code{R> qplot(time, data = first_play, geom = "histogram", binwidth = 2)}\\
+\code{R> game1_scores <- ddply(game1, "team", transform, score = cumsum(points))}\\
+\code{R> game1_scores <- game1_scores[game1_scores$team != "OFF",]}\\
+\code{R> qplot(ymd("2008-01-01") + time, score, data = game1_scores, geom = "line", colour = team)}\\
\begin{figure}[htpb]
\centering
- \includegraphics[width=.5\textwidth]{seconds-til-first-score.png}
- \caption{Seconds until first score of game}
- \label{fig:first-score}
+ \includegraphics[width=\textwidth]{score-comparison.png}
+ \caption{Scores over time for first game of season}
+ \label{fig:scores}
\end{figure}
-We see that the first points of each game are usually made within the first 30 seconds, figure~\ref{fig:first-score}. The longest time until the first score was 50 seconds. Moreover, the distribution of time until the first score is bimodal. Perhaps the first mode shows games where the first team to control the ball scored and the second mode shows games where the first team to control the ball missed and the second team scored.
\section{Conclusion}
Date-times create technical difficulties that other types of data do not. They must be specifically identified as date-time data, which can be difficult due to the overabundance of date-time classes. It can also be difficult to access and manipulate the individual pieces of data contained within a date time. Math with date-times is often appropriate, but must follow different rules than math with ordinal numbers. Finally, date related conventions such as daylight savings time and time zones make it difficult to compare and recognize different moments of time.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit ba60b8e

Please sign in to comment.