# Chapter 2 - Discovering patterns in Data

Aim:
Identify factors determining high-pressure game situtations and factors affecting scoring probability

## 2.1 - Quantifying associations between variables

Association or term dependence: any statistical relationship between two variables, casual or otherwise

Types of variables:
1. Categorical variables
2. Numerical variables

__Categorical variables__ assign each unit of observation to a particular group or category on the basis of some qualitative property; possible values for categorical variables can be attributes or categories
Eg: 
- Conference to which a team belongs - East or West
- Result of a shot - made or missed
- Player's role - with 5 possibles values given by point guard, shooting gaurd, small forward, power forward and center


__Discrete numerical variable__ , for example, is the total number of points made, while the time played in the quarter is a __continuous numerical variable__

__*Why is the distinction important?*__

Because statistical methods conceived for variables of one kind cannot be appropriatly used to analyze the other kind of variables

Focusing on **bivariate analysis**, there 3 common definitions:
1. Statistical dependence
2. Mean dependence
3. Correlation

__Statistical dependence__ can be evaluated when variables are both categorical or at least one is numerical, or they are both numerical.

__Mean dependence__ requires at least one numerical variable

__Correlation analysis__ requires that both variables are numerical


_This book is devoted to the study of correlation, because the in basketball analytics the variables of interest are often numerical_

### 2.1.1 - Statistical dependence

Statistical dependence can be studied using a two-way cross-table and investigating the general relationship between the two variables.
This method compares the observed frequencies with the number of expected frequencies under the independence hypothesis.

Many association indexes proposed in literature are based on summary of differences  between the observed and the exercised frequency - statistical difference (beyond random chance)

Most famous assocaition index based on the differences between observed and expected frequencies is teh Chi-square (${X^2}$, sometimes indicated as $\chi^2$).

Suppose table of *r* rows x *c* columns and let $n_{ij}$ and $\hat{n}_{ij}$ (*i* = 1,..., *r* ; *j* = 1,...,*c*) be the observed and expected frequency respectively in a random cell *i,j*. $X^2$ can be computed as $\sum_{i} \sum_{j} \frac {(n_{ij} - \hat{n}_{ij})^2} {\hat{n}_{ij}}$.

Since the above mesaures the intensity of the relationship between the two variables, but also depends on the sample size $N$, number of $r$ rows and $c$ columns, this statistic has been adjusted in several ways, giving birth to a number of related measures of association, like:
1. Phi $\phi$ ( $\phi = \sqrt{{X^2}/{N}}$, also known as $M_{2}(D)$)
2. Mean Squared Contingency (${\phi^2}$)
3. Pearson's contingency coefficient $P$ ($P = \sqrt{ \phi^2 / (\phi^2 + 1)}$)
4. Cramers's *V*, also known as the normalized index $C$ ($ V = \phi / \sqrt{k - 1}$) where $k = min(r, c)$)

NOTE: When there is no association between the two variables, each of these above measures have a value of 0

As the intensity of association increases, the value of each of these measures increases. **Cramer's $V$ is the preferred measure** because it is the only one that equals 1 in the case of a perfect association between the two variables and so can be easily interpreted as a percentage.
In addition, since Cramer's $V$ formula considers the dimensions of the table, it can be used for comparisons among tables of different dimensions (size of the tables doesn't have to be the same).

A significance test is usually performed on {X^2} (called *Chi-square test of independence*, based on the $\chi^2$ distribution) in order to test whether the $X^2$ value can be considered statistically different from zero, indicating that there exists a significant association between the two variables. For the other association measures derived from $X^2$, it can be said that their tests of significance lead to the same conclusion as it is for the $\chi^2$ test of independence.

For example, we may be interested in analyzing if some game statistics of the Golden State Warriors depends on the opponent team they play. To do this, we can compute some  association measure between the two variables crossed in table `T` below, which reports the #of free throws, missed shots, rebounds and attempted shots by Golden State Warriors in the matches played against each of the opponent teams shown in the table rows



library(BasketballAnalyzeR)

In [42]:
#data(package="BasketballAnalyzeR")

In [9]:
PbP <- PbPmanipulation(PbP.BDB)

In [10]:
PbP.GSW <- subset(PbP, team=="GSW")

In [11]:
ev <- c("ejection", "end of period", "jump ball",
        "start of period", "unknown", "violation",
        "timeout", "sub", "foul", "turnover")

In [12]:
ev

In [13]:
event.unsel <- which(PbP.GSW$event_type %in% ev)

In [14]:
PbP.GSW.ev <- PbP.GSW[-event.unsel,]

In [15]:
# PbP.GSW.ev

The cross-table `T` is given by:

In [16]:
attach(PbP.GSW.ev)

In [17]:
T <- table(oppTeam, event_type, exclude=ev)

In [18]:
T

       event_type
oppTeam free throw miss rebound shot
    ATL         33   88      81   84
    BKN         34   80      98   93
    BOS         45   95      90   71
    CHA         26   91      90   80
    CHI         46   80      98   95
    CLE         47   88      95   79
    DAL         74  155     188  188
    DEN         78  172     164  173
    DET         34   75      85   83
    HOU         56  118     119  131
    IND         33   97      90   72
    LAC        127  161     166  176
    LAL        104  190     202  176
    MEM         77  126     128  117
    MIA         48   92      92   79
    MIL         33   70      74   85
    MIN         54  132     142  133
    NOP         85  183     175  180
    NYK         46   78      79   90
    OKC         86  176     179  153
    ORL         27   77      92   99
    PHI         46   76      88   98
    PHX         59  166     178  197
    POR         60  123     116  125
    SAC         76  165     169  159
    SAS         70  

In [19]:
detach(PbP.GSW.ev)

Some association measure can be directly obtained by resporting to the function assocstats in teh library `vcd` as. follows:

In [20]:
# install.packages("vcd", repos='http://cran.us.r-project.org')

In [21]:
library(vcd)

“package ‘vcd’ was built under R version 4.0.2”
Loading required package: grid



In [22]:
assocstats(T)

                    X^2 df P(> X^2)
Likelihood Ratio 115.26 84 0.013396
Pearson          116.25 84 0.011421

Phi-Coefficient   : NA 
Contingency Coeff.: 0.097 
Cramer's V        : 0.056 

As the above results show:
- the association between the two variables is low (Cramer's $V$=~0.06), but significantly not 0. 
- Pearson's $\chi^2$ = 116.25 has *p*-value=~0.011. An association is usually considered significant when *p*-value is lower than conventional values 0.05 or 0.10. Ergo, not significant
- Liklihood ratio a.k.a *G*-test, that gives here the same indications as the Chi-square test of independence, not significant.

**We may conclude that there is low association that the number of game events (shots, missed shots, rebounds and free throws) in GSW's playtime depends on the opponent team**. Play attention to the fact that teams do not play against all other teams an equal number of times


In [23]:
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [24]:
# install.packages("lsr", repos='http://cran.us.r-project.org')

In [25]:
library(lsr)

“package ‘lsr’ was built under R version 4.0.2”


In [26]:
 R.version

               _                           
platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          0.1                         
year           2020                        
month          06                          
day            06                          
svn rev        78648                       
language       R                           
version.string R version 4.0.1 (2020-06-06)
nickname       See Things Now              

In [38]:
# install.packages("tibble", repos='http://cran.us.r-project.org')

In [1]:
# remove.packages("ggplot2")

Removing package from ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library’
(as ‘lib’ is unspecified)



In [28]:
# install.packages("ggplot2")

In [29]:
library(tibble)

In [30]:
FF <- fourfactors(Tbox, Obox)

In [31]:
attach(Tbox)

In [32]:
attach(FF)

The following object is masked from Tbox:

    Team




In [33]:
X <- data.frame(PTS, P2M, P3M, FTM, REB=OREB+DREB, AST,
                STL, BLK, ORtg, DRtg)

In [34]:
detach(Tbox)

In [35]:
detach(FF)

In [36]:
Playoff <- Tadd$Playoff

In [41]:
eta <- sapply(X, function(Y){
    cm <- round(tapply(Y, Playoff, mean), 1)
    eta2 <- etaSquared(aov(Y~Playoff))[1]*100
    c(cm, round(eta2, 2))
}) %>%
    t() %>%
    as.data.frame() %>%
    rename(No=N, Yes=Y, eta2=V3) %>%
    tibble::rownames_to_column("rowname") %>%
    arrange(-eta2) %>%
    tibble::column_to_rownames('rowname')