# Chapter 2 - Discovering patterns in Data

Aim:
Identify factors determining high-pressure game situtations and factors affecting scoring probability

----
## 2.1 - Quantifying associations between variables

----
Association or term dependence: any statistical relationship between two variables, casual or otherwise

Types of variables:
1. Categorical variables
2. Numerical variables

__Categorical variables__ assign each unit of observation to a particular group or category on the basis of some qualitative property; possible values for categorical variables can be attributes or categories
Eg: 
- Conference to which a team belongs - East or West
- Result of a shot - made or missed
- Player's role - with 5 possibles values given by point guard, shooting gaurd, small forward, power forward and center


__Discrete numerical variable__ , for example, is the total number of points made, while the time played in the quarter is a __continuous numerical variable__

__*Why is the distinction important?*__

Because statistical methods conceived for variables of one kind cannot be appropriatly used to analyze the other kind of variables

Focusing on **bivariate analysis**, there 3 common definitions:
1. Statistical dependence
2. Mean dependence
3. Correlation

__Statistical dependence__ can be evaluated when variables are both categorical or at least one is numerical, or they are both numerical.

__Mean dependence__ requires at least one numerical variable

__Correlation analysis__ requires that both variables are numerical


_This book is devoted to the study of correlation, because the in basketball analytics the variables of interest are often numerical_

____

### 2.1.1 - Statistical dependence

Statistical dependence can be studied using a two-way cross-table and investigating the general relationship between the two variables.
This method compares the observed frequencies with the number of expected frequencies under the independence hypothesis.

Many association indexes proposed in literature are based on summary of differences  between the observed and the exercised frequency - statistical difference (beyond random chance)

Most famous assocaition index based on the differences between observed and expected frequencies is teh Chi-square (${X^2}$, sometimes indicated as $\chi^2$).

Suppose table of *r* rows x *c* columns and let $n_{ij}$ and $\hat{n}_{ij}$ (*i* = 1,..., *r* ; *j* = 1,...,*c*) be the observed and expected frequency respectively in a random cell *i,j*. $X^2$ can be computed as $\sum_{i} \sum_{j} \frac {(n_{ij} - \hat{n}_{ij})^2} {\hat{n}_{ij}}$.

Since the above mesaures the intensity of the relationship between the two variables, but also depends on the sample size $N$, number of $r$ rows and $c$ columns, this statistic has been adjusted in several ways, giving birth to a number of related measures of association, like:
1. Phi $\phi$ ( $\phi = \sqrt{{X^2}/{N}}$, also known as $M_{2}(D)$)
2. Mean Squared Contingency (${\phi^2}$)
3. Pearson's contingency coefficient $P$ ($P = \sqrt{ \phi^2 / (\phi^2 + 1)}$)
4. Cramers's *V*, also known as the normalized index $C$ ($ V = \phi / \sqrt{k - 1}$) where $k = min(r, c)$)

NOTE: When there is no association between the two variables, each of these above measures have a value of 0

As the intensity of association increases, the value of each of these measures increases. **Cramer's $V$ is the preferred measure** because it is the only one that equals 1 in the case of a perfect association between the two variables and so can be easily interpreted as a percentage.
In addition, since Cramer's $V$ formula considers the dimensions of the table, it can be used for comparisons among tables of different dimensions (size of the tables doesn't have to be the same).

A significance test is usually performed on {X^2} (called *Chi-square test of independence*, based on the $\chi^2$ distribution) in order to test whether the $X^2$ value can be considered statistically different from zero, indicating that there exists a significant association between the two variables. For the other association measures derived from $X^2$, it can be said that their tests of significance lead to the same conclusion as it is for the $\chi^2$ test of independence.

For example, we may be interested in analyzing if some game statistics of the Golden State Warriors depends on the opponent team they play. To do this, we can compute some  association measure between the two variables crossed in table `T` below, which reports the #of free throws, missed shots, rebounds and attempted shots by Golden State Warriors in the matches played against each of the opponent teams shown in the table rows



In [44]:
library(BasketballAnalyzeR)

In [45]:
#data(package="BasketballAnalyzeR")

In [46]:
PbP <- PbPmanipulation(PbP.BDB)

In [47]:
PbP.GSW <- subset(PbP, team=="GSW")

In [48]:
ev <- c("ejection", "end of period", "jump ball",
        "start of period", "unknown", "violation",
        "timeout", "sub", "foul", "turnover")

In [49]:
ev

In [50]:
event.unsel <- which(PbP.GSW$event_type %in% ev)

In [51]:
PbP.GSW.ev <- PbP.GSW[-event.unsel,]

In [52]:
# PbP.GSW.ev

The cross-table `T` is given by:

In [53]:
attach(PbP.GSW.ev)

In [54]:
T <- table(oppTeam, event_type, exclude=ev)

In [55]:
T

       event_type
oppTeam free throw miss rebound shot
    ATL         33   88      81   84
    BKN         34   80      98   93
    BOS         45   95      90   71
    CHA         26   91      90   80
    CHI         46   80      98   95
    CLE         47   88      95   79
    DAL         74  155     188  188
    DEN         78  172     164  173
    DET         34   75      85   83
    HOU         56  118     119  131
    IND         33   97      90   72
    LAC        127  161     166  176
    LAL        104  190     202  176
    MEM         77  126     128  117
    MIA         48   92      92   79
    MIL         33   70      74   85
    MIN         54  132     142  133
    NOP         85  183     175  180
    NYK         46   78      79   90
    OKC         86  176     179  153
    ORL         27   77      92   99
    PHI         46   76      88   98
    PHX         59  166     178  197
    POR         60  123     116  125
    SAC         76  165     169  159
    SAS         70  

In [56]:
detach(PbP.GSW.ev)

Some association measure can be directly obtained by resporting to the function assocstats in the library `vcd` as. follows:

In [57]:
# install.packages("vcd", repos='http://cran.us.r-project.org')

In [58]:
library(vcd)

In [59]:
assocstats(T)

                    X^2 df P(> X^2)
Likelihood Ratio 115.26 84 0.013396
Pearson          116.25 84 0.011421

Phi-Coefficient   : NA 
Contingency Coeff.: 0.097 
Cramer's V        : 0.056 

As the above results show:
- the association between the two variables is low (Cramer's $V$=~0.06), but significantly not 0. 
- Pearson's $\chi^2$ = 116.25 has *p*-value=~0.011. An association is usually considered significant when *p*-value is lower than conventional values 0.05 or 0.10. Ergo, not significant
- Liklihood ratio a.k.a *G*-test, that gives here the same indications as the Chi-square test of independence, not significant.

**We may conclude that there is low association that the number of game events (shots, missed shots, rebounds and free throws) in GSW's playtime depends on the opponent team**. Play attention to the fact that teams do not play against all other teams an equal number of times


____

### 2.1.2 - Mean dependence

Mean dependence allows us to examine **correlation**, e.g., if the average number of points scored by all the NBA teams differ between the East and West Conferences or among the different Divisions, or between qualified teams for Playoffs, or to study if the average number of fouls (or assists, rebounds, ...) of one team differs across the quarters.

In those situation, we want to analyze if and how the average of a numerical variable (e.g., the points made) varies acriss the classes defined by another variable (e.g., the points made) varies across the classes defined by another variable, which can be categorical (e.g., Playoff, with values Yes or No)

In EDA, a variable *Y* is said to be **mean independent** from another variable *X* if and only if the conditional means of *Y* (i.e, the means computed within each group or class defined by *X*) are all equal and, consequently, equal to the unconditional mean of *Y* (computed over all the observation units, with out considering their classification according to *X*).

----
Note on ***Mean Independence***:

Given two real random variables X and Y, we say that:

1. $X$ and $Y$ are independent if the events ${X≤x}$ and ${Y≤y}$ are independent for any x,y,
2. $X$ is **mean independent** from $Y$ if its conditional mean $𝔼(Y|X=x)$ equals its (unconditional) mean $𝔼(Y)$ for all $X$ such that the probability that $X=x$ is not zero,
3. $X$ and $Y$ are uncorrelated if $𝔼(XY)=𝔼(X)𝔼(Y)$.

Assuming the necessary integrability hypothesis, we have the implications  1.⟹2.⟹3..
The 2nd implication follows from the law of iterated expectations: 
$𝔼(XY)=𝔼(𝔼(XY|Y))=𝔼(𝔼(X|Y)Y)=𝔼(X)𝔼(Y)$
____

A well-known index able to measure the level of mean dependence of $Y$ with respect to $X$ is the **Pearson's correlation ratio $\eta^2_{Y/X}$**, that is the ratio of the between deviance over the total deviance $(BD/TD)$. (based on Total Deviance decomposition explained in the next notebook)

Pearson's correlation ratio $\eta^2_{Y/X}$ ranges from 0 (when the conditional means of *Y* are all equal and *Y* is mean independent from *X*) to 1 (*Y* perfectly depends on average of $X$). The ratio finds a very useful application in cluster analyis, where it helps deciding the number of clusters to maintain in the solution of a $k$-means clustering.

To investigate the mean dependence of some game variables on Playoff qualification, we compute the conditional means of each variable, that is averaging over teams qualified *(Playoff=Yes)* and not qualified *(Playoff=No)*, separately, adn the values of the Pearson's correlation ratio $\eta^2$, in %.

In [60]:
library(dplyr)

In [61]:
# install.packages("lsr", repos='http://cran.us.r-project.org')

In [62]:
library(lsr)

In [63]:
# R.version

In [64]:
# install.packages("tibble", repos='http://cran.us.r-project.org')

In [65]:
# remove.packages("ggplot2")

In [66]:
# install.packages("ggplot2")

In [67]:
library(tibble)

In [68]:
FF <- fourfactors(Tbox, Obox)

In [69]:
attach(Tbox)

In [70]:
attach(FF)

The following object is masked from Tbox:

    Team




In [71]:
X <- data.frame(PTS, P2M, P3M, FTM, REB=OREB+DREB, AST,
                STL, BLK, ORtg, DRtg)

In [72]:
detach(Tbox)

In [73]:
detach(FF)

In [74]:
Playoff <- Tadd$Playoff

In [75]:
eta <- sapply(X, function(Y){
    cm <- round(tapply(Y, Playoff, mean), 1)
    eta2 <- etaSquared(aov(Y~Playoff))[1]*100
    c(cm, round(eta2, 2))
}) %>%
    t() %>%
    as.data.frame() %>%
    rename(No=N, Yes=Y, eta2=V3) %>%
    tibble::rownames_to_column("rowname") %>%
    arrange(-eta2) %>%
    tibble::column_to_rownames('rowname')

eta

Unnamed: 0_level_0,No,Yes,eta2
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
DRtg,107.9,104.6,42.53
ORtg,104.0,108.1,40.25
STL,601.9,659.6,28.77
PTS,8576.0,8844.8,19.28
BLK,365.6,420.4,18.12
FTM,1328.0,1394.4,5.58
P2M,2353.7,2417.2,3.28
AST,1875.5,1931.6,3.17
P3M,846.9,871.9,1.07
REB,3558.1,3577.5,0.49


^ Game variables' averages (conditional to Playoff) and values of $\eta^2$ of mean dependence of game variables on Playoff qualification

All the conditional means differ between qualified and non-qualified teams. Ergo, we can conclude that all these game variables are, to some extent, dependent on the Playoff qualification.

As seen the table above, the values of $\eta^2$, in several cases the degree of mean dependence is very low. 
For example, the number if rebounds and assists or the number of shots made (2-pt, 3pt and free throws) are substantially not dependent on thh Playoff qualification (they all have $\eta^2$ lower that 6%).

On the contrary, Defensive and Offensive Ratings are (moderately) highly dependent on Playoff ($\eta^2 = $42.35% and 40.25%, respectively)

This shows that it is the game as a whole that counts, not the game variables taken 
individually. Single variables don't tell us the whole story.

____
### 2.1.3 - Correlation

Correlation is a specific kind of statistical association which refers to the **linear relationship between two numerical variables**. When numerical variables are available, measuring the degree of association by using statistical dependence or mean dependence methods degrades both variables (in the case of statistical dependence) or one of them (in case of mean dependence) to fill the role of categorical variables.

Instead, correlation analysis allows the optimal use of the information available in the **numerical variables which prevail over categoraical ones**

In much detail, correlation analysis is based on concordance indices that are **positive** when the highest values of one variable are associated with the highest values of other variable and negative in the opposite case, and **negative** when the highest values of one variable are associated with the lowest values of the other variable

The most widespread concordance index is **Pearson's correlation coefficient** which is designed to measure both intensity and direction of a linear relationship between the two variables. The focus is then on a linear relationship between the two variables and we can think of an interdependence between variables. Other concordance indexes measuring nonlinear association between variables have been proposed, for example the well known Kendall's $\tau$

Given two variables $X$ and $Y$, the value of Pearson's correlation coefficient $\rho$ between $X$ and $Y$ ranges from $-1$ to $1$, with the extremes meaning perfect (negative or positive respectively) correlation and values near to 0 denoting absence of linear corelation (not necessarily absence of any kind of association)

![Pearson correlation coeffecient](../static/2.1.3.correlation-coefficient.png)

The following code computes the calue of $\rho$ between the number of assists (AST) and turnovers (TOV) per minute played for players who have played at least 500 minutes

In [76]:
data <- subset(Pbox, MIN>=500)

In [77]:
attach(data)

In [78]:
X <- data.frame(AST, TOV)/MIN

In [79]:
detach(data)

In [80]:
cor(X$AST, X$TOV)

^ The Pearson correlation coefficient equals $0.687$: as expected, there exists a positive linear relationship between assists and turnovers and the intensity of the association is rather strong

In [81]:
cor(rank(X$AST), rank(X$TOV))

^ Pearson correlation between ranking based on assists and turnovers -- equivalent to computing Spearman's correlation coefficient

**Spearman's correlation coeffecient**/ is one of the most common rank-correlation measures and ranges from $-1$ to $1$: it equals **1** when the player's postions are identical in the two rankings (perfect rank-agreement) and **-1** when once ranking is the reverse of the other (perfect disagreement). Values close to **0** suggest no association between rankings and increasing values imply increasing agreement between rankings

In [82]:
cor(X$AST, X$TOV, method="spearman")

^ value of $0.668$ denotes a positive and strong assocation between the two rankings: players on top positions in the ranking of assists tend to rank high on turnovers

**If we have more than two variables, we can compute all the Pearson correlation coeffecients between each pair of variables and collect them in a matrix, called *correlation matrix***. It is a squared matrix (number of rows equal to the number of columns) with dimension given by the number of analyzed variables.

In [83]:
cor(X) # correlation matrix of a variable with itself

Unnamed: 0,AST,TOV
AST,1.0,0.6873883
TOV,0.6873883,1.0
