# Multivariate Statistical Tests

## Hotelling's T² Test

In [1]:
!pip install -q -U watermark

In [2]:
%reload_ext watermark
%watermark -a "Zelly Irigon"

Author: Zelly Irigon



In [4]:
#Imports
import scipy
import statsmodels
import numpy as np
import pandas as pd
from statsmodels.multivariate.manova import MANOVA
from scipy.stats import f
from numpy.linalg import inv, det

### Multivariate Statistical Tests
Hypothesis Tests for Multivariate Means are essential to understand if the means of two or more groups are significantly different, considering multiple variables at the same time.

Below are some of the most common tests in this category.

### Hotelling's T² Test
**Definition**

Hotelling's T² Test is the multivariate extension of the t-test for two samples. It is used to test the equality of the means of two groups and requires that the samples are independent and that the data follow a multivariate normal distribution.

Named after Harold Hotelling, Hotelling's T² Test is a multivariate statistical test used to compare the means of two multivariate groups. It is the multivariate extension of the Student's t-test, which is used to compare the means of two groups when there is only one dependent variable. Hotelling's T² Test is applied when there are two or more dependent variables.

**Comparison of Multivariate Means**

The test is used to determine if there is a significant difference between the means of two groups across multiple dependent variables simultaneously.

**Hypotheses** 

* Null Hypothesis (H0): The multivariate means of the two groups are equal.

* Alternative Hypothesis (H1): The multivariate means of the two groups are different.

**Data Requirements**

The test assumes that the data from each group are independent random samples and follow a multivariate normal distribution.

**Covariance Matrix**

The analysis involves using the covariance matrix of the dependent variables. The test takes into account not only the variables themselves but also the relationship (covariance) between them.

**Test Statistic**

The T² test statistic is calculated based on the group means, covariance matrices, and sample sizes. This statistic is then converted into an F-statistic to facilitate obtaining the p-value.

**Interpretation**

If the calculated p-value is less than the chosen significance level (usually 0.05), the null hypothesis is rejected, indicating that there is a significant difference between the group means.

**Applications of Hotelling's T² Test**

It is widely used in fields such as biomedical research, social sciences, marketing, and others, where the comparison of multiple variables is crucial.

It can be used to test differences in group means in pre- and post-treatment studies, comparative studies between different demographic groups, among others.

Hotelling's T² Test is a powerful tool for multivariate analysis, allowing Data Scientists to test complex hypotheses about the relationships between multiple variables in different groups.

### Hotelling's T² Test in Practice with Python
**Analysis Objective:**

The study aims to investigate if there are significant differences in physical characteristics (height and weight) between athletes of two different sports. Hotelling's T² Test will be used to determine if the multivariate means of height and weight are significantly different between the two groups of athletes.

**Study Groups:**

* Group 1: Basketball players. They are generally expected to be taller and heavier due to the nature of the sport.
* Group 2: Artistic gymnasts. These athletes tend to be lighter and shorter, characteristics beneficial for their sport.

**Dependent Variables:**

* Height (cm): Measure of the athletes' height.

* Weight (kg): Measure of the athletes' weight.

**Generation of Synthetic Data:**

Data for basketball players (Group 1) and artistic gymnasts (Group 2) will be synthetically generated with multivariate normal distributions. The means and covariances were chosen to reflect the real expectations of these athletes in terms of height and weight.

**Application of Hotelling's T² Test:**

Hotelling's T² Test is applied to compare the means of height and weight between the two groups of athletes. This test was chosen for its ability to simultaneously evaluate the difference in the means of two dependent variables (height and weight), considering the correlation between them.

**Importance of the Analysis:**

This study can provide valuable information on how different sports are associated with different physical characteristics. These insights can be useful for coaches and sports scientists in identifying talent and developing sport-specific training programmes. Additionally, it can contribute to understanding how the physical demands of each sport shape the athletes' body characteristics.

**Hypotheses Definition:**

* Null Hypothesis (H0)

The null hypothesis in a statistical test is generally a statement of "no difference" or "no effect." In the context of this study, the null hypothesis would be:

H0: There are no significant differences in the multivariate means of height and weight between basketball players (Group 1) and artistic gymnasts (Group 2). This means that, under the null hypothesis, any observed difference in height and weight means between the two groups is attributed to chance.

* Alternative Hypothesis (H1)

The alternative hypothesis is what you test against the null hypothesis. It usually suggests that there is an effect or difference. For this study, the alternative hypothesis would be:

H1: There are significant differences in the multivariate means of height and weight between basketball players (Group 1) and artistic gymnasts (Group 2). Under the alternative hypothesis, the observed differences are sufficiently large to be considered statistically significant and not due to chance.

Applying Hotelling's T² Test will allow evaluating these hypotheses by considering both height and weight simultaneously, taking into account the correlation between these two variables

In [12]:
# Generating Synthetic data 
np.random.seed(0)
group1 = np.random.multivariate_normal([170,60],[[10,2],[2,5]],50) # height and weight of the group 1
group2 = np.random.multivariate_normal([175, 65],[[10,2],[2,5]],50) # height and weight of the group 2

In [6]:
group1[1:5]

array([[165.44087123,  63.32426187],
       [164.90568127,  56.06580982],
       [167.17102989,  58.67510047],
       [170.03684798,  60.91506097]])

In [8]:
group2[1:5]

array([[178.25661423,  68.27224235],
       [177.28749029,  70.07278299],
       [176.78975905,  63.98556345],
       [168.04801617,  65.81424878]])

In [13]:
# axis=0: This means that the mean will be computed along the columns, i.e., for each variable.
# When you specify axis=0, you are instructing NumPy to calculate the mean for each column (each variable) across all rows (all observations). 
# In other words, it computes the mean for each variable across all observations.
np.mean(group1, axis = 0)

array([169.87756591,  60.18293836])

In [11]:
np.mean(group2, axis = 0)

array([174.93237839,  65.37779828])

The function below implements Hotelling's T² test, which is used to compare the means of two multivariate groups. The function first calculates the means and covariances for each group, then computes the pooled covariance and uses these values to calculate Hotelling's T² statistic. This statistic is then converted into an F-statistic, from which the p-value is derived. The p-value is used to determine if the differences between the group means are statistically significant.

In [14]:
# Function to calculate the Hotelling's T² test
def hotelling_t2_test(group1, group2):

    #Calculates the mean of each variable for group 1
    mean1 = np.mean(group1, axis =0)

    # Calculates the mean of each variable for group 2
    mean2 = np.mean(group2, axis =0)

    # Determines the number of observations in each group
    n1,n2 = len(group1), len(group2)

    # Calculates the covariance matrix of group 1
    ### The covariance matrix will define how data vary together.
    cov1 = np.cov(group1.T)

    # Calculates the covariance matrix of group 2
    cov2 = np.cov(group2.T)

    # Calculates the aggregate covariance of the two groups
    pooled_cov = ((n1-1) * cov1 + (n2-1) * cov2)/(n1 + n2 -2)

    # Calculates the difference in means between the two groups
    mean_diff = mean1 - mean2

    # Calculates Hotelling's T² statistic
    t2_stats = n1 * n2 / (n1 + n2) * mean_diff.dot(inv(pooled_cov)).dot(mean_diff)

    # Determines the degrees of freedom for the numerator (number of variables) 
    df1 = len(mean1)

    # Determines the degrees of freedom for the denominator
    df2 = n1 + n2 - df1 -1

    # Convertes T² statistic into the F statistic
    f_stats = t2_stats * (df2 / (n1 + n2 - 2))/df1

    # Calculates the 'p' value associated with the F statistic
    p_value = 1 - f.cdf(f_stats, df1, df2)

    # Returns T² statistics and P value
    return t2_stats, p_value

In [15]:
# Performing the test
t2_stats, p_value = hotelling_t2_test(group1, group2)

In [18]:
print('T² Statistics: ', t2_stats)
print('P Value: ', p_value)

T² Statistics:  146.68909049022432
P Value:  1.1102230246251565e-16


The most important information above is the 'P value'.
P Value:  1.1102230246251565e-16 -> the dash before the number 16, indicates that it is an extremely low number.
The confirmation can be seen when I ask if the p_value is less than 0.05 in the cell below

In [19]:
p_value < 0.05

True

### Interpretation:

**T² Statistic:**

* The T² statistic is a measure of the difference between the means of the two groups. A high value indicates a significant difference between the means of the groups being tested.

**p-value:**

* The p-value is a measure of the probability of obtaining an extreme result as observed, assuming that the null hypothesis (that the group means are equal) is true.
  
* A very small p-value suggests that, if the null hypothesis were true, it would be very unlikely to observe such a large difference or larger by chance.

* A low p-value (less than 0.05) indicates strong evidence against the null hypothesis and in favour of the alternative hypothesis (that the means are different).
  
In summary, if the p-value is small (typically less than 0.05), it suggests that the differences in the means of the groups are statistically significant, leading to the rejection of the null hypothesis.

# End