<center><h2>Hotelling's T²</h2></center>

## One-Sample **Hotelling's T²**

The formula you're referring to is the formula for one sample **Hotelling's T²** statistic. Let’s break it down:

$$
T^2 = n(\bar{X} - \mu)^T S^{-1} (\bar{X} - \mu)
$$

### Explanation of Terms:

1. $n$: The sample size (number of observations).
   
2. $\bar{X}$ : The vector of sample means for each variable. It's a \( p \)-dimensional vector where \( p \) is the number of variables. This is computed by averaging each column (variable) in the dataset.

3. $\mu$: The hypothesized mean vector. This is what you're testing against. It could be a vector of historical means, industry standards, control group means, etc.

4. $S$: The sample covariance matrix. It is a \( p \times p \) matrix that captures the variances of and covariances between the variables in the dataset.

5. $S^{-1}$: The inverse of the sample covariance matrix. It accounts for the relationships between the variables.

6. $\bar{X} - \mu$: This is the difference between the sample mean vector and the hypothesized mean vector. It represents the deviation from the expected values for each variable.

7. $(\bar{X} - \mu)^T$: This is the transpose of the difference vector, turning the column vector into a row vector, so it can be multiplied by the inverse covariance matrix.

### Interpretation:
- Hotelling’s T² is a **multivariate** generalization of the t-test that allows you to test whether a set of sample means (for multiple variables) significantly differs from the hypothesized mean vector \( \mu \).
  
- The larger the value of \( T^2 \), the greater the difference between the sample mean and the hypothesized mean.


### why do we need to Perform one sample Hotelling T2 ?
#### The one-sample **Hotelling's T²** test is used in multivariate statistics to test whether the mean vector of a multivariate population differs from a specified value. It is a multivariate extension of the one-sample t-test, which is used when analyzing a single variable.

Here are the key reasons to perform a **one-sample Hotelling's T²** test:

### 1. **Multivariate Nature of the Data**:
   When you're dealing with **multiple correlated dependent variables**, you cannot simply run separate t-tests for each variable, as this increases the risk of Type I errors (false positives). Hotelling's T² allows you to analyze these variables together, taking into account their correlations.

### 2. **Simultaneous Testing**:
   It tests whether the vector of sample means differs from a known vector of population means (often a vector of zeros). Instead of testing each variable's mean individually, Hotelling's T² **tests them simultaneously**, controlling for relationships between the variables.

### 3. **More Accurate Inference**:
   By considering the covariance among the variables, the test provides a more accurate assessment of whether the mean differences are statistically significant. This is especially useful when the variables are not independent.

### 4. **Applications**:
   - **Quality Control**: In industries, when assessing the quality of a product with multiple dimensions (e.g., strength, durability, appearance), Hotelling's T² can check if the overall quality differs from a standard.
   - **Medical Research**: In clinical trials or studies involving multiple health metrics (e.g., blood pressure, cholesterol levels), it tests whether the overall health status differs from a control or norm.
   - **Survey Analysis**: When analyzing survey responses across multiple categories (e.g., satisfaction, comfort, reliability), the test can determine if the responses deviate from an expected value.

In summary, we perform a one-sample Hotelling's T² test when we need to make inferences about the means of multiple variables simultaneously, especially when these variables are correlated. This test provides a multivariate framework for hypothesis testing that accounts for the relationships between variables, making it more robust than running multiple univariate tests.

### Explanation of Terms:

1. **\( n \)**: The sample size (number of observations).
   
2. **\( \bar{X} \)**: The vector of sample means for each variable. It's a \( p \)-dimensional vector where \( p \) is the number of variables. This is computed by averaging each column (variable) in the dataset.

3. **\( \mu \)**: The hypothesized mean vector. This is what you're testing against. It could be a vector of historical means, industry standards, control group means, etc.

4. **\( S \)**: The sample covariance matrix. It is a \( p \times p \) matrix that captures the variances of and covariances between the variables in the dataset.

5. **\( S^{-1} \)**: The inverse of the sample covariance matrix. It accounts for the relationships between the variables.

6. **\( (\bar{X} - \mu) \)**: This is the difference between the sample mean vector and the hypothesized mean vector. It represents the deviation from the expected values for each variable.

7. **\( (\bar{X} - \mu)^T \)**: This is the transpose of the difference vector, turning the column vector into a row vector, so it can be multiplied by the inverse covariance matrix.

### Interpretation:
- Hotelling’s T² is a **multivariate** generalization of the t-test that allows you to test whether a set of sample means (for multiple variables) significantly differs from the hypothesized mean vector \( \mu \).
  
- The larger the value of \( T^2 \), the greater the difference between the sample mean and the hypothesized mean.


<center><h2>CODE</h2></center>

## Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets
from scipy.stats import f

### one-sample **Hotelling's T²**
$$
T^2 = n(\bar{X} - \mu)^T S^{-1} (\bar{X} - \mu)
$$

In [2]:

def hotellings_t_squared_test(X, mu):
    # Number of observations (n) and number of variables (p)
    n, p = X.shape

    # Hypothesized mean vector (mu) - Assuming population mean is a vector of based on your own data

    # Calculate sample mean vector (X_bar)
    X_bar = np.mean(X, axis=0)

    # Calculate sample covariance matrix (S)
    S = np.cov(X, rowvar=False)

    # Calculate Hotelling's T-squared statistic
    T_squared = n * np.dot(np.dot((X_bar - mu).T, np.linalg.inv(S)), (X_bar - mu))

    # Calculate F-statistic equivalent
    F_stat = (n - p) / (p * (n - 1)) * T_squared

    # Degrees of freedom for F-distribution
    df1 = p  # Degrees of freedom 1
    df2 = n - p  # Degrees of freedom 2

    # p-value
    p_value = 1 - f.cdf(F_stat, df1, df2)

    return T_squared, F_stat, p_value


### Load your data

In [3]:
data = pd.read_excel('Footwear_Quality_Assessment.xlsx')
print("shape of dataframe", data.shape)

dash_line = '-'.join('' for x in range(100))

print(dash_line)
print(data.columns)

shape of dataframe (25, 6)
---------------------------------------------------------------------------------------------------
Index(['Subject', 'Style', 'Comfort', 'Stability', 'Cushion', 'Durability'], dtype='object')


In [4]:
assumed_mean_vector = [7, 8, 5, 6,11] ## assumed based on the given data
result = hotellings_t_squared_test(data.drop(columns=['Subject']).values, assumed_mean_vector)
print("Hotelling's T-squared statistic:", result[0])
print(dash_line)
print("F-statistic:", result[1])
print(dash_line)
print("p-value:",round(result[2], 16))

Hotelling's T-squared statistic: 21.999714884305625
---------------------------------------------------------------------------------------------------
F-statistic: 3.666619147384271
---------------------------------------------------------------------------------------------------
p-value: 0.016201732088495


### Conclusion
We reject the null hypothesis since p-value(0.01) < alpha value (0.05)

## Two-Sample **Hotelling's T²**

Two-Sample **Hotelling's T²** Formula:

$$
T^2 = \frac{n_1 n_2}{n_1 + n_2} (\bar{X}_1 - \bar{X}_2)^T S_W^{-1} (\bar{X}_1 - \bar{X}_2)
$$

-  $n_1$ and  $n_2$: The sample sizes of the two groups.
- $\bar{X}_1 $ and $\bar{X}_2$: The mean vectors of the two groups.
- $S_W$: The pooled within-group covariance matrix, which accounts for the variability within each group.




A **two-sample Hotelling's T² test** is used to compare the means of two groups across multiple variables, extending the concept of a two-sample t-test to a multivariate setting. Here’s why you would need to perform this test:

### 1. **Multivariate Comparison**:
   - You have **multiple dependent variables** and want to test whether the mean vectors of these variables are significantly different between two groups. 
   - Example: Comparing two groups of people on several health indicators (e.g., blood pressure, cholesterol levels, glucose levels). A two-sample Hotelling's T² would assess whether there is a statistically significant difference in the overall health profile between the groups.

### 2. **Correlation between Variables**:
   - In many multivariate datasets, the variables are not independent of each other, and there may be **correlations** between the variables.
   - A two-sample Hotelling's T² test takes this correlation into account, unlike running individual t-tests on each variable, which assumes independence and increases the chance of a Type I error (false positives).

### 3. **Control Over Multiple Testing**:
   - If you run several individual t-tests (one for each variable), it increases the risk of committing a Type I error (incorrectly rejecting the null hypothesis). The Hotelling’s T² test helps **control the error rate** by considering all variables simultaneously in one test.

### 4. **Test of Overall Mean Difference**:
   - The test is used when you want to determine if there is an **overall difference** in the mean vectors between the two groups. 
   - It answers the question: **Is there a significant difference between the groups across the set of variables together**, rather than looking at each variable in isolation.

### 5. **Applications**:
   - **Medical Research**: Comparing two treatment groups across multiple health indicators.
   - **Quality Control**: Testing whether two batches of products differ significantly in several characteristics (e.g., weight, size, and durability).
   - **Social Sciences**: Comparing the psychological profiles of two different populations (e.g., using multiple personality traits).

### Conclusion:
Performing a two-sample Hotelling's T² test is essential when comparing groups on multiple correlated variables and ensuring that the analysis accounts for the relationships between these variables while controlling for multiple comparisons.


<center><h2>CODE</h2></center>

### Two-Sample **Hotelling's T²**
$$
T^2 = \frac{n_1 n_2}{n_1 + n_2} (\bar{X}_1 - \bar{X}_2)^T S_W^{-1} (\bar{X}_1 - \bar{X}_2)
$$


In [5]:

def TwoSampleT2Test(X, Y):
    """
    Performs a two-sample Hotelling's T-squared test.

    Parameters:
    X (numpy.ndarray): Matrix of data for group 1, with observations as rows and variables as columns.
    Y (numpy.ndarray): Matrix of data for group 2, with observations as rows and variables as columns.

    Returns:
    tuple: Contains Hotelling's T-squared statistic, F-statistic, and p-value.
    """
    # Number of observations for each sample (n1 for X, n2 for Y)
    n1, p1 = X.shape
    n2, p2 = Y.shape

    # Ensure the number of variables is the same in both samples
    assert p1 == p2, "Both groups must have the same number of variables."
    p = p1  # Number of variables (dimensions)

    # Calculate mean vectors for each sample
    X_bar = np.mean(X, axis=0)
    Y_bar = np.mean(Y, axis=0)

    # Calculate pooled covariance matrix (S_pooled)
    S_X = np.cov(X, rowvar=False)
    S_Y = np.cov(Y, rowvar=False)
    S_pooled = ((n1 - 1) * S_X + (n2 - 1) * S_Y) / (n1 + n2 - 2)

    # Calculate the difference in the mean vectors
    mean_diff = X_bar - Y_bar

    # Calculate Hotelling's T-squared statistic
    T_squared = (n1 * n2) / (n1 + n2) * np.dot(np.dot(mean_diff.T, np.linalg.inv(S_pooled)), mean_diff)

    # Calculate F-statistic equivalent
    F_stat = (n1 + n2 - p - 1) / (p * (n1 + n2 - 2)) * T_squared

    # Degrees of freedom for F-distribution
    df1 = p  # Degrees of freedom 1
    df2 = n1 + n2 - p - 1  # Degrees of freedom 2

    # p-value
    p_value = 1 - f.cdf(F_stat, df1, df2)

    return T_squared, F_stat, p_value

### Loading Data

In [6]:
data = pd.read_csv("Real-estate.csv")
data.shape

(414, 8)

##### Splitting the data into two groups as two samples

In [7]:
group1 = data.iloc[:15, 1:]
group2 = data.iloc[15:30, 1:] ## the sample should be less than 30

In [8]:
# Run the two-sample Hotelling's T-squared test
T_squared, F_stat, p_value = TwoSampleT2Test(group1, group2)

print("Hotelling's T-squared statistic:", T_squared)
print(dash_line)
print("F-statistic:", F_stat)
print(dash_line)
print("p-value:", p_value)

Hotelling's T-squared statistic: 15.57818305545573
---------------------------------------------------------------------------------------------------
F-statistic: 1.7485715674491125
---------------------------------------------------------------------------------------------------
p-value: 0.14929486734301478


### Conclusion
We accept our null hypothesis since p(0.14) > alpha value(0.05)