# Factor Analysis Basic Example - MA2003B Multivariate Statistics Course

This notebook demonstrates the fundamental concepts of Factor Analysis (FA) using a simple 3-variable correlation matrix. Factor Analysis models observed variables as linear combinations of underlying latent factors plus unique error terms.

## Learning Objectives:
- Understand the difference between PCA and Factor Analysis
- Interpret factor loadings as correlations between variables and factors
- Distinguish communalities (common variance) from uniqueness (unique variance)
- See how FA focuses on shared variance rather than total variance

**Data**: Hypothetical 3-variable correlation matrix showing moderate intercorrelations

**Expected Output**:
- Single factor loading for each variable
- Communalities showing proportion of variance explained by the factor
- Uniqueness showing variable-specific variance

In [46]:
# Import Required Libraries
import numpy as np
from factor_analyzer import FactorAnalyzer

# Set random seed for reproducibility
np.random.seed(42)

In [47]:
# Generate Sample Data
# Create 1000 observations of 3 variables that share a common underlying factor

n_samples = 1000
latent_factor = np.random.randn(n_samples)

# Each variable is influenced by the latent factor plus unique noise
X = np.column_stack(
    [
        0.8 * latent_factor + 0.6 * np.random.randn(n_samples),  # Variable 1
        0.9 * latent_factor + 0.4 * np.random.randn(n_samples),  # Variable 2
        0.7 * latent_factor + 0.7 * np.random.randn(n_samples),  # Variable 3
    ]
)

In [48]:
# Display Dataset
print("Factor Analysis: Basic Single-Factor Model")
print("=" * 50)
print(f"\nDataset X:")
print(f"Shape: {X.shape} (1000 observations, 3 variables)")
print(f"\nFirst 5 observations:")
print(X[:5])
print(f"\nBasic statistics:")
print(f"Means: {X.mean(axis=0)}")
print(f"Stds:  {X.std(axis=0)}")

Factor Analysis: Basic Single-Factor Model

Dataset X:
Shape: (1000, 3) (1000 observations, 3 variables)

First 5 observations:
[[ 1.23698458  0.17697143 -0.98776538]
 [ 0.44416877 -0.18224534 -0.69905452]
 [ 0.55392905  0.26595172  0.1638581 ]
 [ 0.83026182  1.24754226  2.38750226]
 [ 0.23161129 -0.9681839   0.22567982]]

Basic statistics:
Means: [0.05796739 0.01973254 0.00042899]
Stds:  [0.96593337 0.97253566 0.98625049]


In [49]:
# How the data was generated:
# - latent_factor: hidden common factor affecting all variables
# - Each variable = (factor_loading * latent_factor) + unique_noise
# - Factor analysis will try to recover these factor loadings

In [50]:
# Initialize Factor Analysis
fa = FactorAnalyzer(n_factors=1, method="principal", rotation=None)

In [51]:
# Fit Factor Analysis
fa.fit(X)



0,1,2
,n_factors,1
,rotation,
,method,'principal'
,use_smc,True
,is_corr_matrix,False
,bounds,"(0.005, ...)"
,impute,'median'
,svd_method,'randomized'
,rotation_kwargs,{}


In [52]:
# Extract Results
loadings = fa.loadings_
communalities = fa.get_communalities()
uniqueness = fa.get_uniquenesses()

In [53]:
# Display Factor Loadings
print("\nFactor Loadings:")
for i, loading in enumerate(loadings.flatten(), 1):
    print(f"Variable {i}: {loading:.3f}")


Factor Loadings:
Variable 1: 0.862
Variable 2: 0.912
Variable 3: 0.819


In [54]:
# Display Communalities
print("\nCommunalities (variance explained by common factor):")
for i, comm in enumerate(communalities, 1):
    print(f"Variable {i}: {comm:.3f}")


Communalities (variance explained by common factor):
Variable 1: 0.743
Variable 2: 0.832
Variable 3: 0.671


In [55]:
# Display Uniqueness
print("\nUniqueness (variable-specific variance):")
for i, uniq in enumerate(uniqueness, 1):
    print(f"Variable {i}: {uniq:.3f}")


Uniqueness (variable-specific variance):
Variable 1: 0.257
Variable 2: 0.168
Variable 3: 0.329


## Comparison with True Structure

Let's compare what factor analysis recovered versus how we actually generated the data.

In [56]:
# True structure used to generate X
true_loadings = np.array([0.8, 0.9, 0.7])
true_noise_std = np.array([0.6, 0.4, 0.7])

# Recovered structure from factor analysis
recovered_loadings = loadings.flatten()

print("\nComparison of True vs Recovered Structure:")
print("=" * 60)
print(f"{'Variable':<12} {'True Loading':<15} {'Recovered':<15} {'Difference':<12}")
print("-" * 60)
for i in range(3):
    diff = abs(true_loadings[i] - abs(recovered_loadings[i]))
    print(
        f"Variable {i+1:<3} {true_loadings[i]:<15.3f} {abs(recovered_loadings[i]):<15.3f} {diff:<12.3f}"
    )

print("\n\nVariance Decomposition:")
print("-" * 60)
print(f"{'Variable':<12} {'True h²':<15} {'Recovered h²':<15} {'Difference':<12}")
print("-" * 60)
for i in range(3):
    # True communality: loading² / (loading² + noise²)
    true_h2 = true_loadings[i] ** 2 / (true_loadings[i] ** 2 + true_noise_std[i] ** 2)
    recovered_h2 = communalities[i]
    diff = abs(true_h2 - recovered_h2)
    print(f"Variable {i+1:<3} {true_h2:<15.3f} {recovered_h2:<15.3f} {diff:<12.3f}")

print("\n\nUniqueness Comparison:")
print("-" * 60)
print(f"{'Variable':<12} {'True ψ':<15} {'Recovered ψ':<15} {'Difference':<12}")
print("-" * 60)
for i in range(3):
    # True uniqueness: noise² / (loading² + noise²)
    true_psi = true_noise_std[i] ** 2 / (true_loadings[i] ** 2 + true_noise_std[i] ** 2)
    recovered_psi = uniqueness[i]
    diff = abs(true_psi - recovered_psi)
    print(f"Variable {i+1:<3} {true_psi:<15.3f} {recovered_psi:<15.3f} {diff:<12.3f}")


Comparison of True vs Recovered Structure:
Variable     True Loading    Recovered       Difference  
------------------------------------------------------------
Variable 1   0.800           0.862           0.062       
Variable 2   0.900           0.912           0.012       
Variable 3   0.700           0.819           0.119       


Variance Decomposition:
------------------------------------------------------------
Variable     True h²         Recovered h²    Difference  
------------------------------------------------------------
Variable 1   0.640           0.743           0.103       
Variable 2   0.835           0.832           0.003       
Variable 3   0.500           0.671           0.171       


Uniqueness Comparison:
------------------------------------------------------------
Variable     True ψ          Recovered ψ     Difference  
------------------------------------------------------------
Variable 1   0.360           0.257           0.103       
Variable 2   0.165  

## Interpretation

**How we built X:**
- Variable 1 = 0.8 * latent_factor + 0.6 * noise
- Variable 2 = 0.9 * latent_factor + 0.4 * noise  
- Variable 3 = 0.7 * latent_factor + 0.7 * noise

**What factor analysis recovered:**
- Loadings are very close to the true values (0.8, 0.9, 0.7)
- Small differences are due to sampling variability with finite data
- Communalities match the theoretical values based on signal-to-noise ratios

**Key insight:**
Factor analysis successfully identified the underlying latent structure. The method correctly estimated:
1. How strongly each variable relates to the hidden factor (loadings)
2. What proportion of variance is shared (communalities) vs unique (uniqueness)

This demonstrates that factor analysis can recover the true latent structure from observed data when the factor model assumptions hold.

In [57]:
# Compute Factor Scores
factor_scores = fa.transform(X)
print("\nFactor Scores (first 10 observations):")
print(factor_scores[:10].flatten())


Factor Scores (first 10 observations):
[ 0.16882527 -0.18954738  0.36040226  1.70257151 -0.26030402 -0.47076761
  1.8126552   0.31413228  0.16198443  0.39251693]




## Installation Note

To run this notebook, install the factor_analyzer package:

```bash
pip install factor_analyzer
```