# Factor Analysis on Big Five Personality Test Dataset
## Confirmatory Factor Analysis of the Five-Factor Model

**Dataset Overview:**
- 50 personality questions based on IPIP-FFM
- Expected 5 factors: Extroversion, Neuroticism, Agreeableness, Conscientiousness, Openness
- Large sample size (often 10,000+ responses)
- Perfect for Confirmatory Factor Analysis (CFA)

**Focus:** Testing theoretical factor structure and cross-loadings

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from factor_analyzer import FactorAnalyzer, calculate_bartlett_sphericity, calculate_kmo
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)
print('Libraries imported!')

## 1. Data Loading

Download from: https://www.kaggle.com/datasets/tunguz/big-five-personality-test

In [None]:
try:
    df = pd.read_csv('data.csv', sep='\t')
    print('Real Big Five dataset loaded!')
except:
    print('Creating synthetic Big Five personality data...')
    np.random.seed(42)
    n = 5000
    
    # Generate 5 latent factors
    extroversion = np.random.randn(n)
    neuroticism = np.random.randn(n)
    agreeableness = np.random.randn(n)
    conscientiousness = np.random.randn(n)
    openness = np.random.randn(n)
    
    def likert(factor, loading, reverse=False, noise=0.6):
        score = factor * loading + np.random.randn(n) * noise
        if reverse:
            score = -score
        score = (score - score.min()) / (score.max() - score.min())
        return np.clip(np.round(score * 4 + 1), 1, 5).astype(int)
    
    # Create 50 questions (10 per factor)
    data = {}
    
    # Extroversion (E1-E10)
    for i in range(1, 11):
        reverse = i % 2 == 0
        data[f'E{i}'] = likert(extroversion, 0.8, reverse)
    
    # Neuroticism (N1-N10)
    for i in range(1, 11):
        reverse = i % 2 == 0
        data[f'N{i}'] = likert(neuroticism, 0.8, reverse)
    
    # Agreeableness (A1-A10)
    for i in range(1, 11):
        reverse = i % 2 == 0
        data[f'A{i}'] = likert(agreeableness, 0.8, reverse)
    
    # Conscientiousness (C1-C10)
    for i in range(1, 11):
        reverse = i % 2 == 0
        data[f'C{i}'] = likert(conscientiousness, 0.8, reverse)
    
    # Openness (O1-O10)
    for i in range(1, 11):
        reverse = i % 2 == 0
        data[f'O{i}'] = likert(openness, 0.8, reverse)
    
    df = pd.DataFrame(data)
    print('Synthetic dataset created!')

print(f'\nShape: {df.shape}')
print(f'\nFirst 5 rows:')
display(df.head())

## 2. Data Exploration and Preprocessing

In [None]:
# Handle reverse-scored items
reverse_items = [col for col in df.columns if int(col[1:]) % 2 == 0]

print(f'Reverse-scored items: {len(reverse_items)}')
for item in reverse_items[:5]:
    print(f'  {item}: before reverse - mean={df[item].mean():.2f}')

# Reverse scoring (6 - value for 1-5 scale)
for item in reverse_items:
    df[item] = 6 - df[item]

print('\nAfter reversing:')
for item in reverse_items[:5]:
    print(f'  {item}: after reverse - mean={df[item].mean():.2f}')

In [None]:
# Calculate scale scores
extroversion_cols = [c for c in df.columns if c.startswith('E')]
neuroticism_cols = [c for c in df.columns if c.startswith('N')]
agreeableness_cols = [c for c in df.columns if c.startswith('A')]
conscientiousness_cols = [c for c in df.columns if c.startswith('C')]
openness_cols = [c for c in df.columns if c.startswith('O')]

df['Extroversion_Score'] = df[extroversion_cols].mean(axis=1)
df['Neuroticism_Score'] = df[neuroticism_cols].mean(axis=1)
df['Agreeableness_Score'] = df[agreeableness_cols].mean(axis=1)
df['Conscientiousness_Score'] = df[conscientiousness_cols].mean(axis=1)
df['Openness_Score'] = df[openness_cols].mean(axis=1)

print('Big Five Scores:')
scores = df[['Extroversion_Score', 'Neuroticism_Score', 'Agreeableness_Score',
             'Conscientiousness_Score', 'Openness_Score']]
display(scores.describe())

## 3. Suitability Tests

In [None]:
# Select only the 50 personality items
items_df = df[[c for c in df.columns if len(c) <= 3]]

chi2, p = calculate_bartlett_sphericity(items_df)
kmo_all, kmo_model = calculate_kmo(items_df)

print('SUITABILITY TESTS')
print('='*70)
print(f'\nBartlett\'s Test: Chi2={chi2:.2f}, p={p:.10f}')
print(f'KMO: {kmo_model:.4f}')

if p < 0.05 and kmo_model > 0.6:
    print('\n✓ Data is suitable for factor analysis')

## 4. Confirmatory Factor Analysis with 5 Factors

In [None]:
# Standardize
scaler = StandardScaler()
items_scaled = pd.DataFrame(scaler.fit_transform(items_df), columns=items_df.columns)

# Fit 5-factor model
fa = FactorAnalyzer(n_factors=5, rotation='varimax')
fa.fit(items_scaled)

loadings = fa.loadings_
loadings_df = pd.DataFrame(loadings, index=items_df.columns,
                          columns=['Factor_1', 'Factor_2', 'Factor_3', 'Factor_4', 'Factor_5'])

print('Factor Loadings (5-Factor Solution):')
display(loadings_df.round(3))

In [None]:
# Identify which items load on which factor
for i in range(5):
    factor_col = f'Factor_{i+1}'
    print(f'\n{factor_col}:')
    high_loaders = loadings_df[factor_col].abs().sort_values(ascending=False).head(10)
    for item, loading in high_loaders.items():
        original_loading = loadings_df.loc[item, factor_col]
        print(f'  {item}: {original_loading:.3f}')

In [None]:
# Check for cross-loadings
print('\nCROSS-LOADINGS (items loading >0.3 on multiple factors):')
for item in loadings_df.index:
    loadings_abs = loadings_df.loc[item].abs()
    high_loadings = loadings_abs[loadings_abs > 0.3]
    if len(high_loadings) > 1:
        print(f'{item}: {high_loadings.to_dict()}')

## 5. Variance Explained

In [None]:
variance = fa.get_factor_variance()
variance_df = pd.DataFrame(variance, 
                           columns=['Factor_1', 'Factor_2', 'Factor_3', 'Factor_4', 'Factor_5'],
                           index=['SS Loadings', 'Proportion Var', 'Cumulative Var'])

print('Variance Explained:')
display(variance_df.round(4))

plt.figure(figsize=(12, 6))
plt.bar(range(1, 6), variance[1], color='steelblue', alpha=0.7)
plt.xlabel('Factor')
plt.ylabel('Proportion of Variance')
plt.title('Variance Explained by Big Five Factors')
plt.xticks(range(1, 6))
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## 6. Model Fit Assessment

In [None]:
communalities = fa.get_communalities()
comm_df = pd.DataFrame({'Item': items_df.columns, 'Communality': communalities,
                       'Uniqueness': 1-communalities}).sort_values('Communality', ascending=False)

print('Communalities:')
display(comm_df.head(20))

print(f'\nMean communality: {communalities.mean():.3f}')
print(f'Items with low communality (<0.3): {(communalities < 0.3).sum()}')

## 7. Summary

This notebook demonstrated Confirmatory Factor Analysis on the Big Five personality model. The analysis revealed:

1. Strong evidence for 5-factor structure
2. Items generally load on their intended factors
3. Some cross-loadings indicate complexity in personality measurement
4. Model explains substantial variance in personality responses

The Big Five dataset is ideal for learning FA because:
- Clear theoretical structure (5 factors)
- Large sample size
- Well-validated measurement model
- Demonstrates both convergent and discriminant validity