# Soil PCA Analysis

## Applying PCA to Agricultural Data

Now we apply PCA to our soil dataset and interpret results.

### Goals
1. Apply PCA to soil data
2. Determine optimal components
3. Interpret principal components
4. Understand agronomic meaning

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import pickle

plt.style.use('seaborn-v0_8-darkgrid')
np.set_printoptions(precision=3, suppress=True)

print('✓ Libraries imported')

## 1. Load Preprocessed Data

In [None]:
with open('soil_data_preprocessed.pkl', 'rb') as f:
    data = pickle.load(f)

X_scaled = data['X_scaled']
feature_names = data['feature_names']
df = data['df']

print(f'Data loaded: {X_scaled.shape}')
print(f'Features: {len(feature_names)}')

## 2. Apply PCA

In [None]:
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

print('PCA Results:')
print(f'Components: {pca.n_components_}')
print(f'\nExplained variance by component:')
for i, var in enumerate(pca.explained_variance_ratio_[:5], 1):
    print(f'  PC{i}: {var:.4f} ({var*100:.2f}%)')

cumvar = np.cumsum(pca.explained_variance_ratio_)
print(f'\nCumulative variance:')
for i in [1, 2, 3, 4, 5]:
    print(f'  First {i} PCs: {cumvar[i-1]:.4f} ({cumvar[i-1]*100:.2f}%)')

## 3. Scree Plot

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

ax1.bar(range(1, len(pca.explained_variance_ratio_)+1),
       pca.explained_variance_ratio_, alpha=0.7, edgecolor='black')
ax1.set_xlabel('Principal Component', fontsize=12)
ax1.set_ylabel('Explained Variance Ratio', fontsize=12)
ax1.set_title('Scree Plot', fontsize=14, fontweight='bold')
ax1.grid(axis='y', alpha=0.3)

ax2.plot(range(1, len(cumvar)+1), cumvar, 'o-', linewidth=2, markersize=8)
ax2.axhline(0.95, color='r', linestyle='--', linewidth=2, label='95%')
ax2.set_xlabel('Number of Components', fontsize=12)
ax2.set_ylabel('Cumulative Variance', fontsize=12)
ax2.set_title('Cumulative Explained Variance', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

n_95 = np.argmax(cumvar >= 0.95) + 1
print(f'\n💡 {n_95} components explain 95% variance')

## 4. Component Interpretation

In [None]:
components_df = pd.DataFrame(
    pca.components_[:3].T,
    columns=['PC1', 'PC2', 'PC3'],
    index=feature_names
)

print('Top 3 Component Loadings:')
print(components_df)

print('\n' + '='*60)
print('PC1 - Top Contributing Features:')
pc1_sorted = components_df['PC1'].abs().sort_values(ascending=False)
for feat, val in pc1_sorted.head(5).items():
    print(f'  {feat:30s}: {components_df.loc[feat, "PC1"]:7.3f}')

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, pc in enumerate(['PC1', 'PC2', 'PC3']):
    axes[i].barh(feature_names, components_df[pc], alpha=0.7, edgecolor='black')
    axes[i].set_xlabel('Loading', fontsize=11)
    axes[i].set_title(f'{pc} Loadings ({pca.explained_variance_ratio_[i]:.1%} var)',
                     fontsize=12, fontweight='bold')
    axes[i].axvline(0, color='black', linewidth=0.8)
    axes[i].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Agronomic Interpretation

In [None]:
print('Agronomic Interpretation of Principal Components:')
print('='*70)

print('\nPC1 - Soil Fertility Factor')
print('  High loadings: NPK nutrients, organic matter, CEC')
print('  Interpretation: Overall soil fertility and nutrient availability')
print('  High PC1 = Nutrient-rich soil, Low PC1 = Nutrient-poor soil')

print('\nPC2 - Texture Factor')
print('  High loadings: Sand, silt, clay percentages')
print('  Interpretation: Soil physical properties and texture')
print('  Affects water retention and workability')

print('\nPC3 - Micronutrient Factor')
print('  High loadings: Micronutrients (Fe, Zn, Cu, Mn, B)')
print('  Interpretation: Trace element availability')
print('  Important for specific crop requirements')
print('='*70)

## 6. Reduce to 2D

In [None]:
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

print(f'Reduced to 2D')
print(f'Variance retained: {pca_2d.explained_variance_ratio_.sum():.2%}')
print(f'  PC1: {pca_2d.explained_variance_ratio_[0]:.2%}')
print(f'  PC2: {pca_2d.explained_variance_ratio_[1]:.2%}')

data['X_pca_2d'] = X_2d
data['pca_2d'] = pca_2d

with open('soil_pca_results.pkl', 'wb') as f:
    pickle.dump(data, f)

print('\n✓ Results saved for visualization notebook')

## Key Takeaways

### Results Summary

1. **Dimensionality Reduction**: 16 features → 2-3 components
2. **Information Retention**: 95% variance with ~5 components
3. **Interpretability**: Components have agronomic meaning
4. **Actionable**: Can guide soil management decisions

### Components Represent

- **PC1**: Overall soil fertility (NPK, organic matter)
- **PC2**: Soil texture (sand/silt/clay balance)
- **PC3**: Micronutrient availability

### Agricultural Value

- Simplify complex soil data
- Identify soil quality patterns
- Group similar soils
- Target management practices

---

Continue to: `soil_pca_visualization.ipynb`