# Soil PCA Visualization

## Advanced Visualizations for Agricultural Insights

Final step: Create comprehensive visualizations to extract agronomic insights.

### Visualizations
1. 2D scatter plots with different colorings
2. 3D PCA space
3. Biplots showing features and samples
4. Regional comparisons
5. Soil type clustering

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
import pickle

plt.style.use('seaborn-v0_8-darkgrid')

with open('soil_pca_results.pkl', 'rb') as f:
    data = pickle.load(f)

X_2d = data['X_pca_2d']
df = data['df']
feature_names = data['feature_names']
pca_2d = data['pca_2d']

print('✓ Data loaded for visualization')

## 1. Basic 2D PCA Plot

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(X_2d[:, 0], X_2d[:, 1], alpha=0.6, s=50, edgecolors='k', linewidths=0.5)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)', fontsize=12)
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)', fontsize=12)
plt.title('Soil Samples in PCA Space', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 2. Color by Region

In [None]:
plt.figure(figsize=(12, 8))
regions = df['region'].unique()
colors = plt.cm.Set1(np.linspace(0, 1, len(regions)))

for region, color in zip(regions, colors):
    mask = df['region'] == region
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], 
               label=region, alpha=0.6, s=60, 
               edgecolors='k', linewidths=0.5, color=color)

plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%})', fontsize=12)
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%})', fontsize=12)
plt.title('Soil Samples by Region', fontsize=14, fontweight='bold')
plt.legend(title='Region', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print('💡 Regional patterns in soil properties visible!')

## 3. Color by Soil Type

In [None]:
plt.figure(figsize=(12, 8))
soil_types = df['soil_type'].unique()
colors_soil = plt.cm.tab10(np.linspace(0, 1, len(soil_types)))

for stype, color in zip(soil_types, colors_soil):
    mask = df['soil_type'] == stype
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1],
               label=stype, alpha=0.6, s=60,
               edgecolors='k', linewidths=0.5, color=color)

plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%})', fontsize=12)
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%})', fontsize=12)
plt.title('Soil Samples by Soil Type', fontsize=14, fontweight='bold')
plt.legend(title='Soil Type', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 4. Color by pH

In [None]:
plt.figure(figsize=(11, 8))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], 
                     c=df['pH'], cmap='RdYlGn', 
                     s=70, alpha=0.7, edgecolors='k', linewidths=0.5)
plt.colorbar(scatter, label='pH')
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%})', fontsize=12)
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%})', fontsize=12)
plt.title('Soil pH Distribution in PCA Space', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Biplot - Features and Samples

In [None]:
plt.figure(figsize=(14, 10))

plt.scatter(X_2d[:, 0], X_2d[:, 1], alpha=0.5, s=40, edgecolors='k', linewidths=0.3)

scale = 4.0
for i, feature in enumerate(feature_names):
    plt.arrow(0, 0, 
             scale * pca_2d.components_[0, i],
             scale * pca_2d.components_[1, i],
             head_width=0.15, head_length=0.15,
             fc='red', ec='red', alpha=0.7, linewidth=2)
    plt.text(scale * pca_2d.components_[0, i] * 1.15,
            scale * pca_2d.components_[1, i] * 1.15,
            feature, fontsize=9, ha='center',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.7))

plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%})', fontsize=13)
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%})', fontsize=13)
plt.title('Biplot: Soil Samples and Features', fontsize=15, fontweight='bold')
plt.axhline(0, color='gray', linestyle='--', alpha=0.3)
plt.axvline(0, color='gray', linestyle='--', alpha=0.3)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print('💡 Arrows show which features contribute to each PC')

## 6. 3D PCA Visualization

In [None]:
from sklearn.decomposition import PCA
pca_3d = PCA(n_components=3)
X_3d = pca_3d.fit_transform(data['X_scaled'])

fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')

regions = df['region'].unique()
colors = plt.cm.Set1(np.linspace(0, 1, len(regions)))

for region, color in zip(regions, colors):
    mask = df['region'] == region
    ax.scatter(X_3d[mask, 0], X_3d[mask, 1], X_3d[mask, 2],
              label=region, alpha=0.6, s=40, color=color, edgecolors='k', linewidths=0.3)

ax.set_xlabel(f'PC1 ({pca_3d.explained_variance_ratio_[0]:.1%})', fontsize=11)
ax.set_ylabel(f'PC2 ({pca_3d.explained_variance_ratio_[1]:.1%})', fontsize=11)
ax.set_zlabel(f'PC3 ({pca_3d.explained_variance_ratio_[2]:.1%})', fontsize=11)
ax.set_title('3D PCA: Soil Samples by Region', fontsize=14, fontweight='bold')
ax.legend(title='Region', fontsize=9)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f'3 PCs explain {pca_3d.explained_variance_ratio_.sum():.1%} variance')

## 7. Identify Extreme Soils

In [None]:
print('Extreme Soil Samples:')
print('='*70)

idx_high_pc1 = X_2d[:, 0].argmax()
idx_low_pc1 = X_2d[:, 0].argmin()
idx_high_pc2 = X_2d[:, 1].argmax()
idx_low_pc2 = X_2d[:, 1].argmin()

print('\nHighest PC1 (Most Fertile):')
print(df.iloc[idx_high_pc1][['sample_id', 'region', 'soil_type', 'nitrogen_ppm', 'phosphorus_ppm', 'organic_matter_percent']])

print('\nLowest PC1 (Least Fertile):')
print(df.iloc[idx_low_pc1][['sample_id', 'region', 'soil_type', 'nitrogen_ppm', 'phosphorus_ppm', 'organic_matter_percent']])

print('\n💡 PC1 clearly separates nutrient-rich from nutrient-poor soils!')

## Final Summary

### What We Accomplished

1. **Complete PCA Pipeline**: From raw data to insights
2. **Dimensionality Reduction**: 16 features → 2-3 components
3. **Interpretation**: Components have agricultural meaning
4. **Visualization**: Multiple views reveal patterns
5. **Actionable**: Can guide soil management

### Key Insights

- **PC1**: Soil fertility gradient (nutrient-rich to poor)
- **PC2**: Texture variations (sandy to clay)
- **Regional patterns**: Some regions cluster together
- **Soil type separation**: Different types occupy different PCA regions

### Agricultural Applications

1. **Soil Classification**: Group similar soils automatically
2. **Recommendation Systems**: Target fertilizer by PC scores
3. **Monitoring**: Track soil changes over time in PCA space
4. **Anomaly Detection**: Identify unusual soil samples
5. **Variable Rate Application**: Use PCs for precision agriculture

### Next Steps for Your Projects

- Apply to your own soil/crop datasets
- Combine with crop yield data
- Use PCA components as ML features
- Build soil quality indices from PCs
- Integrate with GIS for spatial analysis

---

**Congratulations!** 🎉

You've mastered PCA from theory to practice!

### Your Learning Journey
1. ✓ Understanding variance
2. ✓ Covariance and eigenvectors
3. ✓ 2D manual calculation
4. ✓ Implementation from scratch
5. ✓ sklearn PCA
6. ✓ Real agricultural application

**You're now ready to apply PCA to any dataset!**