# Soil Data Exploration - Agricultural PCA Application

## Introduction

Welcome to the practical application! We'll explore real soil data before applying PCA.

### What You'll Learn
1. Load and inspect soil dataset
2. Exploratory Data Analysis (EDA)
3. Understand feature correlations
4. Data preprocessing for PCA
5. Why PCA is useful for soil data

### The Dataset
200 soil samples with:
- Physical properties (texture, moisture)
- Chemical properties (pH, NPK, micronutrients)
- Derived properties (CEC, EC)

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
pd.set_option('display.precision', 2)

print('âœ“ Libraries imported successfully!')

## 1. Load the Data

In [None]:
# Load soil data
data_path = '../../../datasets/soil/sample_soil_data.csv'
df = pd.read_csv(data_path)

print('Soil Dataset Loaded')
print('=' * 60)
print(f'Samples: {len(df)}')
print(f'Features: {len(df.columns)}')
print(f'\nColumn names:')
for col in df.columns:
    print(f'  - {col}')

print('\n' + '=' * 60)
print('First 5 samples:')
df.head()

In [None]:
# Data info
print('Dataset Information:')
print('=' * 60)
df.info()

print('\n' + '=' * 60)
print('Missing values:')
missing = df.isnull().sum()
if missing.sum() == 0:
    print('âœ“ No missing values!')
else:
    print(missing[missing > 0])

## 2. Basic Statistics

In [None]:
# Separate numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print('Column Types:')
print(f'Numeric: {len(numeric_cols)}')
print(f'Categorical: {len(categorical_cols)}')

print('\nStatistical Summary:')
df[numeric_cols].describe()

In [None]:
# Categorical distributions
print('Categorical Variable Distributions:')
print('=' * 60)

for col in categorical_cols:
    print(f'\n{col}:')
    print(df[col].value_counts())

## 3. Distribution Visualizations

In [None]:
# Plot distributions of key nutrients
key_features = ['pH', 'nitrogen_ppm', 'phosphorus_ppm', 'potassium_ppm', 
               'organic_matter_percent', 'moisture_percent']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, col in enumerate(key_features):
    axes[idx].hist(df[col], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
    axes[idx].axvline(df[col].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
    axes[idx].axvline(df[col].median(), color='green', linestyle='--', linewidth=2, label='Median')
    axes[idx].set_xlabel(col, fontsize=11)
    axes[idx].set_ylabel('Frequency', fontsize=11)
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print('ðŸ’¡ These distributions help us understand the typical ranges for each soil property')

## 4. Correlation Analysis

In [None]:
# Calculate correlation matrix
correlation_matrix = df[numeric_cols].corr()

# Plot heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm',
           center=0, square=True, linewidths=1, cbar_kws={'shrink': 0.8})
plt.title('Soil Features Correlation Matrix', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print('\nðŸ’¡ Key Observations:')
print('  â€¢ Strong correlations suggest redundancy - perfect for PCA!')
print('  â€¢ NPK nutrients often correlate (soil fertility factor)')
print('  â€¢ Texture components are related (constrained to sum to 100%)')

In [None]:
# Find highly correlated pairs
print('Highly Correlated Feature Pairs (|r| > 0.7):')
print('=' * 60)

high_corr = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.7:
            high_corr.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

high_corr.sort(key=lambda x: abs(x[2]), reverse=True)
for feat1, feat2, corr in high_corr[:10]:  # Top 10
    print(f'{feat1:25} <-> {feat2:25} : {corr:6.3f}')

print(f'\nTotal pairs with |r| > 0.7: {len(high_corr)}')

## 5. Scatter Plots of Key Relationships

In [None]:
# Visualize key relationships
fig, axes = plt.subplots(2, 2, figsize=(16, 14))

# N vs P
axes[0, 0].scatter(df['nitrogen_ppm'], df['phosphorus_ppm'], 
                  alpha=0.6, s=50, edgecolors='k', linewidths=0.5)
axes[0, 0].set_xlabel('Nitrogen (ppm)', fontsize=12)
axes[0, 0].set_ylabel('Phosphorus (ppm)', fontsize=12)
axes[0, 0].set_title('Nitrogen vs Phosphorus', fontsize=13, fontweight='bold')
axes[0, 0].grid(alpha=0.3)

# P vs K
axes[0, 1].scatter(df['phosphorus_ppm'], df['potassium_ppm'],
                  alpha=0.6, s=50, edgecolors='k', linewidths=0.5, color='coral')
axes[0, 1].set_xlabel('Phosphorus (ppm)', fontsize=12)
axes[0, 1].set_ylabel('Potassium (ppm)', fontsize=12)
axes[0, 1].set_title('Phosphorus vs Potassium', fontsize=13, fontweight='bold')
axes[0, 1].grid(alpha=0.3)

# Organic Matter vs CEC
axes[1, 0].scatter(df['organic_matter_percent'], df['cec_meq_100g'],
                  alpha=0.6, s=50, edgecolors='k', linewidths=0.5, color='green')
axes[1, 0].set_xlabel('Organic Matter (%)', fontsize=12)
axes[1, 0].set_ylabel('CEC (meq/100g)', fontsize=12)
axes[1, 0].set_title('Organic Matter vs CEC', fontsize=13, fontweight='bold')
axes[1, 0].grid(alpha=0.3)

# Clay vs CEC
axes[1, 1].scatter(df['clay_percent'], df['cec_meq_100g'],
                  alpha=0.6, s=50, edgecolors='k', linewidths=0.5, color='brown')
axes[1, 1].set_xlabel('Clay (%)', fontsize=12)
axes[1, 1].set_ylabel('CEC (meq/100g)', fontsize=12)
axes[1, 1].set_title('Clay vs CEC', fontsize=13, fontweight='bold')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print('ðŸ’¡ These correlations show redundancy - we can reduce dimensions with PCA!')

## 6. Feature Scaling Analysis

In [None]:
# Check feature scales
print('Feature Scale Comparison:')
print('=' * 60)

scales = df[numeric_cols].describe().loc[['mean', 'std', 'min', 'max']]
print(scales.T)

print('\nðŸ’¡ Notice the huge scale differences!')
print('  â€¢ pH: 5-8 range')
print('  â€¢ Potassium: 50-400 range')
print('  â€¢ Iron: 20-200 range')
print('\n  â†’ We MUST standardize before PCA!')

## 7. Prepare Data for PCA

In [None]:
from sklearn.preprocessing import StandardScaler

# Extract features for PCA (exclude IDs and categorical)
features_for_pca = [col for col in numeric_cols 
                   if col not in ['sample_id']]

X = df[features_for_pca].values
feature_names = features_for_pca

print(f'Features for PCA: {len(feature_names)}')
print('\nFeature list:')
for i, name in enumerate(feature_names, 1):
    print(f'  {i:2d}. {name}')

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f'\nOriginal data shape: {X.shape}')
print(f'Scaled data shape: {X_scaled.shape}')
print('\nâœ“ Data standardized (mean=0, std=1)')

# Verify standardization
print(f'\nMeans after scaling: {X_scaled.mean(axis=0)}')
print(f'Stds after scaling:  {X_scaled.std(axis=0)}')

## 8. Save Preprocessed Data

In [None]:
# Save for next notebooks
import pickle

data_dict = {
    'X_scaled': X_scaled,
    'X_original': X,
    'feature_names': feature_names,
    'scaler': scaler,
    'df': df,
    'categorical_cols': categorical_cols
}

with open('soil_data_preprocessed.pkl', 'wb') as f:
    pickle.dump(data_dict, f)

print('âœ“ Preprocessed data saved to: soil_data_preprocessed.pkl')
print('\nThis will be used in the next notebooks for PCA analysis')

## Key Takeaways

### What We Learned

1. **Dataset Structure**: 200 soil samples, 16 numeric features
2. **High Correlations**: Many features are correlated (redundancy)
3. **Scale Differences**: Features span very different ranges
4. **Perfect for PCA**: Correlations + many features = great PCA candidate

### Why PCA is Useful Here

- **Reduce complexity**: 16 features â†’ 2-3 components
- **Remove redundancy**: Correlated nutrients combined
- **Enable visualization**: Can plot in 2D/3D
- **Reveal patterns**: Find underlying soil quality factors
- **Interpretability**: Components may represent fertility, texture, etc.

### Next Steps

Now we'll:
1. Apply PCA to this soil data
2. Interpret the components
3. Visualize in reduced dimensions
4. Extract agronomic insights

---

**Great work!** Data is explored and ready for PCA.

Continue to: `soil_pca_analysis.ipynb`