# Week 5 Workshop: Univariate Analysis - Agricultural Evaluations
## EDA Part 1: Analyzing Crop Production Data

**Student Name:** (Your name here)  
**Date:** (Today's date)  
**Dataset:** Evaluaciones Agropecuarias (EVA) from datos.gov.co

---

### Instructions

Apply the **5-step univariate analysis framework** to each of the 3 variables below:

1. **producci_n_t** - Agricultural production in tons
2. **rea_sembrada_ha** - Area planted in hectares
3. **rendimiento_t_ha** - Yield in tons per hectare

For each variable, complete all 5 steps:
- Identify (data type, missing values)
- Summarize (mean, median, mode)
- Spread (std, IQR)
- Visualize (histogram, boxplot)
- Detect (outliers using IQR method)

---

## Setup

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

# Plot style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

In [None]:
# Load the dataset
df = pd.read_csv('../data/evaluaciones_agropecuarias.csv')

# Quick overview
print(f"Dataset shape: {df.shape}")
print(f"\nColumns:\n{df.columns.tolist()}")
df.head()

In [None]:
# Check data types and missing values
df.info()

---

## Helper Function (Use this for all variables)

Run this cell to define a reusable analysis function.

In [None]:
def univariate_analysis(df, var):
    """
    Perform complete univariate analysis on a variable.
    
    Parameters:
    -----------
    df : pandas DataFrame
        The dataset
    var : str
        The column name to analyze
        
    Returns:
    --------
    dict : Summary statistics
    """
    print("=" * 60)
    print(f"UNIVARIATE ANALYSIS: {var}")
    print("=" * 60)
    
    # Step 1: Identify
    print(f"\n--- STEP 1: IDENTIFY ---")
    print(f"Data type: {df[var].dtype}")
    print(f"Total records: {len(df):,}")
    print(f"Non-null values: {df[var].count():,}")
    print(f"Missing values: {df[var].isna().sum():,} ({df[var].isna().sum()/len(df)*100:.2f}%)")
    
    # Step 2: Summarize (Central Tendency)
    print(f"\n--- STEP 2: CENTRAL TENDENCY ---")
    mean_val = df[var].mean()
    median_val = df[var].median()
    mode_val = df[var].mode()[0] if len(df[var].mode()) > 0 else np.nan
    ratio = mean_val / median_val if median_val != 0 else np.nan
    
    print(f"Mean: {mean_val:,.2f}")
    print(f"Median: {median_val:,.2f}")
    print(f"Mode: {mode_val:,.2f}")
    print(f"Mean/Median ratio: {ratio:.2f}")
    
    # Step 3: Spread (Dispersion)
    print(f"\n--- STEP 3: DISPERSION ---")
    std_val = df[var].std()
    var_val = df[var].var()
    q1 = df[var].quantile(0.25)
    q3 = df[var].quantile(0.75)
    iqr = q3 - q1
    
    print(f"Standard Deviation: {std_val:,.2f}")
    print(f"Variance: {var_val:,.2f}")
    print(f"Q1 (25th): {q1:,.2f}")
    print(f"Q3 (75th): {q3:,.2f}")
    print(f"IQR: {iqr:,.2f}")
    print(f"Range: {df[var].min():,.2f} to {df[var].max():,.2f}")
    
    # Step 4: Visualize
    print(f"\n--- STEP 4: VISUALIZATION ---")
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Histogram
    axes[0].hist(df[var].dropna(), bins=50, edgecolor='black', alpha=0.7, color='steelblue')
    axes[0].axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:,.0f}')
    axes[0].axvline(median_val, color='green', linestyle='-', linewidth=2, label=f'Median: {median_val:,.0f}')
    axes[0].set_xlabel(var, fontsize=12)
    axes[0].set_ylabel('Frequency', fontsize=12)
    axes[0].set_title(f'Distribution of {var}', fontsize=14)
    axes[0].legend()
    
    # Boxplot
    axes[1].boxplot(df[var].dropna(), vert=True, patch_artist=True,
                    boxprops=dict(facecolor='steelblue', alpha=0.7))
    axes[1].set_ylabel(var, fontsize=12)
    axes[1].set_title(f'Box Plot of {var}', fontsize=14)
    
    plt.tight_layout()
    plt.show()
    
    # Step 5: Detect Outliers
    print(f"\n--- STEP 5: OUTLIER DETECTION (IQR Method) ---")
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = df[(df[var] < lower_bound) | (df[var] > upper_bound)]
    lower_outliers = df[df[var] < lower_bound]
    upper_outliers = df[df[var] > upper_bound]
    
    print(f"Lower bound: {lower_bound:,.2f}")
    print(f"Upper bound: {upper_bound:,.2f}")
    print(f"Total outliers: {len(outliers):,} ({len(outliers)/len(df)*100:.2f}%)")
    print(f"  - Lower outliers: {len(lower_outliers):,}")
    print(f"  - Upper outliers: {len(upper_outliers):,}")
    
    # Distribution type
    if ratio > 1.2:
        dist_type = "Right-skewed"
        best_measure = "Median"
    elif ratio < 0.8:
        dist_type = "Left-skewed"
        best_measure = "Median"
    else:
        dist_type = "Approximately symmetric"
        best_measure = "Mean"
    
    print(f"\n--- SUMMARY ---")
    print(f"Distribution type: {dist_type}")
    print(f"Recommended central measure: {best_measure}")
    print("=" * 60)
    
    return {
        'variable': var,
        'mean': mean_val,
        'median': median_val,
        'mode': mode_val,
        'std': std_val,
        'iqr': iqr,
        'distribution': dist_type,
        'outliers_pct': len(outliers)/len(df)*100,
        'best_measure': best_measure
    }

---

# Variable 1: producci_n_t

**Description:** Agricultural production in tons

In [None]:
# YOUR CODE HERE: Run the univariate analysis for producci_n_t
# Hint: results_production = univariate_analysis(df, 'producci_n_t')


### Your Interpretation for producci_n_t

**Distribution type:** (Write your answer - Normal, Right-skewed, Left-skewed, or Bimodal)

**Explanation:** (Why does it have this distribution? What does it mean in the context of agricultural production?)

**Outlier investigation:** (What might cause the outliers? Which crops or regions could produce extremely high tonnage?)

**One-sentence summary:** (Summarize your key finding)

---

# Variable 2: rea_sembrada_ha

**Description:** Area planted in hectares

In [None]:
# YOUR CODE HERE: Run the univariate analysis for rea_sembrada_ha
# Hint: results_area = univariate_analysis(df, 'rea_sembrada_ha')


### Your Interpretation for rea_sembrada_ha

**Distribution type:** (Write your answer)

**Explanation:** (Why does it have this distribution? Think about farm sizes in Colombia.)

**Comparison to producci_n_t:** (Are the patterns similar? Why or why not?)

**One-sentence summary:** (Summarize your key finding)

---

# Variable 3: rendimiento_t_ha

**Description:** Yield in tons per hectare (production / area)

In [None]:
# YOUR CODE HERE: Run the univariate analysis for rendimiento_t_ha
# Hint: results_yield = univariate_analysis(df, 'rendimiento_t_ha')


### Your Interpretation for rendimiento_t_ha

**Distribution type:** (Write your answer)

**Explanation:** (This is a ratio variable - production per hectare. How does that affect the distribution compared to the raw production variable?)

**Outlier investigation:** (What might high-yield outliers represent? Think about crop efficiency.)

**One-sentence summary:** (Summarize your key finding)

---

# GroupBy: Compare Statistics Across Groups

Now that you have analyzed each variable individually, use **GroupBy** to see how the statistics change across groups.

**Concept:** GroupBy splits your data by a category, applies a statistic to each group, and combines the results. Think of sorting M&Ms by color and counting each pile.

**Pattern:** `df.groupby('GROUP_COLUMN')['VALUE_COLUMN'].operation()`

### Task 1: Mean Production by Crop Group (grupo_de_cultivo)

Calculate the mean producci_n_t for each crop group. Which crop group produces the most?

In [None]:
# YOUR CODE HERE: Calculate mean production by crop group
# Hint: df.groupby('grupo_de_cultivo')['producci_n_t'].mean().sort_values(ascending=False)


### Task 2: Median Yield by Crop Cycle (ciclo_de_cultivo)

Calculate the median rendimiento_t_ha for each crop cycle. Do permanent crops (PERMANENTE) have different yields than transitory crops (TRANSITORIO)?

In [None]:
# YOUR CODE HERE: Calculate median yield by crop cycle
# Hint: df.groupby('ciclo_de_cultivo')['rendimiento_t_ha'].median()


### Task 3: Count and Total Production by Department

How many records per department? Which department has the highest total production?

In [None]:
# YOUR CODE HERE: Count records per department (top 10)
# Hint: df.groupby('departamento')['producci_n_t'].count().sort_values(ascending=False).head(10)


# YOUR CODE HERE: Total production per department (top 5)
# Hint: df.groupby('departamento')['producci_n_t'].sum().sort_values(ascending=False).head(5)


### GroupBy Reflection

**Question:** Compare the overall mean production (from Variable 1 analysis) with the group means by grupo_de_cultivo. What insight does GroupBy reveal that a single statistic hides?

**Your answer:** (Write your answer here)

---

# Final Summary Table

Compile all your findings into a summary table.

In [None]:
# YOUR CODE HERE: Create the summary table using your results
# Make sure you have run the univariate_analysis for all 3 variables first
# and stored them in results_production, results_area, results_yield

summary_data = [
    results_production,
    results_area,
    results_yield
]

summary_df = pd.DataFrame(summary_data)
summary_df = summary_df[['variable', 'mean', 'median', 'distribution', 'outliers_pct', 'best_measure']]
summary_df.columns = ['Variable', 'Mean', 'Median', 'Distribution', 'Outliers (%)', 'Best Measure']

print("\n" + "=" * 80)
print("FINAL SUMMARY TABLE")
print("=" * 80)
print(summary_df.to_string(index=False))
print("=" * 80)

---

# Reflection Questions

Answer these questions based on your analysis:

### 1. Pattern Comparison
Do all three variables follow similar distribution patterns? Why might that be?

**Your answer:**

### 2. Mean vs Median
For which variable is the difference between mean and median the largest? What does this indicate?

**Your answer:**

### 3. Outlier Decision
If you had to decide whether to keep or remove outliers for further analysis, what would you choose for each variable and why?

**Your answer:**

### 4. Next Steps
Based on your univariate analysis, what bivariate questions would you like to explore in Week 6?

**Your answer:**

---

## Submission Checklist

Before submitting, verify that you have:

- [ ] Completed analysis for all 3 variables
- [ ] Added interpretations for each variable
- [ ] Completed all 3 GroupBy tasks
- [ ] Created the final summary table
- [ ] Answered all reflection questions
- [ ] Run all cells to ensure no errors
- [ ] Saved the notebook

---

*Week 5 Workshop - Data Analytics Course*