[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wasim/Data-Science/blob/main/data-analyst-roadmap/05_statistics_for_data_analysis/04_correlation_analysis.ipynb)

# Correlation Analysis

Discover relationships between variables.

## What is Correlation?
- Measures relationship strength
- Range: -1 to +1
- Positive: Both increase together
- Negative: One increases, other decreases
- Zero: No linear relationship

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

sns.set_style('whitegrid')
np.random.seed(42)

## 1. Pearson Correlation

Measures linear relationship between 
continuous variables.

In [None]:
# Generate sample data
n = 100
study_hours = np.random.uniform(1, 10, n)
test_scores = (
    50 + 5 * study_hours + 
    np.random.normal(0, 5, n)
)

df = pd.DataFrame({
    'Study_Hours': study_hours,
    'Test_Score': test_scores
})

df.head()

In [None]:
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(
    df['Study_Hours'], 
    df['Test_Score'],
    alpha=0.6
)
plt.xlabel('Study Hours')
plt.ylabel('Test Score')
plt.title('Study Hours vs Test Score')
plt.grid(True)
plt.show()

In [None]:
# Calculate Pearson correlation
corr, p_value = stats.pearsonr(
    df['Study_Hours'], 
    df['Test_Score']
)

print("Pearson Correlation")
print(f"Correlation coefficient (r): {corr:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpret strength
if abs(corr) > 0.7:
    strength = "Strong"
elif abs(corr) > 0.4:
    strength = "Moderate"
else:
    strength = "Weak"

direction = "positive" if corr > 0 else "negative"

print(f"\nInterpretation: {strength} {direction} "
      f"correlation")

if p_value < 0.05:
    print("Result: Statistically significant")
else:
    print("Result: Not statistically significant")

## 2. Spearman Correlation

Measures monotonic relationship 
(works with ordinal data).

In [None]:
# Generate non-linear data
x = np.linspace(0, 10, 100)
y = x ** 2 + np.random.normal(0, 5, 100)

df_nonlinear = pd.DataFrame({
    'X': x,
    'Y': y
})

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(df_nonlinear['X'], 
            df_nonlinear['Y'], 
            alpha=0.6)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Non-Linear Relationship')
plt.grid(True)
plt.show()

In [None]:
# Compare Pearson vs Spearman
pearson_r, pearson_p = stats.pearsonr(
    df_nonlinear['X'], 
    df_nonlinear['Y']
)
spearman_r, spearman_p = stats.spearmanr(
    df_nonlinear['X'], 
    df_nonlinear['Y']
)

print("Pearson Correlation:")
print(f"r = {pearson_r:.4f}, p = {pearson_p:.4f}")

print("\nSpearman Correlation:")
print(f"ρ = {spearman_r:.4f}, p = {spearman_p:.4f}")

print("\nNote: Spearman captures monotonic "
      "relationship better!")

## 3. Correlation Matrix

Analyze multiple variables at once.

In [None]:
# Load sample dataset
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
df_diabetes = pd.DataFrame(
    diabetes.data[:, :5],
    columns=['Age', 'Sex', 'BMI', 'BP', 'S1']
)
df_diabetes['Target'] = diabetes.target

df_diabetes.head()

In [None]:
# Calculate correlation matrix
corr_matrix = df_diabetes.corr()

print("Correlation Matrix:")
print(corr_matrix.round(3))

In [None]:
# Visualize with heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=1
)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

In [None]:
# Find strongest correlations with target
target_corr = corr_matrix['Target'].drop('Target')
target_corr_sorted = target_corr.abs().sort_values(
    ascending=False
)

print("Correlations with Target:")
for var, corr in target_corr_sorted.items():
    print(f"{var}: {target_corr[var]:.3f}")

## 4. Scatter Plot Matrix

Visualize all pairwise relationships.

In [None]:
# Create pairplot
sns.pairplot(
    df_diabetes[['BMI', 'BP', 'S1', 'Target']],
    diag_kind='hist',
    plot_kws={'alpha': 0.6}
)
plt.suptitle('Pairwise Relationships', y=1.02)
plt.tight_layout()
plt.show()

## 5. Partial Correlation

Control for confounding variables.

In [None]:
# Example: Ice cream sales vs drowning
# (confounded by temperature)
temperature = np.random.uniform(15, 35, 100)
ice_cream = (
    10 + 2 * temperature + 
    np.random.normal(0, 5, 100)
)
drowning = (
    5 + 0.5 * temperature + 
    np.random.normal(0, 2, 100)
)

df_confound = pd.DataFrame({
    'Temperature': temperature,
    'Ice_Cream_Sales': ice_cream,
    'Drowning_Cases': drowning
})

# Correlation without controlling
r_raw, _ = stats.pearsonr(
    df_confound['Ice_Cream_Sales'],
    df_confound['Drowning_Cases']
)

print(f"Raw correlation: {r_raw:.3f}")
print("(Spurious! Both caused by temperature)")

## 6. Correlation ≠ Causation

**Important:** Correlation does NOT imply 
causation!

### Common Issues:
1. **Confounding variables**
2. **Reverse causation**
3. **Coincidence**

### Example:
- Ice cream sales ↔ Drowning deaths
- Both caused by temperature!

## 7. Real-World Example

In [None]:
# Marketing data
np.random.seed(42)
n = 200

ad_spend = np.random.uniform(1000, 10000, n)
social_media = np.random.uniform(100, 1000, n)
email_campaigns = np.random.randint(5, 50, n)

# Revenue influenced by all factors
revenue = (
    5000 + 
    0.8 * ad_spend + 
    2 * social_media + 
    100 * email_campaigns +
    np.random.normal(0, 1000, n)
)

df_marketing = pd.DataFrame({
    'Ad_Spend': ad_spend,
    'Social_Media': social_media,
    'Email_Campaigns': email_campaigns,
    'Revenue': revenue
})

df_marketing.head()

In [None]:
# Correlation analysis
corr_matrix = df_marketing.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.3f',
    cmap='RdYlGn',
    center=0,
    square=True
)
plt.title('Marketing Metrics Correlation')
plt.tight_layout()
plt.show()

print("\nCorrelations with Revenue:")
revenue_corr = corr_matrix['Revenue'].drop(
    'Revenue'
).sort_values(ascending=False)
for metric, corr in revenue_corr.items():
    print(f"{metric}: {corr:.3f}")

## Practice Exercises

### Exercise 1
Calculate correlation between age and 
income in a dataset.

In [None]:
# Your code here


### Exercise 2
Create a correlation matrix for a 
multi-variable dataset.

In [None]:
# Your code here


## Key Takeaways

✅ **Pearson** - Linear relationships  
✅ **Spearman** - Monotonic relationships  
✅ **Range** - -1 to +1  
✅ **Heatmap** - Visualize correlations  
✅ **Causation** - Correlation ≠ Causation!  
✅ **p-value** - Test significance  

**Next:** [A/B Testing](05_ab_testing.ipynb) →