
## Step 1: Import Required Libraries

We import essential libraries for **data handling, visualization, and statistical analysis**.

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
```

### **Breaking Down Each Import:**
1. **NumPy (`numpy`)**: 
   - Used for numerical computations.
   - Helps in handling arrays and performing mathematical operations.

2. **Pandas (`pandas`)**: 
   - A powerful library for handling structured data.
   - Used for reading datasets and performing correlation computations.

3. **Matplotlib (`matplotlib.pyplot`)**:
   - Used for data visualization.
   - Provides basic plotting functions.

4. **Seaborn (`seaborn`)**:
   - Built on Matplotlib but provides **better statistical visualizations**.
   - Used for scatter plots, heatmaps, and pair plots in correlation analysis.

5. **Warnings (`warnings.filterwarnings('ignore')`)**:
   - Suppresses unnecessary warnings for cleaner outputs.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


## Step 2: Load the Dataset

We load the **Palmer Penguins dataset** using **Seaborn’s built-in dataset loader**.

```python
penguin = sns.load_dataset('penguins')
penguin.head()
```

### **Breaking Down the Code:**
1. **`sns.load_dataset('penguins')`**:
   - Loads the `penguins` dataset from Seaborn’s built-in collection.
   - Contains measurements of **different penguin species** in Antarctica.

2. **`penguin.head()`**:
   - Displays the **first five rows** of the dataset.
   - Helps us quickly inspect the structure.

### **Why is this Important?**
- We need to understand the **features** available before performing correlation analysis.


In [None]:
penguin = sns.load_dataset('penguins')
penguin.head()


## Step 3: Exploratory Data Analysis (EDA)

Before analyzing correlations, we **explore** the dataset:

```python
penguin.describe()
penguin.info()
penguin.isnull().sum()
```

### **Breaking Down the Code:**

1. **`penguin.describe()`**:
   - Generates **summary statistics** for numerical columns (mean, std, min, max, etc.).

2. **`penguin.info()`**:
   - Shows **data types**, missing values, and column counts.

3. **`penguin.isnull().sum()`**:
   - Counts missing values in each column.

### **Why is this Important?**
- Helps identify **data issues** before correlation analysis.
- If many missing values exist, correlation results might be **biased**.


In [None]:
penguin.describe()


## Step 4: Handling Missing Values

Missing values can distort correlation computations. We need to decide whether to:

1. **Drop rows with missing values** (if there aren’t too many missing values).
2. **Impute missing values** using the mean/median/mode.
3. **Use pairwise correlation** (automatically handles missing values).

### **Steps We Take:**

```python
penguin.dropna(inplace=True)
```

- **`dropna(inplace=True)`**:  
  - Removes all rows containing missing values.
  - `inplace=True` ensures changes are applied directly.

### **Why is this Important?**
- If we **ignore missing values**, correlation results might be **misleading**.


In [None]:
sns.scatterplot(y = 'bill_length_mm', x = 'body_mass_g', data=penguin)
plt.show()


## Step 5: Visualizing Correlations using Scatterplots

A **scatterplot** helps visualize **relationships between variables**.

```python
sns.scatterplot(y='bill_length_mm', x='body_mass_g', data=penguin)
plt.show()
```

### **Breaking Down the Code:**
1. **`sns.scatterplot(y='bill_length_mm', x='body_mass_g', data=penguin)`**:
   - Plots a **scatterplot** with:
     - **X-axis (`body_mass_g`)**: Penguin body mass.
     - **Y-axis (`bill_length_mm`)**: Penguin bill length.

2. **`plt.show()`**:
   - Displays the scatter plot.

### **How to Interpret?**
- A **positive trend** suggests a **positive correlation**.
- A **downward trend** suggests a **negative correlation**.


In [None]:
penguin.corr(numeric_only= True)


## Step 6: Computing Pearson Correlation

We compute **Pearson’s correlation coefficient**:

```python
penguin.corr(numeric_only=True)
```

### **Breaking Down the Code:**
1. **`penguin.corr(numeric_only=True)`**:
   - Computes Pearson’s correlation for **numeric columns only**.
   - Pearson’s correlation measures **linear relationships** between variables.

### **How to Interpret Pearson Correlation?**
- **+1.0** → Perfect **positive** correlation.
- **-1.0** → Perfect **negative** correlation.
- **0.0** → No correlation.

⚠️ **Pearson correlation assumes linear relationships.**  
If relationships are **non-linear**, Spearman/Kendall correlation should be used.


In [None]:
val = penguin.corr(numeric_only= True)
upper = np.triu(val)
lower = np.tril(val)
sns.heatmap(val, cmap = 'RdYlGn', annot = True, mask = lower)
plt.show()


## Step 7: Correlation Heatmap

A **heatmap** provides a **visual representation** of correlations.

```python
val = penguin.corr(numeric_only=True)
sns.heatmap(val, cmap='RdYlGn', annot=True)
plt.show()
```

### **Breaking Down the Code:**
1. **`penguin.corr(numeric_only=True)`** → Computes Pearson correlation matrix.
2. **`sns.heatmap(val, cmap='RdYlGn', annot=True)`**:
   - Plots a **heatmap**.
   - **`cmap='RdYlGn'`** → Uses Red-Yellow-Green color scheme.
   - **`annot=True`** → Displays correlation values on the heatmap.

### **How to Interpret the Heatmap?**
- **Dark Green** → Strong **positive** correlation.
- **Dark Red** → Strong **negative** correlation.
- **Yellow** → Weak or no correlation.


In [None]:
sns.pairplot(penguin,
             y_vars = ['body_mass_g'],
             x_vars = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'], height = 5, aspect = 0.8, kind = 'kde')
plt.show()


## Step 8: Additional Correlation Methods (Spearman & Kendall)

Pearson correlation works **only for linear relationships**.  
We also check **Spearman** and **Kendall** correlation, which handle **non-linear relationships**.

```python
penguin.corr(method='spearman', numeric_only=True)
penguin.corr(method='kendall', numeric_only=True)
```

### **Breaking Down the Code:**  
1. **`penguin.corr(method='spearman', numeric_only=True)`**:  
   - Computes **Spearman Rank Correlation**.  
   - Measures **monotonic** relationships (order-based, but not necessarily linear).  

2. **`penguin.corr(method='kendall', numeric_only=True)`**:  
   - Computes **Kendall Tau Correlation**.  
   - More robust for small datasets or data with **many tied ranks**.

### **Why Use Spearman & Kendall?**  
- If data has **non-linear** relationships, **Pearson fails** to capture correlations.  
- These methods work better for **ordinal** or **rank-based** data.


In [None]:
sns.pairplot(penguin,
             y_vars = ['body_mass_g'],
             x_vars = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'],
             height = 5,
             aspect = 0.8,
             kind = 'reg',
             hue = 'species')
plt.show()


## Step 9: Hypothesis Testing for Correlation

Just because two variables have **a correlation value**, we need to test if it's **statistically significant**.  

We use **p-values** to check if the correlation is real.  

```python
from scipy.stats import pearsonr, spearmanr

pearson_corr, pearson_p = pearsonr(penguin['bill_length_mm'], penguin['body_mass_g'])
spearman_corr, spearman_p = spearmanr(penguin['bill_length_mm'], penguin['body_mass_g'])

print(f"Pearson Correlation: {pearson_corr}, p-value: {pearson_p}")
print(f"Spearman Correlation: {spearman_corr}, p-value: {spearman_p}")
```

### **Breaking Down the Code:**  
1. **`pearsonr(penguin['bill_length_mm'], penguin['body_mass_g'])`**:  
   - Computes Pearson’s **correlation coefficient** and **p-value**.  
   - If **p-value < 0.05**, correlation is **statistically significant**.  

2. **`spearmanr(penguin['bill_length_mm'], penguin['body_mass_g'])`**:  
   - Computes Spearman’s **rank correlation** and **p-value**.  
   - If **p-value < 0.05**, we reject H₀ (**correlation is significant**).  

### **How to Interpret?**  
- **p < 0.05** → Significant correlation.  
- **p > 0.05** → No strong evidence of correlation.


In [None]:
for cols in penguin:
  if penguin[cols].dtype == 'O':
    print(cols, ':', penguin[cols].unique())


## Step 10: Checking Multicollinearity with Variance Inflation Factor (VIF)

If two or more variables are **highly correlated**, they cause **multicollinearity**,  
which **reduces model reliability** in regression analysis.  

We use **Variance Inflation Factor (VIF)** to detect multicollinearity.

```python
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Select numeric columns
num_cols = penguin.select_dtypes(include=['float64', 'int64']).dropna()

# Compute VIF for each feature
vif_data = pd.DataFrame()
vif_data['Feature'] = num_cols.columns
vif_data['VIF'] = [variance_inflation_factor(num_cols.values, i) for i in range(len(num_cols.columns))]

print(vif_data)
```

### **Breaking Down the Code:**  
1. **`variance_inflation_factor(num_cols.values, i)`**:  
   - Computes **VIF score** for feature `i`.  
   - VIF > **5** suggests **multicollinearity** is a concern.  

2. **`num_cols.select_dtypes(include=['float64', 'int64']).dropna()`**:  
   - Ensures only **numeric columns** are analyzed (ignoring categorical).  

### **How to Interpret VIF Results?**  
- **VIF < 5** → No multicollinearity concern.  
- **VIF > 5** → Moderate multicollinearity.  
- **VIF > 10** → Severe multicollinearity; remove one of the correlated variables.


In [None]:
 sns.jointplot(penguin,
             x = 'bill_length_mm',
             y = 'body_mass_g',
             kind = 'kde')
plt.show()


## Step 11: Outlier Detection & Its Effect on Correlation

Outliers can **inflate or distort correlation values**.  
We detect outliers using **boxplots**.

```python
plt.figure(figsize=(10,6))
sns.boxplot(data=penguin[['bill_length_mm', 'body_mass_g']], palette='Set2')
plt.title('Boxplot of Variables')
plt.show()
```

### **Breaking Down the Code:**  
1. **`sns.boxplot(data=penguin[['bill_length_mm', 'body_mass_g']], palette='Set2')`**:  
   - Plots a **boxplot** to detect **outliers** in numerical variables.  

2. **`plt.show()`**:  
   - Displays the plot.

### **How to Handle Outliers?**  
- **Winsorization**: Replacing extreme values with percentiles.  
- **Removing Outliers**: If they distort correlation.  
- **Transforming Data**: Using log/square root transformations.



## Step 12: Computing Correlation Before & After Removing Outliers

We check how removing outliers **changes correlation values**.

```python
Q1 = penguin['body_mass_g'].quantile(0.25)
Q3 = penguin['body_mass_g'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier range
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
filtered_penguin = penguin[(penguin['body_mass_g'] >= lower_bound) & (penguin['body_mass_g'] <= upper_bound)]

# Compare correlation before and after
print("Correlation before removing outliers:", penguin.corr(numeric_only=True))
print("Correlation after removing outliers:", filtered_penguin.corr(numeric_only=True))
```

### **Breaking Down the Code:**  
1. **`IQR = Q3 - Q1`**:  
   - Computes **Interquartile Range (IQR)** to detect outliers.  

2. **`lower_bound = Q1 - 1.5 * IQR`** & **`upper_bound = Q3 + 1.5 * IQR`**:  
   - Defines **outlier threshold** using **1.5x IQR rule**.  

3. **`penguin[(penguin['body_mass_g'] >= lower_bound) & (penguin['body_mass_g'] <= upper_bound)]`**:  
   - Creates a **filtered dataset** without extreme values.  

4. **`corr()` Before & After**:  
   - Computes correlation **before and after** removing outliers.  

### **Why is This Important?**  
- If correlation **changes significantly**, outliers are **influencing results**.



## Step 13: Partial Correlation - Controlling for Other Variables

Sometimes, two variables **appear correlated** only because they both depend on a **third variable**.  
To check the **true correlation** between two variables **while controlling for another**, we use **Partial Correlation**.

### **Formula for Partial Correlation (r_xy.z):**
\[ r_{xy.z} = \frac{r_{xy} - r_{xz} r_{yz}}{\sqrt{(1 - r_{xz}^2)(1 - r_{yz}^2)}} \]

Where:
- **r_xy** = Pearson correlation between `x` and `y`.
- **r_xz** = Pearson correlation between `x` and `z`.
- **r_yz** = Pearson correlation between `y` and `z`.

```python
from pingouin import partial_corr

# Compute Partial Correlation between bill_length_mm and body_mass_g, controlling for flipper_length_mm
partial_corr_result = partial_corr(data=penguin, x='bill_length_mm', y='body_mass_g', covar='flipper_length_mm')

print(partial_corr_result)
```

### **Why is This Important?**
- If two variables **share a common influence**, **Partial Correlation** helps reveal the **true relationship**.
- Useful when dealing with **confounding variables**.


In [None]:
from pingouin import partial_corr

In [None]:
partial_corr(data=penguin, x='bill_length_mm', y='body_mass_g', covar='flipper_length_mm')


## Step 14: Feature Selection Based on Correlation

Highly correlated features can cause **redundancy** in machine learning models.  
We **remove one of the features** if correlation **exceeds a threshold**.

```python
corr_matrix = penguin.corr(numeric_only=True)
high_corr_features = set()

threshold = 0.8  # Define strong correlation threshold

for col in corr_matrix.columns:
    for row in corr_matrix.index:
        if abs(corr_matrix.loc[row, col]) > threshold and row != col:
            high_corr_features.add(row)

print("Highly correlated features:", high_corr_features)
```

### **Why is This Important?**
- Helps in **dimensionality reduction**.
- Prevents **overfitting** by removing **redundant variables**.


In [None]:
corr_matrix = penguin.corr(numeric_only=True)


## Step 15: Correlation with Categorical Variables

Pearson correlation **only works with numerical data**.  
For **categorical variables**, we use:

1. **Point-Biserial Correlation** (Binary vs. Numeric)
2. **Cramer’s V** (Categorical vs. Categorical)

```python
from scipy.stats import pointbiserialr, chi2_contingency

# Point-Biserial Correlation: Checking correlation between species (binary) and body mass
penguin['species_binary'] = (penguin['species'] == 'Adelie').astype(int)
r, p = pointbiserialr(penguin['species_binary'], penguin['body_mass_g'])
print(f"Point-Biserial Correlation: {r}, p-value: {p}")

# Cramér’s V: Checking association between two categorical variables (species & island)
contingency_table = pd.crosstab(penguin['species'], penguin['island'])
chi2, p, dof, expected = chi2_contingency(contingency_table)
cramers_v = np.sqrt(chi2 / (penguin.shape[0] * (min(contingency_table.shape) - 1)))

print(f"Cramér’s V: {cramers_v}, p-value: {p}")
```

### **Why is This Important?**
- **Point-Biserial** helps analyze categorical/numeric relationships.
- **Cramer’s V** helps measure association strength between **two categorical variables**.


In [None]:
from scipy.stats import pointbiserialr, chi2_contingency


## Step 16: Cross-Correlation for Time-Series Data

If working with **time-series data**, we check **cross-correlation** (correlation at different time lags).  

```python
import pandas as pd
import numpy as np

# Generate time-series data
np.random.seed(42)
time_series = pd.DataFrame({
    'time': np.arange(1, 101),
    'feature_1': np.random.randn(100).cumsum(),
    'feature_2': np.random.randn(100).cumsum()
})

# Compute lagged correlation (Shift feature_2 by 1 step)
corr_lag1 = time_series['feature_1'].corr(time_series['feature_2'].shift(1))

print(f"Lagged Correlation (1 time-step lag): {corr_lag1}")
```

### **Why is This Important?**
- Useful for detecting **leading/lagging relationships** in stock prices, climate patterns, etc.


In [None]:
time_series['feature_1'].corr(time_series['feature_2'].shift(1))


## Step 17: Rank Transformation for Robust Correlation

Pearson correlation assumes **normally distributed data**.  
If this assumption fails, we **apply rank transformation** for **more robust correlation measures**.

```python
from scipy.stats import rankdata

# Rank transform numerical features
penguin['body_mass_rank'] = rankdata(penguin['body_mass_g'])

# Compute Pearson correlation on ranked data
ranked_corr = penguin[['body_mass_rank', 'bill_length_mm']].corr()

print(ranked_corr)
```

### **Why is This Important?**
- **Ranked correlation** is **less sensitive to outliers**.
- Works well for **non-normal data distributions**.


In [None]:
penguin['body_mass_rank'] = rankdata(penguin['body_mass_g'])


## Step 18: Non-Parametric Bootstrap for Correlation Confidence Intervals

We estimate **confidence intervals for correlation** using **bootstrap resampling**.

```python
import numpy as np

# Bootstrap resampling for correlation estimation
bootstrap_corr = []
for _ in range(1000):
    sample = penguin.sample(frac=1, replace=True)
    bootstrap_corr.append(sample['bill_length_mm'].corr(sample['body_mass_g']))

# Compute 95% confidence interval
ci_lower, ci_upper = np.percentile(bootstrap_corr, [2.5, 97.5])

print(f"95% Confidence Interval for Correlation: ({ci_lower}, {ci_upper})")
```

### **Why is This Important?**
- Provides **uncertainty estimates** for correlation values.
- Useful when **sample sizes are small**.


In [None]:
bootstrap_corr = [penguin.sample(frac=1, replace=True)['bill_length_mm'].corr(penguin.sample(frac=1, replace=True)['body_mass_g']) for _ in range(1000)]