# ðŸ“˜ Ultimate EDA & Feature Engineering Playbook

## Overview
This notebook is a comprehensive guide to Exploratory Data Analysis (EDA) and Feature Engineering.
It is designed to be a **template** you can plug any dataset into.

### ðŸ“š Table of Contents
1. **Environment Setup**: Libraries and Configuration
2. **Data Loading**: Ingestion and Sanity Checks
3. **Initial Exploration**: Structure, Types, and Summary Stats
4. **Data Cleaning**: Missing Values and Duplicates
5. **Univariate Analysis**: Numerical and Categorical Distributions
6. **Bivariate Analysis**: Correlations and Relationships
7. **Multivariate Analysis**: Pairplots and Interactions
8. **Feature Engineering**: Creation, Transformation, and Encoding
9. **Preprocessing**: Scaling and Splitting
10. **Conclusion**: Summary of Findings


## 1. Environment Setup
Importing necessary libraries for manipulation and visualization.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Configuration
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_theme(style='whitegrid')
warnings.filterwarnings('ignore')

%matplotlib inline
print('Libraries Imported Successfully')


## 2. Data Loading
Load your dataset here. For this playbook, we will generate a synthetic dataset to demonstrate functionality.


In [None]:
# Generating a robust synthetic dataset
from sklearn.datasets import make_classification

# Create synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, 
                           n_redundant=5, n_classes=2, random_state=42)

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(20)])
df['target'] = y

# Injecting some messiness for practice (Missing values, Categorical columns)
import random

df['category_A'] = np.random.choice(['Red', 'Blue', 'Green'], df.shape[0])
df['category_B'] = np.random.choice(['Low', 'Medium', 'High'], df.shape[0])
df.loc[::10, 'feature_0'] = np.nan  # Inject missing values
df.loc[::20, 'category_A'] = np.nan # Inject missing values in categorical

print(f'Dataset Shape: {df.shape}')
df.head()


## 3. Initial Exploration
Understanding the basic structure of the data.


In [None]:
# 3.1 Data Types and Info
df.info()


In [None]:
# 3.2 Summary Statistics (Numerical)
df.describe().T


In [None]:
# 3.3 Summary Statistics (Categorical)
df.describe(include=['object']).T


## 4. Data Cleaning
Identifying and handling Nulls and Duplicates.


In [None]:
# 4.1 Missing Value Analysis
missing = df.isnull().sum()
missing = missing[missing > 0]
if not missing.empty:
    missing_percent = (missing / len(df)) * 100
    pd.DataFrame({'Missing Count': missing, 'Percentage': missing_percent})
else:
    print('No missing values found')


In [None]:
# 4.2 Visualizing Missing Data
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Value Heatmap')
plt.show()


In [None]:
# 4.3 Handling Missing Values
# Strategy: Impute Numerical with Median, Categorical with Mode

# Numerical
num_cols = df.select_dtypes(include=np.number).columns
for col in num_cols:
    df[col].fillna(df[col].median(), inplace=True)

# Categorical
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

print('Missing values handled.')


In [None]:
# 4.4 Duplicate Removal
duplicates = df.duplicated().sum()
print(f'Duplicates found: {duplicates}')
df.drop_duplicates(inplace=True)


## 5. Univariate Analysis
Analyzing features individually.


In [None]:
# 5.1 Numerical Distributions (Histograms + KDE)
features_to_plot = ['feature_0', 'feature_1', 'feature_2', 'feature_3']

plt.figure(figsize=(15, 10))
for i, col in enumerate(features_to_plot, 1):
    plt.subplot(2, 2, i)
    sns.histplot(df[col], kde=True, bins=30)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()


In [None]:
# 5.2 Boxplots for Outlier Detection
plt.figure(figsize=(15, 6))
sns.boxplot(data=df[features_to_plot], orient='h')
plt.title('Boxplots of Selected Features')
plt.show()


In [None]:
# 5.3 Categorical Frequency Plots
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.countplot(x='category_A', data=df)
plt.title('Frequency of Category A')

plt.subplot(1, 2, 2)
sns.countplot(x='category_B', data=df)
plt.title('Frequency of Category B')
plt.tight_layout()
plt.show()


## 6. Bivariate Analysis
Analyzing relationships between variables and the target.


In [None]:
# 6.1 Numerical Feature vs Target (Box Plot)
# Assuming 'target' is categorical/binary for this visualization
plt.figure(figsize=(15, 6))
sns.boxplot(x='target', y='feature_0', data=df)
plt.title('Feature 0 Distribution by Target')
plt.show()


In [None]:
# 6.2 Correlation Heatmap
plt.figure(figsize=(18, 14))
corr = df.select_dtypes(include=np.number).corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()


In [None]:
# 6.3 Categorical vs Target (Crosstab)
ct = pd.crosstab(df['category_A'], df['target'], normalize='index')
ct.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Category A vs Target (Stacked Bar)')
plt.ylabel('Proportion')
plt.show()


## 7. Multivariate Analysis
Complex interactions between multiple variables.


In [None]:
# 7.1 Pairplot of key features
subset_cols = ['feature_0', 'feature_1', 'feature_2', 'target']
sns.pairplot(df[subset_cols], hue='target', palette='husl')
plt.show()


## 8. Feature Engineering
Creating new features and transforming existing ones.


In [None]:
# 8.1 Interaction Features
# Example: Creating a ratio of two features
df['feature_0_1_ratio'] = df['feature_0'] / (df['feature_1'] + 0.001)
print('Created interaction feature: feature_0_1_ratio')


In [None]:
# 8.2 Binning Numerical Variables
df['feature_0_binned'] = pd.qcut(df['feature_0'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print('Created binned feature: feature_0_binned')


In [None]:
# 8.3 Encoding Categorical Variables
# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['category_A', 'category_B', 'feature_0_binned'], drop_first=True)
print('Performed One-Hot Encoding')
df_encoded.head()


## 9. Preprocessing for Machine Learning
Scaling and Splitting the data.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Separating Target and Features
X = df_encoded.drop('target', axis=1)
y = df_encoded['target']

# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f'Train Shape: {X_train_scaled.shape}')
print(f'Test Shape: {X_test_scaled.shape}')


## 10. Practice Exercises
Try these to test your understanding.


In [None]:
# Exercise 1: Find the feature with the highest correlation to the target (excluding itself).


In [None]:
# Exercise 2: Create a Violin Plot for 'feature_5' against 'category_A'.


In [None]:
# Exercise 3: Use Log Transformation on 'feature_0' and plot the before/after distribution.
