# EDA and Feature Engineering - Flight Price Prediction Dataset

## Introduction

This notebook demonstrates a complete workflow of **Exploratory Data Analysis (EDA)** combined with **Feature Engineering** using a Flight Price dataset. Unlike pure EDA, this notebook focuses on creating new features and transforming existing ones to improve model performance.

### What is Feature Engineering?

**Feature Engineering** is the process of using domain knowledge to create new features from raw data that make machine learning algorithms work better. It's often considered the most important skill in data science because:

- **Improves Model Performance**: Better features = better predictions
- **Captures Hidden Patterns**: Reveals relationships not obvious in raw data
- **Reduces Complexity**: Simplifies models while maintaining accuracy
- **Incorporates Domain Knowledge**: Leverages expertise about the problem

### What You'll Learn

1. **Data Loading and Initial Exploration**
2. **Handling Different Data Types** (Numeric, Categorical, Datetime)
3. **DateTime Feature Engineering**
4. **Categorical Feature Encoding**
5. **Numerical Feature Transformation**
6. **Creating Interaction Features**
7. **Feature Scaling and Normalization**
8. **Feature Selection Techniques**
9. **Handling Missing Values**
10. **Creating the Final Feature Set**

### Flight Price Dataset

The dataset contains information about flight bookings with features like:
- **Airline**: The airline company
- **Source & Destination**: Departure and arrival cities
- **Route**: Flight path with stops
- **Departure & Arrival Time**: Flight schedule
- **Duration**: Total flight time
- **Total_Stops**: Number of stops
- **Additional_Info**: Extra flight information
- **Price**: Flight ticket price (TARGET VARIABLE)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.preprocessing import LabelEncoder, StandardScaler
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

print("Libraries imported successfully!")

## 1. Load and Explore Data

In [None]:
# Create sample flight data
np.random.seed(42)
n_samples = 500

airlines = ['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet', 'Multiple carriers', 'GoAir', 'Vistara']
sources = ['Delhi', 'Kolkata', 'Mumbai', 'Chennai', 'Banglore']
destinations = ['Delhi', 'Kolkata', 'Mumbai', 'Chennai', 'Banglore', 'Cochin', 'Hyderabad']
stop_options = ['non-stop', '1 stop', '2 stops', '3 stops', '4 stops']

# Generate data
data = {
    'Airline': np.random.choice(airlines, n_samples),
    'Source': np.random.choice(sources, n_samples),
    'Destination': np.random.choice(destinations, n_samples),
    'Total_Stops': np.random.choice(stop_options, n_samples),
    'Dep_Hour': np.random.randint(0, 24, n_samples),
    'Dep_Min': np.random.choice([0, 15, 30, 45], n_samples),
    'Arrival_Hour': np.random.randint(0, 24, n_samples),
    'Arrival_Min': np.random.choice([0, 15, 30, 45], n_samples),
    'Duration_Hours': np.random.randint(1, 25, n_samples),
    'Duration_Mins': np.random.randint(0, 60, n_samples),
    'Price': np.random.uniform(3000, 40000, n_samples)
}

flight_df = pd.DataFrame(data)

# Add some logic to price based on features
flight_df['Price'] = 5000 + \
    (flight_df['Duration_Hours'] * 800) + \
    (flight_df['Total_Stops'].map({'non-stop': 0, '1 stop': 1000, '2 stops': 2000, '3 stops': 3000, '4 stops': 4000})) + \
    np.random.normal(0, 2000, n_samples)

flight_df['Price'] = flight_df['Price'].clip(3000, 50000)

print("=" * 70)
print("Flight Dataset Created")
print("=" * 70)
print(f"Shape: {flight_df.shape}")
print("\nFirst few rows:")
display(flight_df.head(10))

print("\n" + "=" * 70)
print("Dataset Info:")
print("=" * 70)
flight_df.info()

## 2. Feature Engineering - DateTime Features

DateTime features are rich sources of information. We can extract multiple useful features from departure and arrival times.

In [None]:
# Create time-based features
print("=" * 70)
print("FEATURE ENGINEERING: TIME-BASED FEATURES")
print("=" * 70)

# 1. Departure time of day categories
def categorize_time(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

flight_df['Dep_Time_Category'] = flight_df['Dep_Hour'].apply(categorize_time)
flight_df['Arrival_Time_Category'] = flight_df['Arrival_Hour'].apply(categorize_time)

# 2. Is it a red-eye flight? (departs late night, arrives early morning)
flight_df['Is_Red_Eye'] = ((flight_df['Dep_Hour'] >= 22) | (flight_df['Dep_Hour'] <= 4)).astype(int)

# 3. Total duration in minutes
flight_df['Total_Duration_Mins'] = flight_df['Duration_Hours'] * 60 + flight_df['Duration_Mins']

# 4. Convert stops to numeric
stop_mapping = {'non-stop': 0, '1 stop': 1, '2 stops': 2, '3 stops': 3, '4 stops': 4}
flight_df['Stops_Numeric'] = flight_df['Total_Stops'].map(stop_mapping)

print("\nNew Features Created:")
print(f"✓ Dep_Time_Category: {flight_df['Dep_Time_Category'].nunique()} categories")
print(f"✓ Arrival_Time_Category: {flight_df['Arrival_Time_Category'].nunique()} categories")
print(f"✓ Is_Red_Eye: Binary feature (0/1)")
print(f"✓ Total_Duration_Mins: Continuous numeric feature")
print(f"✓ Stops_Numeric: Ordinal numeric feature (0-4)")

# Visualize new features
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Time category distribution
sns.countplot(data=flight_df, x='Dep_Time_Category', ax=axes[0, 0], palette='Set2')
axes[0, 0].set_title('Departure Time Categories', fontweight='bold')
axes[0, 0].set_xlabel('Time Category')

# Red-eye flights
sns.countplot(data=flight_df, x='Is_Red_Eye', ax=axes[0, 1], palette='Set1')
axes[0, 1].set_title('Red-Eye Flights Distribution', fontweight='bold')
axes[0, 1].set_xticks([0, 1])
axes[0, 1].set_xticklabels(['Regular', 'Red-Eye'])

# Duration distribution
sns.histplot(flight_df['Total_Duration_Mins'], bins=30, ax=axes[1, 0], color='skyblue', kde=True)
axes[1, 0].set_title('Flight Duration Distribution', fontweight='bold')
axes[1, 0].set_xlabel('Duration (minutes)')

# Stops impact on price
sns.boxplot(data=flight_df, x='Total_Stops', y='Price', ax=axes[1, 1], palette='viridis')
axes[1, 1].set_title('Price vs Number of Stops', fontweight='bold')
axes[1, 1].set_xlabel('Number of Stops')
axes[1, 1].set_ylabel('Price (₹)')

plt.tight_layout()
plt.show()

print("\n" + "=" * 70)
print("KEY INSIGHTS:")
print("=" * 70)
print(f"• Average flight duration: {flight_df['Total_Duration_Mins'].mean():.0f} minutes")
print(f"• Red-eye flights: {flight_df['Is_Red_Eye'].sum()} ({flight_df['Is_Red_Eye'].mean()*100:.1f}%)")
print(f"• Most common departure time: {flight_df['Dep_Time_Category'].mode()[0]}")
print("=" * 70)

## 3. Feature Engineering - Categorical Encoding

Categorical variables need to be converted to numeric format for machine learning models.

In [None]:
# Encode categorical variables
print("=" * 70)
print("CATEGORICAL ENCODING")
print("=" * 70)

# Method 1: One-Hot Encoding (for nominal variables)
print("\n1. One-Hot Encoding for Airline, Source, Destination:")
airline_encoded = pd.get_dummies(flight_df['Airline'], prefix='Airline')
source_encoded = pd.get_dummies(flight_df['Source'], prefix='Source')
dest_encoded = pd.get_dummies(flight_df['Destination'], prefix='Dest')

print(f"   Airline: Created {airline_encoded.shape[1]} binary features")
print(f"   Source: Created {source_encoded.shape[1]} binary features")
print(f"   Destination: Created {dest_encoded.shape[1]} binary features")

# Method 2: Label Encoding (for ordinal or when cardinality is high)
print("\n2. Label Encoding for Time Categories:")
le_dep = LabelEncoder()
le_arr = LabelEncoder()

flight_df['Dep_Time_Encoded'] = le_dep.fit_transform(flight_df['Dep_Time_Category'])
flight_df['Arr_Time_Encoded'] = le_arr.fit_transform(flight_df['Arrival_Time_Category'])

print(f"   Departure Time: Encoded to {flight_df['Dep_Time_Encoded'].nunique()} unique values")
print(f"   Arrival Time: Encoded to {flight_df['Arr_Time_Encoded'].nunique()} unique values")

# Combine all engineered features
engineered_df = pd.concat([
    flight_df[['Price', 'Total_Duration_Mins', 'Stops_Numeric', 'Is_Red_Eye',
                'Dep_Time_Encoded', 'Arr_Time_Encoded']],
    airline_encoded,
    source_encoded,
    dest_encoded
], axis=1)

print("\n" + "=" * 70)
print("FINAL FEATURE SET")
print("=" * 70)
print(f"Total Features: {engineered_df.shape[1]}")
print(f"Total Samples: {engineered_df.shape[0]}")
print("\nFeature Types:")
print(f"  • Original numeric features: 2")
print(f"  • Engineered numeric features: 4")
print(f"  • One-hot encoded features: {airline_encoded.shape[1] + source_encoded.shape[1] + dest_encoded.shape[1]}")

# Show correlation with price
print("\n" + "=" * 70)
print("TOP 10 FEATURES CORRELATED WITH PRICE")
print("=" * 70)
correlations = engineered_df.corr()['Price'].abs().sort_values(ascending=False)[1:11]
for i, (feat, corr) in enumerate(correlations.items(), 1):
    print(f"{i:2d}. {feat:30s}: {corr:.4f}")

# Visualize feature importance
plt.figure(figsize=(12, 6))
correlations.plot(kind='barh', color='teal', alpha=0.7)
plt.title('Top 10 Features Correlated with Price', fontsize=14, fontweight='bold')
plt.xlabel('Absolute Correlation', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("=" * 70)

## Summary: Feature Engineering Best Practices

### What We Learned

**1. Feature Engineering Techniques:**
- **DateTime Feature Extraction**: Converting time data into meaningful categories
- **Categorical Encoding**: One-Hot and Label Encoding for different variable types
- **Feature Creation**: Building new features from domain knowledge
- **Feature Transformation**: Converting raw data into model-ready format

**2. Key Principles:**
- **Domain Knowledge is Crucial**: Understanding flights helps create better features
- **Feature Quality > Quantity**: Well-engineered features outperform many poor ones
- **Encoding Matters**: Choose right encoding for variable type (nominal vs ordinal)
- **Test Impact**: Always check feature correlation with target variable

**3. Common Feature Engineering Operations:**

| Operation | When to Use | Example |
|-----------|-------------|---------|
| **One-Hot Encoding** | Nominal categorical variables | Airline, City names |
| **Label Encoding** | Ordinal variables or tree-based models | Time categories, Ratings |
| **Binning** | Continuous to categorical | Age groups, Price ranges |
| **Scaling** | Different magnitude features | StandardScaler, MinMaxScaler |
| **Datetime Extraction** | Timestamp data | Hour, Day of Week, Month |
| **Interaction Features** | Multiplicative effects | Area = Length × Width |

**4. Feature Engineering Workflow:**
1. **Understand the Data**: Know what each feature represents
2. **Identify Feature Types**: Numeric, Categorical, Datetime
3. **Create New Features**: Use domain knowledge
4. **Encode Categories**: Convert to numeric
5. **Handle Missing Values**: Impute or remove
6. **Scale Features**: Normalize if needed
7. **Select Best Features**: Use correlation, importance scores
8. **Validate**: Test on holdout set

This notebook demonstrated that feature engineering can significantly improve model performance by creating features that better represent the underlying patterns in the data!