## Pre - Modeling Outlier Analysis Explained 

### What it is & Why You Do it.

- Pre - modeling outlier analysis is checking your data for unsual values BEFORE you build your regression model.
- This of it as 'inspecting your ingredients before cooking', you want to know if there's rotten fruit before making the salad!

### Why Check All 4 Variables ( All Variables) 

- One extreme value can distort your entire model. (target variable)
- Exteme x - values have disproportionate influence on slope
- Affects the feature's coefficient in your model.
- Affects the feature's weight in the equation.

### During the Analysis (we are looking for)

- Data Entry Errors
- Measurements Errors
- Valid Extreme Values

### Simple Pre - Modeling Checklist:

In [None]:
for col in ['size', 'bedrooms', 'bathrooms', 'price']:
    print(f"\n=== {col.upper()} ===")
    print(f"Min: {df[col].min()}")
    print(f"Max: {df[col].max()}")
    print(f"Mean: {df[col].mean():.2f}")
    print(f"Std Dev: {df[col].std():.2f}")
    
    # Ask: Do these make sense?
    # Size: Negative? >10,000 sq ft for regular house?
    # Bedrooms: 0? >10?
    # Price: Negative? $0? $100 million?

### Visual Check - Boxplots

- Boxplot tells you : Which Variable have dots outside the whiskers.


In [None]:
# Boxplots show outliers automatically as dots beyond whiskers
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(10, 8))
variables = ['size', 'bedrooms', 'bathrooms', 'price']

for idx, var in enumerate(variables):
    row = idx // 2
    col = idx % 2
    axes[row, col].boxplot(df[var].dropna())
    axes[row, col].set_title(f'{var}')
    
plt.show()

### IQR Method (Mathematical Check)

In [None]:
def find_outliers_iqr(data, column):
    """Find outliers using IQR method"""
    Q1 = data[column].quantile(0.25)  # 25th percentile
    Q3 = data[column].quantile(0.75)  # 75th percentile
    IQR = Q3 - Q1  # Interquartile Range
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | 
                    (data[column] > upper_bound)]
    return outliers

# Check ALL 4 variables
for col in ['size', 'bedrooms', 'bathrooms', 'price']:
    outliers = find_outliers_iqr(df, col)
    print(f"\n{col}: {len(outliers)} potential outliers")
    if len(outliers) > 0:
        print(outliers[[col]].head())  # Show first few

### Common Patterns to Look for 

### A. Single Variable Outliers

In [14]:
# Only PRICE has outliers: Might be luxury homes
# Only SIZE has outliers: Might be mansions or tiny homes
# Action: Consider if these are valid or errors

### B. Correlated Outliers

In [17]:
# Same houses are outliers in MULTIPLE variables:
# Size=8000, Price=5M, Bedrooms=8 → Legitimate mansion
# Size=100, Price=50K, Bedrooms=1 → Might be studio apartment

### C. Isolated Outliers 

In [20]:
# Outlier in ONE variable but normal in others:
# Size=5000, Price=200K → Undervalued mansion?
# Size=1000, Price=2M → Overpriced small house?

- -------------------------------------------------------------------------------

### What do DO with OUTLIERS:

### Option 1 : Keep them (if legitimate)

In [26]:
# Example: Legitimate luxury homes
# Action: Document them, consider robust regression

### Option 2: Transform them 

In [None]:
# Example: Use log transformation
df['log_price'] = np.log(df['price'])
df['log_size'] = np.log(df['size'])
# This reduces impact of extreme values

### Option 3: Remove Them (If errors)

In [None]:
# Example: Clear data entry errors
df_clean = df[(df['price'] > 0) & 
              (df['size'] > 100) & 
              (df['bedrooms'] <= 10)]

### Option 4: Create Indicator Variable 

In [None]:
# Flag outliers without removing them
df['is_outlier'] = ((df['price'] > price_upper_bound) | 
                    (df['size'] > size_upper_bound)).astype(int)
# Include this flag in your regression

In [42]:
"""Start Pre-Modeling Outlier Check
      ↓
Check each variables separately
      ↓
For each variable:
  1. Look at min/max → Any impossible values?
  2. Look at boxplot → Visual outliers?
  3. Calculate IQR → Statistical outliers?
      ↓
Document findings:
  - How many outliers in each variable?
  - Are they correlated across variables?
  - Do they make sense for housing data?
      ↓
 Decide: Keep, Transform, or Remove
      ↓
Proceed to build regression model

"""

'Start Pre-Modeling Outlier Check\n      ↓\nCheck each variables separately\n      ↓\nFor each variable:\n  1. Look at min/max → Any impossible values?\n  2. Look at boxplot → Visual outliers?\n  3. Calculate IQR → Statistical outliers?\n      ↓\nDocument findings:\n  - How many outliers in each variable?\n  - Are they correlated across variables?\n  - Do they make sense for housing data?\n      ↓\n Decide: Keep, Transform, or Remove\n      ↓\nProceed to build regression model\n\n'