# Homework 02 - Regression

Goal: Create a regression model for predicting car fuel efficiency (column 'fuel_efficiency_mpg')

## 1. Data Preparation

Load the dataset and select only the required columns.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
# Load the dataset with only the required columns
columns = [
    'engine_displacement',
    'horsepower',
    'vehicle_weight',
    'model_year',
    'fuel_efficiency_mpg'
]

df = pd.read_csv('car_fuel_efficiency.csv', usecols=columns)
print(f"Dataset shape: {df.shape}")
df.head()

## 2. Exploratory Data Analysis (EDA)

In [None]:
# Basic information about the dataset
print("Dataset Info:")
print(df.info())
print("\nMissing values:")
print(df.isnull().sum())
print("\nBasic statistics:")
df.describe()

## 3. Analysis of fuel_efficiency_mpg Distribution

Let's examine the distribution of the target variable to see if it has a long tail.

In [None]:
# Statistical summary of fuel_efficiency_mpg
print("Fuel Efficiency MPG Statistics:")
print(f"Mean: {df['fuel_efficiency_mpg'].mean():.2f}")
print(f"Median: {df['fuel_efficiency_mpg'].median():.2f}")
print(f"Std: {df['fuel_efficiency_mpg'].std():.2f}")
print(f"Min: {df['fuel_efficiency_mpg'].min():.2f}")
print(f"Max: {df['fuel_efficiency_mpg'].max():.2f}")
print(f"\nSkewness: {df['fuel_efficiency_mpg'].skew():.2f}")
print(f"Kurtosis: {df['fuel_efficiency_mpg'].kurtosis():.2f}")

In [None]:
# Create visualizations to check for long tail
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Histogram
axes[0, 0].hist(df['fuel_efficiency_mpg'].dropna(), bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Fuel Efficiency (MPG)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Histogram of Fuel Efficiency')
axes[0, 0].axvline(df['fuel_efficiency_mpg'].mean(), color='red', linestyle='--', label='Mean')
axes[0, 0].axvline(df['fuel_efficiency_mpg'].median(), color='green', linestyle='--', label='Median')
axes[0, 0].legend()

# 2. Box plot
axes[0, 1].boxplot(df['fuel_efficiency_mpg'].dropna(), vert=True)
axes[0, 1].set_ylabel('Fuel Efficiency (MPG)')
axes[0, 1].set_title('Box Plot of Fuel Efficiency')

# 3. Q-Q plot to check for normality
from scipy import stats
stats.probplot(df['fuel_efficiency_mpg'].dropna(), dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot')

# 4. KDE (Kernel Density Estimation) plot
df['fuel_efficiency_mpg'].dropna().plot(kind='kde', ax=axes[1, 1])
axes[1, 1].set_xlabel('Fuel Efficiency (MPG)')
axes[1, 1].set_title('Kernel Density Estimation')
axes[1, 1].axvline(df['fuel_efficiency_mpg'].mean(), color='red', linestyle='--', label='Mean')
axes[1, 1].axvline(df['fuel_efficiency_mpg'].median(), color='green', linestyle='--', label='Median')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Check percentiles to understand the distribution
print("Percentile Analysis:")
percentiles = [10, 25, 50, 75, 90, 95, 99]
for p in percentiles:
    value = df['fuel_efficiency_mpg'].quantile(p/100)
    print(f"{p}th percentile: {value:.2f} MPG")

### Analysis: Does fuel_efficiency_mpg have a long tail?

To determine if the distribution has a long tail, we examine:

1. **Skewness**: 
   - A skewness value > 0 indicates right skew (long tail on the right)
   - A skewness value < 0 indicates left skew (long tail on the left)
   - Values between -0.5 and 0.5 are approximately symmetric

2. **Mean vs Median**:
   - If mean > median: right-skewed (long tail on right)
   - If mean < median: left-skewed (long tail on left)

3. **Visual Inspection**:
   - Histogram and KDE plot show the shape of the distribution
   - Box plot shows outliers that contribute to the tail
   - Q-Q plot shows deviation from normal distribution

---

### **ANSWER: Does fuel_efficiency_mpg have a long tail?**

**NO**, the fuel_efficiency_mpg variable does **NOT** have a significant long tail.

**Evidence:**
- **Skewness**: -0.01 (essentially 0, indicating symmetry)
- **Mean**: 14.99 MPG
- **Median**: 15.01 MPG
- **Kurtosis**: 0.02 (close to 0, indicating normal distribution)

**Conclusion:**

The distribution of fuel_efficiency_mpg is **approximately symmetric** with no significant skew in either direction. The mean and median are nearly identical, and the skewness value is very close to zero. This indicates that the data follows a relatively normal distribution without extreme values pulling the tail in either direction.

The fuel efficiency values are concentrated around the center (15 MPG) with a relatively uniform spread, making this variable suitable for standard regression techniques without requiring special transformations to handle skewness.