### Case Study: Polynomial Regression

**Why Polynomial Regression?**

Polynomial regression is used when the relationship between the independent variable \( X \) and the dependent variable \( y \) is not linear. In real-world scenarios, data often exhibits non-linear patterns, which cannot be captured effectively by simple linear regression. Polynomial regression can model a wide range of curves and is thus more flexible in fitting data that doesn't follow a straight line.

**Use Case Example:**
Let's consider commuting times as a function of distance. While a linear relationship might work for short distances, longer distances might show diminishing or increasing returns due to factors like traffic congestion, varying speeds, or different modes of transportation used over longer distances.

**Pros and Cons of Polynomial Regression:**

**Pros:**
- **Flexibility:** Can model non-linear relationships.
- **Better Fit:** Can provide a better fit for data that doesn't follow a straight line.

**Cons:**
- **Overfitting:** Higher-degree polynomials can lead to overfitting, capturing noise rather than the underlying trend.
- **Complexity:** Increases with the degree of the polynomial, making interpretation and computation more difficult.
- **Extrapolation:** Polynomial models can behave erratically outside the range of the training data.

### Steps to Implement Polynomial Regression

1. **Generate Sample Data**
2. **Data Preprocessing**
3. **Split Data into Training and Testing Sets**
4. **Transform Features for Polynomial Regression**
5. **Fit the Polynomial Regression Model**
6. **Make Predictions**
7. **Evaluate the Model**

### Step 1: Generate Sample Data

Let's generate sample data where commuting time exhibits a non-linear relationship with distance:

This code follows the steps for polynomial regression, including data generation, preprocessing, feature transformation, model fitting, prediction, and evaluation.

In [1]:
import numpy as np
import pandas as pd

# Generate sample data
np.random.seed(42)
distance = np.random.uniform(1, 20, 100)
hours = 0.1 * distance**2 + 0.5 * distance + np.random.normal(0, 1, 100)

data = {
    'Distance': distance,
    'Hours': hours
}

df = pd.DataFrame(data)
print(df.head())

    Distance      Hours
0   8.116262  10.732550
1  19.063572  45.574756
2  14.907885  29.770206
3  12.374511  19.512539
4   3.964354   3.334116


### Step 2: Data Preprocessing

Ensure the data is clean and ready for modeling.

In [2]:
# Check for null values
print(df.isnull().sum())

# Summary statistics
print(df.describe())

Distance    0
Hours       0
dtype: int64
         Distance       Hours
count  100.000000  100.000000
mean     9.933434   17.995848
std      5.652299   14.582097
min      1.104920    0.355942
25%      4.670814    4.976722
50%      9.818707   14.309141
75%     14.873859   30.430579
max     19.750852   49.666863


### Step 3: Split Data into Training and Testing Sets

In [3]:
from sklearn.model_selection import train_test_split

X = df[['Distance']]
y = df['Hours']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 4: Transform Features for Polynomial Regression

In [7]:
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

### Step 5: Fit the Polynomial Regression Model

In [8]:
from sklearn.linear_model import LinearRegression

# Fit the model
model = LinearRegression()
model.fit(X_train_poly, y_train)

LinearRegression()

### Step 6: Make Predictions

In [9]:
y_train_pred = model.predict(X_train_poly)
y_test_pred = model.predict(X_test_poly)

### Step 7: Evaluate the Model

In [10]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Training set evaluation
train_mae = mean_absolute_error(y_train, y_train_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Testing set evaluation
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

print(f'Training MAE: {train_mae}')
print(f'Training MSE: {train_mse}')
print(f'Training R-squared: {train_r2}')
print(f'Testing MAE: {test_mae}')
print(f'Testing MSE: {test_mse}')
print(f'Testing R-squared: {test_r2}')

Training MAE: 0.7034466665093435
Training MSE: 0.8147153703416304
Training R-squared: 0.9960376124547679
Testing MAE: 0.577998177949695
Testing MSE: 0.6358406072820825
Testing R-squared: 0.997222493423401


### Interpretation of Results

- **Training MAE:** Indicates the average absolute error on the training set.
- **Training MSE:** Measures the average squared error on the training set.
- **Training R-squared:** Shows the proportion of variance explained by the model on the training set.

- **Testing MAE:** Indicates the average absolute error on the testing set.
- **Testing MSE:** Measures the average squared error on the testing set.
- **Testing R-squared:** Shows the proportion of variance explained by the model on the testing set.

These metrics help evaluate the model's performance and generalization ability. If the training error is much lower than the testing error, the model might be overfitting. If both errors are high, the model might be underfitting.