In [None]:
# Install required libraries
!pip install matplotlib scikit-learn seaborn --quiet

## 1. What is Simple Linear Regression?

Simple Linear Regression is a statistical method used to model the relationship between a single independent variable (X) and a dependent variable (Y) by fitting a linear equation to observed data. The model is of the form:

\[ Y = mX + c \]

Where:
- \( Y \) is the predicted value,
- \( m \) is the slope of the line,
- \( X \) is the independent variable,
- \( c \) is the intercept.

## 2. What are the key assumptions of Simple Linear Regression?

1. Linearity
2. Independence of errors
3. Homoscedasticity (constant variance of errors)
4. Normality of errors
5. No multicollinearity (though multicollinearity is less of an issue in simple regression)

## 3. What does the coefficient m represent in the equation Y = mX + c?

The coefficient \( m \) represents the **slope** of the regression line. It indicates the change in the dependent variable \( Y \) for a one-unit change in the independent variable \( X \).

## 4. What does the intercept c represent in the equation Y = mX + c?

The intercept \( c \) is the predicted value of \( Y \) when \( X = 0 \). It represents the point where the regression line crosses the Y-axis.

## 5. How do we calculate the slope m in Simple Linear Regression?

The formula for slope \( m \):

\[ m = \frac{n\sum xy - \sum x \sum y}{n \sum x^2 - (\sum x)^2} \]

## 6. What is the purpose of the least squares method in Simple Linear Regression?

The **least squares method** minimizes the sum of squared differences between the observed and predicted values. It helps find the best-fitting regression line.

## 7. How is the coefficient of determination (R²) interpreted in Simple Linear Regression?

R² represents the proportion of the variance in the dependent variable that is predictable from the independent variable. It ranges from 0 to 1.

## 8. What is Multiple Linear Regression?

Multiple Linear Regression is a method that models the relationship between a dependent variable and **two or more** independent variables.

\[ Y = b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n \]

## 9. What is the main difference between Simple and Multiple Linear Regression?

- Simple Linear Regression: one independent variable
- Multiple Linear Regression: two or more independent variables

## 10. What are the key assumptions of Multiple Linear Regression?

1. Linearity
2. Independence of errors
3. Homoscedasticity
4. Normality of errors
5. No multicollinearity

## 11. What is heteroscedasticity, and how does it affect the results of a Multiple Linear Regression model?

Heteroscedasticity refers to non-constant variance of residuals. It affects the model by:
- Invalidating statistical tests
- Reducing prediction accuracy

## 12. How can you improve a Multiple Linear Regression model with high multicollinearity?

- Remove correlated predictors
- Use PCA (Principal Component Analysis)
- Apply Ridge or Lasso Regression

## 13. What are some common techniques for transforming categorical variables for use in regression models?

- One-hot encoding
- Label encoding
- Binary encoding

## 14. What is the role of interaction terms in Multiple Linear Regression?

Interaction terms allow modeling of combined effects of two or more variables on the target variable. It captures non-additive relationships.

## 15. How can the interpretation of intercept differ between Simple and Multiple Linear Regression?

- Simple: Y-intercept when X = 0
- Multiple: Value of Y when **all Xs = 0**, which may or may not be meaningful

## 16. What is the significance of the slope in regression analysis, and how does it affect predictions?

The slope indicates the rate of change of the dependent variable with respect to the independent variable. A higher slope indicates a stronger effect.

## 17. How does the intercept in a regression model provide context for the relationship between variables?

It provides a baseline value of the dependent variable when all independent variables are 0.

## 18. What are the limitations of using R² as a sole measure of model performance?

- Doesn't indicate whether the model is biased
- Increases with more variables (overfitting)
- Doesn’t indicate causal relationship

## 19. How would you interpret a large standard error for a regression coefficient?

A large standard error indicates that the coefficient is not reliably estimated, suggesting a lack of significance or high variability.

## 20. How can heteroscedasticity be identified in residual plots, and why is it important to address it?

- Identified when residuals fan out or form patterns.
- It violates model assumptions and affects the validity of confidence intervals.

## 21. What does it mean if a Multiple Linear Regression model has a high R² but low adjusted R²?

It means the added variables do not contribute significantly to the model. Adjusted R² penalizes unnecessary variables.

## 22. Why is it important to scale variables in Multiple Linear Regression?

- Prevents dominance of larger scale variables
- Important for regularized models like Ridge, Lasso
- Ensures model convergence

## 23. What is polynomial regression?

Polynomial regression is a form of regression analysis in which the relationship between the independent and dependent variables is modeled as an nth-degree polynomial.

## 24. How does polynomial regression differ from linear regression?

- Linear regression fits a straight line.
- Polynomial regression fits a curved line.

## 25. When is polynomial regression used?

- When the relationship between the variables is non-linear.
- Useful in modeling curves or trends.

## 26. What is the general equation for polynomial regression?

\[ Y = b_0 + b_1X + b_2X^2 + b_3X^3 + ... + b_nX^n \]

## 27. Can polynomial regression be applied to multiple variables?

Yes, it becomes a **multivariate polynomial regression**, where each variable can be raised to different powers and combined.

## 28. What are the limitations of polynomial regression?

- Overfitting with high degrees
- Difficult to interpret
- Sensitive to outliers

## 29. What methods can be used to evaluate model fit when selecting the degree of a polynomial?

- Cross-validation
- Adjusted R²
- AIC/BIC
- Visual inspection of residual plots

## 30. Why is visualization important in polynomial regression?

- Helps detect overfitting or underfitting
- Makes model behavior interpretable
- Useful in selecting the appropriate polynomial degree

## 31. How is polynomial regression implemented in Python?

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt

# Sample data
X = np.arange(1, 11).reshape(-1, 1)
y = np.array([1.2, 1.9, 3.1, 3.9, 5.1, 6.0, 6.8, 8.3, 9.1, 9.9])

# Degree 2 polynomial
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(X, y)

# Predictions
X_test = np.linspace(1, 10, 100).reshape(-1, 1)
y_pred = model.predict(X_test)

# Plotting
plt.scatter(X, y, color='red', label='Data')
plt.plot(X_test, y_pred, color='blue', label='Polynomial Fit')
plt.title('Polynomial Regression (degree=2)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()