# Theory Questions

1. **What is Simple Linear Regression?**
   - Simple Linear Regression is a statistical method used to model the relationship between a dependent variable (Y) and an independent variable (X) using a straight-line equation: **Y = mX + c**.

2. **What are the key assumptions of Simple Linear Regression?**
   1. Linearity: The relationship between X and Y is linear.
   2. Independence: Observations are independent of each other.
   3. Homoscedasticity: Constant variance of residuals.
   4. Normality: Residuals follow a normal distribution.
   5. No or minimal multicollinearity.

3. **What does the coefficient m represent in the equation Y = mX + c?**
   - The coefficient **m** (slope) represents the rate of change in Y for a one-unit change in X. It quantifies the strength and direction of the relationship.

4. **What does the intercept c represent in the equation Y = mX + c?**
   - The intercept **c** is the value of Y when X is zero. It represents the starting point of the regression line.

5. **How do we calculate the slope m in Simple Linear Regression?**
   - The slope **m** is calculated using the formula: ![m = (Σ(X - X̄)(Y - Ȳ)) / (Σ(X - X̄)²)] where X̄ and Ȳ are the mean values of X and Y.

6. **What is the purpose of the least squares method in Simple Linear Regression?**
   - The least squares method minimizes the sum of squared residuals (differences between actual and predicted values) to find the best-fitting regression line.

7. **How is the coefficient of determination (R²) interpreted in Simple Linear Regression?**
   - R² measures the proportion of variance in the dependent variable explained by the independent variable. An R² value close to 1 indicates a strong relationship.

8. **What is Multiple Linear Regression?**
   - Multiple Linear Regression models the relationship between a dependent variable and multiple independent variables using the equation: **Y = b₀ + b₁X₁ + b₂X₂ + ... + bₙXₙ**.

9. **What is the main difference between Simple and Multiple Linear Regression?**
   - Simple Linear Regression has one independent variable, while Multiple Linear Regression has two or more independent variables.

10. **What are the key assumptions of Multiple Linear Regression?**
    1. Linearity
    2. Independence of errors
    3. Homoscedasticity
    4. No perfect multicollinearity
    5. Normality of residuals

11. **What is heteroscedasticity, and how does it affect the results of a Multiple Linear Regression model?**
    - Heteroscedasticity occurs when the variance of residuals is not constant. It can lead to inefficient estimates and affect hypothesis testing.

12. **How can you improve a Multiple Linear Regression model with high multicollinearity?**
    1. Remove highly correlated variables
    2. Use Principal Component Analysis (PCA)
    3. Apply Ridge or Lasso regression
    4. Collect more data

13. **What are some common techniques for transforming categorical variables for use in regression models?**
    1. One-hot encoding
    2. Label encoding
    3. Ordinal encoding
    4. Dummy variables

14. **What is the role of interaction terms in Multiple Linear Regression?**
    - Interaction terms capture the combined effect of two or more variables when their effect is not purely additive.

15. **How can the interpretation of intercept differ between Simple and Multiple Linear Regression?**
   - In Simple Linear Regression, the intercept represents the expected value of Y when X = 0. In Multiple Linear Regression, it represents the expected value of Y when all independent variables are zero.

16. **What is the significance of the slope in regression analysis, and how does it affect predictions?**
    - The slope represents the change in the dependent variable for a one-unit change in the independent variable, helping in making predictions.

17. **How does the intercept in a regression model provide context for the relationship between variables?**
    - The intercept provides a baseline value for Y when all independent variables are zero. However, its practical interpretation depends on the dataset.

18. **What are the limitations of using R² as a sole measure of model performance?**
    - R² does not indicate whether the model is correctly specified, and it can be artificially high due to overfitting. Adjusted R² is often preferred.

19. **How would you interpret a large standard error for a regression coefficient?**
    - A large standard error suggests high variability in the coefficient estimate, indicating that the predictor may not be reliable.

20. **How can heteroscedasticity be identified in residual plots, and why is it important to address it?**
    - Heteroscedasticity appears as a funnel shape in residual plots. Addressing it ensures valid hypothesis testing and better model efficiency.

21. **What does it mean if a Multiple Linear Regression model has a high R² but low adjusted R²?**
    - It suggests that some predictors do not contribute significantly, leading to overfitting. Adjusted R² accounts for the number of predictors.

22. **Why is it important to scale variables in Multiple Linear Regression?**
    - Scaling ensures that variables with different units and magnitudes do not disproportionately affect the model, improving convergence and stability.

23. **What is polynomial regression?**
    - Polynomial regression extends linear regression by fitting a polynomial equation to the data, capturing non-linear relationships.

24. **How does polynomial regression differ from linear regression?**
    - Polynomial regression allows for curved relationships by adding higher-degree terms, while linear regression models only straight-line relationships.

25. **When is polynomial regression used?**
    - When data exhibits a non-linear relationship that cannot be well-represented by a simple straight line.

26. **What is the general equation for polynomial regression?**
    - Y = b₀ + b₁X + b₂X² + ... + bₙXⁿ, where higher-degree terms capture non-linearity.

27. **Can polynomial regression be applied to multiple variables?**
    - Yes, polynomial regression can extend to multiple variables, creating polynomial terms for each predictor.

28. **What are the limitations of polynomial regression?**
    1. Risk of overfitting
    2. Higher complexity
    3. Sensitive to outliers
    4. Increased computational cost

29. **What methods can be used to evaluate model fit when selecting the degree of a polynomial?**
    1. Cross-validation
    2. Adjusted R²
    3. AIC/BIC
    4. Residual analysis

30. **Why is visualization important in polynomial regression?**
    - Visualization helps in assessing the fit of the polynomial curve and identifying overfitting or underfitting.

31. **How is polynomial regression implemented in Python?**
    - Polynomial regression can be implemented using **scikit-learn**:
    ```python
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import LinearRegression
    poly = PolynomialFeatures(degree=2)
    X_poly = poly.fit_transform(X)
    model = LinearRegression().fit(X_poly, y)
    ```

