## Assignment Questions

Q1. What is Simple Linear Regression?

A1. Simple Linear Regression is a technique used to predict a continuous output variable (dependent variable) using one input variable (independent variable). It works by fitting a straight line to the data points in such a way that the difference between the predicted values and actual values is minimized. The general formula is $Y = a + bX$, where $Y$ is the predicted value, $X$ is the input, $a$ is the intercept, and $b$ is the slope of the line.

Q2. What are the key assumptions of Simple Linear Regression?

A2. The key assumptions of Simple Linear Regression are as follows:
1. Linearity: There is a linear relationship between the independent variable (X) and the dependent variable (Y).
2. Independence: The observations are independent of each other.
3. Homoscedasticity: The variance of errors (residuals) is constant across all values of X.
4. Normality of Errors: The residuals (difference between actual and predicted values) should be normally distributed.
5. No Multicollinearity: Since there is only one independent variable in simple linear regression, this assumption is automatically satisfied.

Q3. What does the coefficient m represent in the equation Y=mX+c?

A3. In the equation Y = mX + c, the coefficient m represents the slope of the line.

Q4. What does the intercept c represent in the equation Y=mX+c?

A4. In the equation Y = mX + c, the intercept c represents the value of Y when X is zero. It is the point where the line crosses the Y-axis. This means even if there is no input (X = 0), the output (Y) will be equal to c.

Q5. How do we calculate the slope m in Simple Linear Regression?

A5. In **Simple Linear Regression**, the slope **m** is calculated using the formula:

$$
m = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2}
$$
Here:
* $X_i$ = individual value of X
* $Y_i$ = individual value of Y
* $\bar{X}$ = mean of X values
* $\bar{Y}$ = mean of Y values

Q6. What is the purpose of the least squares method in Simple Linear Regression?

A6. The purpose of the least squares method in Simple Linear Regression is to find the best-fitting line through the data points by minimizing the sum of the squares of the errors (residuals). Residuals are the differences between the actual values and the predicted values. By squaring these differences and minimizing their total, the method ensures the line is as close as possible to all data points.

Q7. How is the coefficient of determination (R²) interpreted in Simple Linear Regression?

A7. The coefficient of determination (R²) in Simple Linear Regression measures how well the independent variable explains the variation in the dependent variable. Its value ranges from 0 to 1:
    R² = 1 means the model explains 100% of the variation (perfect fit).
    R² = 0 means the model explains none of the variation.

Q8. What is Multiple Linear Regression?

A8. Multiple Linear Regression is an extension of Simple Linear Regression that uses two or more independent variables to predict a single continuous dependent variable. It helps understand how multiple factors together influence the output.

Q9. What is the main difference between Simple and Multiple Linear Regression?

A9. The main difference between Simple Linear Regression and Multiple Linear Regression lies in the number of input variables used to predict the output. Simple Linear Regression uses only one independent variable to predict a dependent variable, whereas Multiple Linear Regression uses two or more independent variables for prediction.

Q10. What are the key assumptions of Multiple Linear Regression?

A10. The key assumptions of Multiple Linear Regression are that there is a linear relationship between the dependent variable and all independent variables, and the residuals (errors) are normally distributed. It also assumes independence of observations, meaning data points are not related to each other, and homoscedasticity, which means the variance of residuals is constant across all levels of independent variables. Another important assumption is no multicollinearity, meaning the independent variables should not be highly correlated with each other.

Q11. What is heteroscedasticity, and how does it affect the results of a Multiple Linear Regression model?

A11. Heteroscedasticity occurs when the variance of the residuals (errors) is not constant across all levels of the independent variables in a regression model. In other words, as the value of an independent variable changes, the spread of the errors becomes larger or smaller instead of staying uniform. This violates the assumption of homoscedasticity in Multiple Linear Regression.

Heteroscedasticity affects the results by making the standard errors of the coefficients unreliable, which can lead to incorrect t-tests and confidence intervals. This means the significance of predictors might be misinterpreted, reducing the accuracy and trustworthiness of the model.

Q12. How can you improve a Multiple Linear Regression model with high multicollinearity?

A12. To improve a Multiple Linear Regression model with high multicollinearity, you can take several steps. One common approach is to remove one or more highly correlated independent variables, as they provide redundant information. Another method is to combine correlated variables into a single feature (such as creating an average or using feature engineering). You can also apply regularization techniques like Ridge Regression or Lasso Regression, which help reduce the impact of multicollinearity by penalizing large coefficients. Additionally, calculating the Variance Inflation Factor (VIF) can help identify which variables contribute most to multicollinearity so you can address them specifically.

Q13. What are some common techniques for transforming categorical variables for use in regression models?

A13. Common techniques for transforming categorical variables for use in regression models include One-Hot Encoding, Label Encoding, and Dummy Variable Encoding. One-Hot Encoding creates separate binary columns for each category, assigning 1 if the category is present and 0 otherwise, which is widely used because it avoids giving an ordinal meaning to categories. Label Encoding assigns a unique numeric value to each category, but it can introduce an unintended order, so it is usually applied when categories have a natural ranking. Dummy Variable Encoding is similar to One-Hot Encoding but drops one category to avoid the dummy variable trap (multicollinearity).

Q14. What is the role of interaction terms in Multiple Linear Regression?

A14. The role of interaction terms in Multiple Linear Regression is to capture the combined effect of two or more independent variables on the dependent variable that cannot be explained by their individual effects alone. An interaction term is created by multiplying two variables, and it allows the model to account for situations where the impact of one variable depends on the value of another.

Q15. How can the interpretation of intercept differ between Simple and Multiple Linear Regression?

A15. In Simple Linear Regression, the intercept represents the predicted value of the dependent variable when the independent variable is zero. For example, if predicting house price based on size, the intercept is the estimated price when the size is 0.

In Multiple Linear Regression, the intercept represents the predicted value of the dependent variable when all independent variables are zero simultaneously. However, this situation may not always make practical sense because some variables cannot realistically be zero (e.g., age, income). Therefore, the intercept in Multiple Linear Regression often has less meaningful interpretation compared to Simple Linear Regression, though it is still essential for calculations in the model.

Q16. What is the significance of the slope in regression analysis, and how does it affect predictions?

A16. The slope in regression analysis represents the change in the dependent variable for a one-unit increase in the independent variable, keeping other variables constant (in multiple regression). It indicates the strength and direction of the relationship: a positive slope means the dependent variable increases as the independent variable increases, while a negative slope means it decreases.

The slope directly affects predictions because it determines how much the predicted value changes when the input changes. For example, in predicting house price based on size, if the slope is 2000, then for every additional square foot, the price increases by 2000 units. This makes the slope a key factor in understanding the impact of each predictor on the outcome.

Q17.  How does the intercept in a regression model provide context for the relationship between variables?

A17. The intercept in a regression model provides the starting point or baseline value of the dependent variable when all independent variables are zero. It sets the context for the relationship by indicating where the regression line crosses the Y-axis. This helps understand the overall equation and gives a reference for predictions.

Q18. What are the limitations of using R² as a sole measure of model performance?

A18. The main limitation of using R² as the only measure of model performance is that it only explains how much variance in the dependent variable is explained by the model, but it does not indicate whether the model is correct or reliable. A high R² does not mean the model is accurate, as it can increase by simply adding more variables, even if they are irrelevant, leading to overfitting. R² also does not check assumptions of regression or indicate if relationships are causal.

Q19. How would you interpret a large standard error for a regression coefficient?

A19. A large standard error for a regression coefficient indicates that the estimate of that coefficient is not precise and varies significantly across different samples. This usually happens when there is high variability in the data or when the predictor has a weak relationship with the dependent variable. It can also signal problems like multicollinearity, where predictors are highly correlated.

Q20. How can heteroscedasticity be identified in residual plots, and why is it important to address it?

A20. Heteroscedasticity can be identified in residual plots by looking at the pattern of residuals versus predicted values (or versus an independent variable). If the residuals are spread out unevenly, such as forming a funnel shape (narrow at one end and wide at the other) or any systematic pattern instead of a random scatter, it indicates heteroscedasticity.

Q21. What does it mean if a Multiple Linear Regression model has a high R² but low adjusted R²

A21. If a Multiple Linear Regression model has a high R² but low Adjusted R², it means that the model includes predictors that do not significantly contribute to explaining the variation in the dependent variable. R² always increases when new variables are added, even if they are irrelevant, while Adjusted R² adjusts for the number of predictors and only increases if the new variable improves the model meaningfully.

Q22. Why is it important to scale variables in Multiple Linear Regression?

A22. It is important to scale variables in Multiple Linear Regression because predictors can have different units and ranges, which can cause numerical instability and make the model harder to interpret. Scaling ensures that all variables contribute equally to the model, especially when regularization methods like Ridge or Lasso are applied, as these techniques are sensitive to variable magnitude.

Q23. What is polynomial regression?

A23. Polynomial Regression is a type of regression technique that models the relationship between the independent variable (X) and the dependent variable (Y) as an nth-degree polynomial instead of a straight line. It is useful when data shows a non-linear pattern but can still be represented by a curve.

Q24. How does polynomial regression differ from linear regression?

A24. Polynomial Regression is different from Linear Regression because Linear Regression fits a straight line to the data, while Polynomial Regression fits a curved line by adding powers of the independent variable.

Q25. When is polynomial regression used?

A25. Polynomial Regression is used when the relationship between the independent variable and the dependent variable is non-linear but can be represented as a smooth curve. It is helpful when a straight line from Linear Regression does not fit the data well.

Q26. What is the general equation for polynomial regression?

A26. The general equation for Polynomial Regression is:

$$
Y = b_0 + b_1X + b_2X^2 + b_3X^3 + \dots + b_nX^n
$$
Here:
* $Y$ = Dependent variable
* $X$ = Independent variable
* $b_0, b_1, \dots, b_n$ = Coefficients
* $n$ = Degree of the polynomial

Q27. Can polynomial regression be applied to multiple variables?

A27. Yes, Polynomial Regression can be applied to multiple variables. In this case, the model includes polynomial terms for each variable and their interactions.

Q28. What are the limitations of polynomial regression?

A28. he limitations of Polynomial Regression include:
1. Overfitting: Higher-degree polynomials can fit the training data too closely, reducing performance on new data.
2. Complexity: As the degree increases, the model becomes harder to interpret and more computationally expensive.
3. Extrapolation Issues: Predictions outside the range of the data can be highly inaccurate.
4. Multicollinearity: Adding higher-order terms can cause strong correlation between predictors, making coefficient estimates unstable.

Q29. What methods can be used to evaluate model fit when selecting the degree of a polynomial?

A29. To evaluate model fit when selecting the degree of a polynomial, you can use the following methods:

1. Cross-Validation: Split the data into training and validation sets to check how well the model generalizes to unseen data.
2. Adjusted R²: Unlike R², it accounts for the number of predictors and penalizes unnecessary complexity.
3. AIC/BIC (Akaike or Bayesian Information Criterion): These criteria help balance model fit with complexity; lower values indicate a better model.
4. Residual Plots: Analyze residuals for randomness; patterns may indicate underfitting or overfitting.
5. Validation Metrics: Use metrics like RMSE (Root Mean Squared Error) or MAE (Mean Absolute Error) on a test set to compare different polynomial degrees.

Q30. Why is visualization important in polynomial regression?

A30. Visualization is important in Polynomial Regression because it helps us see how well the polynomial curve fits the data and whether the chosen degree is appropriate. By plotting the actual data points and the regression curve, we can easily identify underfitting (curve too simple), overfitting (curve too complex), or a good fit. Visualization also helps in understanding the nature of the relationship between variables, especially since polynomial regression models non-linear patterns that are not obvious in raw data.

Q31. How is polynomial regression implemented in Python?

A31. Polynomial Regression in Python can be implemented using the scikit-learn library by transforming the input features into polynomial terms and then applying linear regression. First, we create the polynomial features using PolynomialFeatures from sklearn.preprocessing, specifying the degree of the polynomial. Then, we fit these transformed features to a LinearRegression model and make predictions. For example, after preparing the data, we use poly = PolynomialFeatures(degree=2) to generate squared terms, transform the input with X_poly = poly.fit_transform(X), and then fit the model using model.fit(X_poly, y). This approach helps capture non-linear patterns while still using the linear regression algorithm on the expanded feature set.