## Question 1: What is Simple Linear Regression (SLR)? Explain its purpose.

**Answer (Expanded and Theoretical):**

Simple Linear Regression (SLR) is a statistical and machine learning technique used to model the relationship between two continuous variables: one independent (predictor) variable `X` and one dependent (response) variable `Y`. The model assumes this relationship is linear and can be expressed as a straight line through the data. SLR belongs to the family of **parametric** models because it assumes a specific functional form (linear) and estimates a small number of parameters (two in SLR).

### Key theoretical points:
- **Model form:** The model can be written as

  \[ Y = \beta_0 + \beta_1 X + \epsilon \]

  where `β₀` is the intercept, `β₁` is the slope, and `ε` is a random error term representing unexplained variability.

- **Assumptions:** (see Question 2 for details) — the correctness of inference and many statistical properties hinge on assumptions like linearity, homoscedasticity, independence and normality of errors.

- **Estimation goal:** Estimate `β₀` and `β₁` from observed data so that the fitted line best explains the observed `Y` values.

- **Statistical interpretation:** The slope `β₁` quantifies the average change in `Y` for one-unit change in `X`, holding other factors constant (there are no other factors in SLR).

- **Uses and purpose:**
  - **Prediction:** Estimate `Y` for new values of `X`.
  - **Inference:** Test hypotheses (e.g., is `β₁ = 0`?) and build confidence intervals for parameters.
  - **Explanation:** Understand direction and strength of association between `X` and `Y`.
  - **Baseline modeling:** SLR often acts as a simple baseline before progressing to multiple regression or non-linear methods.

- **Limitations:** Limited to linear relationships; sensitive to outliers and influential points; real-world phenomena often require multiple predictors or non-linear forms.

## Question 2: What are the key assumptions of Simple Linear Regression?

**Answer (Detailed):**

SLR relies on several assumptions. Violations affect estimator properties, hypothesis tests, and predictive performance. Below are the assumptions, their meanings, diagnostics and consequences of violation:

1. **Linearity**
   - **Meaning:** The conditional expectation of `Y` given `X` is a linear function of `X`: `E[Y|X] = β₀ + β₁X`.
   - **Diagnostic:** Scatter plots of `Y` vs `X`, residual vs fitted plots.
   - **Consequence if violated:** Coefficients become biased for the true relationship; consider polynomial or non-linear models.

2. **Independence of errors**
   - **Meaning:** Error terms `ε_i` are uncorrelated; `Cov(ε_i, ε_j) = 0` for `i ≠ j`.
   - **Diagnostic:** Durbin-Watson test, autocorrelation plots (ACF) especially for time series.
   - **Consequence if violated:** Standard errors are incorrect, leading to invalid confidence intervals and hypothesis tests.

3. **Homoscedasticity (constant variance)**
   - **Meaning:** `Var(ε_i) = σ²` for all observations.
   - **Diagnostic:** Plot residuals vs fitted values; Breusch-Pagan test.
   - **Consequence if violated (heteroscedasticity):** OLS estimates remain unbiased but are no longer the Best Linear Unbiased Estimators (BLUE); standard errors are biased.

4. **Normality of errors**
   - **Meaning:** `ε_i ~ N(0, σ²)` (for inference and small-sample properties).
   - **Diagnostic:** Q-Q plot of residuals, Shapiro-Wilk test.
   - **Consequence if violated:** Confidence intervals and p-values may be unreliable for small samples; with large samples, the Central Limit Theorem mitigates this.

5. **No perfect multicollinearity**
   - **Meaning:** In SLR this is trivial (one predictor). For multiple regression, predictors should not be exact linear combinations of each other.

6. **Exogeneity / No omitted variable bias**
   - **Meaning:** The predictor `X` should be uncorrelated with the error term: `Cov(X, ε) = 0`.
   - **Consequence if violated:** Coefficient estimates are biased and inconsistent. This commonly occurs with omitted confounders or measurement errors in `X`.

**Practical notes:** Diagnostics and remedial measures (transformations, robust standard errors, generalized least squares, or additional predictors) are used when assumptions do not hold.

## Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

**Answer (Expanded):**

The standard SLR equation is:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

**Term-by-term explanation:**
- **Y (dependent variable):** The outcome variable we want to predict or explain.
- **X (independent variable):** The predictor variable used to explain variation in `Y`.
- **β₀ (intercept):** The expected value of `Y` when `X = 0`; geometrically, the point where the regression line crosses the Y-axis.
- **β₁ (slope coefficient):** The expected change in `Y` for a one-unit increase in `X` (average marginal effect).
- **ε (error term / disturbance):** A random variable capturing the deviation of observed `Y` from the deterministic part `β₀ + β₁X`. It represents unobserved factors, measurement error, and inherent randomness.

**Additional notes:**
- The model implies `E[Y|X] = β₀ + β₁X` and `Var(Y|X) = Var(ε) = σ²` under homoscedasticity.
- Estimation of `β₀` and `β₁` is commonly done using Ordinary Least Squares (OLS).

## Question 4: Provide a real-world example where simple linear regression can be applied.

**Answer (Detailed Example): Predicting House Prices by Size**

**Context:** Real estate analysts often want to model the relationship between a house's sale price and a single measurable attribute such as living area (square feet).

- **Dependent variable (Y):** House sale price (e.g., in ₹ or $).
- **Independent variable (X):** Living area (square feet).

**Why SLR fits:** There is typically an approximately linear relationship: larger houses tend to sell for higher prices. A linear model gives a simple interpretable rate of change (price per square foot).

**What to check and extend:**
- Perform exploratory data analysis and check scatter plots for linearity.
- Check for influential outliers (extremely large homes or luxury properties) that can distort the slope.
- If necessary, extend to multiple regression by adding predictors such as location, number of bedrooms, age of house, to reduce omitted variable bias.

**Other real-world SLR examples:**
- Predicting sales from advertising budget.
- Estimating crop yield from rainfall.
- Predicting student test scores from hours studied.

## Question 5: What is the method of least squares in linear regression?

**Answer (Theory + Formulas):**

The **Ordinary Least Squares (OLS)** method chooses parameters `β₀` and `β₁` that minimize the sum of squared residuals (errors). The residual for observation `i` is `e_i = Y_i - \hat{Y_i}` where `\hat{Y_i} = β̂₀ + β̂₁ X_i`.

**Objective function:**

\[ SSE(β_0, β_1) = \sum_{i=1}^n (Y_i - β_0 - β_1 X_i)^2 \]

The OLS estimates `β̂₀` and `β̂₁` are the values that minimize this SSE. By calculus, setting derivatives to zero gives closed-form solutions:

\[ β̂_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} \]
\[ β̂_0 = \bar{Y} - β̂_1 \bar{X} \]

**Properties of OLS (under Gauss-Markov conditions):**
- **Unbiasedness:** `E[β̂] = β`.
- **Efficiency:** Among all linear unbiased estimators, OLS has the smallest variance (BLUE) if errors are homoscedastic and uncorrelated.
- **Variance formula:** `Var(β̂_1) = σ² / Σ(X_i - X̄)²` and `Var(β̂_0) = σ² (1/n + X̄²/Σ(X_i - X̄)²)`.

**Interpretation:** OLS minimizes vertical distances (errors in `Y`), not perpendicular distances to the line.

## Question 6: What is Logistic Regression? How does it differ from Linear Regression?

**Answer (Thorough):**

Logistic Regression is a generalized linear model (GLM) used for binary classification. Instead of modelling the conditional mean `E[Y|X]` directly as a linear function, logistic regression models the **log-odds** (logit) of the probability that `Y = 1` as a linear function of predictors:

\[ \text{logit}(P(Y=1|X)) = \log\frac{P(Y=1|X)}{1 - P(Y=1|X)} = β_0 + β_1 X \]

**Equivalently,** the probability is given by the logistic (sigmoid) function:

\[ P(Y=1|X) = \frac{1}{1 + e^{-(β_0 + β_1 X)}}. \]

**Key differences vs Linear Regression:**
- **Outcome type:** Linear → continuous; Logistic → binary/categorical.
- **Model target:** Linear models `E[Y|X]`; logistic models `P(Y=1|X)` via a transformed (logit) scale.
- **Estimation:** Linear uses OLS; logistic uses maximum likelihood estimation (MLE).
- **Interpretation of coefficients:** In logistic regression, `e^{β_1}` is the odds ratio for a one-unit increase in `X`.

**Use cases:** Spam detection, credit default prediction, medical diagnosis.

## Question 7: Name and briefly describe three common evaluation metrics for regression models.

**Answer (Expanded with interpretation):**

1. **Mean Absolute Error (MAE):**
   - Formula: \( MAE = \frac{1}{n} \sum_{i=1}^n |Y_i - \hat{Y}_i| \).
   - Interpretation: Average absolute prediction error in the same units as `Y`. Less sensitive to outliers than MSE.

2. **Mean Squared Error (MSE):**
   - Formula: \( MSE = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 \).
   - Interpretation: Penalizes larger errors more heavily; useful for optimization and theoretical analysis.

3. **Root Mean Squared Error (RMSE):**
   - Formula: \( RMSE = \sqrt{MSE} \).
   - Interpretation: Presents error on the same scale as `Y`. Easier to interpret than MSE.

**Additional metrics (brief):**
- **Mean Absolute Percentage Error (MAPE):** Useful when relative errors matter, but problematic if `Y` can be zero.
- **R-squared:** Proportion of variance explained (covered in Question 8).

## Question 8: What is the purpose of the R-squared metric in regression analysis?

**Answer (Detailed):**

R-squared (R²) quantifies the proportion of variance in the dependent variable `Y` that is explained by the independent variable(s) `X` in the model. Formally:

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \]

where `SS_res = Σ(Y_i - \hat{Y}_i)^2` and `SS_tot = Σ(Y_i - \bar{Y})^2`.

**Interpretation and caveats:**
- `R² = 0` → model explains none of the variance; `R² = 1` → perfect fit.
- R² increases with additional predictors; use **Adjusted R²** to penalize model complexity.
- High R² doesn't imply causation or model correctness.

## Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

**Answer (Code + Explanation):**

In [1]:
# Scikit-learn example for Simple Linear Regression (SLR)
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Sample data (X: feature, y: target)
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2.5, 3.5, 4.2, 5.0, 5.8])

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Coefficients
slope = model.coef_[0]
intercept = model.intercept_
y_pred = model.predict(X)

print(f"Slope (β₁): {slope:.4f}")
print(f"Intercept (β₀): {intercept:.4f}")

# Additional evaluation
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R-squared: {r2:.4f}")


Slope (β₁): 0.8100
Intercept (β₀): 1.7700
MSE: 0.0038
RMSE: 0.0616
R-squared: 0.9971


**Explanation:** The `LinearRegression` model fits weights using OLS. `model.coef_` returns slope(s) and `model.intercept_` returns intercept. The extra metrics show model fit.

## Question 10: How do you interpret the coefficients in a simple linear regression model?

**Answer (Expanded):**

Given `Y = β₀ + β₁ X + ε`:

- **Intercept (β₀):** The expected value of `Y` when `X = 0`. In practice, assess whether `X = 0` is in the domain; if not, the intercept may not have meaningful real-world interpretation.

- **Slope (β₁):** The expected change in `Y` given a one-unit increase in `X`. It captures direction (positive/negative) and magnitude of association.

**Statistical inference:**
- Use standard errors, t-tests and confidence intervals to assess whether coefficients are statistically significantly different from zero.
- A small p-value for `β₁` suggests a statistically significant association between `X` and `Y` under model assumptions.

**Practical considerations:**
- Check residual diagnostics, influence measures (e.g., Cook's distance), and multicollinearity (in multiple regression) before trusting coefficient interpretation.