## Question 1: What is Simple Linear Regression (SLR)? Explain its purpose.

### Simple Linear Regression (SLR)

Simple Linear Regression (SLR) is a statistical method used to model the relationship between two continuous variables:

1.  **Dependent Variable (Y)**: The variable we are trying to predict or explain.
2.  **Independent Variable (X)**: The variable used to predict or explain the dependent variable.

SLR assumes a linear relationship between these two variables, meaning that changes in the independent variable are associated with proportional changes in the dependent variable. The relationship is represented by a straight line, often expressed by the equation:

$Y = \beta_0 + \beta_1 X + \epsilon$

Where:
*   $Y$ is the dependent variable.
*   $X$ is the independent variable.
*   $\beta_0$ (beta-naught) is the y-intercept, representing the expected value of Y when X is 0.
*   $\beta_1$ (beta-one) is the slope of the regression line, representing the change in Y for a one-unit change in X.
*   $\epsilon$ (epsilon) is the error term, accounting for the variability in Y that cannot be explained by X.

### Purpose of SLR

The primary purposes of Simple Linear Regression are:

1.  **Prediction**: To predict the value of the dependent variable based on a given value of the independent variable. For example, predicting a student's exam score based on the number of hours they studied.
2.  **Understanding the Relationship**: To quantify and understand the strength and direction of the linear relationship between two variables. This helps in determining how one variable influences another.
3.  **Explanation and Inference**: To explain the variation in the dependent variable based on the independent variable and to make inferences about the population parameters (e.g., whether a relationship exists, and the magnitude of that relationship).

In essence, SLR provides a straightforward way to analyze and visualize how two variables move together, making it a foundational tool in many fields like economics, biology, social sciences, and engineering.

## Question 2: What are the key assumptions of Simple Linear Regression?

Simple Linear Regression (SLR) relies on several key assumptions for its results to be valid and reliable. Violations of these assumptions can lead to biased coefficients, incorrect standard errors, and unreliable hypothesis tests. The main assumptions are:

1.  **Linearity**: The relationship between the independent variable (X) and the dependent variable (Y) is linear. This means that the change in Y for a one-unit change in X is constant.

2.  **Independence of Errors**: The residuals (errors) are independent of each other. In other words, the error for one observation does not influence the error for another observation. This assumption is often violated in time series data or hierarchical data.

3.  **Homoscedasticity (Constant Variance of Errors)**: The variance of the residuals is constant across all levels of the independent variable. This means the spread of the residuals should be roughly the same for all values of X. If the variance changes, it's called heteroscedasticity.

4.  **Normality of Errors**: The residuals are normally distributed. While SLR estimates (especially for $\beta_0$ and $\beta_1$) can be robust to minor deviations from normality due to the Central Limit Theorem for large sample sizes, hypothesis testing and confidence intervals rely on this assumption.

5.  **No Multicollinearity (for Multiple Linear Regression)**: Although not strictly an assumption for *Simple* Linear Regression (which only has one independent variable), it's crucial to mention that in *Multiple* Linear Regression, independent variables should not be highly correlated with each other. For SLR, this simplifies to assuming that X is not constant (i.e., there is some variation in the independent variable).

6.  **No Outliers**: While not a strict mathematical assumption, outliers can heavily influence the regression line, especially in smaller datasets. It's important to identify and address outliers if they significantly distort the model.

## Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

The mathematical equation for a Simple Linear Regression (SLR) model is typically expressed as:

$Y = \beta_0 + \beta_1 X + \epsilon$

Let's break down each term in the equation:

*   **$Y$ (Dependent Variable / Response Variable)**:
    *   This is the variable that we are trying to predict or explain. Its value depends on the independent variable(s).
    *   In the context of prediction, $Y$ is the output we want to estimate.

*   **$X$ (Independent Variable / Predictor Variable / Explanatory Variable)**:
    *   This is the variable used to predict or explain the changes in the dependent variable $Y$.
    *   In SLR, there is only one independent variable.

*   **$\beta_0$ (Beta-naught / Y-intercept)**:
    *   This represents the expected mean value of $Y$ when $X$ is 0.
    *   Geometrically, it's the point where the regression line crosses the y-axis.
    *   It gives the baseline value of $Y$ when there is no influence from $X$.

*   **$\beta_1$ (Beta-one / Slope Coefficient)**:
    *   This represents the change in the expected mean value of $Y$ for every one-unit increase in $X$.
    *   It indicates the strength and direction of the linear relationship between $X$ and $Y$.
    *   A positive $\beta_1$ means $Y$ increases as $X$ increases, while a negative $\beta_1$ means $Y$ decreases as $X$ increases.

*   **$\epsilon$ (Epsilon / Error Term / Residual)**:
    *   This represents the error or residual in the model. It accounts for the variability in $Y$ that cannot be explained by the linear relationship with $X$.
    *   It includes the effects of all other factors not included in the model, as well as random variation.
    *   The error term is typically assumed to be normally distributed with a mean of zero and constant variance (homoscedasticity).

## Question 4: Provide a real-world example where simple linear regression can be applied.

**Example: Predicting House Prices based on Size**

*   **Scenario**: A real estate agent wants to predict the selling price of a house based on its living area (square footage).

*   **Dependent Variable (Y)**: Selling Price of the house (e.g., in USD).

*   **Independent Variable (X)**: Living Area of the house (e.g., in square feet).

*   **Application of SLR**: The agent can collect data on many houses, recording their living area and their corresponding selling prices. By applying Simple Linear Regression, they can model the relationship between these two variables.

    *   **Equation**: `Selling Price = β₀ + β₁ * Living Area + ε`

*   **Interpretation of Terms**:
    *   **β₀ (Y-intercept)**: This would represent the baseline price of a house, hypothetically when the living area is zero. In a practical sense, it might not be directly interpretable but helps in fitting the line.
    *   **β₁ (Slope Coefficient)**: This would represent the average increase in selling price for every one-unit (one square foot) increase in the living area. For example, if β₁ = 100, it means that, on average, a house's price increases by $100 for every additional square foot.

*   **Purpose in this example**:
    *   **Prediction**: The agent can use the established regression model to predict the selling price of a new house for which only the living area is known.
    *   **Understanding the Relationship**: It helps the agent understand how strongly and in what direction house size influences its price. They can see if larger houses generally fetch higher prices and by how much.
    *   **Pricing Strategy**: This model can inform pricing strategies for new listings, helping to set a competitive and realistic price.

This is a classic example of SLR because it involves a single independent variable (living area) explaining a single dependent variable (selling price) with an assumed linear relationship.

## Question 5: What is the method of least squares in linear regression?

The **Method of Least Squares** is a standard approach in linear regression to find the best-fitting line through a set of data points. The "best-fitting" line is defined as the line that minimizes the sum of the squared differences between the observed values (actual data points) and the values predicted by the model (points on the regression line).

Let's break down the concept:

*   **The Goal**: In Simple Linear Regression, we want to estimate the coefficients ($\beta_0$ and $\beta_1$) of the regression line $Y = \beta_0 + \beta_1 X + \epsilon$. The method of least squares provides a way to calculate these coefficients.

*   **Residuals (Errors)**: For each data point $(X_i, Y_i)$, the model predicts a value $\hat{Y_i} = \beta_0 + \beta_1 X_i$. The difference between the actual observed value $Y_i$ and the predicted value $\hat{Y_i}$ is called the residual or error term ($e_i$):

    $e_i = Y_i - \hat{Y_i} = Y_i - (\beta_0 + \beta_1 X_i)$

    These residuals represent how far off our prediction is for each data point.

*   **Minimizing the Sum of Squared Residuals**: The core idea of the least squares method is to find the values of $\beta_0$ and $\beta_1$ that minimize the **Sum of Squared Errors (SSE)**, also known as the **Residual Sum of Squares (RSS)**. Squaring the errors does two things:
    1.  It ensures that positive and negative errors don't cancel each other out, so large errors (both positive and negative) are penalized equally.
    2.  It gives more weight to larger errors, meaning the model tries harder to fit points that are further away from the line.

    The objective function to minimize is:

    $SSE = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (Y_i - (\beta_0 + \beta_1 X_i))^2$

*   **Finding the Coefficients**: To find the values of $\beta_0$ and $\beta_1$ that minimize SSE, calculus is used. We take the partial derivatives of the SSE function with respect to $\beta_0$ and $\beta_1$, set them to zero, and solve the resulting system of equations. This process yields the following formulas for the estimated coefficients:

    $\hat{\beta_1} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2} = \frac{Cov(X, Y)}{Var(X)}$

    $\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}$

    Where:
    *   $\bar{X}$ is the mean of the independent variable $X$.
    *   $\bar{Y}$ is the mean of the dependent variable $Y$.
    *   $Cov(X, Y)$ is the covariance between $X$ and $Y$.
    *   $Var(X)$ is the variance of $X$.

In summary, the method of least squares is a mathematical technique used to determine the regression line that minimizes the total of the squared vertical distances (residuals) from each data point to the line. This method is fundamental to estimating the parameters of a linear regression model.

## Question 6: What is Logistic Regression? How does it differ from Linear Regression?

### What is Logistic Regression?

**Logistic Regression** is a statistical model used for predicting the probability of a binary outcome (an outcome that can have only two possible values, e.g., yes/no, true/false, pass/fail). Despite its name, Logistic Regression is a classification algorithm, not a regression algorithm in the traditional sense, because its output is a probability that is then mapped to a class label.

It works by using a **sigmoid (or logistic) function** to map any real-valued number into a value between 0 and 1. This output value can then be interpreted as the probability of the dependent variable belonging to a particular class. If the probability is above a certain threshold (e.g., 0.5), the observation is classified into one class; otherwise, it's classified into the other.

The equation for the logistic (sigmoid) function is:

$P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}$

Where:
*   $P(Y=1|X)$ is the probability that the dependent variable $Y$ belongs to class 1, given the independent variable $X$.
*   $e$ is the base of the natural logarithm.
*   $\beta_0$ is the intercept.
*   $\beta_1$ is the coefficient for the independent variable $X$.

The core idea is to model the *log-odds* of the event occurring as a linear combination of the independent variables:

$\ln\left(\frac{P(Y=1|X)}{1 - P(Y=1|X)}\right) = \beta_0 + \beta_1 X$

### How does it differ from Linear Regression?

The key differences between Logistic Regression and Linear Regression are fundamental and relate to their purpose, the nature of their dependent variables, and the functions they employ:

1.  **Nature of Dependent Variable (Outcome)**:
    *   **Linear Regression**: Predicts a continuous outcome variable (e.g., house price, temperature, sales figures).
    *   **Logistic Regression**: Predicts a categorical outcome variable, specifically a binary outcome (e.g., spam/not spam, disease/no disease, customer churn/no churn).

2.  **Output and Interpretation**:
    *   **Linear Regression**: Outputs a direct continuous value, representing the predicted value of the dependent variable.
    *   **Logistic Regression**: Outputs a probability (a value between 0 and 1) that the dependent variable belongs to a particular class. This probability is then often converted into a binary class prediction.

3.  **Underlying Function/Equation**:
    *   **Linear Regression**: Uses a linear equation ($Y = \beta_0 + \beta_1 X + \epsilon$) to model the relationship between X and Y directly.
    *   **Logistic Regression**: Uses the **logistic (sigmoid) function** to transform the linear combination of independent variables into a probability. It models the *log-odds* of the dependent variable as a linear function of the independent variables.

4.  **Error Distribution**:
    *   **Linear Regression**: Assumes that the error terms (residuals) are normally distributed.
    *   **Logistic Regression**: Does not assume normally distributed errors. Instead, it typically assumes that the errors follow a binomial distribution, consistent with the binary nature of the outcome.

5.  **Parameter Estimation**:
    *   **Linear Regression**: Parameters (coefficients) are typically estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of squared errors.
    *   **Logistic Regression**: Parameters are typically estimated using Maximum Likelihood Estimation (MLE), which maximizes the likelihood of observing the actual data given the model parameters.

In summary, while both are regression techniques, Linear Regression is designed for continuous outcome prediction, producing a numerical value, whereas Logistic Regression is a classification technique for binary outcomes, producing a probability that can be used for classification.

## Question 7: Name and briefly describe three common evaluation metrics for regression models.

When evaluating the performance of a regression model, several metrics can be used to quantify how well the model's predictions align with the actual observed values. Here are three common ones:

1.  **Mean Absolute Error (MAE)**:
    *   **Description**: MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average of the absolute differences between the predicted values and the actual values.
    *   **Formula**: $MAE = \frac{1}{n} \sum_{i=1}^{n} |Y_i - \hat{Y}_i|$
    *   **Interpretation**: A lower MAE indicates a better fit. MAE is robust to outliers compared to MSE because it does not square the errors.

2.  **Mean Squared Error (MSE)**:
    *   **Description**: MSE measures the average of the squares of the errors. It is the average squared difference between the predicted values and the actual values. Because the errors are squared, MSE gives more weight to larger errors, making it sensitive to outliers.
    *   **Formula**: $MSE = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2$
    *   **Interpretation**: A lower MSE indicates a better fit. MSE is useful when larger errors are particularly undesirable, as it penalizes them more heavily.

3.  **R-squared ($R^2$) or Coefficient of Determination**:
    *   **Description**: R-squared is a statistical measure that represents the proportion of the variance in the dependent variable that can be explained by the independent variable(s) in the regression model. It indicates how well the model accounts for the variability in the outcome.
    *   **Formula**: $R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2}{\sum_{i=1}^{n} (Y_i - \bar{Y})^2}$
        *   $SSE$ (Sum of Squared Errors) is the sum of the squared differences between actual and predicted values.
        *   $SST$ (Total Sum of Squares) is the sum of the squared differences between actual values and their mean.
    *   **Interpretation**: R-squared values range from 0 to 1 (or sometimes negative for poor models). An $R^2$ of 1 means the model perfectly explains the variability of the dependent variable. An $R^2$ of 0 means the model explains none of the variability. Generally, a higher $R^2$ indicates a better fit, but it should be used cautiously as a standalone metric, especially when comparing models with different numbers of predictors.

## Question 8: What is the purpose of the R-squared metric in regression analysis?


**Description:** R-squared is a statistical measure that represents the proportion of the variance in the dependent variable that can be explained by the independent variable(s) in the regression model. It indicates how well the model accounts for the variability in the outcome.

**Formula:** $R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum_{i=1}^{n} (Y_i - \hat{Y}i)^2}{\sum{i=1}^{n} (Y_i - \bar{Y})^2}$
SSE  (Sum of Squared Errors) is the sum of the squared differences between actual and predicted values.
SST  (Total Sum of Squares) is the sum of the squared differences between actual values and their mean.

**Interpretation:** R-squared values range from 0 to 1 (or sometimes negative for poor models). An  R2  of 1 means the model perfectly explains the variability of the dependent variable. An  R2  of 0 means the model explains none of the variability. Generally, a higher  R2  indicates a better fit, but it should be used cautiously as a standalone metric, especially when comparing models with different numbers of predictors.

In [2]:
#Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.
import numpy as np
from sklearn.linear_model import LinearRegression

# 1. Generate some sample data
# Let's assume X is the independent variable and y is the dependent variable
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) # Independent variable
y = np.array([2, 4, 5, 4, 6, 7, 8, 9, 10, 12]) # Dependent variable

# You can also add some noise to make it more realistic
# y = 2 * X.flatten() + 1 + np.random.randn(10) * 0.5

# 2. Initialize the Linear Regression model
model = LinearRegression()

# 3. Fit the model to the data
model.fit(X, y)

# 4. Print the slope (coefficient) and intercept
print(f"Slope (Coefficient): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")

Slope (Coefficient): 1.00
Intercept: 1.20


### Explanation of the Code:

1.  **`import numpy as np`**: Imports the NumPy library, which is essential for numerical operations, especially for creating arrays.
2.  **`from sklearn.linear_model import LinearRegression`**: Imports the `LinearRegression` class from scikit-learn, which is the model we'll use.
3.  **`X = np.array([...]).reshape(-1, 1)`**: Creates our independent variable data. `reshape(-1, 1)` is crucial because scikit-learn expects the input features (`X`) to be a 2D array (even if there's only one feature).
4.  **`y = np.array([...])`**: Creates our dependent variable data.
5.  **`model = LinearRegression()`**: Initializes an instance of the `LinearRegression` model.
6.  **`model.fit(X, y)`**: This is where the magic happens! The model learns the relationship between `X` and `y` by finding the optimal slope and intercept that minimize the sum of squared errors (the method of least squares).
7.  **`model.coef_[0]`**: After fitting, `model.coef_` contains the estimated coefficients (slopes) of the linear model. For simple linear regression, it's a single value, so we access it with `[0]`.
8.  **`model.intercept_`**: This attribute holds the estimated y-intercept of the regression line.

## Question 10: How do you interpret the coefficients in a simple linear regression model?

In a simple linear regression model, represented by the equation $Y = \beta_0 + \beta_1 X + \epsilon$, there are two main coefficients to interpret: the intercept ($\beta_0$) and the slope ($\beta_1$).

1.  **Interpretation of the Intercept ($\beta_0$)**:
    *   **Definition**: The intercept represents the predicted mean value of the dependent variable ($Y$) when the independent variable ($X$) is equal to zero.
    *   **Practicality**: Its practical interpretation depends on the context of the data:
        *   **Meaningful Zero**: If $X=0$ is a meaningful and possible value (e.g., predicting plant growth based on hours of sunlight, and 0 hours is a possibility), then $\beta_0$ represents the expected baseline value of $Y$ without any influence from $X$.
        *   **Extrapolation**: If $X=0$ is outside the range of observed data or doesn't make logical sense in the real world (e.g., predicting house price based on square footage, where zero square footage is not a real house), then $\beta_0$ might not have a direct, interpretable meaning. In such cases, it primarily serves to adjust the height of the regression line.

2.  **Interpretation of the Slope ($\beta_1$)**:
    *   **Definition**: The slope represents the change in the predicted mean value of the dependent variable ($Y$) for every one-unit increase in the independent variable ($X$).
    *   **Direction and Magnitude**:
        *   **Positive $\beta_1$**: Indicates a positive linear relationship. As $X$ increases by one unit, $Y$ is predicted to increase by $\beta_1$ units.
        *   **Negative $\beta_1$**: Indicates a negative linear relationship. As $X$ increases by one unit, $Y$ is predicted to decrease by $|\beta_1|$ units.
        *   **Magnitude**: The absolute value of $\beta_1$ indicates the strength of the relationship. A larger absolute value suggests a stronger impact of $X$ on $Y$.
    *   **Causation vs. Correlation**: It's important to remember that a significant slope indicates a correlation or association, but it does not necessarily imply causation, unless the study design allows for such conclusions (e.g., a well-controlled experiment).

In summary, the intercept gives the baseline prediction of $Y$ when $X$ is zero, and the slope quantifies the expected change in $Y$ for a unit change in $X$.