# Regression Assignment

Q1. What is Simple Linear Regression?  
- Simple Linear Regression is a statistical method that models the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. The model assumes a linear relationship between the two variables, represented by the equation:

`[ Y = mX + c ]`

where:
- \( Y \) is the dependent variable.
- \( X \) is the independent variable.
- \( m \) is the slope of the line (the change in \( Y \) for a one-unit change in \( X \)).
- \( c \) is the intercept (the value of \( Y \) when \( X \) is zero).

-----------------------------------------------------------------

Q2. What are the key assumptions of Simple Linear Regression?  
- **Linearity:** The relationship between the independent and dependent variables should be linear.  
- **Independence:** The residuals (errors) should be independent.  
- **Homoscedasticity:** The residuals should have constant variance at every level of the independent variable.  
- **Normality:** The residuals should be normally distributed.  

**Example:** If the residuals show a pattern or funnel shape, it indicates a violation of homoscedasticity.  

-----------------------------------------------------------------

Q3. : Write the mathematical equation for a simple linear regression model and
explain each term.
- the mathematical equation for a simple linear regression model:
y = \beta_0 + \beta_1 x + \varepsilon

- y: The dependent variable (or response variable) — the outcome you're trying to predict or explain.
- x: The independent variable (or predictor variable) — the input or feature used to make predictions.
- \beta_0: The intercept — the value of y when x = 0. It represents the baseline level of the response variable.
- \beta_1: The slope coefficient — it shows how much y changes for a one-unit increase in x. It captures the strength and direction of the relationship between x and y.
- \varepsilon: The error term — accounts for the variability in y that cannot be explained by the linear relationship with x. It represents random noise or other influencing factors not included in the model.
Would you like to see how this works with a real-world example or dataset?

-----------------------------------------------------------------

Q4. Provide a real-world example where simple linear regression can be
applied.
- Scenario:
A real estate analyst wants to predict the price of a house based on its size (in square feet).
Application of Simple Linear Regression:
- Dependent Variable (y): House price
- Independent Variable (x): Size of the house in square feet
The analyst collects data on recent house sales, including the size and sale price of each home. Using simple linear regression, they fit a model:
\text{Price} = \beta_0 + \beta_1 \cdot \text{Size} + \varepsilon
This model helps estimate how much the price increases for each additional square foot. For example, if \beta_1 = 1500, then each extra square foot adds ₹1500 to the expected price.


-----------------------------------------------------------------

Q5.What is the method of least squares in linear regression?  
- The method of least squares is a fundamental approach used in linear regression to find the best-fitting line through a set of data points. Its goal is to minimize the sum of the squared differences (called residuals) between the observed values and the values predicted by the linear model.

Given a simple linear regression model:
y = \beta_0 + \beta_1 x + \varepsilon
The method of least squares estimates the coefficients \beta_0 (intercept) and \beta_1 (slope) by minimizing the sum of squared residuals:
\text{Minimize } \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2
Where:
- y_i is the actual observed value
- \hat{y}_i is the predicted value from the model
- x_i is the input value
- n is the number of data points

-----------------------------------------------------------------

Q6. What is Logistic Regression? How does it differ from Linear Regression?  
- Logistic Regression is a statistical method used for binary classification — that is, predicting one of two possible outcomes (like yes/no, 0/1, true/false).

Logistic regression is a statistical technique used for predicting binary outcomes, such as yes/no or success/failure, by modeling the probability that a given input belongs to a particular category. Unlike linear regression, which estimates a continuous numeric value using a straight-line relationship between variables, logistic regression uses the logistic (sigmoid) function to constrain its output between 0 and 1, representing probabilities. While linear regression is suitable for tasks like predicting house prices or temperatures, logistic regression is ideal for classification problems like spam detection or disease diagnosis. Another key difference lies in their loss functions: linear regression minimizes mean squared error, whereas logistic regression minimizes log loss (cross-entropy), making it better suited for handling categorical data.
-----------------------------------------------------------------

Q7. Name and briefly describe three common evaluation metrics for regression
models.
  
- Three common evaluation metrics for regression models are Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared. MAE measures the average magnitude of errors in predictions, without considering their direction, making it easy to interpret and less sensitive to outliers. MSE, on the other hand, squares the errors before averaging, which penalizes larger errors more heavily and is useful when large deviations are particularly undesirable. R-squared, also known as the coefficient of determination, indicates the proportion of variance in the dependent variable that is predictable from the independent variable, with values closer to 1 signifying a better fit. These metrics help assess how well a regression model performs and guide improvements

-----------------------------------------------------------------

Q8. What is the purpose of the R-squared metric in regression analysis?
  
- The purpose of the R-squared metric in regression analysis is to measure how well the independent variable(s) explain the variability of the dependent variable. It represents the proportion of the total variation in the outcome that is accounted for by the regression model. R-squared values range from 0 to 1, where a value closer to 1 indicates that the model explains a large portion of the variance, and a value near 0 suggests that the model does not explain much of the variation. In essence, R-squared helps assess the goodness of fit of a regression model, showing how effectively the model captures the underlying data patterns.

-----------------------------------------------------------------


In [1]:
# Q9 Write Python code to fit a simple linear regression model using scikit-learn
# and print the slope and intercept.
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data: X = independent variable, y = dependent variable
X = np.array([[1], [2], [3], [4], [5]])  # Feature values
y = np.array([2, 4, 5, 4, 5])            # Target values

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Print the slope (coefficient) and intercept
print("Slope (β1):", model.coef_[0])
print("Intercept (β0):", model.intercept_)


Slope (β1): 0.6
Intercept (β0): 2.2


Q 10. How do you interpret the coefficients in a simple linear regression model?

-In a simple linear regression model, the coefficients represent the relationship between the independent variable and the dependent variable. The intercept (\beta_0) indicates the expected value of the dependent variable when the independent variable is zero — essentially, it's the baseline level of the outcome. The slope (\beta_1) shows how much the dependent variable is expected to change for each one-unit increase in the independent variable. A positive slope means the variables move in the same direction (as x increases, y increases), while a negative slope indicates an inverse relationship (as x increases, y decreases). Together, these coefficients define the best-fit line that models the data.
