In [None]:
#Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.
'''
Simple Linear Regression is a statistical method used to model the relationship between two variables — one independent variable (X) and one dependent
variable (Y) — by fitting a straight line to the observed data.

The model assumes that the relationship between X and Y can be expressed as:

𝑌=𝛽0+𝛽1𝑋+𝜖

Where:
Y = Dependent (response) variable
X = Independent (predictor) variable
β0= Intercept (value of Y when 𝑋=0)
𝛽1= Slope (change in Y for a one-unit change in X)
ϵ = Random error term (captures variations not explained by the model)

Purpose of Simple Linear Regression:
1.To Understand Relationships:
It helps to determine whether and how strongly two variables are related — for example, how study hours (X) affect exam scores (Y).

2.To Predict Outcomes:
Once the relationship is established, it can be used to predict the value of the dependent variable for a given value of the independent variable.
Example: Predicting house price (Y) from its size (X).

3.To Quantify the Relationship:
The slope (𝛽1) gives a numerical estimate of how much 𝑌 changes for each unit increase in 𝑋.

4.To Identify Trends:
SLR is often used to find trends or forecast future values based on historical data.'''

In [None]:
#Question 2: What are the key assumptions of Simple Linear Regression?
'''
Key Assumptions of Simple Linear Regression (SLR):
For Simple Linear Regression to give valid and reliable results, several assumptions must be satisfied. These assumptions ensure that the relationship
between the independent variable (X) and the dependent variable (Y) is properly modeled by a straight line.

1.Linearity:
The relationship between X and Y should be linear — meaning that changes in X result in proportional changes in Y.
Mathematically:
            𝐸(𝑌∣𝑋)=𝛽0+𝛽1𝑋
How to check: Use scatter plots or residual plots. If the data points roughly form a straight line, the linearity assumption holds.

2.Independence of Errors:
The residuals (errors) should be independent of each other.
In other words, the value of one residual should not depend on another.
Why it matters: Violation often occurs in time-series data, leading to misleading results.
How to check: Use the Durbin–Watson test or inspect residual plots.

3.Homoscedasticity (Constant Variance of Errors):
The variance of the error terms (ϵ) should be constant across all levels of X.
That is, the spread of residuals should be roughly the same for all values of X.
If violated: This is called heteroscedasticity, which can make predictions unreliable.
How to check: Plot residuals vs. predicted values — the spread should look uniform.

4.Normality of Errors:
The residuals should be normally distributed (especially important for hypothesis testing and confidence intervals).
How to check:
  * Use a histogram or Q-Q plot of residuals.
  * Apply Shapiro–Wilk test or Kolmogorov–Smirnov test.

5. No Perfect Multicollinearity (in SLR context: X is not constant):
In SLR, since there’s only one independent variable, it must vary — i.e., it should not be constant.
If X doesn’t change, the model cannot estimate the slope (𝛽1).'''


In [None]:
#Question 3: Write the mathematical equation for a simple linear regression model and explain each term.
'''
Mathematical Equation of a Simple Linear Regression Model:
The general equation for a Simple Linear Regression (SLR) model is:
                          𝑌 = 𝛽0 + 𝛽1𝑋 + 𝜖

| Term         | Meaning                                       |        Description                                                                                                                                                                |
| ------------ | --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------    |
| ( Y )        | Dependent Variable (Response Variable)        | The variable we are trying to predict or explain. Its value depends on ( X ).                                                                                                     |
| ( X )        | Independent Variable (Predictor Variable)     | The variable used to predict ( Y ). It is assumed to cause or influence changes in ( Y ).                                                                                         |
| ( 𝛽0)        | Intercept (Constant Term)                     | The value of ( Y ) when ( X = 0 ). It represents the point where the regression line crosses the Y-axis.                                                                          |
| ( 𝛽1)        | Slope (Regression Coefficient)                | The rate of change in ( Y ) for every one-unit increase in ( X ). It indicates the strength and direction of the relationship between ( X ) and ( Y ).                            |
| ( 𝜖 )        | Error Term (Residual)                          |Represents the random variation in ( Y ) that cannot be explained by the linear relationship with ( X ). It captures noise, measurement errors, and other unknown factors.        |
'''

In [None]:
#Question 4: Provide a real-world example where simple linear regression can be applied.
'''
Real-World Example of Simple Linear Regression (SLR)
Example: Predicting House Prices Based on Size
Scenario:
A real estate company wants to predict the price of a house (Y) based on its size in square feet (X).
They collect data from several houses, noting both the house size and its selling price.
| **House Size (sq. ft.) (X)** | **House Price (₹ in lakhs) (Y)** |
| ---------------------------- | -------------------------------- |
| 1000                         | 50                               |
| 1500                         | 65                               |
| 2000                         | 80                               |
| 2500                         | 95                               |
| 3000                         | 110                              |

Model Formulation:
We assume a linear relationship between house size and price:
                             𝑌=𝛽0+𝛽1𝑋+𝜖
After performing regression analysis, we might obtain:
                             𝑌^=20+0.03𝑋

Interpretation:
*β₀ (Intercept) = 20:
When house size = 0 sq. ft., the base predicted price is ₹20 lakh (this has limited practical meaning but helps define the line).
*β₁ (Slope) = 0.03:
For every additional 1 sq. ft., the price increases by ₹0.03 lakh (i.e., ₹3,000).

Using the Model for Prediction:
If a house is 2,500 sq. ft., then:
                            𝑌^=20+0.03(2500)=20+75=95
✅ Predicted Price: ₹95 lakh

Purpose of Using SLR Here:
*To understand how house size affects price.
*To predict prices of new houses based on size.
*To support decision-making for buyers, sellers, and real estate agents.'''

In [None]:
#Question 5: What is the method of least squares in linear regression?
'''
The method of least squares is a mathematical approach used in linear regression to find the best-fitting line through a set of data points by 
minimizing the sum of the squared errors (residuals).
In simple terms, it finds the line that makes the predicted values as close as possible to the actual observed values.

Mathematical Explanation:

The simple linear regression model is:
                                   𝑌=𝛽0+𝛽1𝑋+𝜖
The predicted value of Y is:
                                   𝑌^=𝛽0+𝛽1𝑋

The residual (error) for each observation is:
                                  𝑒𝑖=𝑌𝑖−𝑌^𝑖=𝑌𝑖−(𝛽0+𝛽1𝑋𝑖)

The method of least squares chooses 𝛽0 and 𝛽1 such that the sum of squared residuals (SSR) is minimized:
                               Minimize 𝑆=∑𝑖=1𝑛(𝑌𝑖−𝛽0−𝛽1𝑋𝑖)2

Purpose of the Least Squares Method:
1.To find the line of best fit:
Ensures the fitted regression line is as close as possible to the actual data points.

2.To minimize prediction errors:
Reduces the total squared difference between actual and predicted values.

3.To ensure unbiased estimation:
Provides statistically efficient and unbiased estimates of regression coefficients (under the assumptions of linear regression).

In [None]:
#Question 6: What is Logistic Regression? How does it differ from Linear Regression?
'''
Logistic Regression is a statistical method used to model the relationship between a categorical dependent variable (usually binary: 0 or 1) 
and one or more independent variables (which can be continuous or categorical).
It is used when the outcome you’re predicting is qualitative, such as yes/no, pass/fail, spam/not spam, etc.

Mathematical Form:
In logistic regression, instead of predicting Y directly, we predict the probability that Y=1.

The logistic regression model is based on the sigmoid (logistic) function:

                                     P(Y=1∣X)=1/1+e−(β0+β1X)

Where:
P(Y=1∣X) = Probability that the dependent variable Y equals 1 given X
β0= Intercept
𝛽1= Coefficient (slope) of predictor X
e = Euler’s number (≈ 2.718)

Difference Between Linear and Logistic Regression
| **Aspect**     | **Linear Regression**                                    | **Logistic Regression**                                                           |
| -------------- | -------------------------------------------------------- | --------------------------------------------------------------------------------- |
| **Purpose**    | Predicts a **continuous** outcome (e.g., income, height) | Predicts a **categorical** outcome (e.g., yes/no, 0/1)                            |
| **Equation**   | ( Y = \beta_0 + \beta_1 X + \epsilon )                   | ( P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} )                             |
| **Output**     | Produces **numerical values** (from −∞ to +∞)            | Produces **probabilities** (between 0 and 1)                                      |
| **Error Term** | Assumes errors are normally distributed                  | Assumes a **binomial distribution** for outcomes                                  |
| **Linearity**  | Models a **linear relationship** between X and Y         | Models a **non-linear (sigmoid)** relationship between X and the probability of Y |
| **Use Case**   | Predicting house prices, salaries, etc.                  | Predicting if an email is spam, a patient has a disease, etc.                     |
'''

In [None]:
#Question 7: Name and briefly describe three common evaluation metrics for regression models?
'''
Common Evaluation Metrics for Regression Models:
When we build a regression model (like linear regression), we need to measure how well it predicts the actual outcomes.
Here are three commonly used evaluation metrics:
1. Mean Absolute Error (MAE):
                             n
                      MAE=1/n∑∣Yi−Y^i∣
                            i=1
Description:
* MAE measures the average absolute difference between the actual values (𝑌𝑖) and the predicted values (Y^i).
* It tells us, on average, how much the predictions deviate from the true values.

Key Points:
* Easy to understand and interpret.
* Treats all errors equally (no squaring).
* Lower MAE → better model performance.

Example:
If MAE = 3, predictions are off by 3 units on average.

2. Mean Squared Error (MSE):
                                   n
                            MSE=1/n∑(Yi−Y^i)2
                                  i=1
Description:
* MSE measures the average of the squared errors.
* Squaring the errors penalizes larger deviations more strongly than smaller ones.

Key Points:
*Sensitive to large errors (outliers).
*Commonly used in optimization because it is mathematically smooth and differentiable.
*Lower MSE → better model performance.

Example:
If MSE = 9, the average squared error between predictions and actual values is 9.

3. R-squared (Coefficient of Determination)
R2=1−SSres/SStot
Where:
*𝑆𝑆𝑟𝑒𝑠=∑(𝑌𝑖−𝑌^𝑖)2 → Sum of squared residuals
*𝑆𝑆𝑡𝑜𝑡=∑(𝑌𝑖−𝑌ˉ)2 → Total sum of squares

Description:
*R² represents the proportion of variance in the dependent variable that is explained by the independent variable(s).

Key Points:
*R² ranges from 0 to 1.
   *R² = 1: Perfect prediction
   *R² = 0: Model explains nothing about the variability of Y
*Gives a sense of overall fit of the model.

Example:
If R² = 0.85, it means the model explains 85% of the variation in the dependent variable.'''

In [None]:
#Question 8: What is the purpose of the R-squared metric in regression analysis?
'''
R-squared (R²), also called the coefficient of determination, is a statistical measure that indicates how well the independent variable(s) 
explain the variation in the dependent variable in a regression model.
Mathematically, it is expressed as:
                             R2=1−SSres/SStot
Where:
*𝑆𝑆𝑟𝑒𝑠=∑(𝑌𝑖−𝑌^𝑖)2 → Residual Sum of Squares (unexplained variation)
*𝑆𝑆𝑡𝑜𝑡=∑(𝑌𝑖−𝑌ˉ)2 → Total Sum of Squares (total variation in Y)

Purpose and Interpretation:
1.Measures Goodness of Fit
* R² shows how well the regression model fits the observed data.
* Higher R² values mean the model explains a larger portion of the variance in the dependent variable.
Example: R² = 0.80 → The model explains 80% of the variation in Y.

2.Indicates Predictive Power:
* A higher R² means the model’s predictions are closer to actual values.
* Helps assess how effectively the independent variable(s) predict the dependent variable.

3.Helps Compare Models:
* When comparing multiple regression models on the same dataset, the one with the higher R² generally provides a better fit (though not always the most appropriate model).
Example:
Model A: R² = 0.60 → explains 60% variance
Model B: R² = 0.85 → explains 85% variance → better fit

4.Shows Explained vs. Unexplained Variation:
* R² separates total variation in Y into:
   * Explained variation: due to the regression model
   * Unexplained variation: due to random errors (residuals)
Total Variation =Explained Variation+Unexplained Variation'''


In [1]:
#Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

from sklearn.linear_model import LinearRegression
import numpy as np

# (X = independent variable, Y = dependent variable)
# Reshape X into a 2D array because scikit-learn expects that format
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
Y = np.array([2, 4, 5, 4, 5])

# Create and fit the Linear Regression model
model = LinearRegression()
model.fit(X, Y)

# Print the slope (coefficient) and intercept
print("Slope (β₁):", model.coef_[0])
print("Intercept (β₀):", model.intercept_)


Slope (β₁): 0.6
Intercept (β₀): 2.2


In [None]:
#Question 10: How do you interpret the coefficients in a simple linear regression model?
'''
A Simple Linear Regression model is expressed as:
                                     𝑌=𝛽0+𝛽1𝑋+𝜖
Where:
Y: Dependent (response) variable
X: Independent (predictor) variable
𝛽0: Intercept
β1: Slope (coefficient)
ϵ: Random error term

1. Intercept (𝛽0)
*The intercept represents the predicted value of Y when 𝑋=0.
*It is the point where the regression line crosses the Y-axis.
*Interpretation:
   When the independent variable X is 0, the expected value of Y is 𝛽0
🧩 Example:
If the regression equation is:
                             𝑌=30+4𝑋
Then when 𝑋=0, 𝑌=30.
→ The intercept (30) represents the baseline value of Y when X has no effect.

2. Slope (𝛽1)
*The slope represents the change in Y for a one-unit increase in X.
*It tells us the direction and strength of the relationship between the two variables.

Interpretation:
*For every 1-unit increase in X, the predicted value of Y changes by 𝛽1 units (increases if positive, decreases if negative).
🧩 Example:
In the same equation:
                     𝑌=30+4𝑋
*β1 =4: For each additional unit increase in X, 𝑌 increases by 4 units.
*If 𝛽1 were negative (e.g., −4), then Y would decrease by 4 units for each 1-unit increase in X.

3. Summary Table:
| **Coefficient**         | **Meaning**                              | **Interpretation Example**                             |
| ----------------------- | ---------------------------------------- | ------------------------------------------------------ |
| ( \beta_0 ) (Intercept) | Value of ( Y ) when ( X = 0 )            | If β₀ = 30 → baseline value of Y = 30                  |
| ( \beta_1 ) (Slope)     | Change in ( Y ) per unit change in ( X ) | If β₁ = 4 → Y increases by 4 for every 1 increase in X |'''
