Theoretical

In [None]:
### Q.1)   What does R-squared represent in a regression model?

ans) R-squared (R²) represents the proportion of variance in the dependent variable that is explained by the independent variable(s) in a regression model. Here's the precise explanation:

Mathematical definition:
R² = 1 - (SSR/SST)
where:
- SSR = Sum of Squared Residuals (unexplained variation)
- SST = Total Sum of Squares (total variation)

In simpler terms:
- R² ranges from 0 to 1 (or 0% to 100%)
- An R² of 0.7 means 70% of the variation in your dependent variable is explained by your model
- The remaining 30% is due to other factors not included in the model

Key points to understand:
1. A higher R² doesn't necessarily mean a better model
2. Adding more variables almost always increases R² (which is why adjusted R² is often preferred)
3. R² doesn't indicate whether:
   - The coefficients make sense
   - The variables have a causal relationship
   - The model meets regression assumptions



In [None]:
### Q.2) What are the assumptions of linear regression?

ans) 1. Linearity: The relationship between independent and dependent variables is linear.
2. Independence: Observations are independent of each other.
3. Homoscedasticity: Constant variance of residuals.
4. Normality: Residuals are normally distributed.
5. No multicollinearity among predictors.

In [None]:
### Q.3) What is the difference between R-squared and Adjusted R-squared?

ans)  1. R-squared: Indicates the proportion of variance explained by the model but doesn't account for the number of predictors.
2. Adjusted R-squared: Adjusts R-squared for the number of predictors, penalizing for unnecessary complexity.

In [None]:
### Q.4) Why do we use Mean Squared Error (MSE)?

ans)MSE is used as a loss/error function in statistics and machine learning for specific mathematical and practical advantages:

Mathematical Definition:
MSE = (1/n) Σ(yᵢ - ŷᵢ)²
where:
- yᵢ is the actual value
- ŷᵢ is the predicted value
- n is the number of observations

Key reasons for using MSE:

1. Penalizes Large Errors
- Squaring makes larger errors disproportionately more significant than smaller ones
- Particularly useful when large errors are especially undesirable
- Example: A $1,000 error in price prediction is worse than two $500 errors

2. Mathematical Properties
- Always positive (due to squaring)
- Differentiable, making it suitable for optimization algorithms
- Convex function, ensuring a global minimum exists

3. Statistical Connection
- Direct relationship to variance and standard deviation
- When using least squares regression, minimizing MSE gives the same result as maximum likelihood estimation under normal distribution assumptions

4. Practical Advantages
- Same units as squared dependent variable
- Easier to calculate derivatives (important for gradient descent)
- More stable than other metrics like Mean Absolute Error (MAE) for optimization

However, MSE has limitations:
- Not in the same units as the original data
- Very sensitive to outliers
- Can be harder to interpret than MAE

Would you like me to explain when you might choose a different error metric instead of MSE?

In [None]:
### Q.5) What does an Adjusted R-squared value of 0.85 indicate?
ans)  85% of the variance in the dependent variable is explained by the independent variables, adjusted for the number of predictors in the model.

In [None]:
### Q.6) How do we check for normality of residuals in linear regression?
ans)There are several methods to check for normality of residuals in linear regression. Here's a comprehensive breakdown:

1. Visual Methods:
   - Q-Q (Quantile-Quantile) Plot
     * Plots theoretical vs. actual quantiles
     * Straight line indicates normal distribution
     * Deviations show non-normality

   - Histogram of residuals
     * Should approximate a bell curve
     * Look for symmetry and shape

2. Statistical Tests:
   - Shapiro-Wilk Test
     * Null hypothesis: data is normally distributed
     * Better for smaller samples (n < 50)
     * Most powerful normality test

   - Kolmogorov-Smirnov Test
     * Tests against any continuous distribution
     * Less powerful than Shapiro-Wilk for normality
     * Better for larger samples

   - Anderson-Darling Test
     * Similar to K-S test but gives more weight to tails
     * More sensitive to deviations in distribution tails

3. Numerical Measures:
   - Skewness (should be close to 0)
   - Kurtosis (should be close to 3)

In Python, you can implement these checks using:
```python
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Q-Q plot
stats.probplot(residuals, dist="norm", plot=plt)

# Histogram
plt.hist(residuals, bins='auto')

# Shapiro-Wilk test
stat, p_value = stats.shapiro(residuals)
```


In [None]:
### Q.7) What is multicollinearity, and how does it impact regression?
ans)  Multicollinearity refers to high correlations between independent variables in a regression model. Let me break down its definition, detection, and impacts precisely:

Definition:
- Occurs when two or more independent variables are highly correlated
- Perfect multicollinearity: exact linear relationship (correlation = 1 or -1)
- Near multicollinearity: strong but not perfect correlation

Detection Methods:
1. Variance Inflation Factor (VIF)
   - VIF > 5 indicates potential problem
   - VIF > 10 indicates serious multicollinearity

2. Correlation Matrix
   - Look for correlations > 0.8 between predictors

Impacts on Regression:
1. Coefficient Estimates
   - Become unstable
   - Standard errors increase
   - Coefficients may have wrong signs
   - Individual effects harder to isolate

2. Statistical Inference
   - Reduced t-statistics
   - Wider confidence intervals
   - May fail to reject null hypothesis when you should

3. Model Interpretation
   - Difficult to determine individual variable importance
   - R² remains unaffected
   - Predictions still valid if correlation pattern remains same

Solutions:
1. Remove one of the correlated variables
2. Create interaction terms or combined features
3. Use regularization (Ridge, Lasso)
4. Principal Component Analysis (PCA)
5. Collect more data if possible


In [None]:
### Q.8) What is Mean Absolute Error (MAE)?

ans) MAE measures the average absolute difference between predicted and observed values, providing an easily interpretable error metric.
      Formula:
                 mae = (1 / n) * sum(abs(y_actual[i] - y_predicted[i]) for i in range(n))



In [None]:
### Q.9)What are the benefits of using an ML pipeline?

ans)   1. Automates repetitive tasks.
2. Ensures consistency in data preprocessing.
3. Simplifies the deployment process.
4. Facilitates hyperparameter tuning.

In [None]:
### Q.10) Why is RMSE considered more interpretable than MSE?
ans)   RMSE is in the same units as the dependent variable, making it easier to interpret compared to MSE, which is squared.

In [None]:
### Q.11) What is pickling in Python, and how is it useful in ML?
ans)   Pickling serializes Python objects to save them to disk. In ML, it is used to save trained models for reuse.



In [None]:
### Q.12)  What does a high R-squared value mean?
ans)    A high R-squared indicates that a large proportion of the variance in the dependent variable is explained by the independent variables.

In [None]:
### Q.13)  What happens if linear regression assumptions are violated?
ans)  1. Predictions may become unreliable.
2. Coefficients might be biased.
3. Hypothesis testing results may be invalid.

In [None]:
### Q.14) How can we address multicollinearity in regression?
ans) 1. Use techniques like Principal Component Analysis (PCA).
2. Drop highly correlated predictors.
3. Regularize the model using methods like Ridge or Lasso regression.

In [None]:
### Q.15) Why do we use pipelines in machine learning?
ans) 1. Streamlines the workflow by combining preprocessing, feature selection, and modeling steps.
2. Reduces the chances of data leakage.


In [None]:
### Q.16) How is Adjusted R-squared calculated?

ans)  The Adjusted R-squared adjusts the R-squared value to account for the number of predictors in a regression model. It penalizes the addition of irrelevant predictors that do not improve the model's performance.
The formula for Adjusted R-squared is:
                                  adjusted_r2 = 1 - ((1 - r2) * (n - 1)) / (n - p - 1)
Where:

r2 is the R-squared value.
n is the number of observations.
p is the number of predictors.

In [None]:
### Q.17)   Why is MSE sensitive to outliers?
ans) MSE is sensitive to outliers due to the squaring of errors in its formula. Let me break this down mathematically:

MSE = (1/n) Σ(yᵢ - ŷᵢ)²

Consider this example:
- Normal error: (10 - 8)² = 2² = 4
- Outlier error: (100 - 8)² = 92² = 8,464

The squaring operation causes:
1. Large errors become exponentially larger
2. A single large outlier can dominate the entire MSE calculation
3. Small and moderate errors become relatively insignificant

Let's see a concrete comparison:
- Dataset A (no outliers): errors of 2, 2, 2, 2, 2
  * MSE = (4 + 4 + 4 + 4 + 4)/5 = 4

- Dataset B (one outlier): errors of 2, 2, 2, 2, 92
  * MSE = (4 + 4 + 4 + 4 + 8,464)/5 = 1,696

Notice how one outlier increased the MSE by 424 times!

This is why alternatives like:
- MAE (Mean Absolute Error)
- Huber Loss
- RMSE with trimming
Are often used when dealing with datasets prone to outliers.


In [None]:
### Q.18) What is the role of homoscedasticity in linear regression?
ans) Homoscedasticity is a key assumption in linear regression that refers to the constant variance of residuals. Here's a precise explanation of its role:

Key Aspects:
1. Definition
- The variance of residuals should be constant across all values of predicted/independent variables
- Mathematically: Var(εᵢ|Xᵢ) = σ² (constant) for all i

2. Importance
- Ensures OLS estimators are BLUE (Best Linear Unbiased Estimators)
- Makes standard errors reliable
- Validates inference tests (t-tests, F-tests)
- Enables accurate confidence intervals

3. When Violated (Heteroscedasticity)
- Estimators remain unbiased but lose efficiency
- Standard errors become incorrect
- Hypothesis tests become unreliable
- Confidence intervals are inaccurate

4. Detection Methods
- Visual: Plotting residuals vs. fitted values
- Statistical tests:
   * Breusch-Pagan test
   * White test
   * Goldfeld-Quandt test

5. Solutions if Violated
- Transform variables (often log transformation)
- Use Weighted Least Squares (WLS)
- Apply robust standard errors
- Consider different modeling approaches




In [None]:
### Q.19) What is Root Mean Squared Error (RMSE)?

ans)  RMSE is the square root of MSE, indicating the standard deviation of prediction errors.
         Formula:
                $$
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2}
$$


In [None]:
### Q.20) Why is pickling considered risky?

ans)  1. Vulnerable to code injection attacks if loading pickle files from untrusted sources.
2. Not cross-platform or version-independent.

In [None]:
### Q.21) What alternatives exist to pickling for saving ML models?

ans)  1.Joblib for model serialization.
2. ONNX or PMML for platform-independent formats.
3. Saving model weights using frameworks like TensorFlow or PyTorch.

In [None]:
### Q.22)  What is heteroscedasticity, and why is it a problem?
ans)  Here is the exact definition and explanation of heteroscedasticity:

Heteroscedasticity refers to a specific statistical condition where:
1. The variance of the error terms in a regression model is not constant across observations
2. The error terms' variability differs across values of an independent variable

Mathematical definition:
- In a regression model Y = β₀ + β₁X + ε
- Heteroscedasticity exists when Var(ε|X) ≠ σ² (variance is not constant)

Problems it causes:
1. OLS estimators remain unbiased but are no longer BLUE (Best Linear Unbiased Estimators)
2. Standard errors are incorrect, invalidating:
   - t-tests
   - F-tests
   - Confidence intervals
3. Statistical inference becomes unreliable

Statistical tests to detect it:
1. Breusch-Pagan test
2. White test
3. Goldfeld-Quandt test

Formal solutions:
1. Transform dependent variable (usually log transformation)
2. Use Weighted Least Squares (WLS)
3. Use Heteroscedasticity-consistent standard errors (White's robust standard errors)

In [None]:
### Q.23)  How does adding irrelevant predictors affect R-squared and Adjusted R-squared?

ans) Here's how adding irrelevant predictors affects both R² and Adjusted R²:

Impact on R-squared:
1. R² always increases or stays the same when adding predictors, even irrelevant ones
2. Mathematical reason: Additional variables can only explain more variation or explain none
3. This increase occurs even with random noise variables
4. The increase is typically small for truly irrelevant predictors

Impact on Adjusted R-squared:
1. Adjusts for the number of predictors using the formula:
   Adj R² = 1 - [(1 - R²)(n-1)/(n-k-1)]
   where:
   - n = sample size
   - k = number of predictors

2. Can decrease when adding irrelevant predictors because:
   - Penalizes for additional variables
   - Only increases if new variable's t-statistic > 1
   - Helps prevent overfitting

Example:
Original model (2 relevant predictors):
- R² = 0.70
- Adj R² = 0.69

After adding irrelevant predictor:
- R² = 0.71 (increases)
- Adj R² = 0.68 (decreases)



**Practical**

In [None]:
### Q.1) Write a Python script that calculates the Mean Squared Error (MSE) and Mean Absolute Error (MAE) for a multiple linear regression model using Seaborn's diamonds dataset.

In [None]:
ans)   import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Load the diamonds dataset
diamonds = sns.load_dataset('diamonds').dropna()

# Select features and target
X = pd.get_dummies(diamonds[['carat', 'depth', 'table', 'price']], drop_first=True)
y = diamonds['price']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Calculate MSE and MAE
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print("MSE:", mse)
print("MAE:", mae)


In [None]:
### Q.2) Write a Python script to calculate and print Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) for a linear regression model.

In [None]:
ans)  import numpy as np

# RMSE Calculation
rmse = np.sqrt(mse)

print("MSE:", mse)
print("MAE:", mae)
print("RMSE:", rmse)


In [None]:
### Q.3) Write a Python script to check if the assumptions of linear regression are met. Use a scatter plot to check linearity, residuals plot for homoscedasticity, and correlation matrix for multicollinearity.

In [None]:
ans)  import matplotlib.pyplot as plt
import seaborn as sns

# Linearity Check
sns.scatterplot(x=y_test, y=y_pred)
plt.title("Linearity Check")
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.show()

# Residuals for Homoscedasticity
residuals = y_test - y_pred
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title("Residuals for Homoscedasticity")
plt.show()

# Multicollinearity Check (Correlation Matrix)
corr_matrix = X.corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()


In [None]:
### Q.4)  Create a machine learning pipeline that standardizes the features, fits a linear regression model, and evaluates the model's R-squared score.

In [None]:
ans)  from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

# Create the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the model
y_pred_pipeline = pipeline.predict(X_test)
r2 = r2_score(y_test, y_pred_pipeline)
print("R-squared Score:", r2)


In [None]:
### Q.5)  Implement a simple linear regression model on a dataset and print the model's coefficients, intercept, and R-squared score.

In [None]:
ans)   # Fit a simple linear regression model
simple_model = LinearRegression()
simple_model.fit(X_train[['carat']], y_train)  # Using 'carat' as a single feature
y_pred_simple = simple_model.predict(X_test[['carat']])

# Print coefficients, intercept, and R-squared score
print("Coefficient:", simple_model.coef_[0])
print("Intercept:", simple_model.intercept_)
print("R-squared Score:", r2_score(y_test, y_pred_simple))


In [None]:
### Q.6)  Fit a simple linear regression model to the tips dataset and print the slope and intercept of the regression line.

In [None]:
ans)   # Load tips dataset
tips = sns.load_dataset('tips')

# Select features and target
X_tips = tips[['total_bill']]
y_tips = tips['tip']

# Split the data
X_train_tips, X_test_tips, y_train_tips, y_test_tips = train_test_split(X_tips, y_tips, test_size=0.2, random_state=42)

# Train the model
tips_model = LinearRegression()
tips_model.fit(X_train_tips, y_train_tips)

# Print slope and intercept
print("Slope (Coefficient):", tips_model.coef_[0])
print("Intercept:", tips_model.intercept_)


In [None]:
### Q.7)  Write a Python script that fits a linear regression model to a synthetic dataset with one feature. Use the model to predict new values and plot the data points along with the regression line.

In [None]:
ans)    import numpy as np

# Generate synthetic data
X_synthetic = np.random.rand(100, 1) * 10
y_synthetic = 3 * X_synthetic + np.random.randn(100, 1) * 2

# Train-test split
X_train_syn, X_test_syn, y_train_syn, y_test_syn = train_test_split(X_synthetic, y_synthetic, test_size=0.2, random_state=42)

# Train the model
synthetic_model = LinearRegression()
synthetic_model.fit(X_train_syn, y_train_syn)

# Predictions
y_pred_syn = synthetic_model.predict(X_test_syn)

# Plot the data points and regression line
plt.scatter(X_synthetic, y_synthetic, color='blue', label='Data Points')
plt.plot(X_test_syn, y_pred_syn, color='red', label='Regression Line')
plt.legend()
plt.show()


In [None]:
### Q.8)  Write a Python script that pickles a trained linear regression model and saves it to a file.

In [None]:
ans) import pickle

# Save the model to a file
with open('linear_model.pkl', 'wb') as f:
    pickle.dump(model, f)

print("Model saved to 'linear_model.pkl'")


In [None]:
### Q.9)  Write a Python script that fits a polynomial regression model (degree 2) to a dataset and plots the curve.

In [None]:
ans)  from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate synthetic data
X_poly = np.random.rand(100, 1) * 10
y_poly = 3 * (X_poly ** 2) + 2 * X_poly + np.random.randn(100, 1) * 10

# Polynomial features
poly_features = PolynomialFeatures(degree=2)
X_poly_transformed = poly_features.fit_transform(X_poly)

# Train the model
poly_model = LinearRegression()
poly_model.fit(X_poly_transformed, y_poly)

# Predictions
y_pred_poly = poly_model.predict(X_poly_transformed)

# Plot
plt.scatter(X_poly, y_poly, color='blue', label='Data Points')
plt.plot(X_poly, y_pred_poly, color='red', label='Polynomial Regression Curve')
plt.legend()
plt.show()


In [None]:
### Q.10)  Generate synthetic data for simple linear regression (random values for X and y), fit a linear regression model, and print the coefficient and intercept.

In [None]:
ans) # Generate synthetic data
X_simple = np.random.rand(100, 1) * 10
y_simple = 5 * X_simple + np.random.randn(100, 1) * 5

# Train the model
simple_model = LinearRegression()
simple_model.fit(X_simple, y_simple)

# Print coefficient and intercept
print("Coefficient:", simple_model.coef_[0])
print("Intercept:", simple_model.intercept_)


In [None]:
### Q.11)  Write a Python script that fits a polynomial regression model (degree 3) to a synthetic dataset and plots the curve.

In [None]:
ans) # Polynomial features (degree 3)
poly_features_3 = PolynomialFeatures(degree=3)
X_poly_3 = poly_features_3.fit_transform(X_poly)

# Train the model
poly_model_3 = LinearRegression()
poly_model_3.fit(X_poly_3, y_poly)

# Predictions
y_pred_poly_3 = poly_model_3.predict(X_poly_3)

# Plot
plt.scatter(X_poly, y_poly, color='blue', label='Data Points')
plt.plot(X_poly, y_pred_poly_3, color='green', label='Degree 3 Polynomial Regression Curve')
plt.legend()
plt.show()


In [None]:
### Q.12) Write a Python script that fits a simple linear regression model with two features and prints the coefficients, intercept, and R-squared score.

In [None]:
ans) # Select two features for linear regression
X_two_features = diamonds[['carat', 'depth']]
y_two_features = diamonds['price']

# Train-test split
X_train_two, X_test_two, y_train_two, y_test_two = train_test_split(X_two_features, y_two_features, test_size=0.2, random_state=42)

# Train the model
two_feature_model = LinearRegression()
two_feature_model.fit(X_train_two, y_train_two)

# Predictions
y_pred_two = two_feature_model.predict(X_test_two)

# Print coefficients, intercept, and R-squared score
print("Coefficients:", two_feature_model.coef_)
print("Intercept:", two_feature_model.intercept_)
print("R-squared Score:", r2_score(y_test_two, y_pred_two))


In [None]:
### Q.13)  Write a Python script that generates a synthetic dataset, fits a linear regression model, and calculates MSE, MAE, and RMSE.

In [None]:
ans)   # Generate synthetic data
X_syn = np.random.rand(100, 1) * 10
y_syn = 4 * X_syn + np.random.randn(100, 1) * 5

# Train-test split
X_train_syn, X_test_syn, y_train_syn, y_test_syn = train_test_split(X_syn, y_syn, test_size=0.2, random_state=42)

# Train the model
synthetic_model = LinearRegression()
synthetic_model.fit(X_train_syn, y_train_syn)

# Predictions
y_pred_syn = synthetic_model.predict(X_test_syn)

# Calculate MSE, MAE, RMSE
mse_syn = mean_squared_error(y_test_syn, y_pred_syn)
mae_syn = mean_absolute_error(y_test_syn, y_pred_syn)
rmse_syn = np.sqrt(mse_syn)

print("MSE:", mse_syn)
print("MAE:", mae_syn)
print("RMSE:", rmse_syn)


In [None]:
### Q.14) Write a Python script that uses the Variance Inflation Factor (VIF) to check for multicollinearity in a dataset with multiple features.



In [None]:
ans)  from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature
X_multi = diamonds[['carat', 'depth', 'table', 'x', 'y', 'z']]
vif_data = pd.DataFrame()
vif_data['Feature'] = X_multi.columns
vif_data['VIF'] = [variance_inflation_factor(X_multi.values, i) for i in range(X_multi.shape[1])]

print(vif_data)


In [None]:
### Q.15) Write a Python script that generates synthetic data for a polynomial relationship (degree 4), fits a polynomial regression model, and plots the regression curve.

In [None]:
ans)  # Generate synthetic data
X_poly4 = np.random.rand(100, 1) * 10
y_poly4 = 2 * (X_poly4 ** 4) - 3 * (X_poly4 ** 3) + 4 * (X_poly4 ** 2) + 5 * X_poly4 + np.random.randn(100, 1) * 10

# Polynomial features (degree 4)
poly_features_4 = PolynomialFeatures(degree=4)
X_poly4_transformed = poly_features_4.fit_transform(X_poly4)

# Train the model
poly_model_4 = LinearRegression()
poly_model_4.fit(X_poly4_transformed, y_poly4)

# Predictions
y_pred_poly4 = poly_model_4.predict(X_poly4_transformed)

# Plot
plt.scatter(X_poly4, y_poly4, color='blue', label='Data Points')
plt.plot(X_poly4, y_pred_poly4, color='purple', label='Degree 4 Polynomial Regression Curve')
plt.legend()
plt.show()


In [None]:
### Q.16) Write a Python script that creates a machine learning pipeline with data standardization and a multiple linear regression model, and prints the R-squared score.

In [None]:
ans) # Pipeline from previous task
pipeline.fit(X_train, y_train)
r2 = pipeline.score(X_test, y_test)

print("R-squared Score:", r2)


In [None]:
### Q.17)  Write a Python script that performs polynomial regression (degree 3) on a synthetic dataset and plots the regression curve.

In [None]:
ans)  import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10  # Random values for X in the range 0 to 10
y = 2 * (X ** 3) - 3 * (X ** 2) + 4 * X + np.random.randn(100, 1) * 50  # Degree 3 polynomial relationship with noise

# Transform the features to polynomial features (degree 3)
poly_features = PolynomialFeatures(degree=3)
X_poly = poly_features.fit_transform(X)

# Train a polynomial regression model
model = LinearRegression()
model.fit(X_poly, y)

# Make predictions
y_pred = model.predict(X_poly)

# Plot the data and the regression curve
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X, y_pred, color='red', label='Degree 3 Regression Curve')
plt.title('Polynomial Regression (Degree 3)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()


In [None]:
### Q.18)  Write a Python script that performs multiple linear regression on a synthetic dataset with 5 features. Print the R-squared score and model coefficients.

In [None]:
ans)  # Generate synthetic data with 5 features
X_syn_multi = np.random.rand(100, 5)
y_syn_multi = 3 * X_syn_multi[:, 0] + 2 * X_syn_multi[:, 1] - X_syn_multi[:, 2] + np.random.randn(100)

# Train-test split
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_syn_multi, y_syn_multi, test_size=0.2, random_state=42)

# Train the model
multi_model = LinearRegression()
multi_model.fit(X_train_multi, y_train_multi)

# Predictions and R-squared score
r2_multi = multi_model.score(X_test_multi, y_test_multi)

print("R-squared Score:", r2_multi)
print("Coefficients:", multi_model.coef_)


In [None]:
### Q.19)  Write a Python script that generates synthetic data for linear regression, fits a model, and visualizes the data points along with the regression line.



In [None]:
ans)  # Generate synthetic data
X_visual = np.random.rand(100, 1) * 10
y_visual = 5 * X_visual + np.random.randn(100, 1) * 2

# Train the model
visual_model = LinearRegression()
visual_model.fit(X_visual, y_visual)

# Predictions
y_pred_visual = visual_model.predict(X_visual)

# Plot
plt.scatter(X_visual, y_visual, color='blue', label='Data Points')
plt.plot(X_visual, y_pred_visual, color='red', label='Regression Line')
plt.legend()
plt.show()


In [None]:
### Q.20) Create a synthetic dataset with 3 features and perform multiple linear regression. Print the model's R-squared score and coefficients.



In [None]:
ans)  # Generate synthetic data with 3 features
X_syn_3 = np.random.rand(100, 3)
y_syn_3 = 3 * X_syn_3[:, 0] + 2 * X_syn_3[:, 1] - X_syn_3[:, 2] + np.random.randn(100)

# Train-test split
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_syn_3, y_syn_3, test_size=0.2, random_state=42)

# Train the model
model_3_features = LinearRegression()
model_3_features.fit(X_train_3, y_train_3)

# Print R-squared and coefficients
r2_3 = model_3_features.score(X_test_3, y_test_3)
print("R-squared Score:", r2_3)
print("Coefficients:", model_3_features.coef_)


In [None]:
### Q.21)  Write a Python script to pickle a trained linear regression model, save it to a file, and load it back for prediction.

In [None]:
ans)  # Save the model
with open('model.pkl', 'wb') as f:
    pickle.dump(model_3_features, f)

# Load the model
with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# Test prediction with the loaded model
sample_data = np.array([[1.5, 2.5, -1.0]])
print("Prediction for sample data:", loaded_model.predict(sample_data))


In [None]:
### Q.22) Write a Python script to perform linear regression with categorical features using one-hot encoding. Use the Seaborn tips dataset.



In [None]:
ans)  import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Load the Seaborn 'tips' dataset
data = sns.load_dataset('tips')

# One-hot encoding of categorical features
data_encoded = pd.get_dummies(data, columns=['sex', 'smoker', 'day', 'time'], drop_first=True)

# Define features (X) and target (y)
X = data_encoded.drop('tip', axis=1)
y = data_encoded['tip']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print R-squared score
print("R-squared Score:", r2_score(y_test, y_pred))


In [None]:
### Q.23) Compare Ridge Regression with Linear Regression on a synthetic dataset and print the coefficients and R-squared score.


In [None]:
ans)   import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 3 * X + 2 + np.random.randn(100, 1) * 2

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

# Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)

# Print coefficients and R-squared scores
print("Linear Regression Coefficients:", lr_model.coef_)
print("Linear Regression R-squared Score:", r2_score(y_test, y_pred_lr))
print("Ridge Regression Coefficients:", ridge_model.coef_)
print("Ridge Regression R-squared Score:", r2_score(y_test, y_pred_ridge))


In [None]:
### Q.24) Write a Python script that uses cross-validation to evaluate a Linear Regression model on a synthetic dataset.

In [None]:
ans) import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Generate synthetic dataset
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 5 * X + 7 + np.random.randn(100, 1) * 3

# Initialize Linear Regression model
model = LinearRegression()

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='r2')

# Print cross-validation R-squared scores
print("Cross-Validation R-squared Scores:", scores)
print("Mean R-squared Score:", np.mean(scores))


In [None]:
### Q.25) Write a Python script that compares polynomial regression models of different degrees and prints the R-squared score for each.


In [None]:
ans) import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * (X ** 3) - 3 * (X ** 2) + 4 * X + np.random.randn(100, 1) * 50

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

for degree in range(1, 6):  # Degrees 1 to 5
    poly_features = PolynomialFeatures(degree=degree)
    X_train_poly = poly_features.fit_transform(X_train)
    X_test_poly = poly_features.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    y_pred = model.predict(X_test_poly)

    print(f"Degree {degree} R-squared Score:", r2_score(y_test, y_pred))


In [None]:
### Q.26) Write a Python script that adds interaction terms to a linear regression model and prints the coefficients.

In [None]:
ans) import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate synthetic dataset
np.random.seed(42)
X = pd.DataFrame({
    'feature1': np.random.rand(100) * 10,
    'feature2': np.random.rand(100) * 5
})
y = 3 * X['feature1'] + 2 * X['feature2'] + 1.5 * X['feature1'] * X['feature2'] + np.random.randn(100) * 2

# Add interaction terms
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interaction = poly.fit_transform(X)

# Fit a linear regression model
model = LinearRegression()
model.fit(X_interaction, y)

# Print coefficients
print("Coefficients:", model.coef_)
