# **Theoretical**

In [None]:
### Q.1) What does R-squared represent in a regression model?

In [None]:
ans)  R-squared represents the proportion of variance in the dependent variable that's explained by the independent variables, ranging from 0 to 1. A value of 0.75 means the model explains 75% of the variability.

In [None]:
### Q.2) What are the assumptions of linear regression?

In [None]:
ans) The key assumptions are:

1) Linearity: X and Y have a linear relationship
2) Independence of observations
3) Homoscedasticity: Constant variance of residuals
4) Normality of residuals
5) No multicollinearity among independent variables

In [None]:
### Q.3) What is the difference between R-squared and Adjusted R-squared?

In [None]:
ans)   While R-squared always increases with added variables, Adjusted R-squared penalizes unnecessary variables and can decrease, making it better for model comparison.

In [None]:
### Q.4) Why do we use Mean Squared Error (MSE)?

In [None]:
ans) MSE is used because it:

1) Measures average squared differences between predictions and actuals
2) Makes all errors positive through squaring
3) Penalizes larger errors more heavily
4) Is differentiable for optimization

In [None]:
### Q.5) What does an Adjusted R-squared value of 0.85 indicate?

In [None]:
ans)  It indicates that 85% of the variance in the dependent variable is explained by the independent variables, accounting for the number of predictors.

In [None]:
### Q.6) How do we check for normality of residuals in linear regression?

In [None]:
ans)  Through:

1) Q-Q plots
2) Histograms of residuals
3) Shapiro-Wilk test
4) Kolmogorov-Smirnov test

In [None]:
### Q.7) What is multicollinearity, and how does it impact regression?

In [None]:
ans)  Multicollinearity is high correlation between independent variables. It causes:

Unstable coefficient estimates
Increased standard errors
Difficulty in determining variable importance

In [None]:
### Q.8) What is Mean Absolute Error (MAE)?

In [None]:
ans) MAE is the average absolute difference between predicted and actual values. It's more robust to outliers than MSE and maintains the original unit of measurement.

In [None]:
### Q.9) What are the benefits of using an ML pipeline?

In [None]:
ans)  ML pipelines provide:

1) Standardized workflow
2) Automated preprocessing
3) Reproducibility
4) Easier deployment and maintenance
5) Reduced human error

In [None]:
### Q.10) Why is RMSE considered more interpretable than MSE?

In [None]:
ans) RMSE is in the same units as the target variable, while MSE is squared units. This makes RMSE easier to understand in context of the original data.

In [None]:
### Q.11) What is pickling in Python, and how is it useful in ML?

In [None]:
ans) Pickling is serializing Python objects to save them to files. In ML, it's used to:

1) Save trained models
2) Share models between systems
3) Preserve preprocessing transformations
4) Store model parameters

In [None]:
### Q.12) What does a high R-squared value mean?

In [None]:
ans) A high R-squared indicates the model explains a large portion of variance in the dependent variable, suggesting good fit. However, it could also indicate overfitting.

In [None]:
### Q.13) What happens if linear regression assumptions are violated?

In [None]:
ans) Violations can lead to:

1) Biased coefficients
2) Incorrect standard errors
3) Unreliable p-values
4) Poor model generalization

In [None]:
### Q.14) How can we address multicollinearity in regression?

In [None]:
ans)  Methods include:

1) Remove highly correlated features
2) Use principal component analysis (PCA)
3) Ridge regression or Lasso regularization
4) Combine correlated features

In [None]:
### Q.15) How can feature selection improve model performance in regression analysis?

In [None]:
ans) Feature selection can:

1) Reduce overfitting
2) Improve model interpretability
3) Decrease training time
4) Remove noise from irrelevant features

In [None]:
### Q.16) How is Adjusted R-squared calculated?

In [None]:
ans) Adjusted R-squared = 1 - [(1 - R²)(n-1)/(n-k-1)]
where n is sample size and k is number of predictors

In [None]:
### Q.17) Why is MSE sensitive to outliers?

In [None]:
ans) MSE squares errors, which magnifies large differences, making it particularly sensitive to outliers compared to metrics like MAE.

In [None]:
### Q.18) What is the role of homoscedasticity in linear regression?

In [None]:
ans) Homoscedasticity means constant variance of residuals across all predictor values, ensuring reliable coefficient estimates and inference.

In [None]:
### Q.19) What is Root Mean Squared Error (RMSE)?

In [None]:
ans) RMSE is the square root of MSE, providing error measurement in the same units as the target variable.

In [None]:
### Q.20) Why is pickling considered risky?

In [None]:
ans) Pickling can be risky because:

1) Security vulnerabilities when unpickling untrusted files
2) Version compatibility issues
3) Platform dependency problems
4) Potential for arbitrary code execution

In [None]:
### Q.21) What alternatives exist to pickling for saving ML models?

In [None]:
ans) Alternatives include:

1) ONNX format
2) TensorFlow SavedModel
3) Joblib
4) Model-specific formats (like H5 for Keras)
5) Custom serialization methods

In [None]:
### Q.22) What is heteroscedasticity, and why is it a problem?

In [None]:
ans) Heteroscedasticity is non-constant variance in residuals, causing:

1) Inefficient parameter estimates
2) Biased standard errors
3) Invalid hypothesis tests
4) Unreliable confidence intervals

In [None]:
### Q.23) How can interaction terms enhance a regression model's predictive power?

In [None]:
ans)  Interaction terms can:

1) Capture non-linear relationships
2) Model feature dependencies
3) Improve model flexibility
4) Account for conditional effects between variables

**Practical**

In [None]:
### Q.1) Write a Python script to visualize the distribution of errors (residuals) for a multiple linear regression model using Seaborn's "diamonds" dataset.

In [None]:
ans) import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the diamonds dataset
diamonds = sns.load_dataset('diamonds')

# Preprocess the data: Select numeric features and drop rows with missing values
diamonds = diamonds.select_dtypes(include=[np.number]).dropna()

# Define features (X) and target (y)
X = diamonds.drop(columns=['price'])  # Exclude 'price' as it's the target
y = diamonds['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Calculate residuals
residuals = y_test - y_pred

# Visualize the distribution of residuals
sns.histplot(residuals, kde=True)
plt.title('Distribution of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()


In [None]:
### Q.2)  Write a Python script to calculate and print Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) for a linear regression model.

In [None]:
ans) from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"MSE: {mse}")
print(f"MAE: {mae}")
print(f"RMSE: {rmse}")


In [None]:
### Q.3) Write a Python script to check if the assumptions of linear regression are met. Use a scatter plot to check linearity, residuals plot for homoscedasticity, and correlation matrix for multicollinearity.

In [None]:
ans) import seaborn as sns
import matplotlib.pyplot as plt

# Linearity: Residuals vs Fitted
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title("Residuals vs Fitted")
plt.show()

# Homoscedasticity
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title("Homoscedasticity Check")
plt.show()

# Multicollinearity
corr_matrix = X.corr()
sns.heatmap(corr_matrix, annot=True)
plt.title("Correlation Matrix")
plt.show()


In [None]:
### Q.4)  Write a Python script that creates a machine learning pipeline with feature scaling and evaluates the performance of different regression models.

In [None]:
ans) from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

# Cross-validation to evaluate performance
scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
print(f"Mean R-squared: {scores.mean()}")


In [None]:
### Q.5) Implement a simple linear regression model on a dataset and print the model's coefficients, intercept, and R-squared score.

In [None]:
ans) model = LinearRegression()
model.fit(X_train, y_train)
r_squared = model.score(X_test, y_test)

print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared: {r_squared}")


In [None]:
### Q.6)  Write a Python script that analyzes the relationship between total bill and tip in the tips dataset using simple linear regression and visualizes the results.

In [None]:
ans) tips = sns.load_dataset('tips')

# Simple Linear Regression
X = tips[['total_bill']]
y = tips['tip']

model = LinearRegression()
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Plot
sns.scatterplot(x=tips['total_bill'], y=tips['tip'])
plt.plot(tips['total_bill'], y_pred, color='red')
plt.title("Total Bill vs Tip")
plt.show()


In [None]:
### Q.7) Write a Python script that fits a linear regression model to a synthetic dataset with one feature. Use the model to predict new values and plot the data points along with the regression line.



In [None]:
ans) from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
model = LinearRegression()
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.title("Regression Line with Synthetic Data")
plt.show()


In [None]:
### Q.8)  Write a Python script that pickles a trained linear regression model and saves it to a file.



In [None]:
ans)  import pickle

with open('linear_model.pkl', 'wb') as file:
    pickle.dump(model, file)

print("Model saved to 'linear_model.pkl'")


In [None]:
### Q.9) Write a Python script that fits a polynomial regression model (degree 2) to a dataset and plots the regression curve.

In [None]:
ans) from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Polynomial Regression
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_model.fit(X, y)

# Predictions
y_pred = poly_model.predict(X)

# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.title("Polynomial Regression (Degree 2)")
plt.show()


In [None]:
### Q.10) Generate synthetic data for simple linear regression (use random values for X and y) and fit a linear regression model to the data. Print the model's coefficient and intercept.

In [None]:
ans) import numpy as np

X = np.random.rand(100, 1) * 10  # Random values for X
y = 3 * X.flatten() + np.random.randn(100) * 5  # Linear relation with noise

model = LinearRegression()
model.fit(X, y)

print(f"Coefficient: {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")


In [None]:
### Q.11)   Write a Python script that fits polynomial regression models of different degrees to a synthetic dataset and compares their performance.

In [None]:
ans) degrees = [1, 2, 3, 4]
for degree in degrees:
    poly_model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression())
    poly_model.fit(X, y)
    y_pred = poly_model.predict(X)
    plt.plot(X, y_pred, label=f"Degree {degree}")

plt.scatter(X, y, color='blue', alpha=0.5)
plt.legend()
plt.title("Polynomial Regression Models")
plt.show()


In [None]:
### Q.12)  Write a Python script that fits a simple linear regression model with two features and prints the model's coefficients, intercept, and R-squared score.

In [None]:
ans) X = diamonds[['carat', 'depth']]
y = diamonds['price']

model = LinearRegression()
model.fit(X, y)
r_squared = model.score(X, y)

print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared: {r_squared}")


In [None]:
### Q.13) Write a Python script that generates synthetic data, fits a linear regression model, and visualizes the regression line along with the data points.

In [None]:
ans) X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.title("Regression Line")
plt.show()


In [None]:
### Q.14) Write a Python script that uses the Variance Inflation Factor (VIF) to check for multicollinearity in a dataset with multiple features.



In [None]:
ans) import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.datasets import make_regression

# Generate synthetic data
X, _ = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)
df = pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5'])

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Feature"] = df.columns
vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

print(vif_data)


In [None]:
### Q.15) Write a Python script that generates synthetic data for a polynomial relationship (degree 4), fits a polynomial regression model, and plots the regression curve.

In [None]:
ans) from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
X = np.random.rand(100, 1) * 10
y = 3 * X**4 - 5 * X**3 + 2 * X**2 + 7 * X.flatten() + np.random.randn(100) * 100

# Polynomial Regression
poly_model = make_pipeline(PolynomialFeatures(degree=4), LinearRegression())
poly_model.fit(X, y)

# Predictions
y_pred = poly_model.predict(X)

# Plot
plt.scatter(X, y, color='blue')
plt.plot(np.sort(X, axis=0), y_pred[np.argsort(X, axis=0)], color='red')
plt.title("Polynomial Regression (Degree 4)")
plt.show()


In [None]:
### Q.16) Write a Python script that creates a machine learning pipeline with data standardization and a multiple linear regression model, and prints the R-squared score.

In [None]:
ans) from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

# Train and evaluate
pipeline.fit(X_train, y_train)
r_squared = pipeline.score(X_test, y_test)

print(f"R-squared: {r_squared}")


In [None]:
### Q.17) Write a Python script that performs polynomial regression (degree 3) on a synthetic dataset and plots the regression curve.


In [None]:
ans) # Generate synthetic data
X = np.random.rand(100, 1) * 10
y = X**3 - 2 * X**2 + 5 * X.flatten() + np.random.randn(100) * 50

# Polynomial Regression
poly_model = make_pipeline(PolynomialFeatures(degree=3), LinearRegression())
poly_model.fit(X, y)

# Predictions
y_pred = poly_model.predict(X)

# Plot
plt.scatter(X, y, color='blue')
plt.plot(np.sort(X, axis=0), y_pred[np.argsort(X, axis=0)], color='red')
plt.title("Polynomial Regression (Degree 3)")
plt.show()


In [None]:
### Q.18) Write a Python script that performs multiple linear regression on a synthetic dataset with 5 features. Print the R-squared score and model coefficients.

In [None]:
ans) # Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Train model
model = LinearRegression()
model.fit(X, y)

# Results
r_squared = model.score(X, y)
print(f"R-squared: {r_squared}")
print(f"Coefficients: {model.coef_}")


In [None]:
### Q.19) Write a Python script that generates synthetic data for linear regression, fits a model, and visualizes the data points along with the regression line.



In [None]:
ans) # Generate synthetic data
X = np.random.rand(100, 1) * 10
y = 2 * X.flatten() + 5 + np.random.randn(100) * 2

# Fit model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.title("Linear Regression Line")
plt.show()


In [None]:
### Q.20) Create a synthetic dataset with 3 features and perform multiple linear regression. Print the model's R-squared score and coefficients.

In [None]:
ans) # Generate synthetic data
X, y = make_regression(n_samples=100, n_features=3, noise=10, random_state=42)

# Train model
model = LinearRegression()
model.fit(X, y)

# Results
r_squared = model.score(X, y)
print(f"R-squared: {r_squared}")
print(f"Coefficients: {model.coef_}")


In [None]:
### Q.21) Write a Python script that demonstrates how to serialize and deserialize machine learning models using joblib instead of pickling.

In [None]:
ans) from joblib import dump, load

# Serialize model
dump(model, 'linear_model.joblib')
print("Model saved to 'linear_model.joblib'")

# Deserialize model
loaded_model = load('linear_model.joblib')
print("Model loaded successfully.")


In [None]:
### Q.22) Write a Python script to perform linear regression with categorical features using one-hot encoding. Use the Seaborn tips dataset.

In [None]:
ans) from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Load dataset
tips = sns.load_dataset('tips')

# One-hot encoding
X = tips[['total_bill', 'sex', 'smoker']]
y = tips['tip']

ct = ColumnTransformer([
    ('onehot', OneHotEncoder(), ['sex', 'smoker'])
], remainder='passthrough')

X_transformed = ct.fit_transform(X)

# Fit model
model = LinearRegression()
model.fit(X_transformed, y)

# Results
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")


In [None]:
### Q.23) Compare Ridge Regression with Linear Regression on a synthetic dataset and print the coefficients and R-squared score.

In [None]:
ans) from sklearn.linear_model import Ridge

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X, y)

# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X, y)

# Results
print("Ridge Regression Coefficients:", ridge_model.coef_)
print("Linear Regression Coefficients:", linear_model.coef_)
print("Ridge R-squared:", ridge_model.score(X, y))
print("Linear R-squared:", linear_model.score(X, y))


In [None]:
### Q.24) Write a Python script that uses cross-validation to evaluate a Linear Regression model on a synthetic dataset.

In [None]:
ans) from sklearn.model_selection import cross_val_score

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Cross-validation
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')
print(f"R-squared scores: {scores}")
print(f"Mean R-squared: {scores.mean()}")


In [None]:
### Q.25)  Write a Python script that compares polynomial regression models of different degrees and prints the R-squared score for each.

In [None]:
ans) degrees = [1, 2, 3, 4]
for degree in degrees:
    poly_model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression())
    poly_model.fit(X, y)
    r_squared = poly_model.score(X, y)
    print(f"Degree {degree} R-squared: {r_squared}")
