Regression Assignment

Theory Questions
Q1: What does R-squared represent in a regression model?

A: R-squared represents the proportion of variance in the dependent variable that's explained by the independent variables. It ranges from 0 to 1, where 1 means the model explains all variability in the data.

Q2: What is the difference between R-squared and Adjusted R-squared?

A: - R-squared increases or stays the same when adding any variable
- Adjusted R-squared penalizes for adding variables that don't improve the model
- Adjusted R-squared can decrease when adding irrelevant variables

Q3: What does an Adjusted R-squared value of 0.85 indicate?

A: An Adjusted R-squared of 0.85 indicates that 85% of the variance in the dependent variable is explained by the predictors, accounting for the number of predictors in the model.

Q4: What does a high R-squared value mean?

A: A high R-squared indicates:
- Strong fit between predictors and target
- Large proportion of variance explained
- Model captures patterns well
- May suggest overfitting if extremely high

Q5: How is Adjusted R-squared calculated?

A: Adjusted R-squared calculation:
1 - [(1 - R²)(n-1)/(n-k-1)]
where:
n = sample size
k = number of predictors

ERROR METRICS

Q6: Why do we use Mean Squared Error (MSE)?

A: MSE is used because it:
- Penalizes larger errors more heavily (squared term)
- Provides a single metric for model evaluation
- Is differentiable (useful for optimization)
- Always yields positive values

Q7: What is Mean Absolute Error (MAE)?

A: MAE is the average absolute difference between predicted and actual values. It's:
- More robust to outliers than MSE
- Easier to interpret
- Represents average error in original units

Q8: Why is RMSE considered more interpretable than MSE?

A: RMSE is more interpretable than MSE because:
- It's in the same units as the target variable
- Provides a more intuitive error magnitude
- Easier to compare with the original scale

Q9: Why is MSE sensitive to outliers?

A: MSE is sensitive to outliers because:
- Errors are squared
- Large deviations are heavily penalized
- Outliers have disproportionate impact
- Can distort model evaluation

Q10: What is Root Mean Squared Error (RMSE)?

A: RMSE is:
- Square root of MSE
- Average deviation in original units
- Common metric for model evaluation
- More sensitive to outliers than MAE

LINEAR REGRESSION ASSUMPTIONS & ISSUES

Q11: What are the assumptions of linear regression?

A: Linear regression assumptions include:
- Linearity: Linear relationship between X and Y
- Independence of errors
- Homoscedasticity: Constant variance of residuals
- Normality of residuals
- No perfect multicollinearity
- Independent observations

Q12: How do we check for normality of residuals in linear regression?

A: Normality of residuals can be checked through:
- Q-Q plots
- Histogram of residuals
- Shapiro-Wilk test
- Anderson-Darling test
- Visual inspection of residual plots

Q13: What is multicollinearity, and how does it impact regression?

A: Multicollinearity occurs when independent variables are highly correlated. It:
- Makes coefficient estimates unstable
- Increases standard errors
- Makes it difficult to determine individual variable importance
- Doesn't affect overall model predictions

Q14: What happens if linear regression assumptions are violated?

A: Violation of assumptions can lead to:
- Biased coefficient estimates
- Incorrect standard errors
- Invalid hypothesis tests
- Unreliable predictions
- Misleading R-squared values

Q15: What is the role of homoscedasticity in linear regression?

A: Homoscedasticity ensures:
- Constant variance of residuals
- Valid standard errors
- Reliable hypothesis tests
- Efficient parameter estimates

HANDLING ISSUES & SOLUTIONS

Q16: How can we address multicollinearity in regression?

A: Addressing multicollinearity:
- Remove highly correlated variables
- Use principal component analysis (PCA)
- Ridge regression
- Create composite variables
- Collect more data

Q17: What is heteroscedasticity, and why is it a problem?

A: Heteroscedasticity:
- Non-constant variance of residuals
- Leads to inefficient estimates
- Invalidates standard errors
- Makes prediction intervals unreliable

Q18: How does adding irrelevant predictors affect R-squared and Adjusted R-squared?

A: Effect on R² and Adjusted R²:
- R² increases with irrelevant predictors
- Adjusted R² decreases with irrelevant predictors
- Shows why Adjusted R² is preferred
- Helps prevent overfitting

ML PIPELINES & MODEL SAVING

Q19: What are the benefits of using an ML pipeline?

A: ML pipeline benefits:
- Ensures consistent preprocessing
- Reduces code duplication
- Prevents data leakage
- Makes deployment easier
- Improves reproducibility

Q20: Why do we use pipelines in machine learning?

A: Pipelines are used because they:
- Ensure proper order of operations
- Prevent data leakage
- Simplify model deployment
- Make cross-validation easier
- Enable proper scaling/transformation

Q21: What is pickling in Python, and how is it useful in ML?

A: Pickling in Python:
- Serializes Python objects to binary format
- Allows saving trained models to disk
- Enables model sharing and deployment
- Preserves the entire object state

Q22: Why is pickling considered risky?

A: Pickling risks include:
- Security vulnerabilities
- Version compatibility issues
- Platform dependencies
- Potential code execution risks
- Size inefficiency

Q23: What alternatives exist to pickling for saving ML models?

A: Alternatives to pickling:
- joblib
- ONNX format
- TensorFlow SavedModel
- PMML
- Custom serialization methods

PRACTICAL QUESTIONS

In [None]:
#QUESTION 1 ANSWER

import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Load dataset
diamonds = sns.load_dataset('diamonds')

# Preprocessing
X = pd.get_dummies(diamonds.drop('price', axis=1), drop_first=True)
y = diamonds['price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Errors
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"MSE: {mse:.2f}, MAE: {mae:.2f}")


In [None]:
#QUESTION 2 ANSWER
from sklearn.datasets import make_regression
import numpy as np

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Errors
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"MSE: {mse:.2f}, MAE: {mae:.2f}, RMSE: {rmse:.2f}")


In [None]:
#QUESTION 3 ANSWER
import matplotlib.pyplot as plt
import seaborn as sns

# Linearity
sns.scatterplot(x=y_test, y=y_pred)
plt.title('Linearity Check')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

# Residuals plot for homoscedasticity
residuals = y_test - y_pred
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='r', linestyle='--')
plt.title('Residuals Plot (Homoscedasticity)')
plt.show()

# Correlation matrix for multicollinearity
sns.heatmap(pd.DataFrame(X_train).corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


In [None]:
#QUESTION 4 ANSWER
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)

print(f"R-squared score: {score:.2f}")



In [None]:
5.
# Model fitting
model = LinearRegression()
model.fit(X_train, y_train)

# Coefficients, Intercept, and R-squared
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared score: {model.score(X_test, y_test):.2f}")


In [None]:
6. # Load dataset
tips = sns.load_dataset('tips')

# Prepare data
X = tips[['total_bill']]
y = tips['tip']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
model = LinearRegression()
model.fit(X_train, y_train)

# Slope and Intercept
print(f"Slope: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")


In [None]:
7. import numpy as np

# Generate synthetic data
X = np.random.rand(100, 1) * 10
y = 2 * X + np.random.randn(100, 1) * 2

# Model
model = LinearRegression()
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, label='Data points')
plt.plot(X, y_pred, color='r', label='Regression line')
plt.legend()
plt.title('Simple Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.show()


In [None]:
8. import pickle

# Save the model
with open('linear_model.pkl', 'wb') as file:
    pickle.dump(model, file)

# Load the model
with open('linear_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

print(f"Loaded Model Coefficients: {loaded_model.coef_}")


In [None]:
9.from sklearn.preprocessing import PolynomialFeatures

# Polynomial Features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Model
poly_model = LinearRegression()
poly_model.fit(X_poly, y)

# Predictions
y_poly_pred = poly_model.predict(X_poly)

# Plot
plt.scatter(X, y, label='Data points')
plt.plot(X, y_poly_pred, color='r', label='Polynomial regression curve')
plt.legend()
plt.title('Polynomial Regression (Degree 2)')
plt.show()


In [None]:
10.# Generate synthetic data
X = np.random.rand(100, 1) * 10
y = 3 * X + np.random.randn(100, 1) * 2

# Model
model = LinearRegression()
model.fit(X, y)

# Coefficient and Intercept
print(f"Coefficient: {model.coef_[0][0]:.2f}")
print(f"Intercept: {model.intercept_[0]:.2f}")


In [None]:
11.# Polynomial Features
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

# Model
poly_model = LinearRegression()
poly_model.fit(X_poly, y)

# Predictions
y_poly_pred = poly_model.predict(X_poly)

# Plot
plt.scatter(X, y, label='Data points')
plt.plot(X, y_poly_pred, color='r', label='Polynomial regression curve')
plt.legend()
plt.title('Polynomial Regression (Degree 3)')
plt.show()


In [None]:
12.# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
model = LinearRegression()
model.fit(X_train, y_train)

# Coefficients, Intercept, and R-squared
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared score: {model.score(X_test, y_test):.2f}")


In [None]:
13.# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Errors
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"MSE: {mse:.2f}, MAE: {mae:.2f}, RMSE: {rmse:.2f}")


In [None]:
14.from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

# Example synthetic dataset with multiple features
X, _ = make_regression(n_samples=100, n_features=3, noise=10, random_state=42)
X_df = pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3'])

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Feature"] = X_df.columns
vif_data["VIF"] = [variance_inflation_factor(X_df.values, i) for i in range(X_df.shape[1])]

print(vif_data)


In [None]:
15.# Polynomial Features
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)

# Model
poly_model = LinearRegression()
poly_model.fit(X_poly, y)

# Predictions
y_poly_pred = poly_model.predict(X_poly)

# Plot
plt.scatter(X, y, label='Data points')
plt.plot(X, y_poly_pred, color='r', label='Polynomial regression curve')
plt.legend()
plt.title('Polynomial Regression (Degree 4)')
plt.show()


In [None]:
16.# Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)

print(f"R-squared score: {score:.2f}")


In [None]:
17.# Polynomial Features
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

# Model
poly_model = LinearRegression()
poly_model.fit(X_poly, y)

# Predictions
y_poly_pred = poly_model.predict(X_poly)

# Plot
plt.scatter(X, y, label='Data points')
plt.plot(X, y_poly_pred, color='r', label='Polynomial regression curve')
plt.legend()
plt.title('Polynomial Regression (Degree 3)')
plt.show()


In [None]:
18.# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
model = LinearRegression()
model.fit(X_train, y_train)

# Results
print(f"R-squared score: {model.score(X_test, y_test):.2f}")
print(f"Coefficients: {model.coef_}")


In [None]:
19.# Generate synthetic data
X = np.random.rand(100, 1) * 10
y = 4 * X + np.random.randn(100, 1) * 3

# Model
model = LinearRegression()
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, label='Data points')
plt.plot(X, y_pred, color='r', label='Regression line')
plt.legend()
plt.title('Linear Regression Visualization')
plt.show()


In [None]:
20.# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=3, noise=10, random_state=42)

# Model
model = LinearRegression()
model.fit(X, y)

# Results
print(f"R-squared score: {model.score(X, y):.2f}")
print(f"Coefficients: {model.coef_}")


In [None]:
21.# Save the model
with open('linear_model.pkl', 'wb') as file:
    pickle.dump(model, file)

# Load the model
with open('linear_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Prediction
new_data = [[2.5, 3.5, 1.2]]
prediction = loaded_model.predict(new_data)
print(f"Prediction: {prediction}")


In [None]:
22.from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Load dataset
tips = sns.load_dataset('tips')

# One-hot encoding
column_transformer = ColumnTransformer([
    ('encoder', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])
], remainder='passthrough')

X = column_transformer.fit_transform(tips.drop('tip', axis=1))
y = tips['tip']

# Model
model = LinearRegression()
model.fit(X, y)

print(f"R-squared score: {model.score(X, y):.2f}")


In [None]:
23.from sklearn.linear_model import Ridge

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Results
print("Linear Regression:")
print(f"Coefficients: {model.coef_}")
print(f"R-squared: {model.score(X_test, y_test):.2f}")

print("Ridge Regression:")
print(f"Coefficients: {ridge.coef_}")
print(f"R-squared: {ridge.score(X_test, y_test):.2f}")


In [None]:
24.from sklearn.model_selection import cross_val_score

# Cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print(f"Cross-validated R-squared scores: {scores}")
print(f"Mean R-squared score: {scores.mean():.2f}")


In [None]:
25.from sklearn.model_selection import cross_val_score

# Cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print(f"Cross-validated R-squared scores: {scores}")
print(f"Mean R-squared score: {scores.mean():.2f}")


In [None]:
26.