In [None]:
THEORY QUESTIONS

1. What does R-squared represent in a regression model?
R-squared, or the coefficient of determination, represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates that the model explains all the variability of the response data around its mean.

2. What are the assumptions of linear regression?
The key assumptions are:
Linearity: The relationship between the independent and dependent variables is linear.
Independence: The residuals (errors) are independent.
Homoscedasticity: The residuals have constant variance at every level of the independent variable.
Normality: The residuals are normally distributed.
No multicollinearity: The independent variables are not highly correlated.

3. What is the difference between R-squared and Adjusted R-squared?
Adjusted R-squared adjusts the R-squared value for the number of predictors in the model. It penalizes the addition of irrelevant predictors and provides a more accurate measure of model performance.

4. Why do we use Mean Squared Error (MSE)?
MSE is used to measure the average of the squares of the errors. It indicates how well the regression line fits the data points. Lower MSE values indicate better fit.

5. What does an Adjusted R-squared value of 0.85 indicate?
An Adjusted R-squared value of 0.85 indicates that 85% of the variance in the dependent variable is explained by the model, adjusted for the number of predictors.

6. How do we check for normality of residuals in linear regression?
Normality of residuals can be checked using visual tools like histograms, Q-Q plots, and statistical tests like the Shapiro-Wilk test.

7. What is multicollinearity, and how does it impact regression?
Multicollinearity occurs when independent variables are highly correlated. It can inflate the variance of the coefficient estimates and make the model unstable.

8. What is Mean Absolute Error (MAE)?
MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation.

9. What are the benefits of using an ML pipeline?
ML pipelines automate the workflow for machine learning models, ensuring consistent and efficient processing of data, transformation, and modeling.

10. Why is RMSE considered more interpretable than MSE?
RMSE is in the same units as the dependent variable, making it easier to interpret compared to MSE, which is in squared units.

11. What is pickling in Python, and how is it useful in ML?
Pickling is a way to serialize and deserialize Python objects. It’s useful for saving ML models and other data structures to disk.

12. What does a high R-squared value mean?
A high R-squared value indicates that the model explains a large portion of the variance in the dependent variable.

13. What happens if linear regression assumptions are violated?
Violating assumptions can lead to biased estimates, invalid statistical tests, and unreliable predictions.

14. How can we address multicollinearity in regression?
Multicollinearity can be addressed by removing or combining correlated predictors, using principal component analysis, or ridge regression.

15. Why do we use pipelines in machine learning?
Pipelines streamline the process of data transformation and model training, ensuring reproducibility and reducing the chances of data leakage.

16. How is Adjusted R-squared calculated?
Formula:
Adjusted R-squared = 1 - [(1 - R^2) * (n - 1) / (n - k - 1)]

where:
R^2 is the R-squared value
n is the number of data points
k is the number of independent variables


17. Why is MSE sensitive to outliers?
MSE squares the errors, giving more weight to larger errors, thus making it sensitive to outliers.

18. What is the role of homoscedasticity in linear regression?
Homoscedasticity ensures that the variance of errors is constant across all levels of the independent variables, which is essential for valid statistical inference.

19. What is Root Mean Squared Error (RMSE)?
RMSE is the square root of the average of squared differences between predicted and observed values, providing a measure of how well the model fits the data.

20. Why is pickling considered risky?
Pickling can execute arbitrary code if the pickle file is tampered with, posing a security risk.

21. What alternatives exist to pickling for saving ML models?
Alternatives include using libraries like Joblib, or formats like ONNX and PMML.

22. What is heteroscedasticity, and why is it a problem?
Heteroscedasticity occurs when the variance of errors differs across levels of the independent variables, leading to inefficient and biased estimates.

23. How does adding irrelevant predictors affect R-squared and Adjusted R-squared?
Adding irrelevant predictors can increase R-squared but decrease Adjusted R-squared, as the latter penalizes unnecessary complexity.





In [None]:
PRACTICAL QUESTIONS

1.
# Import necessary libraries
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Load the diamonds dataset
diamonds = sns.load_dataset('diamonds')

# Drop rows with missing values
diamonds = diamonds.dropna()

# Select features and target variable
X = diamonds[['carat', 'depth', 'table']]  # Example features
y = diamonds['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

In [None]:
2.

# Import necessary libraries
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import seaborn as sns
import numpy as np

# Load the diamonds dataset
diamonds = sns.load_dataset('diamonds')

# Drop rows with missing values
diamonds = diamonds.dropna()

# Select features and target variable
X = diamonds[['carat', 'depth', 'table']]  # Example features
y = diamonds['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Print the results
print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


In [None]:
3.
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Load the diamonds dataset
diamonds = sns.load_dataset('diamonds').dropna()

# Select features and target variable
X = diamonds[['carat', 'depth', 'table']]
y = diamonds['price']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Check linearity
plt.scatter(y_test, y_pred)
plt.title('Linearity Check')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()

# Check homoscedasticity
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.title('Homoscedasticity Check')
plt.xlabel('Predicted Prices')
plt.ylabel('Residuals')
plt.show()

# Check multicollinearity
X_vif = add_constant(X)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]
vif["features"] = X_vif.columns
print(vif)


In [None]:
4.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Create the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred_pipeline = pipeline.predict(X_test)
r2 = r2_score(y_test, y_pred_pipeline)
print(f"Pipeline R-squared Score: {r2}")


In [None]:
5.
# Simple Linear Regression Example using tips dataset
tips = sns.load_dataset('tips')

# Features and target variable
X_tips = tips[['total_bill']]
y_tips = tips['tip']

# Split the data
X_train_tips, X_test_tips, y_train_tips, y_test_tips = train_test_split(X_tips, y_tips, test_size=0.2, random_state=42)

# Fit the linear regression model
model_tips = LinearRegression()
model_tips.fit(X_train_tips, y_train_tips)

# Print the model's coefficients, intercept, and R-squared score
print(f"Coefficient: {model_tips.coef_[0]}")
print(f"Intercept: {model_tips.intercept_}")
print(f"R-squared Score: {model_tips.score(X_test_tips, y_test_tips)}")


In [None]:
6.
# Fit the linear regression model
model_tips = LinearRegression()
model_tips.fit(X_tips, y_tips)

# Print the slope and intercept
print(f"Slope: {model_tips.coef_[0]}")
print(f"Intercept: {model_tips.intercept_}")


In [None]:
7.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate synthetic data
np.random.seed(42)
X_synthetic = 2 * np.random.rand(100, 1)
y_synthetic = 4 + 3 * X_synthetic + np.random.randn(100, 1)

# Fit the linear regression model
model = LinearRegression()
model.fit(X_synthetic, y_synthetic)

# Predict new values
X_new = np.array([[0], [2]])
y_new = model.predict(X_new)

# Plot the data points and regression line
plt.scatter(X_synthetic, y_synthetic, color='blue')
plt.plot(X_new, y_new, color='red', linewidth=2)
plt.title('Linear Regression on Synthetic Data')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

# Print coefficients
print(f"Coefficient: {model.coef_[0][0]}")
print(f"Intercept: {model.intercept_[0]}")


In [None]:
8.
import pickle

# Save the trained model using pickle
with open('linear_regression_model.pkl', 'wb') as f:
    pickle.dump(model, f)

print("Model saved as 'linear_regression_model.pkl'")


In [None]:
9.
from sklearn.preprocessing import PolynomialFeatures

# Generate synthetic data
X_poly_synthetic = np.random.rand(100, 1) * 10
y_poly_synthetic = 3 * X_poly_synthetic ** 2 + 2 * X_poly_synthetic + 1 + np.random.randn(100, 1) * 10

# Transform the data for polynomial regression
poly_features = PolynomialFeatures(degree=2)
X_poly = poly_features.fit_transform(X_poly_synthetic)

# Fit the polynomial regression model
poly_model = LinearRegression()
poly_model.fit(X_poly, y_poly_synthetic)

# Plot the polynomial regression curve
plt.scatter(X_poly_synthetic, y_poly_synthetic, color='blue')
plt.plot(X_poly_synthetic, poly_model.predict(X_poly), color='red')
plt.title('Polynomial Regression (Degree 2)')
plt.xlabel('X')
plt.ylabel('y')
plt.show()


In [None]:
10.
# Generate synthetic data
X_lin = 2 * np.random.rand(100, 1)
y_lin = 4 + 3 * X_lin + np.random.randn(100, 1)

# Fit the linear regression model
model_lin = LinearRegression()
model_lin.fit(X_lin, y_lin)

# Plot the data and regression line
plt.scatter(X_lin, y_lin, color='blue')
plt.plot(X_lin, model_lin.predict(X_lin), color='red')
plt.title('Simple Linear Regression on Synthetic Data')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

# Print coefficients
print(f"Coefficient: {model_lin.coef_[0][0]}")
print(f"Intercept: {model_lin.intercept_[0]}")


In [None]:
11.
# Generate synthetic data
X_poly3 = np.random.rand(100, 1) * 10
y_poly3 = 2 * X_poly3 ** 3 + 3 * X_poly3 ** 2 + 4 * X_poly3 + 5 + np.random.randn(100, 1) * 50

# Transform the data for polynomial regression (degree 3)
poly_features3 = PolynomialFeatures(degree=3)
X_poly3_transformed = poly_features3.fit_transform(X_poly3)

# Fit the polynomial regression model
poly_model3 = LinearRegression()
poly_model3.fit(X_poly3_transformed, y_poly3)

# Plot the polynomial regression curve
plt.scatter(X_poly3, y_poly3, color='blue')
plt.plot(X_poly3, poly_model3.predict(X_poly3_transformed), color='red')
plt.title('Polynomial Regression (Degree 3)')
plt.xlabel('X')
plt.ylabel('y')
plt.show()


In [None]:
12.
# Generate synthetic data with two features
X_two_features = np.random.rand(100, 2) * 10
y_two_features = 3 * X_two_features[:, 0] + 2 * X_two_features[:, 1] + 4 + np.random.randn(100)

# Fit the linear regression model
model_two_features = LinearRegression()
model_two_features.fit(X_two_features, y_two_features)

# Print the coefficients, intercept, and R-squared score
print(f"Coefficients: {model_two_features.coef_}")
print(f"Intercept: {model_two_features.intercept_}")
print(f"R-squared Score: {model_two_features.score(X_two_features, y_two_features)}")


In [None]:
13.
# Generate synthetic data
X_synth = 2 * np.random.rand(100, 1)
y_synth = 5 + 3 * X_synth + np.random.randn(100, 1)

# Fit the linear regression model
model_synth = LinearRegression()
model_synth.fit(X_synth, y_synth)

# Predict on the synthetic dataset
y_pred_synth = model_synth.predict(X_synth)

# Calculate MSE, MAE, and RMSE
mse_synth = mean_squared_error(y_synth, y_pred_synth)
mae_synth = mean_absolute_error(y_synth, y_pred_synth)
rmse_synth = np.sqrt(mse_synth)

# Print the results
print(f"Mean Squared Error (MSE): {mse_synth}")
print(f"Mean Absolute Error (MAE): {mae_synth}")
print(f"Root Mean Squared Error (RMSE): {rmse_synth}")


In [None]:
14.
# Generate synthetic data with multiple features
X_vif_synthetic = np.random.rand(100, 5) * 10
X_vif_synthetic_df = pd.DataFrame(X_vif_synthetic, columns=['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5'])

# Add constant for VIF calculation
X_vif_synthetic_df = add_constant(X_vif_synthetic_df)

# Calculate VIF for each feature
vif_df = pd.DataFrame()
vif_df['VIF Factor'] = [variance_inflation_factor(X_vif_synthetic_df.values, i) for i in range(X_vif_synthetic_df.shape[1])]
vif_df['Feature'] = X_vif_synthetic_df.columns
print(vif_df)


In [None]:
15.
# Generate synthetic data for a degree 4 polynomial relationship
X_poly4 = np.random.rand(100, 1) * 10
y_poly4 = 1 + 2 * X_poly4 + 3 * X_poly4 ** 2 + 4 * X_poly4 ** 3 + 5 * X_poly4 ** 4 + np.random.randn(100, 1) * 50

# Transform the data for polynomial regression (degree 4)
poly_features4 = PolynomialFeatures(degree=4)
X_poly4_transformed = poly_features4.fit_transform(X_poly4)

# Fit the polynomial regression model
poly_model4 = LinearRegression()
poly_model4.fit(X_poly4_transformed, y_poly4)

# Plot the polynomial regression curve
plt.scatter(X_poly4, y_poly4, color='blue')
plt.plot(X_poly4, poly_model4.predict(X_poly4_transformed), color='red')
plt.title('Polynomial Regression (Degree 4)')
plt.xlabel('X')
plt.ylabel('y')
plt.show()


In [None]:
16.
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Generate synthetic data for multiple linear regression
X_multi, y_multi = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Split the data
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi, test_size=0.2, random_state=42)

# Create the pipeline
pipeline_multi = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit the pipeline
pipeline_multi.fit(X_train_multi, y_train_multi)

# Predict and evaluate
r2_multi = pipeline_multi.score(X_test_multi, y_test_multi)
print(f"Pipeline R-squared Score: {r2_multi}")


In [None]:
17.
# Generate synthetic data for a degree 3 polynomial relationship
X_poly3_synth = np.random.rand(100, 1) * 10
y_poly3_synth = 1 + 2 * X_poly3_synth + 3 * X_poly3_synth ** 2 + 4 * X_poly3_synth ** 3 + np.random.randn(100, 1) * 50

# Transform the data for polynomial regression (degree 3)
poly_features3_synth = PolynomialFeatures(degree=3)
X_poly3_synth_transformed = poly_features3_synth.fit_transform(X_poly3_synth)

# Fit the polynomial regression model
poly_model3_synth = LinearRegression()
poly_model3_synth.fit(X_poly3_synth_transformed, y_poly3_synth)

# Plot the polynomial regression curve
plt.scatter(X_poly3_synth, y_poly3_synth, color='blue')
plt.plot(X_poly3_synth, poly_model3_synth.predict(X_poly3_synth_transformed), color='red')
plt.title('Polynomial Regression (Degree 3)')
plt.xlabel('X')
plt.ylabel('y')
plt.show()


In [None]:
18.
# Generate synthetic data for multiple linear regression
X_synth_multi, y_synth_multi = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Fit the multiple linear regression model
model_multi = LinearRegression()
model_multi.fit(X_synth_multi, y_synth_multi)

# Print the R-squared score and model coefficients
print(f"R-squared Score: {model_multi.score(X_synth_multi, y_synth_multi)}")
print(f"Coefficients: {model_multi.coef_}")


In [None]:
19.
# Generate synthetic data for simple linear regression
X_final = 2 * np.random.rand(100, 1)
y_final = 5 + 2 * X_final + np.random.randn(100, 1)

# Fit the linear regression model
model_final = LinearRegression()
model_final.fit(X_final, y_final)

# Predict on the synthetic dataset
y_pred_final = model_final.predict(X_final)

# Plot the data points and regression line
plt.scatter(X_final, y_final, color='blue', label='Data points')
plt.plot(X_final, y_pred_final, color='red', label='Regression line')
plt.title('Simple Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

# Print the model's coefficient and intercept
print(f"Coefficient: {model_final.coef_[0][0]}")
print(f"Intercept: {model_final.intercept_[0]}")


In [None]:
20.
from sklearn.datasets import make_regression

# Generate synthetic data with 3 features
X_3_features, y_3_features = make_regression(n_samples=100, n_features=3, noise=0.1, random_state=42)

# Fit the multiple linear regression model
model_3_features = LinearRegression()
model_3_features.fit(X_3_features, y_3_features)

# Print the R-squared score and coefficients
print(f"R-squared Score: {model_3_features.score(X_3_features, y_3_features)}")
print(f"Coefficients: {model_3_features.coef_}")


In [None]:
21.
# Save the model to a file
with open('linear_regression_model.pkl', 'wb') as file:
    pickle.dump(model_3_features, file)

# Load the model from the file
with open('linear_regression_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Predict using the loaded model
predictions = loaded_model.predict(X_3_features[:5])
print("Predictions on the first 5 samples:", predictions)


In [None]:
22.
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder

# Load the 'tips' dataset
tips = sns.load_dataset('tips')

# One-hot encode the categorical features
encoder = OneHotEncoder(drop='first', sparse=False)
encoded_features = encoder.fit_transform(tips[['sex', 'smoker', 'day', 'time']])

# Create the feature matrix and target vector
X_tips = pd.concat([pd.DataFrame(encoded_features), tips[['total_bill', 'size']]], axis=1)
y_tips = tips['tip']

# Fit the linear regression model
model_tips = LinearRegression()
model_tips.fit(X_tips, y_tips)

# Print the R-squared score and coefficients
print(f"R-squared Score: {model_tips.score(X_tips, y_tips)}")
print(f"Coefficients: {model_tips.coef_}")


In [None]:
23.
from sklearn.linear_model import Ridge

# Generate synthetic data
X_ridge, y_ridge = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Fit the Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_ridge, y_ridge)

# Fit the Ridge Regression model
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_ridge, y_ridge)

# Print the R-squared scores and coefficients
print(f"Linear Regression R-squared: {linear_model.score(X_ridge, y_ridge)}")
print(f"Linear Regression Coefficients: {linear_model.coef_}")

print(f"Ridge Regression R-squared: {ridge_model.score(X_ridge, y_ridge)}")
print(f"Ridge Regression Coefficients: {ridge_model.coef_}")


In [None]:
24.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Create a synthetic dataset
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Initialize the model
model = LinearRegression()

# Perform cross-validation and calculate the R-squared score for each fold
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

# Print the cross-validation R-squared scores and the mean score
print(f"Cross-validation R-squared scores: {cv_scores}")
print(f"Mean R-squared score: {np.mean(cv_scores)}")


In [None]:
25.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression

# Create a synthetic dataset
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Evaluate polynomial regression models for different degrees
for degree in range(1, 6):
    # Transform features into polynomial features
    poly = PolynomialFeatures(degree)
    X_poly_train = poly.fit_transform(X_train)
    X_poly_test = poly.transform(X_test)

    # Initialize the model
    model = LinearRegression()

    # Train the model
    model.fit(X_poly_train, y_train)

    # Make predictions
    y_pred = model.predict(X_poly_test)

    # Calculate the R-squared score
    r2 = r2_score(y_test, y_pred)
    print(f"Degree {degree} Polynomial Regression R-squared: {r2}")


In [None]:
26.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import make_regression

# Create a synthetic dataset with multiple features
X, y = make_regression(n_samples=100, n_features=3, noise=0.1, random_state=42)

# Add interaction terms to the dataset (degree=2 for interaction terms)
poly = PolynomialFeatures(degree=2, interaction_only=False)
X_poly = poly.fit_transform(X)

# Initialize the linear regression model
model = LinearRegression()

# Fit the model with the transformed features
model.fit(X_poly, y)

# Print the coefficients (including interaction terms)
print("Coefficients with interaction terms:")
print(model.coef_)

# Optionally, print the names of the features (including interaction terms)
print("Feature names:")
print(poly.get_feature_names_out())
