<a href="https://colab.research.google.com/github/thepersonuadmire/MLLinear/blob/main/ML_L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Theoretical**

1. What does R-squared represent in a regression model?

 R-squared, or the coefficient of determination, represents the proportion of the variance in the dependent variable that can be explained by the independent variables in the model. It ranges from 0 to 1, where a higher value indicates a better fit of the model to the data.

2. What are the assumptions of linear regression?

The main assumptions of linear regression include:

Linearity: The relationship between the independent and dependent variables is linear.

Independence: The residuals (errors) are independent.

Homoscedasticity: The residuals have constant variance at all levels of the independent variables.

Normality: The residuals are normally distributed.

No multicollinearity: The independent variables are not highly correlated with each other.

3. What is the difference between R-squared and Adjusted R-squared?

R-squared measures the proportion of variance explained by the model, while Adjusted R-squared adjusts R-squared for the number of predictors in the model. Adjusted R-squared can decrease if adding more predictors does not improve the model, making it a better metric for comparing models with different numbers of predictors.

4. Why do we use Mean Squared Error (MSE)?

MSE is used to measure the average squared difference between the predicted and actual values. It provides a way to quantify the model's prediction error, with larger errors being penalized more due to the squaring of differences.

5. What does an Adjusted R-squared value of 0.85 indicate?

An Adjusted R-squared value of 0.85 indicates that 85% of the variance in the dependent variable can be explained by the independent variables in the model, after adjusting for the number of predictors. This suggests a strong model fit.

6. How do we check for normality of residuals in linear regression?

Normality of residuals can be checked using:

Q-Q plots (quantile-quantile plots)

Histogram of residuals

Statistical tests such as the Shapiro-Wilk test or the Kolmogorov-Smirnov test.

7. What is multicollinearity, and how does it impact regression?

 Multicollinearity occurs when two or more independent variables in a regression model are highly correlated. It can lead to unreliable coefficient estimates, inflated standard errors, and difficulties in determining the effect of each predictor.

8. What is Mean Absolute Error (MAE)?

MAE is a measure of prediction accuracy that calculates the average absolute difference between predicted and actual values. It is less sensitive to outliers compared to MSE.

9. What are the benefits of using an ML pipeline?

Benefits of using an ML pipeline include:

Streamlining the workflow from data preprocessing to model deployment.

Ensuring reproducibility and consistency in model training and evaluation.

Facilitating collaboration among team members.

Simplifying the process of model updates and maintenance.

10. Why is RMSE considered more interpretable than MSE?

RMSE (Root Mean Squared Error) is considered more interpretable than MSE because it is in the same units as the dependent variable, making it easier to understand the magnitude of the prediction errors.

11. What is pickling in Python, and how is it useful in ML?

Pickling is the process of serializing Python objects into a byte stream, allowing them to be saved to a file and later deserialized back into the original object. In machine learning, it is useful for saving trained models for later use without needing to retrain them.

12. What does a high R-squared value mean?

A high R-squared value indicates that a large proportion of the variance in the dependent variable is explained by the independent variables in the model, suggesting a good fit. However, it does not imply causation or that the model is the best one.

13. What happens if linear regression assumptions are violated?

If linear regression assumptions are violated, it can lead to biased or inefficient estimates of the coefficients, incorrect conclusions about the significance of predictors, and unreliable predictions.



14. How can we address multicollinearity in regression?

Multicollinearity can be addressed by:

Removing highly correlated predictors.

Combining correlated variables into a single predictor (e.g., using PCA).

Using regularization techniques like Ridge or Lasso regression.

15. How can feature selection improve model performance in regression analysis?

Feature selection can improve model performance by reducing overfitting, enhancing model interpretability, and decreasing computational cost. It helps in identifying the most relevant predictors that contribute to the model's predictive power.

16. How is Adjusted R-squared calculated?

Adjusted R-squared is calculated using the formula:

[ \text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right) ]

where ( R^2 ) is the R-squared value, ( n ) is the number of observations, and ( p ) is the number of predictors in the model. This adjustment accounts for the number of predictors, providing a more accurate measure of model fit when comparing models with different numbers of predictors.

17. Why is MSE sensitive to outliers?

 MSE is sensitive to outliers because it squares the differences between predicted and actual values. This squaring means that larger errors have a disproportionately large effect on the overall error metric, which can skew the results and lead to misleading interpretations of model performance.

18. What is the role of homoscedasticity in linear regression?

Homoscedasticity refers to the condition where the variance of the residuals is constant across all levels of the independent variables. It is important because violations of this assumption (heteroscedasticity) can lead to inefficient estimates and affect the validity of hypothesis tests.

19. What is Root Mean Squared Error (RMSE)?

 RMSE is the square root of the average of the squared differences between predicted and actual values. It provides a measure of how well a model predicts the outcome variable, with lower values indicating better model performance.

20. Why is pickling considered risky?

Pickling can be considered risky because it can lead to security vulnerabilities if untrusted data is deserialized. Additionally, changes in the code or libraries used to create the pickled objects can result in compatibility issues when trying to unpickle the data later.

21. What alternatives exist to pickling for saving ML models?

Alternatives to pickling for saving machine learning models include:

Using joblib, which is optimized for large numpy arrays.

Saving models in formats like ONNX or PMML for interoperability.

Utilizing frameworks like TensorFlow or PyTorch that provide their own model saving mechanisms.

22. What is heteroscedasticity, and why is it a problem?

Heteroscedasticity occurs when the variance of the residuals is not constant across all levels of the independent variables. It is a problem because it can lead to inefficient estimates and affect the validity of statistical tests, potentially resulting in misleading conclusions.

23. How can interaction terms enhance a regression model's predictive power?

Interaction terms can enhance a regression model's predictive power by allowing the model to capture the combined effect of two or more independent variables on the dependent variable. This can reveal more complex relationships that a simple additive model might miss, leading to improved predictions.

# **Practical**

1. Write a Python script to visualize the distribution of errors (residuals) for a multiple linear regression model using Seaborn's "diamonds" dataset.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load diamonds dataset
data = sns.load_dataset('diamonds')
data = data.dropna()

# Prepare features and target
X = data[['carat', 'depth', 'table', 'x', 'y', 'z']]
y = data['price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and calculate residuals
y_pred = model.predict(X_test)
residuals = y_test - y_pred

# Plot residual distribution
sns.histplot(residuals, kde=True)
plt.title('Distribution of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()


2. Write a Python script to calculate and print Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) for a linear regression model.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Calculate errors
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print results
print(f"MSE: {mse}")
print(f"MAE: {mae}")
print(f"RMSE: {rmse}")


3. Write a Python script to check if the assumptions of linear regression are met. Use a scatter plot to check linearity, residuals plot for homoscedasticity, and correlation matrix for multicollinearity.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plot for linearity
sns.scatterplot(x=y_test, y=y_pred)
plt.title('Linearity Check')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.show()

# Residuals plot for homoscedasticity
sns.residplot(x=y_pred, y=residuals, lowess=True, line_kws={'color': 'red'})
plt.title('Homoscedasticity Check')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()

# Correlation matrix for multicollinearity
corr_matrix = pd.DataFrame(X_train).corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix (Multicollinearity)')
plt.show()


4. Write a Python script that creates a machine learning pipeline with feature scaling and evaluates the performance of different regression models.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor())
])

# Evaluate model
scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
print(f"Average R-squared Score: {scores.mean()}")


5. Implement a simple linear regression model on a dataset and print the model's coefficients, intercept, and R-squared score.


In [None]:
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared Score: {model.score(X_test, y_test)}")


6. Write a Python script that analyzes the relationship between total bill and tip in the 'tips' dataset using simple linear regression and visualizes the results.


In [None]:
tips = sns.load_dataset('tips')
X = tips[['total_bill']]
y = tips['tip']

model = LinearRegression()
model.fit(X, y)

sns.regplot(x='total_bill', y='tip', data=tips, line_kws={'color': 'red'})
plt.title('Total Bill vs Tip')
plt.show()


7. Write a Python script that fits a linear regression model to a synthetic dataset with one feature. Use the model to predict new values and plot the data points along with the regression line.


In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Generate synthetic data
X = np.random.rand(100, 1) * 10
y = 2.5 * X + np.random.randn(100, 1)

# Fit model
model = LinearRegression()
model.fit(X, y)

# Plot regression line
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.title('Linear Regression on Synthetic Data')
plt.show()


8. Write a Python script that pickles a trained linear regression model and saves it to a file.


In [None]:
import pickle

with open('linear_model.pkl', 'wb') as file:
    pickle.dump(model, file)
print("Model saved to linear_model.pkl")


9. Write a Python script that fits a polynomial regression model (degree 2) to a dataset and plots the regression curve.


In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model.fit(X_poly, y)

plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X_poly), color='red')
plt.title('Polynomial Regression Curve')
plt.show()


10. Generate synthetic data for simple linear regression (use random values for X and y) and fit a linear regression model to the data. Print the model's coefficient and intercept.


In [None]:
print(f"Coefficient: {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")


11. Write a Python script that fits polynomial regression models of different degrees to a synthetic dataset and compares their performance.


In [None]:
degrees = [1, 2, 3, 4]
for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    model.fit(X_poly, y)
    print(f"Degree {degree} R-squared: {model.score(X_poly, y)}")


12. Write a Python script that fits a simple linear regression model with two features and prints the model's coefficients, intercept, and R-squared score.


In [None]:
X = data[['carat', 'depth']]
model.fit(X, y)

print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared Score: {model.score(X_test[['carat', 'depth']], y_test)}")


13. Write a Python script that generates synthetic data, fits a linear regression model, and visualizes the regression line along with the data points.


In [None]:
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.title('Regression Line')
plt.show()


14. Write a Python script that uses the Variance Inflation Factor (VIF) to check for multicollinearity in a dataset with multiple features.


In [None]:
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset
np.random.seed(42)
X = pd.DataFrame({
    'Feature_1': np.random.rand(100),
    'Feature_2': np.random.rand(100) * 0.5,
    'Feature_3': np.random.rand(100) * 1.5,
    'Feature_4': np.random.rand(100) * 2
})

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compute VIF
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]

print(vif_data)


15. Write a Python script that generates synthetic data for a polynomial relationship (degree 4), fits a polynomial regression model, and plots the regression curve.


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Generate synthetic data
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = X**4 - 3*X**3 + 2*X**2 + np.random.randn(100, 1) * 10

# Fit polynomial regression
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)

# Plot results
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X_poly), color='red')
plt.title("Polynomial Regression (Degree 4)")
plt.show()


16. Write a Python script that creates a machine learning pipeline with data standardization and a multiple linear regression model, and prints the R-squared score.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
X = np.random.rand(100, 3)
y = 2*X[:,0] + 3*X[:,1] - 1.5*X[:,2] + np.random.randn(100)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

# Fit model and evaluate
pipeline.fit(X_train, y_train)
r_squared = pipeline.score(X_test, y_test)
print(f"R-squared Score: {r_squared}")


17. Write a Python script that performs polynomial regression (degree 3) on a synthetic dataset and plots the regression curve.


In [None]:
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
model.fit(X_poly, y)

plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X_poly), color='red')
plt.title("Polynomial Regression (Degree 3)")
plt.show()


18. Write a Python script that performs multiple linear regression on a synthetic dataset with 5 features. Print the R-squared score and model coefficients.


In [None]:
X = np.random.rand(100, 5)
y = 2*X[:,0] - 1.5*X[:,1] + 3*X[:,2] + np.random.randn(100)

model = LinearRegression()
model.fit(X, y)

print(f"R-squared Score: {model.score(X, y)}")
print(f"Coefficients: {model.coef_}")


19. Write a Python script that generates synthetic data for linear regression, fits a model, and visualizes the data points along with the regression line.


In [None]:
plt.scatter(X[:,0], y, color='blue')
plt.plot(X[:,0], model.predict(X), color='red')
plt.title("Linear Regression Line")
plt.show()


20. Create a synthetic dataset with 3 features and perform multiple linear regression. Print the model's R-squared score and coefficients.


In [None]:
X = np.random.rand(100, 3)
y = 2*X[:,0] + 1.2*X[:,1] - 0.8*X[:,2] + np.random.randn(100)

model.fit(X, y)

print(f"R-squared Score: {model.score(X, y)}")
print(f"Coefficients: {model.coef_}")


21. Write a Python script that demonstrates how to serialize and deserialize machine learning models using joblib instead of pickling.


In [None]:
import joblib

joblib.dump(model, 'model.joblib')
loaded_model = joblib.load('model.joblib')

print("Model loaded successfully.")


22. Write a Python script to perform linear regression with categorical features using one-hot encoding. Use the Seaborn 'tips' dataset.


In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

tips = sns.load_dataset('tips')

X = tips[['total_bill', 'sex', 'smoker', 'day']]
y = tips['tip']

# One-hot encoding categorical variables
preprocessor = ColumnTransformer([
    ('onehot', OneHotEncoder(drop='first'), ['sex', 'smoker', 'day'])
], remainder='passthrough')

X_transformed = preprocessor.fit_transform(X)

# Train model
model.fit(X_transformed, y)
print(f"R-squared Score: {model.score(X_transformed, y)}")


23. Compare Ridge Regression with Linear Regression on a synthetic dataset and print the coefficients and R-squared score.


In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X, y)

print(f"Linear Regression Coefficients: {model.coef_}")
print(f"Ridge Regression Coefficients: {ridge.coef_}")
print(f"Linear Regression R-squared: {model.score(X, y)}")
print(f"Ridge Regression R-squared: {ridge.score(X, y)}")


24. Write a Python script that uses cross-validation to evaluate a Linear Regression model on a synthetic dataset.


In [None]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"Cross-validated R-squared scores: {cv_scores}")
print(f"Mean R-squared: {np.mean(cv_scores)}")


25. Write a Python script that compares polynomial regression models of different degrees and prints the R-squared score for each.

In [None]:
for degree in range(1, 5):
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    model.fit(X_poly, y)
    print(f"Degree {degree} R-squared: {model.score(X_poly, y)}")
