In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [33]:
data = pd.read_csv('housing.csv')
data

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished


In [34]:
#missing values
data.isna().sum()

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

In [35]:
# Split the dataset into features (X) and target variable (y)
X = data[['area', 'bedrooms', 'bathrooms', 'stories', 'parking']]  # Add other relevant numerical features
y = data['price']

# c. Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [59]:
from sklearn.linear_model import Lasso



# Instantiate the Lasso model
lasso_model = Lasso(alpha=0.01)  # You can tune the alpha parameter for regularization strength

# Fit the model to the training data
lasso_model.fit(X_train, y_train)

# b. Discuss the impact of L1 regularization on feature selection and coefficients
# L1 regularization (Lasso) can force the coefficients of irrelevant features to be exactly 0,
# effectively performing feature selection and simplifying the model.
print("Lasso Coefficients:", lasso_model.coef_)


Lasso Coefficients: [3.08902826e+02 1.51185146e+05 1.18536672e+06 4.95050002e+05
 3.37542244e+05]


In [60]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
from sklearn.metrics import r2_score
# Predict house prices on the test set
lasso_predictions = lasso_model.predict(X_test)
print('LAsso R-sq value :\t',r2_score(y_test,lasso_predictions))
# Calculate MAE, MSE, and RMSE
lasso_mae = mean_absolute_error(y_test, lasso_predictions)
lasso_mse = mean_squared_error(y_test, lasso_predictions)
lasso_rmse = np.sqrt(lasso_mse)

print("Lasso Regression Metrics:")
print("Mean Absolute Error (MAE):", lasso_mae)
print("Mean Squared Error (MSE):", lasso_mse)
print("Root Mean Squared Error (RMSE):", lasso_rmse)


LAsso R-sq value :	 0.5463945902723355
Lasso Regression Metrics:
Mean Absolute Error (MAE): 1127504.0986581189
Mean Squared Error (MSE): 2292780407597.269
Root Mean Squared Error (RMSE): 1514192.988887899


Lasso regression's L1 regularization term penalizes large coefficients, encouraging the model to simplify by setting the coefficients of irrelevant features to zero. This feature selection property helps prevent overfitting by reducing the complexity of the model. It ensures that only the most relevant features are considered, effectively reducing the impact of irrelevant features on the model's predictions. This regularization technique is particularly useful when dealing with datasets containing both numerical and categorical features, allowing the model to
handle feature selection and regularization simultaneously.

In [56]:
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=0.01)
ridge_model.fit(X_train,y_train)



In [57]:
# b. Explain how L2 regularization in Ridge regression differs from L1 regularization in Lasso.
# L2 regularization (Ridge) penalizes the sum of squared coefficients (Euclidean norm), 
# while L1 regularization (Lasso) penalizes the sum of absolute values of coefficients (Manhattan norm).
# The key difference is that L2 regularization does not force coefficients to be exactly zero, 
# but it discourages large coefficients by adding their squared values to the cost function. 
# This tends to distribute the impact of irrelevant features more evenly across all features 
# (keeping all features in the model) compared to Lasso, which may lead to sparse solutions (some coefficients being exactly zero).

In [58]:
# Predict house prices on the test set
ridge_predictions = ridge_model.predict(X_test)

print('ridge R-sq value :\t',r2_score(y_test,ridge_predictions))
# Calculate MAE, MSE, and RMSE
ridge_mae = mean_absolute_error(y_test, ridge_predictions)
ridge_mse = mean_squared_error(y_test, ridge_predictions)
ridge_rmse = np.sqrt(ridge_mse)

print("Ridge Regression Metrics:")
print("Mean Absolute Error (MAE):", ridge_mae)
print("Mean Squared Error (MSE):", ridge_mse)
print("Root Mean Squared Error (RMSE):", ridge_rmse)


ridge R-sq value :	 0.5464063262338164
Ridge Regression Metrics:
Mean Absolute Error (MAE): 1127483.3062465736
Mean Squared Error (MSE): 2292721087355.555
Root Mean Squared Error (RMSE): 1514173.4006894836


b. Discuss the benefits of Ridge regression in handling multicollinearity among features and its impact on the model's coefficients:

Ridge regression is particularly useful when dealing with multicollinearity, a situation where independent variables in a regression model are highly correlated. In the presence of multicollinearity, ordinary least squares (OLS) estimates can be unstable, leading to unreliable and highly sensitive coefficients. Ridge regression addresses this issue by adding the L2 regularization term, which discourages large coefficients.

The L2 regularization term in Ridge regression penalizes the sum of squared coefficients. When features are highly correlated, Ridge regression works to distribute the impact of these correlated features more evenly across them, preventing any single feature from dominating the model. By doing so, Ridge regression helps stabilize the coefficient estimates, making them less sensitive to small changes in the data. This, in turn, leads to a more reliable and interpretable model, especially when dealing with datasets where multicollinearity is a concern. Additionally, Ridge regression can also help prevent overfitting in situations where there are many predictors relative to the number of observations, further enhancing the model's generalization ability.

In [61]:
print("Lasso Regression Metrics:")
print("Mean Absolute Error (MAE):", lasso_mae)
print("Mean Squared Error (MSE):", lasso_mse)
print("Root Mean Squared Error (RMSE):", lasso_rmse)

print("\nRidge Regression Metrics:")
print("Mean Absolute Error (MAE):", ridge_mae)
print("Mean Squared Error (MSE):", ridge_mse)
print("Root Mean Squared Error (RMSE):", ridge_rmse)


Lasso Regression Metrics:
Mean Absolute Error (MAE): 1127504.0986581189
Mean Squared Error (MSE): 2292780407597.269
Root Mean Squared Error (RMSE): 1514192.988887899

Ridge Regression Metrics:
Mean Absolute Error (MAE): 1127483.3062465736
Mean Squared Error (MSE): 2292721087355.555
Root Mean Squared Error (RMSE): 1514173.4006894836


. Discuss when it is preferable to use Lasso, Ridge, or plain linear regression:

Plain Linear Regression: Use when you assume that all features are relevant and there is no issue of multicollinearity.

Lasso Regression: Use when you suspect that there are irrelevant features in your dataset and you want to perform feature selection. Lasso is also useful when you have a large number of features and want to reduce the model's complexity.

Ridge Regression: Use when you suspect there is multicollinearity in the dataset (correlation between predictors) as it handles multicollinearity well. It's also a good choice when you have a dataset with many predictors and you want to prevent overfitting by penalizing large coefficients.

In [62]:
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for Lasso
lasso_params = {'alpha': [0.1, 1, 10, 100]}
lasso_grid = GridSearchCV(Lasso(), param_grid=lasso_params, scoring='neg_mean_squared_error', cv=5)
lasso_grid.fit(X_train, y_train)
best_lasso = lasso_grid.best_estimator_
print("Best Lasso Model:", best_lasso)

# Hyperparameter tuning for Ridge
ridge_params = {'alpha': [0.1, 1, 10, 100]}
ridge_grid = GridSearchCV(Ridge(), param_grid=ridge_params, scoring='neg_mean_squared_error', cv=5)
ridge_grid.fit(X_train, y_train)
best_ridge = ridge_grid.best_estimator_
print("Best Ridge Model:", best_ridge)


Best Lasso Model: Lasso(alpha=100)
Best Ridge Model: Ridge(alpha=1)


8. Model Improvement:

a. Investigate feature engineering or data preprocessing techniques:

Feature Scaling: Standardize or normalize numerical features to bring them to a similar scale, especially if you are using regularization techniques.

Polynomial Features: Introduce interaction terms or polynomial features to capture nonlinear relationships between predictors.

Handling Categorical Variables: Explore advanced techniques like target encoding or embedding layers for neural networks if you have categorical variables.

Feature Selection: Conduct detailed feature analysis to identify the most influential features and focus on them, discarding irrelevant ones.

Outlier Handling: Investigate outliers and apply appropriate techniques such as removing outliers or transforming skewed features.

Data Imputation: If missing values exist, explore advanced imputation techniques like K-nearest neighbors imputation or predictive mean matching.

# 1. Initial Linear Regression Model:

a. Dataset and Variables:
The dataset contains information about employees, with variables including experience (in years), education level (numeric representation), and the number of projects completed. The target variable is employee performance, measured using a numerical scale or score.

b. Simple Linear Regression Model:
The simple linear regression model predicts employee performance based on one predictor variable, for example, experience.

c. Suitability of Linear Regression:
Linear regression is suitable because it assumes a linear relationship between the predictors and the target variable. It's appropriate when predicting a numeric outcome (like performance scores) based on one or more predictor variables (like experience, education level, and projects completed).



2. Identifying Heteroscedasticity:
a. Heteroscedasticity in Linear Regression:
Heteroscedasticity refers to the situation where the
variability of the residuals (the differences between observed and predicted values) is not constant across all 
levels of the predictor variables. In simpler terms, 
the spread of the residuals increases or decreases as the predicted values 
change.

b. Methods for Diagnosing Heteroscedasticity:

Residual Plot: Plotting residuals against predicted values can visually reveal patterns.
Breusch-Pagan Test: A statistical test that formally tests for heteroscedasticity.
White’s Test: Another statistical test for heteroscedasticity.
    
c. Diagnostic Results:

After applying these methods to your model's residuals, you find evidence of heteroscedasticity, suggesting that the variability
in employee performance prediction
errors is not constant across different levels of predictor variables.    
    

In [None]:
3. Remedying Heteroscedasticity:

a. Consequences of Heteroscedasticity:

Heteroscedasticity can lead to inefficient parameter estimates and can affect the statistical tests for the regression coefficients. Standard errors might be biased,
leading to incorrect conclusions about the significance of predictors.

b. Addressing Heteroscedasticity:

Transforming Variables: Applying transformations like logarithmic or square root transformations to the dependent variable or certain predictor variables can stabilize the variance.
Weighted Least Squares (WLS) Regression: Assigning weights to observations inversely proportional to the variance of the residuals can mitigate heteroscedasticity effects.
    
c. Implementation and Evaluation: 
    
You decide to apply a logarithmic transformation to the employee performance variable and re-run the regression. Additionally, you experiment with WLS regression by assigning appropriate weights to observations.
After these remedial actions, you reevaluate the model's performance metrics (such as R-squared, RMSE) and find that they have improved. The model's predictions are now more reliable, and the assumptions of linear regression are better met.    

4. Detecting Multicollinearity:
a. Multicollinearity in Linear Regression:

Multicollinearity occurs when predictor variables in a regression model are highly correlated with each other. This correlation can cause issues in the regression analysis, making it difficult to identify the individual effect of each predictor variable on the target. Multicollinearity inflates standard errors, leading to unstable and unreliable coefficient estimates. It does not affect the model's overall fit (R-squared), but it makes it challenging to interpret the importance of each predictor variable.
b. Identification Methods:

Correlation Matrices: Calculate the correlation coefficients between all pairs of predictor variables. High absolute values (close to 1) indicate strong correlations.
Variance Inflation Factors (VIFs): VIF measures how much the variance of an estimated regression coefficient increases if the predictors are correlated. VIF values greater than 10 indicate a problematic amount of collinearity.
c. Findings on Highly Correlated Variables:

After calculating correlation coefficients and VIF values for your predictor variables, you find that "experience" and "number of projects completed" have a high positive correlation coefficient (close to 1). Additionally, the VIF values for both of these variables are above 10, indicating significant multicollinearity

5. Mitigating Multicollinearity:
a. Issues Associated with Multicollinearity:

Unreliable Coefficients: Multicollinearity makes it challenging to determine the true effect of each predictor variable on the target because the coefficients become unstable and can flip signs erratically.
Reduced Interpretability: When predictor variables are highly correlated, it's difficult to isolate and understand the impact of each variable on the target variable. Interpretation becomes ambiguous.
b. Strategies for Mitigating Multicollinearity:

Feature Selection: Choose a subset of the most relevant predictors based on domain knowledge or statistical techniques like stepwise regression.

Regularization Techniques: Regularization methods like Lasso (L1 regularization) and Ridge (L2 regularization) penalize large coefficients, effectively reducing the impact of less important predictors.


After evaluating the options, you decide to use Lasso regression, which performs both variable selection and regularization, effectively mitigating multicollinearity and providing a simpler, interpretable model.
You apply Lasso regression to the dataset, and it automatically selects the most relevant predictors while penalizing the less relevant ones. After this adjustment, you assess the model's performance using metrics like R-squared, RMSE, and MAE.

6. Model Evaluation:
a. Performance Metrics:

R-squared (Coefficient of Determination): Indicates the proportion of the variance in the dependent variable (employee performance) that is predictable from the independent variables.

MAE (Mean Absolute Error): Represents the average of the absolute errors between predicted and actual performance scores.

MSE (Mean Squared Error): Measures the average of the squared errors, giving more weight to large errors.
RMSE (Root Mean Squared Error): 
Square root of MSE, providing an interpretable measure in the same unit as the target variable.

b. Interpretation of Coefficients:

After addressing heteroscedasticity and multicollinearity, the coefficients of the remaining predictors become more stable and interpretable.
For example, if "experience" is a significant predictor, a unit increase in experience leads to a certain change in the employee performance score, holding other variables constant. The coefficients indicate the strength and direction of these relationships.

7. Conclusion:
a. Impact of Addressing Issues:

Identifying and addressing heteroscedasticity and multicollinearity have significantly improved the predictive accuracy and interpretability of the employee performance model.
The model's performance metrics have likely shown enhancements, with a higher R-squared value indicating a better fit to the data. The MAE, MSE, and RMSE values have likely reduced, indicating smaller prediction errors.
b. Recommendations for Future Model Development:

Feature Engineering: Consider exploring additional relevant features that could enhance the model's predictive power.
Continuous Monitoring: Regularly check for new data patterns and reevaluate the model's performance as the dataset evolves.
External Factors: Incorporate external factors like market trends or company policies that might influence employee performance.
Advanced Techniques: Explore advanced machine learning algorithms beyond linear regression, like decision trees, random forests, or neural networks, to capture complex relationships.