Income-Data-Analysis-Using-Multiple-Regression

Project Overview

This project focuses on analyzing the factors that influence income using a Multiple Linear Regression model. The dataset consists of 4508 observations with 13 variables, including the dependent variable income and 12 independent variables. The goal is to build an optimal model to predict income and understand the relationships between income and other factors.

Key Concepts:

Multiple Linear Regression: A statistical method used to model the relationship between one dependent variable and multiple independent variables.
Gauss-Markov Assumptions: These assumptions ensure the validity of the regression model, including linearity, homoscedasticity, no autocorrelation, and independence between predictors and errors.

Dataset Description

The dataset contains 13 variables:

Variable	Description
`age`	Age in years
`yrsed`	Years of education
`edcat`	Level of education
`income`	Income of individuals (Dependent)
`yrsempl`	Years with current employer
`creddebt`	Credit card debt in thousands
`othdebt`	Other debt in thousands
`default`	Ever defaulted on a bank loan (0/1)
`jobsat`	Job satisfaction (1-5 scale)
`homeown`	Home ownership (0/1)
`address`	Years at current address
`cars`	Number of cars owned/leased
`carvalue`	Value of primary vehicle

Descriptive Statistics

The dataset shows a variety of distributions across variables. For example, variables like income, creddebt, and othdebt are positively skewed, while categorical variables like default and homeown are binary.
The dependent variable income has a mean of 55.41 with a standard deviation of 56.51.

Model Building Process

Step 1: Initial Model

We began by fitting an initial multiple regression model using all independent variables. However, multicollinearity was detected, particularly with the variable edcat. This led to the removal of edcat from subsequent models.

Step 2: Refined Model

After removing edcat, we re-ran the model and further removed variables like homeown, cars, and address due to collinearity. The remaining variables showed a more linear relationship with income.

Step 3: Handling Heteroscedasticity

The presence of heteroscedasticity was detected using the ncvTest, indicating that errors did not have constant variance. To address this, we transformed some variables (e.g., applying logarithmic transformations to income and square root transformations to creddebt and othdebt).

Step 4: Final Model

After removing outliers using Cook's Distance and applying transformations, we arrived at a final model that satisfies all Gauss-Markov assumptions:

Linearity was confirmed using diagnostic plots.
Homoscedasticity was verified with random bands around horizontal lines in residual plots.
No autocorrelation was detected (Durbin-Watson statistic close to 2).
Multicollinearity was reduced (VIF values below 5).

Final Model Summary

The final model achieved an Adjusted R-squared value of 92.66%, indicating that it explains a significant portion of the variance in income. The p-values for all predictors were statistically significant, confirming their importance in predicting income.

Key Results:

Adjusted R-squared: 92.66%
Durbin-Watson Statistic: 1.92 (indicating no autocorrelation)
Variance Inflation Factor (VIF): All values below 5, indicating low multicollinearity.

Final Model Equation:

$$ \log(\text{income}) = \beta_0 + \beta_1 \sqrt{\text{creddebt}} + \beta_2 \sqrt{\text{othdebt}} + \beta_3 \text{default} + \beta_4 \log(\text{carvalue}) $$

Conclusion

This project successfully built a multiple regression model that predicts income with high accuracy. By addressing issues such as multicollinearity, heteroscedasticity, and outliers, we developed a robust model that meets all necessary statistical assumptions.

Citations: [1] https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/41251817/316bdd2a-f481-4c47-9b13-a1adf6ee0616/x19151381-Assessment-Report.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Income_regression_models.R		Income_regression_models.R
README.md		README.md
Sqrt_transformed.R		Sqrt_transformed.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Income-Data-Analysis-Using-Multiple-Regression

Project Overview

Key Concepts:

Dataset Description

Descriptive Statistics

Model Building Process

Step 1: Initial Model

Step 2: Refined Model

Step 3: Handling Heteroscedasticity

Step 4: Final Model

Final Model Summary

Key Results:

Final Model Equation:

Conclusion

About

Uh oh!

Releases

Packages

Languages

ughitsashwin/Income-Data-Analysis-Using-Multiple-Regression

Folders and files

Latest commit

History

Repository files navigation

Income-Data-Analysis-Using-Multiple-Regression

Project Overview

Key Concepts:

Dataset Description

Descriptive Statistics

Model Building Process

Step 1: Initial Model

Step 2: Refined Model

Step 3: Handling Heteroscedasticity

Step 4: Final Model

Final Model Summary

Key Results:

Final Model Equation:

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages