---
title: "Generational Differences in Student Loan Debt and the Implications on Homeownership Trends"
subtitle: MSDS Capstone 2024
authors:
  - Leann Kim and Kass Traieh
title-block-banner: true
format: 
  html:
    theme:
      light: lightly
      dark: darkly
toc: true
---

# Introduction

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As current students, student loans are at the forefront of our minds and loom over us as we juggle the stress of financial burdens and the need for education to grow and succeed in life. Societal pressures push us to graduate college, start our careers, buy a house and start a family. But are Millennials and Gen Z’ers able to fall into place and follow the footsteps of the generations before them? Or are they facing additional challenges with the outstanding debt they’re accumulating with student loans? This project aims to shed light on our research question: how do differences in student loan debt between older and current generations affect homeownership trends? 


# Background

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Historically, homeownership has been a cornerstone of wealth accumulation for American families as it provides stability and financial security across generations. However, in the aftermath of the Great Recession and amidst rising costs of higher education, younger generations are increasingly burdened by substantial student loan debt. According to recent data, outstanding student loan debt in the United States has surpassed $1.75 trillion, which is a 67% increase from the previous decade [@federal2024]. Research has shown that high levels of student debt can delay or reduce the likelihood of homeownership. By comparing older generations (Baby Boomers and Gen X) with younger generations (Millennials), we can better understand how the landscape of student debt has evolved and its implications on long-term financial decisions such as homeownership for younger generations.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Understanding the difference in student loan debt composition across different generations holds personal significance to us as members of the current generations facing the burden of student loans. It speaks to the experiences of students navigating the complexities of higher education financing and the challenges they face in achieving future financial stability, including homeownership. By examining these dynamics, our research aims to empower individuals with knowledge to make informed financial decisions and advocate for systemic reforms.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;This study also reflects on the evolving societal norms and expectations surrounding higher education and homeownership. It highlights the shift towards higher education attendance and the resulting increase in student loan debt, which affects traditional milestones such as home ownership. Ultimately, our research contributes to building a more inclusive and resilient society by addressing systemic barriers to economic opportunity and wealth accumulation.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In summary, the proposed research holds significant merit, with the potential to drive positive societal change. A better understanding of generational differences in student loan debt and its implications on homeownership rates can lead to more informed policies, enhanced educational programs, and a more equitable and prosperous future.


# Data

## Data Acquisition

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The acquisition of student loan data and homeownership data proved challenging due to several factors. There was a significant lack of publicly available data, particularly data on student loans dating back to 1990. This scarcity was exacerbated by the fact that the raw data needed for our analysis was not available on any of the websites we consulted, including government portals and academic repositories. The data we encountered was predominantly aggregated, with limited access to detailed, disaggregated datasets that would allow for more granular analysis. Furthermore, many of the existing studies and reports that addressed related topics did not provide explicit citations or source details for their data, making it difficult to trace the origins and verify the reliability of the information used in previous research. This lack of transparency and accessibility posed significant obstacles to obtaining the comprehensive and historical data required for our study.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To perform our analysis, we utilized the data from the Federal Reserve and Census surveys within the 1989-2022 time frame. The Federal Reserve survey consisted of age groups, the percentage of those who had a mortgage and the percentage of those with education loans, and was stored in various tabs in an excel spreadsheet. The Census survey, also stored in various tabs in an excel spreadsheet, included age groups, total number of people who were surveyed and the number of homeowners within each age group. To further explore our research question and to supplement our existing tables, we found data on the amount of debt students had at the time of their graduation. This table included graduation year, amount of debt at graduation, average starting salary out of college and the debt to income ratio. The original data structure can be seen in [Figure 1](#figure-1).


### Original Tables

![Figure 1: Original Tables](images/erd.png){#figure-1}

::: {.callout-note collapse="true"}
#### Data Dictionary - Features, Descriptions, and Sources (Click to Expand) 
| Feature              	 | Description                                                                                                                                                                                                 	 |
|------------------------|----------------------------------------------------------------------------------------------------------------|
| `year`               	 | Survey year from 1989-2022 in ncrements of 3 years                                                                                                                                                                                               	 |
| `age_group`                 	 | Age groups: 18-34, 35-44, 45-54, 55-64, 65-74, 75+                                                                                                                                                                                      |
| `percent_mortgage`            	 | Percent of mortgage debt an individual within the age group holds on average                                                                                                                            	 |
| `percent_education_loan`    	 | Percent of education loan debt an indiviudual within the age groups holds on average                                                                                                           	 |
:Table 1 - Survey of Consumer Finances 1989-2022: a normally triennial cross-sectional survey of U.S. families showcasing family holdings of debt by selected characteristics of families and type of debt. 

<br><br>

| Feature              	 | Description                                                                                                                                                                                                 	 |
|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `year`               	 | Survey year from 1989-2022 in increments of 3 years                                                                                                                                                                                               	 |
| `age_group`                 	 | Age groups: 18-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75+                                                                                                                                                                                      |
| `total_surveyed`            	 | Total number of homes surveyed                                                                                                                            	 |
| `total_owner`    	 | Total number of homes surveyed that were owned                                                                                                           	 |
:Table 2 - Census 
Housing Vacancy Survey: historical data on rental and homeowner vacancy rates in the U.S.


<br><br>

| Feature              	 | Description                                                                                                                                                                                                 	 |
|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `grad_year`               	 | Graduation year from 1970-2021 in varying increments                                                                                                                                                                                             	 |
| `debt_at_grad`                 	 | Average student loan debt at graduation                                                                                                                                                                                      |
| `avg_start_salary`            	 | Average starting salary after graduation                                                                                                                            	 |
| `avg_debt_to_income`    	 | Average debt to average income ratio                                                                                                            	 |
:Table 3: Student Loan Debt by Year: average debt by year of graduation for students who graduated with a bachelor’s degree 

:::


## Data Cleaning & Feature Engineering

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The data preparation process involved extensive cleaning and manipulation in Excel and PostgreSQL. It is crucial to emphasize that all data was aggregated, ensuring no sensitive information was included within the datasets. For the Federal Reserve data, we extracted variables such as the survey year, age groups, the percentage of families in that age groups with a mortgage loan, and the percentage of families in that age group with an education loan. Similarly, for the Census data, we extracted the survey year, age groups, the total number of homes surveyed, and the total number of homes that were owned. All these variables were organized in a structured manner to facilitate analysis. It is important to note that the age groups from the Federal Reserve and Census data exhibited overlap, although they were categorized differently. The Federal Reserve data utilized broader age categories, while the Census data employed more granular age bins. Specifically, the Federal Reserve age groups were: less than 35, 35-44, 45-54, 55-64, 65-74, and 75+. In contrast, the Census data age groups were segmented as follows: less than 25, 25-29, 30-34, 35-39, ..., 70-74, and 75+.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Once we created these two tables in PostgreSQL, we split each table’s age groups into a minimum age and a maximum age for ease of joining the two tables together. Once we created those new columns in both the Federal Reserve and Census tables, we joined our data on survey year, minimum age, and maximum age, which we treated as primary composite keys. In the join, we ensured that the broader Federal Reserve age ranges were matched with the more granular Census age ranges. Although the join conditions account for overlapping age ranges, there may still be mismatches where age groups did not perfectly align. Additionally, overlaps or gaps in age ranges might lead to duplication of data, which would affect the overall quality and reliability of the dataset.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Upon joining our two tables, we proceeded to integrate generation names for each age range. This involved calculating the minimum and maximum birth years using the survey year along with the minimum and maximum ages. To classify each row into a generational cohort (Traditionalist, Baby Boomer, Generation X, Millennial, or Gen Z), we created a formula to manage overlaps by comparing the minimum and maximum birth years to determine the closest generational boundary alignment for each row. We also created new columns for minimum and maximum graduation years. This was done by assuming the average college graduation age to be 21 and adding this to the minimum and maximum birth years. It’s important to note that using a single average age may lead to inaccuracies in calculating the minimum and maximum graduation years, which could affect subsequent analyses and the classification of individuals into generational cohorts. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Finally, we integrated our student debt dataset, which included variables such as debt at graduation, average starting salary, and debt-to-income ratio by graduation year. The data source did not provide a downloadable file. Consequently, the values were manually entered into Excel and subsequently imported into PostgreSQL for analysis. We aligned each row of our joined dataset to the corresponding debt figures by comparing the minimum and maximum graduation years to determine the closest match to the graduation year in our debt dataset. Although the assignments were done as carefully as possible, there is a chance that some inaccuracies may remain due to variations in actual graduation ages and the potential misalignment of graduation years within the dataset.



### Final Data Dictionary


::: {.callout-note collapse="true"}
#### Data Dictionary - Features and Descriptions (Click to Expand) 
| Feature              	 | Description                                                                                                                                                                                                 	 |
|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `year`| year of survey, in increments of 3 |
|`census_min_birth_year`|  minimum birth year calculated from `year` and `census_min_age` |
|`census_max_birth_year`| maximum birth year calculated from `year` and `census_max_age` |
|`census_generation_name`| generation name derived from `census_min_birth_year` and `census_max_birth_year`|
|`census_min_grad_year`| minimum graduation year derived from `census_min_birth_year` + 21 years|
|`census_max_grad_year`| maximum graduation year derived from `census_max_birth_year` + 21 years |
|`debt_at_grad`| average debt at year of graduation |
|`avg_start_salary`| average starting salary post graduation|
|`avg_debt_to_income`| average debt to income ratio |
|`census_min_age`| minimum age derived from age group|
|`census_max_age`| maximum age derived from age group|
|`total_surveyed`| total number of those surveyed|
|`total_owner`| total number of those surveyed who own a home|
|`federal_min_age`| minimum age derived from age group|
|`federal_max_age`| maximum age derived from age group|
|`percent_mortgage`| percentage of those surveyed with mortgages|
|`percent_education_loan`| percentage of those surveyed with education loans|
|`generation_order`| generations encoded (Baby Boomer = 1, Generation X = 2, Millennial = 3) |
:Table 1 - Survey of Consumer Finances 1989-2022: a normally triennial cross-sectional survey of U.S. families showcasing family holdings of debt by selected characteristics of families and type of debt. 

:::

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As mentioned previously, all the data from the Federal Reserve and Census surveys were anonymized to protect individual privacy, preventing the identification of specific individuals. Additionally, the data was presented in aggregated forms such as percentages and averages, rather than detailed individual records, to further safeguard privacy. However, despite these measures, the potential for bias remains a concern, as survey respondents may not fully represent the entire population, with certain demographics potentially being underrepresented. Additionally, working with aggregated data can weaken the statistical power of analyses and increase the likelihood of Type II errors (failing to detect a true effect) or result in misleading statistics. Regarding data integrity, while the Federal Reserve and Census data were sourced from reputable institutions, our dataset for student debt was obtained from a less reliable source. We acknowledge the limitations of this dataset but feel that its inclusion was crucial for providing a more comprehensive analysis. We advise readers to consider the findings from this source alongside the more robust datasets and to be mindful of the potential impact on our study’s overall conclusions.


# Methods

## Project Design


### Summary
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Due to the limitations of our dataset, there were relatively few features to choose from that could be associated with homeownership and student debt. Initially, we selected specific features and employed linear regression models to evaluate the influence of each feature on homeownership rates. To explore more intricate, non-linear relationships among the features, we then applied a Random Forest machine learning model. This approach allowed us to build upon the findings from our linear regression analysis and assess the relative importance of each feature. These methodologies together provide a comprehensive framework for understanding the potential impact of student loan debt and generational factors on homeownership rates.

### 1. Materials List (Software used)
    1. python v3.11.5
    2. pandas v2.0.3
    3. numpy v1.24.3
    4. statsmodels v0.14.0 for ordinary least squares (linear regression) model
    5. scikit-learn v1.3.0 for random forest model
    6. scipy v1.11.1for statistical analysis
    7. matplotlib v3.7.2 and seaborn v0.12.2 for visualizations 


### 2. Data Exploration and Initial Analysis

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Initially, our dataset comprised 144 rows. Following preliminary data cleaning and the focus on specific generational cohorts, the dataset was refined to 92 rows, which were subsequently used for analysis. This reduction was essential to ensure the relevance and accuracy of the data in examining the targeted generational groups. While analyzing a small dataset can be manageable and insightful, it also comes with significant limitations related to statistical power, model complexity, and generalizability. 

#### Analyzing Trends Over Time

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Initially, we concentrated on examining the overall trend of student loan debt at graduation across different generations, specifically Baby Boomers, Generation X, and Millennials. To facilitate this analysis, we calculated the average graduation year for each entry in our dataset, covering graduates as far back as 1967 and calulcated the slope values of each generation. Additionally, we assessed the general trend in homeownership rates over the years. However, since our homeownership data only dates back to 1989, it presents a challenge in directly comparing the homeownership rates of Baby Boomers with our graduation debt trend.


#### Choosing an Analysis Dataset

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Initially, we intended to incorporate trends for Generation Z into our analysis. However, upon integrating all our datasets, we discovered that the available data for Generation Z was insufficient for a robust analysis of their student debt and homeownership rates. Including Generation Z data would have risked distorting the analysis of Baby Boomers, Generation X, and Millennials. Consequently, we decided to exclude data for Generation Z from our analysis.

#### Handling Missing Data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;While our survey datasets from the Census and Federal Reserve included detailed trends for Traditionalists, our student debt dataset only covered graduation years from 1970 onward. Since Traditionalists were born in 1945 or earlier and since we assumed the average college graduation age of 21, they were not included in the student debt data. Given that the available dataset did not capture their college graduation years, we had to exclude Traditionalists from our analysis.

### 3. Identification of Significant Features

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As mentioned earlier, there were only a handful of features that we could employ in our analysis. In order to determine whether student debt affected homeownership trends, we created a variable to hold the homeownership rate, using the following formula:

<div text-align="center">$$ \frac{to}{ts} * 100 = hr $$ </div>

Where:<br>
<i>to</i> = `total_owner`<br>
<i>ts</i> = `total_surveyed`<br>
<i>hr</i> = `homeownership_rate`<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In our linear regression and random forest models, the dependent variable was `homeownership_rate`, while the independent variables included `debt_at_grad`, `avg_debt_to_income`, `avg_start_salary`, `percent_mortgage`, `percent_education_loan`, and `generation_order`.

### 4. Building a Linear Regression Model

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To examine the relationship between homeownership rates and average debt at graduation across generations, we utilized a linear regression model. While our primary focus is to shed light on the impact of student loan debt by generation on homeownership rates, we included additional features, mentioned above, in our model to enhance its robustness and accuracy. We then created training and testing datasets with a 70%/30% split and utilized the statsmodel library to perform Ordinary Least Squares (OLS) regression to evaluate the performance of the model.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the results of a linear regression analysis to be valid, certain assumptions about the data must be met. These assumptions include linearity, independence, homoscedasticity, normality, and lack of multicollinearity.

#### 4a. Testing for Linearity

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The linearity assumption requires that the relationship between the independent variables and the dependent variable is linear. To assess this, we plotted the residuals (errors) against the predicted values to check for any patterns. By plotting residuals versus fitted values (or predictors), we checked for any systematic patterns. If the residuals display a clear pattern or trend, it suggests that the linear relationship assumption might be violated. Ideally, residuals should scatter randomly around zero, indicating that the linear model is appropriate.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The residuals vs. fitted values plot in [Figure 2](#figure-2) shows a generally random scatter, but there is a very slight parabolic tendency present. Despite this minor curvature, the pattern is not strong enough to warrant a departure from a linear regression model.

![Figure 2: Linearity Assumption Satisfied: Residuals vs. Fitted Plot Shows No Strong Curvature of Patterns](images/residuals.png){#figure-2}

#### 4b. Testing for Independence

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The independence assumption stipulates that the residuals are independent of each other. To test for independent, we used the Durbin-Watson statistics, which tests for autocorrelation in the residuals. The Durbin-Watson statistic ranges from 0 to 4. A value close to 2 suggests no autocorrelation. If the value is significantly less than 2, it indicates positive autocorrelation, while a value significantly greater than 2 indicates negative autocorrelation.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Our analysis yield a Durbin-Watson statistic of 1.77. This value is somewhat close to 2, suggesting that while there is a tendency towards positive autocorrelation, it is not extreme. In practice, DW values close to 2 (within a range of about 1.5 to 2.5) are often considered acceptable.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Given that the Durbin-Watson statistic indicated some positive correlation, we conducted the Breusch-Godfrey test to confirm and assess the extent of autocorrelation. The test yielded a statistic of 0.948 and a p-value of 0.330. Since this p-value exceeds the conventional significance threshold of 0.05, we conclude that there is no statistically significant evidence of autocorrelation in the residuals. This suggests that the residuals are likely independent, supporting the assumption of independence.

#### 4c. Testing for Homoscedasticity

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The homoscedasticity assumption requires that residuals have constant variance across all levels of the independent variable.  Although we noted a minor curvature in the linearity assumption, [Figure 3](#figure-3) shows no pronounced pattern of variance around the horizontal line, indicating that the homoscedasticity assumption is generally met.

![Figure 3: Homoscedasticity Assumption Satisfied: Residuals are Generally Well-Dsipered Around the Horizontal Line](images/homosced.png){#figure-3}

#### 4d. Testing for Normality

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The normality assumption requires that the residuals are normally distributed. To assess this, we created a Q-Q plot of the residuals, seen in [Figure 4](#figure-4). The points on the Q-Q plot slightly deviate but generally follow the 45-degree line, indicating that the residuals are approximately normally distributed.

![Figure 4: Normality Assumption Satisfied: Residuals Generally Align with the the 45-Degree Line](images/normality.png){#figure-4}

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We also performed the Shapiro-Wilk test for normality and got a test statistics of 0.99 and a p-value of 0.86 which suggest that the residuals are approximately normally distributed. Since the p-value is well above the conventional significance level of 0.05, there is no significant evidence to reject the null hypothesis of normality. This supports the assumption that the residuals follow a normal distribution.

#### 4e. Testing for Multicollinearity

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Multicollinearity occurs when independent variables are highly correlated, which can distort the estimation of regression coefficients. We evaluated multicollinearity using the Variance Inflation Factor (VIF) from the statsmodel library. 

| Feature                                | Value                 |
|---------------------------------------|-----------------------|
| `debt_at_grad`             | 26.534943     |
| `avg_start_salary`              | 21.435287    |
| `avg_debt_to_income`        | 28.121863    |
| `percent_mortgage` | 1.411813   |
| `percent_education_loan`              | 7.590111    |
| `generation_order`                            | 7.403381     |

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Half of our features had a VIF value exceeding 10, signaling a significant concern. To further test this, we analyzed the condition number, which resulted in 2.41e+05, which is relatively high. This issue can compromise the reliability and interpretability of our linear regression model by leading to unstable coefficient estimates and inflated standard errors. Although the other assumptions of the model are satisfied, the presence of multicollinearity necessitates cautious interpretation of our results.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Additionally, to ensure the validity of our model, we measure the following values:

* R-squared
* Adjusted R-squared
* Mean Absolute Error (MAE)
* Mean Squared Error (MSE)
* Root Mean Squared Error (RMSE)
* F-statistic

### 5. Building a Random Forest Model

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Given the limitations identified in the linear regression model, we opted to perform a Random Forest analysis to further investigate the relationship between homeownership rates and various predictor variables. The primary motivation for this choice was to address some of the challenges and assumptions inherent in linear regression that may impact the robustness and interpretability of the results. Random Forest models, being ensemble methods based on decision trees, are less sensitive to multicollinearity and do not require the assumptions of linearity, homoscedasticity, or normality of residuals. This makes them a more flexible and robust alternative when dealing with correlated predictors. Overall, the Random Forest model provides an opportunity to validate and complement the findings from our linear regression analysis, offering a more comprehensive approach to analyzing the predictors of homeownership rates and mitigating some of the limitations observed in the linear model.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We utilized Scikit-learn’s RandomForest library to train and test the dataset. Additionally, we employed the permutation importance function from the sklearn library to determin the importance of the various features.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To ensure the validity of our model and compare the fit with our linear regression model, we measure the following values:

* R-squared
* Adjusted R-squared
* Mean Absolute Error (MAE)
* Mean Squared Error (MSE)
* Root Mean Squared Error (RMSE)







# Results

## Trends Over Time


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Our analysis of graduation debt reveals a clear upward trajectory in average graduation debt levels over time for each successive generation, as shown in [Figure 5](#figure-5). For the Baby Boomers, graduates had relatively low levels of student loan debt, which remained stable and below $5,000 through the 1970s and early 1980s. In contrast, Generation X experienced a noticeable increase in debt levels starting in the late 1980s, with a steady rise through the 1990s, peaking around $15,000 by the early 2000s. This trend continues with Millennials, who faced the highest levels of debt at graduation. Starting in the early 2000s, the average debt for this generation saw a steep increase, surpassing $20,000 and reaching nearly $30,000 by 2015. This escalation in average student loan debt across generations highlights a growing financial burden on recent graduates, with Millennials bearing the highest debt load. The data suggest significant changes in higher education financing, economic conditions, and possibly policy impacts over the decades. 


![Figure 5: Rising Average Debt Levels at Graduation Across Successive Generations.](images/debt_over_time.png){#figure-5}


| Generation  | Slope         |
|------------|-----------|
| Baby Boomer      | 296.62     |
| Generation X       | 858.02    |
| Millennial   | 768.13    |




&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The plot in [Figure 6](#figure-6)  illustrates homeownership rates from 1989 to 2022. The initial rate starts at around 47.5% and experiences a slight dip shortly after. From the early 1990s, there is a significant upward trend, peaking at around 62.5% by the mid-2000s. This peak occurred during a significant decade for mortgage rates, particularly influenced by the 2008 financial crisis [(Campisi, 2022)](#ref5). After reaching its peak, there is a noticeable decline starting in the late 2000s, likely influenced by the housing market crash and the global financial crisis around 2007-2008. From the mid-2010s onwards, there is a noticeable upward trend, with the homeownership rate climbing steadily to approximately 65% by 2022. This recent increase might be due to various factors, including economic recovery and an increase in housing demand.

![Figure 6: Fluctuations in Homeownership Rates from 1989 to 2022: A Reflection of Economic Conditions and Financial Barriers.](images/homeownership_over_time.png){#figure-6}





## Linear Regression Model

### Model Fit

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The linear regression model demonstrates a relatively good fit, with an R-squared value of 0.708, indicating that 70.8% of the variability in homeownership rates is explained by the predictors. The adjusted R-squared of 0.681 suggests that after accounting for the number of predictors in the model, about 68.1% of the variance is explained, which is still substantial. The F-statistic for the model is 26.67 with a p-value of 6.99e-16, indicating that the model as a whole is statistically significant and provides a good fit for the data.

##### Model test statistics:

| Metric                                | Value                 |
|---------------------------------------|-----------------------|
| R-squared             | 0.708     |
| Adjusted R-squared              |  0.681    |
| Mean Absolute Error (MAE)       | 10.5018    |
| Mean Squared Error (MSE) |  139.7864   |
| Root Mean Sqaured Error (RMSE)              | 11.8231    |
| F-statistic                             | 26.67    |
| P-value (F-statistic)                            | 6.99e-16    |

### Coefficient Results
| Feature             | Coefficient  | P-Value
|---------------------|----------------------------|-------------|
| `debt_at_grad`         | 0.0015                 |    0.073 |
| `avg_start_salary ` | -0.0013                   | 0.015 |
| `avg_debt_to_income`     | -0.6644                   |      0.136   |
| `percent_mortgage`        | 0.7176                  |  0.000 |
| `percent_education_loan`      | 0.6885                   |  0.046 |
| `generation_order`    | -3.9232                 |   0.429 |



&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We anticipated that `debt_at_grad`, `percent_education_loan`, and `generation_order` would be the most influential features. However, the observed relationships were not as straightforward as expected.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The coefficient for `debt_at_grad` indicates a modest positive relationship with homeownership rate, suggesting that increased graduation debt is associated with higher homeownership rates. However, the p-value for this coefficient exceeds conventional significance levels, casting uncertainty on its precise impact.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Additionally, the coefficient for `percent_education_loan` suggests an unexpected positive association between education loan debt and homeownership rate. This counterintuitive finding may be attributable to factors such as high multicollinearity, sampling issues, or specific economic conditions that could influence the observed relationship.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  Finally, the coefficient for `generation_order` exhibits a negative relationship, indicating that the generational cohort to which an individual belongs may be inversely related to homeownership rates. Despite being the largest negative coefficient, the lack of statistical significance suggests that this effect is not robust within the framework of this model. This result could be influenced by various factors, including high multicollinearity among the features included in the regression model.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The coefficient analysis reveals that `percent_mortgage` exhibits the highest positive and statistically significant effect, with a coefficient of 0.7176 and a p-value of 0. This finding aligns with practical expectations, as mortgage debt is a critical determinant of homeownership.   

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The variables `avg_start_salary` and `avg_debt_to_income` were included to refine the model. Notably, `avg_start_salary` has a slight negative association with homeownership rates, suggesting that higher starting salaries might correlate with lower homeownership, though this effect is minor. Additionally, while `avg_debt_to_income` shows a negative trend, its p-value indicates that this relationship is not statistically significant at the 0.05 level, suggesting potential weakness or confounding factors.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The coefficients offer a nuanced view of how differences in student loan debt between older and current generations affect, homeownership trends, but they do not provide a definitive answer. 




## Random Forest Model

### Model Fit

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In this model analysis, we utilized the same dependent and independent variables as our linear regression model. The Mean Squared Error was 88.46, far better than our linear regression MSE score of 139.78. Although an MSE of 88.46 suggests some level of error in the predictions, it is a useful benchmark for comparing the model’s performance with our linear regression model. The R-squared value was 0.726, which signifies that approximately 72.6% of the variability in homeownership rates can be explained by the model. 


##### Model test statistics:

| Metric                                | Value                 |
|---------------------------------------|-----------------------|
| R-squared             | 0.7262     |
| Adjusted R-squared              |  0.5893    |
| Mean Absolute Error (MAE)       | 6.4070    |
| Mean Squared Error (MSE) |  88.4552   |
| Root Mean Sqaured Error (RMSE)              | 9.4051    |




### Feature Importance Results

| Feature             | Importance  | 
|---------------------|----------------------------|
| `debt_at_grad`         | 0.09359463522665826                 |  
| `avg_start_salary ` | 0.0726717567809253                  |
| `avg_debt_to_income`     |  0.08913656748564287           | 
| `percent_mortgage`        | 0.5899604427319441              |
| `percent_education_loan`      | 0.11617657198613383                | 
| `generation_order`    | 0.03846002578869567               | 


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To aid in visualizing the importance of different features, we created a plot of their respective importances. We excluded `percent_mortgage` from this visualization due to its dominance as the most significant feature. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As illustrated in [Figure 7](#figure-7), the variables `percent_education_loan` and `debt_at_grad` emerge as the top most influential factors in predicting the homeownership rate. This suggests that the amount of student loan debt individuals have plays a substantial role in determining their likelihood of owning a home. Higher student loan debt could be a barrier to homeownership, possibly due to the financial burden it places on individuals. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Conversely, `generation_order` ranks as the least significant feature. This implies that, according to the model, the generational sequence (e.g., Baby Boomer, Gen X, Millennials) is not a major factor in predicting homeownership rates compared to the direct financial impacts like student loan debt. 


![Figure 7: Generational cohort is the least significant predictor of homeownership rate.](images/features.png){#figure-7}

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Other features, like `average_start_salary` and `avg_debt_to_income`, indicate that although starting salary and average debt-to-income ratios plays a role in determining homeownership, they are not as significant a predictor as other variables such as education loan debt.



# Conclusions

The rise in debt at graduation over the generations suggests that newer graduates are starting their post-college lives with heavier financial burdens. Higher student loan debt can directly impact their ability to save for a down payment and secure mortgages, potentially delaying their entry into homeownership.






### Random Forest Conclusion

As illustrated in Figure 7, the variables percent_education_loan and debt_at_grad emerge as the most influential factors in predicting the homeownership rate. This highlights the significant role that student loan debt plays in determining an individual's likelihood of owning a home. The substantial impact of percent_education_loan suggests that the proportion of education-related debt an individual carries is a critical determinant of homeownership. Similarly, debt_at_grad indicates that the amount of debt incurred by graduation also significantly affects homeownership prospects.

The prominence of these variables implies that higher levels of student loan debt could be a considerable barrier to homeownership. This financial burden may limit individuals' ability to save for a down payment, manage monthly mortgage payments, or qualify for favorable loan terms. Consequently, the burden of student loans may hinder long-term financial stability and reduce the likelihood of achieving homeownership.

Addressing the challenges associated with student loan debt could be pivotal in improving homeownership rates. Policy interventions aimed at reducing education-related debt or providing targeted financial assistance might alleviate some of the barriers identified in this analysis and promote greater access to homeownership.









Conversely, generation_order ranks as the least significant feature in predicting homeownership rates. This finding suggests that the generational cohort to which an individual belongs—whether Baby Boomer, Gen X, Millennials, or another group—has minimal impact on their likelihood of owning a home when compared to more direct financial factors, such as student loan debt.

This result indicates that, according to the model, the generational sequence alone does not substantially influence homeownership rates. While generational trends and experiences might shape broader economic and social patterns, the specific cohort to which an individual belongs appears less relevant in this context than the tangible financial burdens they face.

The minimal significance of generation_order highlights the importance of focusing on financial factors that more directly affect homeownership, such as debt levels, income, and savings. It underscores that while generational trends might provide context, the financial challenges individuals face are more decisive in determining their ability to achieve homeownership.

This insight emphasizes the need for targeted policies and interventions that address financial barriers directly, rather than focusing solely on generational characteristics. By prioritizing measures that alleviate student loan debt and improve financial stability, stakeholders can more effectively support homeownership aspirations across different generational cohorts.























In addressing our research question—how do differences in student loan debt between older and current generations affect homeownership trends?—our study offers both insightful and nuanced perspectives. The historical context and the analysis reveal a complex relationship between student loan debt and homeownership rates. Our findings indicate that student loan debt at graduation (`debt_at_grad`) is a significant factor influencing homeownership rates. The data shows a clear increase in student loan debt across generations, with Millennials bearing the highest levels, which is likely contributing to the challenges they face in achieving homeownership. Despite the substantial debt load carried by recent graduates, the impact of debt at graduation on homeownership is relatively moderate.
Through linear regression and Random Forest analyses, we confirmed that `debt_at_grad` has a less pronounced but still notable effect. `Percent_education_loan`, while less influential than mortgage debt, also plays a role, suggesting that education loan debt does impact homeownership trends to a degree. Interestingly, `generation_order`, representing the sequence of generations, was found to be the least significant predictor of homeownership rates. This indicates that, within the context of our model, generational cohort alone does not substantially explain variations in homeownership rates compared to the financial burden of student loans and mortgages.
It is crucial to acknowledge that student loan debt is not the sole determinant of homeownership trends. Inflation, for instance, has affected purchasing power and contributed to rising housing costs, which can make homeownership more challenging for many. Additionally, extensive evidence highlights how student loan debt creates barriers in mortgage eligibility and credit scores. According to a study by the Federal Reserve Board, a $1,000 increase in student loan debt is associated with a 1.8 percent decrease in the homeownership rate for public four-year college graduates, resulting in a delay in purchasing a home. Furthermore and as addressed above, despite our efforts to mitigate bias, the potential for underrepresentation of certain demographics in the survey samples remains a concern. Student loan debt can exacerbate racial disparities. Racial wealth and income gaps are rooted in historical discriminatory housing policies - meaning that Black students, in particular, may face greater financial risks in pursuing higher education. Not adequately capturing the experiences of these underrepresented groups can affect our research and must be taken into consideration when evaluating our results. 
In summary, while student loan debt does impact homeownership trends, it is just one of several factors contributing to the complexity of financial decisions related to homeownership. The rising debt levels faced by Millennials and Gen Z are significant but must be considered alongside other financial elements. Our research underscores the need for a holistic view when analyzing homeownership trends and suggests that addressing the broader financial challenges faced by younger generations may be crucial in facilitating increased homeownership rates.




We can reference this in our conclusion/limitations: 
https://housingmatters.urban.org/articles/how-student-loan-debt-affects-racial-homeownership-gap

“Extensive evidence underscores how debt affects mortgage eligibility and credit score, erecting clear barriers to homeownership. A study by the Federal Reserve Board found that a $1,000 increase in student loan debt lowers the homeownership rate by about 1.8 percent for public four-year college goers; this amounts to an average delay in about four months for attaining homeownership.
Student loan debt may reproduce and exacerbate the racial homeownership gap. Enduring racial disparities in wealth and income—which were, in part, created through decades of racist and discriminatory housing policies that blocked wealth building for many families—mean a greater proportion of Black students need to take on a greater and more enduring financial risk to pursue higher education. Therefore, reducing the impact of student loans on mortgage eligibility could be a critical component of ensuring a more equitable housing landscape.” 


# References
```{bibliography}
@article{federal2024,
  author = {Board of Governors of the Federal Reserve System},
  year = {2024},
  url = {https://www.federalreserve.gov/releases/g19/HIST/cc_hist_memo_levels.html}
}



<a id="ref1"></a> 
1. Board of Governors of the Federal Reserve System (2024). https://www.federalreserve.gov/releases/g19/HIST/cc_hist_memo_levels.html 

<a id="ref2"></a> 
2. Survey of consumer finances (SCF). Federal Reserve Board - Survey of Consumer Finances (SCF). (n.d.-b). https://www.federalreserve.gov/econres/scfindex.html 

<a id="ref3"></a> 
3. Bureau, U. C. (2019, April 15). Housing vacancies and homeownership - historical tables. United States Census Bureau. https://www.census.gov/housing/hvs/data/histtabs.html 

<a id="ref4"></a> 
4. McLoughlin, D. (2023, May 5). Student loan debt by year. https://wordsrated.com/student-loan-debt-by-year/

<a id="ref5"></a> 
5. Campisi, N. (2022, December 29). Mortgage rates history. Forbes Advisor. https://www.forbes.com/advisor/mortgages/mortgage-rates-history/
