## Unveiling the Tapestry of Life: A Machine Learning Exploration of Socioeconomic and Housing Dynamics

Rahul Kanth Panganamamula

## Introduction

The objective of this project is to delve into the American Community Survey 2022 data to uncover intricate patterns and insights related to socioeconomic and housing dynamics. While many studies have explored various aspects of this dataset, our approach aims to integrate multiple dimensions—demographic, socioeconomic, and housing—to provide a comprehensive understanding of the factors influencing community well-being and economic status. This research is vital as it contributes to a nuanced understanding of societal structures, aiding policymakers, community planners, and researchers in making informed decisions.

Our dataset stands out not only for its volume but for the depth of information it contains. From basic demographic details like age and sex to more complex socioeconomic indicators such as income levels, educational attainment, and housing affordability, the dataset offers a panoramic view of the living conditions and societal structures within the surveyed region. Among the myriad of columns, we have judiciously selected 50 features that hold the key to unlocking a multitude of patterns and trends. These features span across various domains, including:

Demographic and Socioeconomic Features

    Age, Sex, Race, and Marital Status: These fundamental demographic attributes lay the groundwork for understanding population structure and social dynamics. They allow us to examine trends such as aging, gender disparities, racial diversity, and family composition.
    Household Income and Personal Income: These features are pivotal in assessing economic well-being and inequality. By analyzing income data, we can identify patterns of wealth distribution, poverty, and economic mobility.
    Educational Attainment: This indicator sheds light on the levels of education reached within the community, offering insights into access to education, lifelong earning potential, and societal value placed on education.
    Employment Status and Occupation Code: These variables help us understand labor market participation, industry sectors employing the population, and the nature of employment—full-time, part-time, or unemployed.
    Income-to-Poverty Ratio: An essential measure for identifying households that face economic hardship, enabling targeted studies on poverty and its impact on various life outcomes.

Housing and Living Conditions

    Property Value and Monthly Rent: These indicators provide a glimpse into the housing market, reflecting on affordability, housing quality, and the economic status of neighborhoods.
    Number of Bedrooms and Rooms: Insights from these features relate to living space, overcrowding, and the capacity of housing to meet family needs.
    Year Structure Built: This feature offers historical context, revealing patterns in housing development, urbanization, and potentially, the aging infrastructure.

Lifestyle and Expenses

    Vehicle Ownership: A proxy for mobility, access to services, and, indirectly, economic status.
    Monthly Utility Costs (Electricity, Gas, Water): These variables offer perspectives on living expenses, energy consumption, and the financial burden of utilities on households.
    Internet Access: In the digital age, internet access is a critical factor in educational opportunities, employment, and social connectivity.

Advanced Socioeconomic Indicators

    Public and Private Health Insurance Coverage: These features highlight healthcare accessibility and the role of public versus private support in ensuring health coverage.
    Disability Status: Understanding the prevalence and types of disabilities within the population can inform accessibility, policy, and support services.
    Military Service: Insights into the veteran population, addressing their unique needs, experiences, and contributions to society.

Our selection is designed not just for breadth but depth, allowing us to construct a multi-dimensional analysis that can address complex questions. By exploring these features through the lens of machine learning, we aim to uncover patterns that are not immediately visible, predict outcomes of interest, and generate insights that have tangible implications for policy-making, community support, and further research.


**Data Dictionary**

**Explanatory Variables:** 

        Demographic Variables: Age of the householder (AGEP), Sex (SEX), Marital Status (MAR), Number of Persons in the Household (NP).
    Housing Variables: Tenure (TEN, indicating whether the home is owned or rented), Number of Bedrooms (BDSP), Year Structure Built (YBL) as a proxy for housing quality.
    Employment Variables: Employment Status (EMP), Occupation (OCCP), Hours Worked Per Week (WKHP), Industry (INDP) as indicators of labor market engagement and sector.

### Literature Review

Summary:

    Introduction and Problem Statement: The project focuses on using machine learning to predict the income level (>50K or <=50K) of individuals based on various demographic and employment-related features.

    Data Preparation and EDA: The dataset, consisting of 32,560 records with 15 attributes, was cleaned and explored. Special attention was given to handling missing values and understanding the distribution of various features.

    Feature Engineering: Categorical variables were encoded, and numerical features were scaled to improve model performance.

    Model Building: Various models including Logistic Regression, KNN, Decision Trees, Random Forest, AdaBoost, SVM, Gradient Boosting, and XGBoost were trained to predict income levels.

    Model Evaluation: Models were evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC AUC curves. XGBoost and Gradient Boosting were identified as the best performing models.

    Hyperparameter Tuning: Further improvements were attempted by tuning the hyperparameters of the XGBoost model, which resulted in minimal gains.

    Conclusion: The research concluded with the selection of the XGBoost model as the most effective for predicting income levels based on the dataset, after considering performance metrics and cross-validation scores.

Purpose of the Research:

The research is centered around leveraging machine learning techniques to analyze census data with the aim of predicting income levels. This endeavor is significant for several reasons:

    Policy Making and Social Research: Understanding income distribution and the factors influencing income levels is crucial for policymakers and social researchers. The insights gained can aid in the formulation of policies aimed at income equality and poverty alleviation.

    Economic Analysis: The project offers a methodological framework for using machine learning in economic research, showcasing how data-driven insights can inform our understanding of economic conditions and trends.

    Technical Showcase: It serves as a practical example of applying various machine learning techniques, including preprocessing steps, model selection, and evaluation strategies. This is valuable for practitioners and students in the field of data science.

    Problem-Solving with Data Science: The research exemplifies how data science can address real-world problems by transforming raw data into actionable insights, thereby demonstrating the impact of machine learning in societal applications.
    
    
    
**Link to simillar research: https://medium.com/@lokeshbisen989/census-income-project-using-python-8f8d33a5942d**

In their research they have used Various models including Logistic Regression, KNN, Decision Trees, Random Forest, AdaBoost, SVM, Gradient Boosting, and XGBoost were trained to predict income levels.
since considering their reseacrh we are implementing three models for the 2022 ACS pums dataset

XGBoost Model, Neural Network Model, Random Forest Model

### American Community Survey (ACS)

he American Community Survey (ACS) is an ongoing survey that provides vital information on a yearly basis about our nation and its people. Information from the survey generates data that help inform how trillions of dollars in federal funds are distributed each year.

ACS Information:
Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics. Public officials, planners, and entrepreneurs use this information to assess the past and plan the future. When you respond to the ACS, you are doing your part to help your community plan for hospitals and schools, support school lunch programs, improve emergency services, build bridges, and inform businesses looking to add jobs and expand to new markets, and more.

## Research Questions

    1. Can we predict a household's income level based on demographic, housing, and employment characteristics?

## Data to be Used

Our Dataset consists of 241 columns (a.k.a Features), 819228 rows (a.k.a Records).

We have taken data from AMERICAN COMMUNITY SURVEY 2022 https://www.census.gov/programs-surveys/acs/about.html, downloaded the data from American Community Survey (ACS) Public Use Microdata Sample (PUMS) files to https://www.census.gov/programssurveys/acs/microdata/access.html.

Specifically from this location https://www2.census.gov/programs-surveys/acs/data/pums/2022/

we have downloaded the data and uploaded it into Our GIthub repository Using Github LFS

Because of data size and running time constraint we are taking into consideration of only first 26 states data only 

These are the states we are working on:

    AK - Alaska
    AL - Alabama
    AR - Arkansas
    AZ - Arizona
    CA - California
    CO - Colorado
    CT - Connecticut
    DC - District of Columbia
    DE - Delaware
    FL - Florida
    GA - Georgia
    HI - Hawaii
    IA - Iowa
    ID - Idaho
    IL - Illinois
    IN - Indiana
    KS - Kansas
    KY - Kentucky
    LA - Louisiana
    MA - Massachusetts
    MD - Maryland
    ME - Maine
    MI - Michigan
    MN - Minnesota
    MO - Missouri
    MS - Mississippi

## Approach

#### 1. Exploratory Data Analysis (EDA)

    Objective: Understand the data's characteristics, distribution, and potential relationships between variables.
    Approach:
        Data Cleaning: Handle missing values, outliers, and errors in the dataset.
        Visualization: Use histograms, box plots, scatter plots, and heatmaps to visualize distributions and relationships.
        Correlation Analysis: Identify potential relationships between variables using correlation coefficients.
        Summary Statistics: Generate summary statistics to understand the central tendency, dispersion, and shape of the dataset's distribution.


#### 2. Preprocessing Data

    Objective: Prepare the data for modeling by handling categorical variables, scaling numerical data, and potentially reducing dimensionality.
    Approach:
        Encoding Categorical Variables: Use one-hot encoding or label encoding for categorical variables.
        Scaling/Normalization: Apply standardization or normalization to numerical variables to ensure they're on the same scale.
        Feature Selection: Use statistical tests, feature importance scores, or dimensionality reduction techniques (e.g., PCA) to select relevant features.
        Data Splitting: Split the data into training, validation, and test sets to prepare for model training and evaluation.


#### 3. Model Development and Evaluation

    1. XGBoost Model:

    Approach: Use XGBoost for a classification task to predict the income level category. XGBoost is effective for handling categorical variables and can manage imbalanced data through its weighting mechanism.
    Preprocessing: Encode categorical variables and apply feature scaling. Income categories will be used as the target.
    Evaluation Metrics: Accuracy, Precision, Recall, F1-Score.

    2. Neural Network Model:

    Approach: Design a shallow neural network architecture with fully connected layers to classify the income level category. Neural networks are flexible and can model complex non-linear relationships.
    Preprocessing: One-hot encode categorical variables, normalize numerical features, and categorize the income variable.
    Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, alongside validation loss.

    3. Random Forest Model:

    Approach: Utilize Random Forest for its robustness and ability to handle a mix of numerical and categorical data without extensive preprocessing. It's also useful for feature importance analysis.
    Preprocessing: Minimal preprocessing required; however, encoding categorical variables is necessary.
    Evaluation Metrics: Accuracy, Precision, Recall, F1-Score.

#### 4. Comparing Model Performance

    Performance Metrics: Compare the models based on their evaluation metrics to identify strengths and weaknesses.
    Validation Set: Use a consistent validation approach, like cross-validation, to ensure fair comparison.
    Selection Criteria: Decide on the best model(s) based on performance, complexity, and interpretability.


#### 5. Creating an Ensemble Model

    Objective: Combine the predictions from the XGBoost, Neural Network, and Random Forest models to create a more robust ensemble model.
    Approach:
        Simple Averaging/Weighted Averaging: For regression problems, average the predictions. For classification, use voting or weighted voting based on model confidence.
        Stacking: Use a meta-model to learn how to best combine the predictions from the individual models. The meta-model is trained on the predictions of the base models on a hold-out set.
    Evaluation: Assess the ensemble model's performance on a test set, separate from the data used to train the individual models.


#### 6. Conclusion and Insights

    Performance Summary: Summarize the performance of the individual models and the ensemble model.
    Insights: Discuss the implications of the findings, including any surprising patterns, strengths, and limitations of the models.
    Recommendations: Offer recommendations for practical applications of the models, potential improvements, and areas for future research.

#### Final Steps

    Documentation: Throughout the project, document your methodology, code, findings, and decisions.
    Reproducibility: Ensure that your analysis and models are reproducible, including data preprocessing steps, model parameters, and evaluation criteria.
    Ethical Considerations: Reflect on the ethical implications of your work, especially in relation to predictive modeling on sensitive topics like income, education, and housing.

### Work Distribution among the team

    mahanth dasari EDA, building XG boost model
    Minjae lee Data preparation, BUilding random forest model
    Rahul kanth panganamamula Building ensemble, Neural Network Model

In [1]:
import pandas as pd

df= pd.read_csv("https://media.githubusercontent.com/media/rk18081999/final-project-rep/main/psam_husa.csv?token=ANTHJQG5TC32VTHJKHOYIF3F6A4JQ")

df.head()


HTTPError: HTTP Error 404: Not Found