<a href="https://colab.research.google.com/github/somyakmukherjee/US-Home-Price-Prediction/blob/main/Data_Preparation_for_US_Home_price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Preparation for US Home price**

## **Project Scope:**
This project aims to build a data science model to predict U.S. home prices based on key economic factors over the last 20 years.

### **Steps**
1.Data Collection and Preparation:

*   Gather historical data for the specified features.
*   Clean and preprocess the data for analysis.

2. Exploratory Data Analysis (EDA):

*   Visualize relationships between features and CSUSHPISA.
*   Identify correlations and patterns.

3. Model Development:

*  Build predictive models using regression or machine learning algorithms.
*  Train and validate the models.

4. Model Evaluation:


*   Evaluate model performance using appropriate metrics.
*   Fine-tune models for better accuracy.

## **Factors chosen & Data Collection Sources**

1. **USHPI (S&P/Case-Shiller U.S. National Home Price Index.**) https://fred.stlouisfed.org/series/CSUSHPINSA
2. **Population** (Population includes resident population plus armed forces overseas.  The monthly estimate is the average of estimates for the first of the month and the first of the following month.) https://fred.stlouisfed.org/series/POPTHM
3. **Personal Income** (It is the income that persons receive in return for their provision of labor, land, and capital used in current production and the net current transfer payments that they receive from business and from government.) https://fred.stlouisfed.org/series/PI
4. **Gross Domestic Product** (Featured measure of U.S. output, is the market value of the goods and services produced by labor and property located in the United States.) https://fred.stlouisfed.org/series/GDP
5. **Unemployment Rate** (The unemployment rate represents the number of unemployed as a percentage of the labor force. (16 years age or above)) https://fred.stlouisfed.org/series/UNRATE
6. **Employment-Population Ratio:** (It is a macroeconomic statistic that measures the civilian labor force currently employed against the totalworking-age populationof a region, municipality, or country. (emratio)) https://fred.stlouisfed.org/series/EMRATIO https://www.investopedia.com/terms/e/employment_to_population_ratio.asp
7. **Building Construction issued permit** in US (Total Units) https://fred.stlouisfed.org/series/PERMIT
8. **Labor Force Participation Rate** (The participation rate is the percentage of the population that is either working or actively looking for work.) https://fred.stlouisfed.org/series/CIVPART https://www.investopedia.com/terms/p/participationrate.asp
9. **Monthly Supply of New Houses** in the United States (The monthly supply is the ratio of new houses for sale to new houses sold.) https://fred.stlouisfed.org/series/MSACSR
10.**Housing starts**(New Housing Project) (This is a measure of the number of units of new housing projects started in a given period.) https://fred.stlouisfed.org/series/HOUST
11. **Median Sales Price**. (Median Sales Price of Houses Sold for the United States.(US Dollers)) https://fred.stlouisfed.org/series/MSPUS
12.**Producer Price Index -Cement Manufacturing** https://fred.stlouisfed.org/series/PCU327310327310
13. **Producer Price Index by Industry: Concrete Block and Brick Manufacturing** https://fred.stlouisfed.org/series/PCU32733132733106
14. **All Employees, Residential Building Construction** (Thousands of Peoples) (Construction employees in the construction sector include: Working supervisors, qualified craft workers, mechanics, apprentices, helpers, laborers, and so forth, engaged in new work, alterations, demolition, repair, maintenance etc.) https://fred.stlouisfed.org/series/CES2023610001
15. **All Employees, Construction**(Thousands of persons) (Construction employees in the construction sector include: Working supervisors, qualified craft workers, mechanics, apprentices, helpers, laborers, and so forth, engaged in new work, alterations, demolition, repair, maintenance.) https://fred.stlouisfed.org/series/USCONS
16. **Industrial Production: Cement** (The industrial production (IP) index measures the real output of all relevant establishments located in the United States) https://fred.stlouisfed.org/series/IPN32731S https://www.investopedia.com/terms/i/ipi.asp
17. **Homeownership Rate (Percentage)** (The homeownership rate is the proportion of households that is owner-occupied.)
https://fred.stlouisfed.org/series/RSAHORUSQ156S
18. **Personal Saving Rate** (Percent) (Personal saving as a percentage of disposable personal income (DPI), frequently referred to as "the personal saving rate," is calculated as the ratio of personal saving to DPI. Personal income that is used either to provide funds to capital markets or to invest in real assets such as residences.)
https://fred.stlouisfed.org/series/PSAVERT
19. **New Privately-Owned Housing Units Completed**: (Total units in thousands) https://fred.stlouisfed.org/series/COMPUTSA
20. **New Privately-Owned Housing Units Under Construction**: (Total Units in thousands)
https://fred.stlouisfed.org/series/UNDCONTSA

In [1]:
#importing libraries
import numpy as np
import pandas as pd
import os

In [2]:
# Creating Directory Cleaned data for storing cleaned data
directory_path = "Cleaned data"
os.makedirs(directory_path, exist_ok=True)

## **1. Target**

S&P/Case-Shiller U.S. National Home Price **Index**

In [46]:
# Reading target data
ushpi = pd.read_csv("CSUSHPINSA.csv")

In [47]:
ushpi.head()

Unnamed: 0,DATE,CSUSHPINSA
0,1987-01-01,63.735
1,1987-02-01,64.134
2,1987-03-01,64.47
3,1987-04-01,64.973
4,1987-05-01,65.547


In [48]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
ushpi.set_index('DATE', inplace =True)
ushpi.rename(columns= {'CSUSHPINSA': 'ushpi'}, inplace =True)
ushpi.index = pd.to_datetime(ushpi.index)
ushpi = ushpi["1990-01-01":"2023-01-01"]

In [49]:
ushpi.shape

(397, 1)

In [50]:
ushpi.to_csv("Cleaned data/ushpi.csv")

## **2. Population**
Population includes resident population plus armed forces overseas.

In [51]:
#Reading population data
population = pd.read_csv('POPTHM.csv')

In [52]:
population.head()

Unnamed: 0,DATE,POPTHM
0,1959-01-01,175818.0
1,1959-02-01,176044.0
2,1959-03-01,176274.0
3,1959-04-01,176503.0
4,1959-05-01,176723.0


In [53]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
population.set_index('DATE', inplace =True)
population.rename(columns= {'POPTHM': 'population'}, inplace =True)
population.index = pd.to_datetime(population.index)
population = population["1990-01-01":"2023-01-01"]

In [54]:
population.shape

(397, 1)

In [55]:
population.to_csv('Cleaned data/population.csv')

## **3. Personal Income**
Income that persons receive in return for their provision of labor, land, and capital used in current production and the net current transfer payments that they receive from business and from government.

In [56]:
#Reading personal income data
income = pd.read_csv('PI.csv')

In [57]:
income.head()

Unnamed: 0,DATE,PI
0,1959-01-01,391.8
1,1959-02-01,393.7
2,1959-03-01,396.5
3,1959-04-01,399.9
4,1959-05-01,402.4


In [58]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
income.set_index('DATE', inplace =True)
income.rename(columns= {'PI': 'income'}, inplace =True)
income.index = pd.to_datetime(income.index)
income = income["1990-01-01":"2023-01-01"]

In [59]:
income.shape

(397, 1)

In [60]:
income.to_csv("Cleaned data/income.csv")

## **4. Gross Domestic Product**
Featured measure of U.S. output, is the market value of the goods and services produced by labor and property located in the United States.

GDP is defined as the total market value of all final goods & services produced within the country in a given period of time- usually a calender year or finanacial year or a fraction like quarter.

GDP shows how much is produced within the boundaries of the country by both the citizens and the foreigners. GDP focuses on where the output is produced rather than who produced it, so it is a Geographical Concept. GDP measures all domestic production, disregarding the producing entities nationalities.

In calculating GDP, certain transactions are excluded. For example, gains from resale are excluded but the services provided by the agents are counted, i.e when a used house is sold, no new goods are being produced. But the real estate or the agent makes some money through commission which adds to the GDP.

In [65]:
#Reading GDP data
gdp = pd.read_csv("GDP.csv")

In [66]:
gdp.head()

Unnamed: 0,DATE,GDP
0,1947-01-01,243.164
1,1947-04-01,245.968
2,1947-07-01,249.585
3,1947-10-01,259.745
4,1948-01-01,265.742


In [67]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
gdp.set_index('DATE', inplace =True)
gdp.index = pd.to_datetime(gdp.index)

# Resampling
gdp = gdp.resample('M').ffill()

# Setting the day of the index to 1
gdp.index = gdp.index.map(lambda x: x.replace(day=1))
gdp = gdp["1990-01-01":"2023-01-01"]

In [68]:
gdp.shape

(397, 1)

In [69]:
gdp.to_csv("Cleaned data/gdp.csv")

## **5. Unemployment Rate**
The unemployment rate represents the number of unemployed as a percentage of the labor force. (16 years age or above)

In [70]:
#Reading Unemployment Rate data
unemployment_rate = pd.read_csv('UNRATE.csv')

In [71]:
unemployment_rate.head()

Unnamed: 0,DATE,UNRATE
0,1948-01-01,3.4
1,1948-02-01,3.8
2,1948-03-01,4.0
3,1948-04-01,3.9
4,1948-05-01,3.5


In [72]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
unemployment_rate.set_index('DATE', inplace =True)
unemployment_rate.rename(columns= {'UNRATE': 'unemployed_rate'}, inplace =True)
unemployment_rate.index = pd.to_datetime(unemployment_rate.index)
unemployment_rate = unemployment_rate["1990-01-01":"2023-01-01"]

In [73]:
unemployment_rate.shape

(397, 1)

In [74]:
unemployment_rate.to_csv("Cleaned data/unemployment_rate.csv")

## **6. Employment-Population Ratio (emratio)**
It is a macroeconomic statistic that measures the civilian labor force currently employed against the totalworking-age populationof a region, municipality, or country.

In [75]:
emp_pop_ratio = pd.read_csv("EMRATIO.csv")

In [76]:
emp_pop_ratio.head()

Unnamed: 0,DATE,EMRATIO
0,1948-01-01,56.6
1,1948-02-01,56.7
2,1948-03-01,56.1
3,1948-04-01,56.7
4,1948-05-01,56.2


In [77]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
emp_pop_ratio.set_index('DATE', inplace =True)
emp_pop_ratio.rename(columns= {'EMRATIO': 'emp_pop_ratio'}, inplace =True)
emp_pop_ratio.index = pd.to_datetime(emp_pop_ratio.index)
emp_pop_ratio = emp_pop_ratio["1990-01-01":"2023-01-01"]

In [78]:
emp_pop_ratio.shape

(397, 1)

In [79]:
emp_pop_ratio.to_csv("Cleaned data/emp_pop_ratio.csv")

## **7. Building Construction issued permit in US (Total Units)**

In [80]:
#Reading permit data
permit = pd.read_csv("PERMIT.csv")

In [81]:
permit.head()

Unnamed: 0,DATE,PERMIT
0,1960-01-01,1092.0
1,1960-02-01,1088.0
2,1960-03-01,955.0
3,1960-04-01,1016.0
4,1960-05-01,1052.0


In [82]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
permit.set_index('DATE', inplace =True)
permit.rename(columns= {'PERMIT': 'permit'}, inplace =True)
permit.index = pd.to_datetime(permit.index)
permit = permit["1990-01-01":"2023-01-01"]

In [83]:
permit.shape

(397, 1)

In [84]:
permit.to_csv("Cleaned data/permit.csv")

## **8. Labor Force Participation Rate**
The participation rate is the percentage of the population that is either working or actively looking for work. The labor force participation rate is an estimate of an economy’s active workforce. The formula is the number of people ages 16 and older who are employed or actively seeking employment, divided by the total non-institutionalized, civilian working-age population.

In [85]:
#Reading Labour force participation rate data
labor_percent = pd.read_csv("CIVPART.csv")

In [86]:
labor_percent.head()

Unnamed: 0,DATE,CIVPART
0,1948-01-01,58.6
1,1948-02-01,58.9
2,1948-03-01,58.5
3,1948-04-01,59.0
4,1948-05-01,58.3


In [87]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
labor_percent.set_index('DATE', inplace =True)
labor_percent.rename(columns= {'CIVPART': 'labor_percent'}, inplace =True)
labor_percent.index = pd.to_datetime(labor_percent.index)
labor_percent = labor_percent["1990-01-01":"2023-01-01"]

In [88]:
labor_percent.shape

(397, 1)

In [89]:
labor_percent.to_csv("Cleaned data/labor_percent.csv")

## **9. Monthly Supply of New Houses in the United States**
The monthy supply is the ratio of new houses for sale to new houses sold. This statistic provides an indication of the size of the for-sale inventory in relation to the number of houses being sold.

In [90]:
#Reading monthly supply data
monthly_supply = pd.read_csv("MSACSR.csv")

In [91]:
monthly_supply.head()

Unnamed: 0,DATE,MSACSR
0,1963-01-01,4.7
1,1963-02-01,6.6
2,1963-03-01,6.4
3,1963-04-01,5.3
4,1963-05-01,5.1


In [92]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
monthly_supply.set_index('DATE', inplace =True)
monthly_supply.rename(columns= {'MSACSR': 'monthly_supply'}, inplace =True)
monthly_supply.index = pd.to_datetime(monthly_supply.index)
monthly_supply = monthly_supply["1990-01-01":"2023-01-01"]

In [93]:
monthly_supply.shape

(397, 1)

In [94]:
monthly_supply.to_csv("Cleaned data/monthly_supply.csv")

## **10. Housing starts (New Housing Project)**
This is a measure of the number of units of new housing projects started in a given period.

In [99]:
#Reading house starts measurement data
house_starts = pd.read_csv("HOUST.csv")

In [100]:
house_starts.head()

Unnamed: 0,DATE,HOUST
0,1959-01-01,1657.0
1,1959-02-01,1667.0
2,1959-03-01,1620.0
3,1959-04-01,1590.0
4,1959-05-01,1498.0


In [101]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
house_starts.set_index('DATE', inplace =True)
house_starts.rename(columns= {'HOUST': 'house_st'}, inplace =True)
house_starts.index = pd.to_datetime(house_starts.index)
house_starts = house_starts["1990-01-01":"2023-01-01"]

In [102]:
house_starts.shape

(397, 1)

In [103]:
house_starts.to_csv("Cleaned data/House_starts.csv")

## **11. Median Sales Price**
Median Sales Price of Houses Sold for the United States.(US Dollers)

In [104]:
#Reading median sales price data
MSPUS = pd.read_csv("MSPUS.csv")

In [105]:
MSPUS.head()

Unnamed: 0,DATE,MSPUS
0,1963-01-01,17800.0
1,1963-04-01,18000.0
2,1963-07-01,17900.0
3,1963-10-01,18500.0
4,1964-01-01,18500.0


In [106]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
MSPUS.set_index('DATE', inplace =True)
MSPUS.index = pd.to_datetime(MSPUS.index)

# Resampling
MSPUS = MSPUS.resample('M').ffill()

# Setting the day of the index to 1
MSPUS.index = MSPUS.index.map(lambda x: x.replace(day=1))
MSPUS = MSPUS["1990-01-01":"2023-01-01"]

In [107]:
MSPUS.shape

(397, 1)

In [108]:
MSPUS.to_csv("Cleaned data/MSPUS.csv")

## **12. Producer Price Index -Cement Manufacturing**

The Producer Price Index (PPI) measures the average change over time in the prices domestic producers receive for their output. It is a measure of inflation at the wholesale level that is compiled from thousands of indexes measuring producer prices by industry and product category. The index is published monthly by the U.S. Bureau of Labor Statistics (BLS). The PPI is different from the consumer price index (CPI), which measures the changes in the price of goods and services paid by consumers.

In [109]:
# Reading Producer Price Index
PPI_Cement = pd.read_csv("PCU327310327310.csv")

In [110]:
PPI_Cement.head()

Unnamed: 0,DATE,PCU327310327310
0,1965-01-01,28.7
1,1965-02-01,28.7
2,1965-03-01,28.7
3,1965-04-01,28.7
4,1965-05-01,28.7


In [111]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
PPI_Cement.set_index('DATE', inplace =True)
PPI_Cement.rename(columns= {'PCU327310327310': 'PPI_Cement'}, inplace =True)
PPI_Cement.index = pd.to_datetime(PPI_Cement.index)
PPI_Cement = PPI_Cement["1990-01-01":"2023-01-01"]

In [112]:
PPI_Cement.shape

(397, 1)

In [113]:
PPI_Cement.to_csv("Cleaned data/PPI_Cement.csv")

## **13. Producer Price Index by Industry: Concrete Brick**



In [114]:
#Reading PPI of concrete brick
PPI_Concrete = pd.read_csv("PCU32733132733106.csv")

In [115]:
PPI_Concrete.head()

Unnamed: 0,DATE,PCU32733132733106
0,1981-06-01,100.0
1,1981-07-01,.
2,1981-08-01,.
3,1981-09-01,.
4,1981-10-01,.


In [116]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
PPI_Concrete.set_index('DATE', inplace =True)
PPI_Concrete.rename(columns= {'PCU32733132733106': 'PPI_Concrete'}, inplace =True)
PPI_Concrete.index = pd.to_datetime(PPI_Concrete.index)
PPI_Concrete = PPI_Concrete["1990-01-01":"2023-01-01"]

In [117]:
PPI_Concrete.shape

(397, 1)

In [118]:
PPI_Concrete.to_csv("Cleaned data/PPI_Concrete.csv")

## **14. All Employees, Residential Building Construction (Thousands of Peoples)**
Construction employees in the construction sector include: Working supervisors, qualified craft workers, mechanics, apprentices, helpers, laborers, and so forth, engaged in new work, alterations, demolition, repair, maintenance etc.

In [126]:
#Reading employees data
all_const_emp = pd.read_csv("CES2023610001.csv")

In [127]:
all_const_emp.head()

Unnamed: 0,DATE,CES2023610001
0,1985-01-01,650.5
1,1985-02-01,643.4
2,1985-03-01,651.8
3,1985-04-01,655.2
4,1985-05-01,659.8


In [128]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
all_const_emp.set_index('DATE', inplace =True)
all_const_emp.rename(columns= {'CES2023610001': 'all_const_emp'}, inplace =True)
all_const_emp.index = pd.to_datetime(all_const_emp.index)
all_const_emp = all_const_emp["1990-01-01":"2023-01-01"]

In [129]:
all_const_emp.shape

(397, 1)

In [130]:
all_const_emp.to_csv("Cleaned data/all_const_emp.csv")

## **15. All Employees, Construction (Thousands of persons)**
Construction employees in the construction sector include: Working supervisors, qualified craft workers, mechanics, apprentices, helpers, laborers, and so forth, engaged in new work, alterations, demolition, repair, maintenance.

In [131]:
#Reading employees data
total_emp_cons = pd.read_csv("USCONS.csv")

In [132]:
total_emp_cons.head()

Unnamed: 0,DATE,USCONS
0,1939-01-01,1139
1,1939-02-01,1162
2,1939-03-01,1225
3,1939-04-01,1249
4,1939-05-01,1262


In [133]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
total_emp_cons.set_index('DATE', inplace =True)
total_emp_cons.rename(columns= {'USCONS': 'total_emp_cons'}, inplace =True)
total_emp_cons.index = pd.to_datetime(total_emp_cons.index)
total_emp_cons = total_emp_cons["1990-01-01":"2023-01-01"]

In [134]:
total_emp_cons.shape

(397, 1)

In [135]:
total_emp_cons.to_csv("Cleaned data/total_emp_cons.csv")

## **16. Industrial Production: Cement**
The industrial production (IP) index measures the real output of all relevant establishments located in the United States. The industrial production index (IPI) measures levels of production in the manufacturing, mining—including oil and gas field drilling services—and electrical and gas utilities sectors. It also measures capacity, an estimate of the production levels that could be sustainably maintained; and capacity utilization, the ratio of actual output to capacity. Here we are talking about the IP of Cement.

In [137]:
#Reading IPI of cement data
IPI_Cement = pd.read_csv("IPN32731S.csv")

In [138]:
IPI_Cement.head()

Unnamed: 0,DATE,IPN32731S
0,1972-01-01,144.6423
1,1972-02-01,138.8505
2,1972-03-01,137.9965
3,1972-04-01,139.3841
4,1972-05-01,136.9861


In [139]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
IPI_Cement.set_index('DATE', inplace =True)
IPI_Cement.rename(columns= {'IPN32731S': 'IPI_Cement'}, inplace =True)
IPI_Cement.index = pd.to_datetime(IPI_Cement.index)
IPI_Cement = IPI_Cement["1990-01-01":"2023-01-01"]

In [140]:
IPI_Cement.shape

(397, 1)

In [141]:
IPI_Cement.to_csv("Cleaned data/IPI_Cement.csv")

## **17. Homeownership Rate (Percentage)**
The homeownership rate is the proportion of households that is owner-occupied.

In [142]:
#Reading homeownership rate data
home_ow_rate = pd.read_csv("RSAHORUSQ156S.csv")

In [143]:
home_ow_rate.head()

Unnamed: 0,DATE,RSAHORUSQ156S
0,1980-01-01,65.5
1,1980-04-01,65.6
2,1980-07-01,65.6
3,1980-10-01,65.6
4,1981-01-01,65.6


In [144]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
home_ow_rate.set_index('DATE', inplace =True)
home_ow_rate.index = pd.to_datetime(home_ow_rate.index)
home_ow_rate.rename(columns= {'RSAHORUSQ156S': 'home_ow_rate'}, inplace =True)

# Resampling
home_ow_rate = home_ow_rate.resample('M').ffill()

# Setting the day of the index to 1
home_ow_rate.index = home_ow_rate.index.map(lambda x: x.replace(day=1))
home_ow_rate = home_ow_rate["1990-01-01":"2023-01-01"]

In [145]:
home_ow_rate.shape

(397, 1)

In [146]:
home_ow_rate.to_csv("Cleaned data/home_ow_rate.csv")

## **18. Personal Saving Rate (Percent)**
Personal saving as a percentage of disposable personal income (DPI), frequently referred to as "the personal saving rate," is calculated as the ratio of personal saving to DPI. Personal income that is used either to provide funds to capital markets or to invest in real assets such as residences.

In [147]:
# Reading personal saving rate
p_saving_rate =pd.read_csv("PSAVERT.csv")

In [148]:
p_saving_rate.head()

Unnamed: 0,DATE,PSAVERT
0,1959-01-01,11.3
1,1959-02-01,10.6
2,1959-03-01,10.3
3,1959-04-01,11.2
4,1959-05-01,10.6


In [149]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
p_saving_rate.set_index('DATE', inplace =True)
p_saving_rate.rename(columns= {'PSAVERT': 'p_saving_rate'}, inplace =True)
p_saving_rate.index = pd.to_datetime(p_saving_rate.index)
p_saving_rate = p_saving_rate["1990-01-01":"2023-01-01"]

In [150]:
p_saving_rate.shape

(397, 1)

In [151]:
p_saving_rate.to_csv("Cleaned data/p_saving_rate.csv")

## **19. New Privately-Owned Housing Construction Completed: (Total units in thousands)**

In [153]:
#Reading privately owned housing data
new_private_house = pd.read_csv("COMPUTSA.csv")

In [154]:
new_private_house.head()

Unnamed: 0,DATE,COMPUTSA
0,1968-01-01,1257.0
1,1968-02-01,1174.0
2,1968-03-01,1323.0
3,1968-04-01,1328.0
4,1968-05-01,1367.0


In [155]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
new_private_house.set_index('DATE', inplace =True)
new_private_house.rename(columns= {'COMPUTSA': 'new_private_house'}, inplace =True)
new_private_house.index = pd.to_datetime(new_private_house.index)
new_private_house = new_private_house["1990-01-01":"2023-01-01"]

In [156]:
new_private_house.shape

(397, 1)

In [157]:
new_private_house.to_csv("Cleaned data/new_private_house.csv")

## **20. New Privately-Owned Housing Units Under Construction: Total Units in thousands**

In [158]:
#Reading housing units data
new_private_hw_under = pd.read_csv("UNDCONTSA.csv")

In [159]:
new_private_hw_under.head()

Unnamed: 0,DATE,UNDCONTSA
0,1970-01-01,889.0
1,1970-02-01,888.0
2,1970-03-01,890.0
3,1970-04-01,891.0
4,1970-05-01,883.0


In [160]:
# Setting DATE as index, column renaming & filtering data from "1990-01-01":"2023-01-01".
new_private_hw_under.set_index('DATE', inplace =True)
new_private_hw_under.rename(columns= {'UNDCONTSA': 'new_private_hw_under'}, inplace =True)
new_private_hw_under.index = pd.to_datetime(new_private_hw_under.index)
new_private_hw_under = new_private_hw_under["1990-01-01":"2023-01-01"]

In [161]:
new_private_hw_under.shape

(397, 1)

In [162]:
new_private_hw_under.to_csv("Cleaned data/new_private_hw_under.csv")

# **Creating the final dataset for Exploratory Data Analysis**

In [163]:
directory_name = "Cleaned data"

if not os.path.exists(directory_name):
    os.makedirs(directory_name)

In [164]:
path = 'Cleaned data'

csv_files = [os.path.join(path, f) for f in os.listdir(path) if f.endswith('.csv')]

dfs = [pd.read_csv(f) for f in csv_files]

# Merging the dataframes on the 'DATE' column
df_final = pd.concat(dfs, ignore_index=False).groupby('DATE').sum()

df_final.head()

Unnamed: 0_level_0,GDP,permit,house_st,IPI_Cement,unemployed_rate,MSPUS,all_const_emp,PPI_Concrete,monthly_supply,total_emp_cons,p_saving_rate,ushpi,new_private_house,new_private_hw_under,emp_pop_ratio,labor_percent,population,home_ow_rate,PPI_Cement,income
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1990-01-01,5872.701,1748.0,1551.0,138.1363,5.4,123900.0,710.3,122.2,7.0,5422.0,7.9,76.527,1508.0,891.0,63.2,66.8,248743.0,64.1,101.7,4783.8
1990-02-01,5872.701,1329.0,1437.0,134.7538,5.3,123900.0,707.3,122.2,7.6,5416.0,8.5,76.587,1352.0,898.0,63.2,66.7,248920.0,64.1,101.7,4819.8
1990-03-01,5872.701,1246.0,1289.0,132.5115,5.2,123900.0,703.0,122.2,7.8,5392.0,8.3,76.79,1345.0,885.0,63.2,66.7,249146.0,64.1,102.0,4842.7
1990-04-01,5960.028,1136.0,1248.0,127.1853,5.4,126800.0,692.5,122.2,8.3,5355.0,8.7,77.038,1332.0,872.0,63.0,66.6,249436.0,63.9,102.5,4883.8
1990-05-01,5960.028,1067.0,1212.0,123.8842,5.4,126800.0,688.6,122.2,8.2,5321.0,8.7,77.297,1351.0,858.0,63.1,66.6,249707.0,63.9,102.5,4889.5


In [165]:
df_final.to_csv('ushomepricedataset.csv')
from google.colab import files
files.download('ushomepricedataset.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>