## Final Project - Beast Team. 
Idea : Pandemic has driven people more towards the suburbs causing a surge in home prices. Naturally, this implies a new supply is going to come about (we've seen that in 2008). In this project, we create two TS-ML models - one that uses historical home prices to predict supply and one that uses historical supply & current interest rates to predict home prices. We will then utilize the former to predict the supply for major cities; we will then feed those predictions into the latter to predict home prices under various interest regimes for the same major cities.

$$s_{jt}=f(X_j,X_t,s_{jt-1},...,s_{jt-k},p_{jt-1},...,p_{jt-k} )+\epsilon_{jt}$$

$s_{jt}$  is Supply for county $j$ at time $t$.

$X_j$ are demographic features for county $j$, example: population of a county, income per capita.

$X_t$ are time specific features, example: Summer fixed effect [indicator variable to represent summer and capture the regime difference of increased supply] etc. 

$p_jt$  is home value index for county $j$ at time $t$. We will only consider single family homes for this analysis. 
[Add a note on calculation of home value index]

$$p_{jt}=g(s_{jt},s_{jt-1},...,s_{jt-k},r_t,X_j,X_t,p_{jt-1},...,p_{jt-k} )+\nu_{jt}$$

$r_t$ is the interest rate at time $t$.

Here we have considered even the historical supply and historical prices since at a previous point, there may have been over supply of homes. 

Goal: Estimation of  $\hat{f} $ and $\hat{g}$ using ML. 


Data cleaning and preparation steps. 
1) Combine multiple census data in one file. [ETA - 06/19 - EOD PDT]
   - Each county may have data retrieved for census at differnt points, we will take the latest data for modeling. 
2) Get two letter code mapping for states. [Done]
3) Filter out only 4 weeks data from home_data. [Done]
4) Split region data in home_data and join on two letter data, state-county. [Done]
5) Get month from 4 week data and combine with interest rate data. [Done]

TODO : Visualizations to build. 
1) Geographically increase in prices and supply side by side heat map. 
2) Scatter plot of 2020 latest price on Y, and 
    - Population
    - Income/Poverty
    - Education 
        - Percent of adults with less than a high school diploma, 2015-19
        - Percent of adults with a high school diploma only, 2015-19
        - Percent of adults completing some college or associate's degree, 2015-19
        - Percent of adults with a bachelor's degree or higher, 2015-19)

TODO : Basic modelling.
    1) Select demographic features using LASSO, Median home prices 2017 vs RHS [All demographics]

In [27]:
import pandas as pd

In [28]:
data_folder = "C:\\Users\\spsundar.FAREAST\\OneDrive - Microsoft\\Documents\\Masters\\Coursework\\TML\\HomePriceBeastNew\\"

In [29]:
raw_home_data = pd.read_csv(f"{data_folder}weekly_housing_market_data_most_recent.tsv", delimiter="\t")

In [30]:
merged_census_county_data = pd.read_csv(f"{data_folder}merged_census_county_data.csv", low_memory=False)

In [31]:
interest_data = pd.read_csv(f"{data_folder}fed_funds.csv")
interest_data["DATE"] = pd.to_datetime(interest_data["DATE"])

In [32]:
state_mapping_data = pd.read_csv(f"{data_folder}lettercodestatemapping.csv")
state_mapping_data = state_mapping_data[['State', 'Code']]

In [33]:
def combine_datasets(merged_census_county_data, home_data, interest_data):
    
    merged_home_data = pd.merge(
    home_data,
    interest_data,
    how="inner",
    left_on="period_month_year",
    right_on="DATE",
    right_index=False)

    merged_home_data = pd.merge(
        merged_home_data,
        merged_census_county_data,
        how="inner",
        right_index=False)
    
    return merged_home_data

In [34]:
def clean_home_data(raw_home_data):
    home_data = raw_home_data
    home_data.drop(home_data[home_data["region_type"]!='county'].index, inplace = True)
    home_data["county_name"] = home_data["region_name"].apply(lambda x: x.split(',')[0])
    home_data["state_code"] = home_data["region_name"].apply(lambda x: x.split(', ')[1])
    home_data['period_begin'] = pd.to_datetime(home_data['period_begin'])
    home_data['period_end'] = pd.to_datetime(home_data['period_end'])
    home_data['period_diff'] = home_data['period_end'] - home_data['period_begin']
    home_data['period_diff'] = home_data['period_diff'].apply(lambda x : x.days)
    home_data.drop(home_data[home_data['period_diff']!=27].index, inplace = True)
    home_data["period_month_year"] = pd.to_datetime( home_data["period_begin"].dt.year.astype(str) + '-' + home_data["period_begin"].dt.month.astype(str)  + '-1')
    return home_data

In [None]:
home_data = clean_home_data(raw_home_data)
combined_home_data = combine_datasets(merged_census_county_data,
                                      home_data,
                                      interest_data)

In [None]:
ignore_cols = ['Unnamed: 165', 'duration', 'last_updated',
                   'region_type', 'region_name', 'region_type_id',
                   'period_diff', 'DATE'] 
combined_home_data = combined_home_data[[x for x in \
                                     combined_home_data.columns \
                                     if x not in ignore_cols]]

In [None]:
combined_home_data.to_csv(f"{data_folder}combined_home_data.csv", index=False)

In [None]:
for x in combined_home_data.columns:
    print(x)