# 03b Data Analysis Appreciation (Correlation)

In this notebook we have a look at the correlation between the features themselves as well as the correlation of the features with the target variable (house price appreciation).

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import math
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from shapely import wkt

In [None]:
df_origin = pd.read_csv("../data/cleaned/df_appreciation_final.csv", index_col=0)

In [None]:
df = df_origin.copy()

### Preprocessing

In [None]:
# drop columns
columns_to_drop = [
    "state_y",
    "geometry_y",
    "lag_month",
    "lag_year_y",
    "date_y",
    "lag_year_x",
]
df = df.drop(columns=columns_to_drop)

In [None]:
df["age"] = (df.year - df.yrblt)
df["eff_age"] = (df.year - df.effyrblt)

In [None]:
# drop where age is negative
df = df[df.age >= 0]

In [None]:
# set effyrblt and eff_age to NaN for the case where effyrblt > saledate (data leakage)
df.loc[df.eff_age < 0, "effyrblt"] = np.nan
df.loc[df.eff_age < 0, "eff_age"] = np.nan

In [None]:
# drop transactions with yrblt of zero
df = df[df.yrblt != 0]

In [None]:
# drop transactions with yrblt of zero
df = df[df.effyrblt != 0]

In [None]:
# calculate appreciation in percent
df["appreciation"] = (df["price"] - df["prior_price"])/df.prior_price

In [None]:
df["saledate"] = pd.to_datetime(df.saledate)
df["prior_saledate"] = pd.to_datetime(df.prior_saledate)

In [None]:
df["appreciation_time"] = df.saledate - df.prior_saledate

In [None]:
df["appreciation_time"] = df.appreciation_time.dt.days

In [None]:
# drop cases where the saleprice is the same (no appreciation) most of the time double recording of transactions
df = df[df.saledate != df.prior_saledate]

In [None]:
# drop negatvie appreciation time (only four cases)
df = df[df.appreciation_time > 0]

In [None]:
# filter out prior prices smaller than 100
df = df[df.prior_price > 100]

In [None]:
# create features longitude, latitude
df["geometry_x"] = df.geometry_x.apply(wkt.loads)
df["geometry_x"] = df.geometry_x.apply(lambda x: x.centroid)

df["longitude"] = df.geometry_x.apply(lambda x: x.x)
df["latitude"] = df.geometry_x.apply(lambda x: x.y)

### Data Analysis

#### Correlations

In [None]:
### correlations
corr = df.corr(numeric_only=True)

# plot 
fig, ax = plt.subplots(figsize=(30,20))
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, ax=ax)

#px.bar(corr.loc["Target"].abs().sort_values(ascending=False)) # maybe leave out abs

**Findings:**

- the correlation pattern between the features are the same as for the dataset that we will use for the initial house price prediciton model
- also the addtitional features like appreciation_time, prior_year, prior_month and prior_price do not have any remarkable correlations with other features

In [None]:
px.bar(corr.loc["appreciation"].sort_values(ascending=False).drop(["appreciation", "price"]),
             labels={'index': 'Features', 'value': 'Correlation Coefficient'},
             title='Feature Correlation with Appreciation')

**Findings:**

- overall we can see that we do not have any feature with high correlation with the appreciation
- the feature with the highest correlation is the appreciation time, which also only has a correlation of 0.07, after tha comes the prior sale year with a negative correlation of - 0.06 and the prior or initial price with - 0.04
- also we we plot the different features against our target, we can not observe any obvious or clear patterns
- all of this indicates the possibility that it would be better to indrectly preidct the appreciation throug prediction the future price and the manually calculating the appreciation
- if we look at the correlations with the future price, we can clearly observe same correlation pattern as for the inital price, the only main difference is that the prior_price, as expected has the highest correlation with the future price, with a correlation of 0.76.
- this support the idea predict the future price and the calculate the resulting predicted appreciation instead of directly predicting the appreciation

In [None]:
px.bar(corr.loc["price"].sort_values(ascending=False).drop(["appreciation", "price"]),
             labels={'index': 'Features', 'value': 'Correlation Coefficient'},
             title='Feature Correlation with Future Price')

##### Mutual Information

In [None]:
X = df.drop(columns="price").select_dtypes(exclude=["object"])
y = df.loc[:, "price"]

In [None]:
X = X.fillna(0)

In [None]:
### mutual information (maybe also after encoding of categorical variables)
mi_scores = mutual_info_regression(X, y)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)



# plot
px.bar(y=mi_scores, x=mi_scores.index, orientation="v")

**Findings:**

- regarding the mutial information we can see some differences compared to the correlation
- what significantly differs is that number of people in poverty and numer of young people in poverty seem to have a high impact on the price
- but in this case we can see that the house price index and the household income still have to have a high impact