# 03 Data Analysis (Correlation)

In this notebook we have a look at the correlation between the features themselves as well as the correlation of the features with the target variable (house price).

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import math
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from shapely import wkt

In [None]:
df = pd.read_csv("../data/cleaned/df_final.csv", index_col=0)

In [None]:
df_back = df.copy()

### Preprocessing

In [None]:
# drop columns
columns_to_drop = [
    "state_y",
    "geometry_y",
    "lag_month",
    "lag_year_y",
    "date_y",
    "lag_year_x",
]
df = df.drop(columns=columns_to_drop)

In [None]:
df["age"] = (df.year - df.yrblt)
df["eff_age"] = (df.year - df.effyrblt)

In [None]:
# drop where age is negative
df = df[df.age >= 0]

In [None]:
# set effyrblt and eff_age to NaN for the case where effyrblt > saledate (data leakage)
df.loc[df.eff_age < 0, "effyrblt"] = np.nan
df.loc[df.eff_age < 0, "eff_age"] = np.nan

In [None]:
# drop transactions with yrblt of zero
df = df[df.yrblt != 0]

In [None]:
# drop transactions with yrblt of zero
df = df[df.effyrblt != 0]

In [None]:
# create features longitude, latitude
df["geometry_x"] = df.geometry_x.apply(wkt.loads)
df["geometry_x"] = df.geometry_x.apply(lambda x: x.centroid)

df["longitude"] = df.geometry_x.apply(lambda x: x.x)
df["latitude"] = df.geometry_x.apply(lambda x: x.y)

### Data Analysis

#### Correlations

In [None]:
df.columns

In [None]:
### correlations
corr = df.corr(numeric_only=True)

# plot 
fig, ax = plt.subplots(figsize=(30,20))
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, ax=ax)

#px.bar(corr.loc["Target"].abs().sort_values(ascending=False)) # maybe leave out abs

**Findings:**

- the effective living area also seems to correlate with other core features like the number of bedrooms and the number of bathrooms

- regarding the correlations we clearly see a positive correlation of most frequency geografic features with other geografic frequency features
- mostly negative correlations between the distance geografix features and frequency geografic features. This makes sense because if the number of healthcare facilities in a 5 km radius is high. its is more likely that the distance to a hospital which is included in the healthcare facilies is small
- besides that we have a great positive correlations between the different economic factors, such as betweeen the population and the number of employed persons. In this specific case this makes sense, because in bigger cities or counties with larger populations the total number of employed peope is more likely to be higher compared to counties with a very small population. However if we take a closer look there is not a big correlation between population and the unemployment rate. because the unemployment rate is relative to the population and therefore can differ form county to county without the population having a clear effect.
- within the economic factors there are also some negative correlated features. Such as the poverty_rate and n_employed which economically also makes sense. if the poverty is high it is more likely that there are less employed people which generally indicates a negative economic state.
the same is the case for economic factors such as unemployment_rate and household_incomre. A high unemployment_rate suggests a bad state of the economy and therefore it makes sense that the average household_incomre would be small.
- another interesting point is that we have strong positive correlation between the house price index and the number of new housing permits. Since the house price index capture the price development of houses it makes sense if the prices for houses generally go up that also the interest to build a house or accuire a house increase. However the hpi does not seem to strongly correlated with other econmic features.

- another interesing thing is that we have a very strong correlation between an economic or demografic factor such as the population and some geografical distance features. 
- for the population we see a very strong negative correlation with the distance to railway station. This is probably or there tend to be there are maybe more railway stations in cities or counties with a higher population. therefore generelly the distance to railway stations would be smaller than in counties with a very low population. The same effect can be seen for the economic factor of number of employed people. this is also the case because we have a strong positive correlation between number of employed people and th population.


Price (price):

Shows moderate correlation with variables like household_income and hpi.
Household Income (household_income):

Strongly positively correlated with hpi and new_housing.
Strongly negatively correlated with poverty_rate, unemployment_rate, and n_unemployed.
Employment-related Variables:

unemployment_rate and n_unemployed are positively correlated.
These variables are negatively correlated with household_income and n_employed

In [None]:
px.bar(corr.loc["price"].sort_values(ascending=False).drop("price"),
             labels={'index': 'Features', 'value': 'Correlation Coefficient'},
             title='Feature Correlation with Price')

**Findings:**

- regarding the correlation with with the target the sale price, we cann see that there is no feature that has a signifiant high correlation with the price.
- +++ the feature with the highest correlation of 0.5 is the effective living area with 
- -- the feature with the highes correlation of 0.45 is the number of the bathrooms
- ++ the number of bathrooms and the living area follow after with correlation of 0.45 and 0.4 
- the household_income and the house price index as econmic factor follow after that. 
- Regarding the distance geografical features the distance to the airport is the one with the highest correlation with the price
- what is intersting is that the frequency features seem to have very low correlation with price which is also refelcted by our previous analysis

##### Mutual Information

In [None]:
X = df.drop(columns="price").select_dtypes(exclude=["object"])
y = df.loc[:, "price"]

In [None]:
X = X.fillna(0)

In [None]:
### mutual information (maybe also after encoding of categorical variables)
mi_scores = mutual_info_regression(X, y)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)



# plot
px.bar(y=mi_scores, x=mi_scores.index, orientation="v")

- regarding the mutial information we can see some differences compared to the correlation
- what significantly differs is that number of people in poverty and numer of young people in poverty seem to have a high impact on the price
- but in this case we can see that the house price index and the household income still have to have a high impact