# 02 Data Analysis (Target Plotting)

In this notebook we have a look at the relationship between or different features and our target variable the house price.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import math
import numpy as np
from shapely import wkt

In [None]:
df = pd.read_csv("../data/cleaned/df_final.csv", index_col=0)

In [None]:
df_back = df.copy()

### Preprocessing

In [None]:
# drop columns
columns_to_drop = [
    "state_y",
    "geometry_y",
    "lag_month",
    "lag_year_y",
    "date_y",
    "lag_year_x",
]
df = df.drop(columns=columns_to_drop)

In [None]:
df["age"] = (df.year - df.yrblt)
df["eff_age"] = (df.year - df.effyrblt)

In [None]:
# drop where age is negative
df = df[df.age >= 0]

In [None]:
# set effyrblt and eff_age to NaN for the case where effyrblt > saledate (data leakage)
df.loc[df.eff_age < 0, "effyrblt"] = np.nan
df.loc[df.eff_age < 0, "eff_age"] = np.nan

In [None]:
# drop transactions with yrblt of zero
df = df[df.yrblt != 0]

In [None]:
# drop transactions with yrblt of zero
df = df[df.effyrblt != 0]

In [None]:
# create features longitude, latitude
df["geometry_x"] = df.geometry_x.apply(wkt.loads)
df["geometry_x"] = df.geometry_x.apply(lambda x: x.centroid)

df["longitude"] = df.geometry_x.apply(lambda x: x.x)
df["latitude"] = df.geometry_x.apply(lambda x: x.y)

### Data Analysis

#### Target Plotting

#### livarea and efflivarea

In [None]:
df_copy = df.copy()

In [None]:
df_copy["livarea_rage"] = pd.cut(df_copy.livarea ,bins=20)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="livarea_rage", ax=ax)

# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Living Area Ranges (in Square Feet)",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=45, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("livarea", fontsize=24,weight="bold")
plt.show()

In [None]:
df_copy["efflivarea_rage"] = pd.cut(df_copy.efflivarea ,bins=10)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="efflivarea_rage", ax=ax)
# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Effective Living Area Ranges (in Square Feet)",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=45, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("efflivarea", fontsize=24,weight="bold")
plt.show()

##### age and eff_age

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
sns.scatterplot(df, y="price", x="age", alpha=.5, ax=ax)

In [None]:
df_copy["age_range"] = pd.cut(df_copy.age ,bins=20)
df_copy["age_range_q"] = pd.qcut(df_copy.age ,q=20, duplicates="drop")

In [None]:
df_copy.price

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="age_range", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="age_range", ax=ax)

# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Age Range",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=45, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("age", fontsize=24, weight="bold")

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="age_range_q", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df, y="price", x="eff_age", alpha=.5, ax=ax)

In [None]:
df_copy["eff_age_range"] = pd.cut(df_copy.eff_age ,bins=20)
df_copy["eff_age_range_q"] = pd.qcut(df_copy.eff_age, q=20)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="eff_age_range", ax=ax)
# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Effective Age",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=45, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("eff_age", fontsize=24,weight="bold")
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="eff_age_range_q", ax=ax)

##### city

In [None]:
# city
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="city", ax=ax)

In [None]:
# city
# Set display format to avoid e-notation
pd.options.display.float_format = '{:.2f}'.format
df_copy.groupby(by="city")["price"].mean().sort_values(ascending=False)

In [None]:
pd.set_option("display.max_rows", None)
df_copy.groupby(by="city").size().sort_values(ascending=False)

##### Condition Description

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="cond_desc", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="cond_desc", ax=ax, estimator=np.median)

# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Condition",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=0, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("cond_desc", fontsize=24, weight="bold")
plt.show()

In [None]:
px.box(df_copy, y="price", x="cond_desc")

In [None]:
df_copy.groupby("cond_desc")["price"].median().sort_values()

In [None]:
df_copy.groupby("cond_desc").size().sort_values()

##### Sale Year

In [None]:
# date
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="saledate", ax=ax)

In [None]:
px.bar(df_copy, y="price", x="year")

In [None]:
# year
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="year", ax=ax)

In [None]:
# year
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy, y="price", x="year", ax=ax)

In [None]:
# year
fig, ax = plt.subplots(figsize=(40,20))
sns.lineplot(df_copy, y="price", x="year", ax=ax)

# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Sale Year",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=0, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("year", fontsize=24,weight="bold")
plt.show()

##### Sale Month

In [None]:
# month
fig, ax = plt.subplots(figsize=(40,20))
sns.lineplot(df_copy, y="price", x="month", ax=ax)

# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Sale Month",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.set_xticks(range(0,13,1))
ax.tick_params(axis='x', which='major', labelrotation=0, labelsize=16)
ax.tick_params(axis='y', which='major', labelsize=16)
ax.set_title("month", fontsize=24,weight="bold")
plt.show()

In [None]:
# month
px.box(df_copy, y="price", x="month")

In [None]:
df_copy.groupby(by="month")["price"].mean().sort_values(ascending=False)

In [None]:
df_copy.groupby(by="month")["price"].median().sort_values(ascending=False)

##### Bedrooms

In [None]:
# bedrooms
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy, y="price", x="nbed", ax=ax)

In [None]:
# bedrooms
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy, y="price", x="nbed", ax=ax, estimator=np.median)

In [None]:
# bedrooms
fig, ax = plt.subplots(figsize=(40,20))
sns.boxplot(df_copy, y="price", x="nbed", ax=ax)

# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Number of Bedrooms",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=0, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("nbed", fontsize=24, weight="bold")

plt.show()

In [None]:
# bedrooms
fig, ax = plt.subplots(figsize=(40,20))
sns.regplot(df_copy, y="price", x="nbed", ax=ax)

In [None]:
px.box(df_copy, y="nbed")

In [None]:
px.histogram(df_copy, x="nbed")

In [None]:
# we have some extreme cases regarding the number of the bedrooms (upper fence = 5)
# bedrooms

#fig, ax = plt.subplots(figsize=(40,20))
px.box(df_copy[df_copy.nbed < 10], y="price", x="nbed")

##### Bathrooms and Half-Bathrooms

In [None]:
# bathrooms
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy, y="price", x="nbath", ax=ax)

In [None]:
px.box(df_copy, y="nbath")

In [None]:
# number of bathrooms (upper fence 4)
px.histogram(df_copy, "nbath")

In [None]:
fig, ax = plt.subplots(figsize=(40,20))
sns.regplot(df_copy[df_copy.nbath < 7], y="price", x="nbath", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(40,20))
sns.boxplot(df_copy[df_copy.nbath < 7], y="price", x="nbath", ax=ax)

# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Number of Bathrooms",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=0, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("nbath", fontsize=24, weight="bold")

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy[df_copy.nbath < 7], y="price", x="nbath", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(40,20))
sns.boxplot(df_copy[df_copy.nhalfbath < 4], y="price", x="nhalfbath", ax=ax)

# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Number of Half Bathrooms",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=0, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("nhalfbath", fontsize=24, weight="bold")
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy[df_copy.nhalfbath < 4], y="price", x="nhalfbath", ax=ax)

In [None]:
# upper fence at 2
px.box(df_copy, "nhalfbath")

##### distance_aerodrome

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="distance_aerodrome", ax=ax)

In [None]:
df_copy["bins_aerodrome"] = pd.cut(df_copy.distance_aerodrome, bins=40)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="price", x="bins_aerodrome", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_aerodrome", ax=ax)
# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Distance Aerodrome (in Kilometers)",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=90, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("distance_aerodrome", fontsize=24, weight="bold")
plt.show()

##### distance_ferry_terminal

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="distance_ferry_terminal", ax=ax)

In [None]:
df_copy["bins_ferry"] = pd.cut(df_copy.distance_ferry_terminal, bins=30)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="price", x="bins_ferry", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_ferry", ax=ax)
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Distance Ferry Terminal (in Kilometers)",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=90, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("distance_ferry_terminal", fontsize=24, weight="bold")
plt.show()

##### distance_hospital

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="distance_hospital", ax=ax)

In [None]:
df_copy["bins_hospital"] = pd.cut(df_copy.distance_hospital, bins=30, labels=range(0,30))

In [None]:
# fig, ax = plt.subplots(figsize=(30,10))
px.box(df_copy, y="price", x="bins_hospital")

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_hospital", ax=ax)

ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Distance Hospital (in Kilometers)",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=0, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("distance_hospital", fontsize=24, weight="bold")
plt.show()

##### distance_hotel

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="distance_hotel", ax=ax)

In [None]:
df_copy["bins_hotel"] = pd.cut(df_copy.distance_hotel, bins=30)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="price", x="bins_hotel", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_hotel", ax=ax)
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Distance Hotel (in Kilometers)",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=90, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("distance_hotel", fontsize=24, weight="bold")
plt.show()

In [None]:
df_copy.groupby(by="bins_hotel")["price"].mean()

In [None]:
df_copy.groupby(by="bins_hotel")["price"].median()

##### distance_market

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="distance_market", ax=ax, alpha=.3)

In [None]:
df_copy["bins_market"] = pd.cut(df_copy.distance_market, bins=30)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="price", x="bins_market", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_market", ax=ax)
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Distance Market (in Kilometers)",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=90, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("distance_market", fontsize=24, weight="bold")
plt.show()

##### distance_museum

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="distance_museum", ax=ax, alpha=.3)

In [None]:
df_copy["bins_museum"] = pd.cut(df_copy.distance_museum, bins=10)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="price", x="bins_museum", ax=ax)

In [None]:
df_copy.groupby(by="bins_museum")["price"].median()

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_museum", ax=ax)
# Adjusting the y-axis to display prices in thousands
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Distance to Museum",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=45, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("distance_museum", fontsize=24, weight="bold")
plt.show()

##### distance_railway_station

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="distance_railway_station", ax=ax, alpha=.3)

In [None]:
df_copy["bins_railway"] = pd.cut(df_copy.distance_railway_station, bins=20)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="price", x="bins_railway", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_railway", ax=ax)
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Distance Railway Station (in Kilometers)",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=90, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("distance_railway_station", fontsize=24, weight="bold")
plt.show()

In [None]:
df_copy.groupby(by="bins_railway")["price"].mean()

In [None]:
df_copy.groupby(by="bins_railway")["price"].median()

##### n_accommodation

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="n_accommodation", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="price", x="n_accommodation", ax=ax)

In [None]:
px.bar(df_copy.groupby(by="n_accommodation")["price"].median())

In [None]:
df_copy["bins_acc"] = pd.cut(df_copy.n_accommodation, bins=10)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_acc", ax=ax)


##### n_food_drink

In [None]:
df_copy["bins_food"] = pd.cut(df_copy.n_food_drink, bins=10)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_food", ax=ax)

In [None]:
df_copy["bins_food"] = pd.cut(df_copy.n_food_drink, bins=10, labels=range(10))
px.bar(df_copy.groupby(by="bins_food")["price"].median())

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="n_food_drink", ax=ax, alpha=.3)

##### n_adults_entertain

In [None]:
df_copy.n_adults_entertain.unique()

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="n_adults_entertain", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="n_adults_entertain", ax=ax, alpha=.4)

In [None]:
df_copy["bins_adults"] = pd.cut(df_copy.n_adults_entertain, bins=20)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_adults", ax=ax)

In [None]:
df_copy["bins_adults"] = pd.cut(df_copy.n_food_drink, bins=20, labels=range(20))
px.bar(df_copy.groupby(by="bins_adults")["price"].median())

##### frequency_features

In [None]:
df_copy = df.copy()

In [None]:
columns= [
#'n_animalcare',
##'n_commu_serv',
 #'n_commu_venu',
 #'n_edu_fac',
 #'n_emergency',
 'n_entertainment',
 #'n_financial',
 #'n_food_drink',
 #'n_government_civic',
 'n_healthcare',
 #'n_recreational',
 'n_reli_inst',
 #'n_shopping',
 #'n_sports',
 #'n_transport',
 'n_utilities',
]

In [None]:
for col in columns:
    print(col)
    df_copy[f"bins_{col}"] = pd.cut(df_copy[col], bins=20)
    fig, ax = plt.subplots(figsize=(30,10))
    sns.barplot(df_copy, y="price", x=f"bins_{col}", ax=ax) 
    plt.show()

In [None]:
for col in columns:
    print(col)
    df_copy[f"bins_{col}"] = pd.cut(df_copy[col], bins=20)
    fig, ax = plt.subplots(figsize=(30,10))
    sns.barplot(df_copy, y="price", x=f"bins_{col}", ax=ax, estimator=np.median) 
    plt.show()

In [None]:

df_copy[f"bins_n_utilities"] = pd.cut(df_copy["n_utilities"], bins=20)
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x=f"bins_n_utilities", ax=ax, estimator=np.median) 
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Number of Utility Facilities",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=90, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("n_utilities", fontsize=24, weight="bold")
plt.show()

In [None]:

df_copy[f"bins_n_reli_inst"] = pd.cut(df_copy["n_reli_inst"], bins=20)
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x=f"bins_n_reli_inst", ax=ax, estimator=np.median) 
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Number of Religious Institutions",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=90, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("n_reli_inst", fontsize=24, weight="bold")
plt.show()

In [None]:

df_copy[f"bins_n_healthcare"] = pd.cut(df_copy["n_healthcare"], bins=20)
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x=f"bins_n_healthcare", ax=ax, estimator=np.median) 
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Number of Healthcare Facilities",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=90, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("n_healthcare", fontsize=24, weight="bold")
plt.show()

In [None]:

df_copy[f"bins_n_entertainment"] = pd.cut(df_copy["n_entertainment"], bins=20)
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x=f"bins_n_entertainment", ax=ax, estimator=np.median) 
ax.set_ylabel('Price ($ in thousands)', fontsize=20)
ax.set_xlabel("Number of Entertainment Facilities",fontsize=20)
ax.set_yticklabels([f'{int(label/1000)}' for label in ax.get_yticks()])
ax.tick_params(axis='x', which='major', labelrotation=90, labelsize=14)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_title("n_entertainment", fontsize=24, weight="bold")
plt.show()

### Results

in most frequency features there is no clear cut pattern that can be seen.

n_utilities: 
- we can see that higher frequency leads to lower price

n_shopping:
- we can see that increasing frquency leads to higher price

reli_institions, entertianment, healthcare:
- we can see that there is first an decrease in price with increasing frequency and then we can see an increase in price agian. indicateds and quadrartic relatinship

financial, transport:
- in this case it the other way around, we can see in increase in price first and the na sudden decrease


n_adults_entertain:
- no clear cut pattern, we can see an increase in price with increasing freuqncy and then a decrese in price again

n_food_drink:
- up to an frequency of 260, we can see that there is an continous increase of the mean price and median price

n_accomodation:
- no clear cut pattern can be seen


distnace to railway station:
- we can see an decrase in the price with an increasing distance to the railway station. But at some point the prices somehow increase again. But there are significalty less house with large distances to the railway station which then have higher effect on the mean price. We can see an increase again at distance of around 62 km.


distance to musuem:
- with increasing distance we can generally see and decrease in the mean price as well as in the median price

distance to market (eventuell löschen): 
- we cannot see a clear pattern


distance to hotel:
- with higher distance the price increases slightly. 
- but we can clearly see that with some high distance to hotels we can see that the mean price is quite small and decreases. starting from distance of 15 km we can see that the price nearly continously decreases
- same is the case for median price



distnace to hospital:
- we can see a decrease in the median price with higher distance to the hospital. first a littly increase but then we can see clear decrease
- the same is the case for the mean price.


distance ferry terminal:
- the median price does not really differ for different distances to a ferry terminal
- the mean price varies, but not really with a clear linear pattern. This can be due to the effect of outliers on the mean. The amount of outliers is varies and is similar to the effect on the mean price.

distance aerodrome:
- we can see a slight increase of the price with an increasing distance to the airport.
- after a certain distance to the airport the median and also mean price decreases again.
- this can maybe be explained because airports tend to be outside of the cities
- therefore being near to an airport could indicate being far away from the city center

Number of halfbaths:
- upper fence already at 2
- generally we can also see the same effect as for number of bathrooms, when we do not consider the outliers, or extrem values in the dataset
- increasing number of halfbathrooms increasing price
- but we can say that we generally do not have many houses with many halfbathrooms


Number of bathrooms:
- varies greatly (most values around 0 to 4 upper fence)
- some extrem values (seen in descriptive analysis)
- if we do not consider the outlier values (bigger than 6), do not want to get biased picture through extreme cases
- generally we can see that there is an increase in price with increasing numbber of bathrooms
- interesting point, some buildings without bathrooms


Number of Bedrooms:
- for the number the bedrooms we see that there is an increase of the sales price with increasing number of bbedrooms up until a number of 5 bedrooms. 
- after that we do not see this effect as clearly, which is also because the upper fence of the boxplot is at 5, we have only very little numbber of houses with more than 5 bedrooms

Sale Month:
- we see an slight increase in the mean sales price for the summer month (june july, august) 
- in the winter months it is lower (january, novermber, oktober)

Sales Year:
- we see an increase in the mean sales price per year until year 1990 (first decrease), then slight decrease in the sales price.
- very significant decrease in mean sales price in year 2008 and 2009, then again a slight increase
- then 2015 and 2016 again slight decrease


Condition:
- since we have very different number houses for the different conditions, especially for poor condition houses we have very small number of houses (1537 only vs 610260 for Average). Therefore, mean prices for the different categories would lead to a much different image than for the median, since the different transaction have an higher effect for the poor category.
For the median price for each category, we get the expected image that the median is lowest for poor - > fair -> average -> average plus -> good.


City:
- highes mean price in cities Greenwich, Westport and New Canaan.
- very high mean price for Greenwich with 2.072.040,71 
- lowest mean price in cities Plainville, Norwich and East Hartford.
- very low mean price in city Plainville with 133.000 dollars



Age:
- we observe more houses with higher prices with in low age e.g. 0 to 20 years of age, however in needs to be considered that we have more younger houses in our dataset (with most having age 0)
- when we examine the average price for different age ranges we can see that the average price for the price ranges decreases with increasing price and the goes up with increasing age. However we have to keep in mind that older houses are less freqent in our dataset which results to individual transaction having an hgiher effect on the mean for this age range

effective age:
- for the effective age we see similar effects
- if we take the average of different qunatiles we see that there is general the effect that with an increasing age there is a decreasing price up until one point where it slightly increases again

**livarea:
- for living area we slightly see the effect, that we have an increasing price, with an increase in the living area. However, this is not the case for every house, there are also houses with more livingarea, but still a lower price compared to other houses. This is also clear, because this is not the only factor influencing the price.