# 02 Data Analysis Appreciation (Target Plotting)

In this notebook we have a look at the relationship between or different features and our target variable the house price appreciation.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import math
import numpy as np

In [2]:
df_origin = pd.read_csv("../data/cleaned/df_appreciation_final.csv", index_col=0)

In [3]:
df = df_origin.copy()

### Preprocessing

In [4]:
# drop columns
columns_to_drop = [
    "state_y",
    "geometry_y",
    "lag_month",
    "lag_year_y",
    "date_y",
    "lag_year_x",
]
df = df.drop(columns=columns_to_drop)

In [5]:
df["age"] = (df.year - df.yrblt)
df["eff_age"] = (df.year - df.effyrblt)

In [6]:
# drop where age is negative
df = df[df.age >= 0]

In [7]:
# set effyrblt and eff_age to NaN for the case where effyrblt > saledate (data leakage)
df.loc[df.eff_age < 0, "effyrblt"] = np.nan
df.loc[df.eff_age < 0, "eff_age"] = np.nan

In [8]:
# drop transactions with yrblt of zero
df = df[df.yrblt != 0]

In [9]:
# drop transactions with yrblt of zero
df = df[df.effyrblt != 0]

- prior_price
- price
- saledate
- prior_saledate
- year
- prior_year
- month
- prior_month

In [None]:
# calculate appreciation in percent
df["appreciation"] = (df["price"] - df["prior_price"])/df.prior_price

In [None]:
df["saledate"] = pd.to_datetime(df.saledate)
df["prior_saledate"] = pd.to_datetime(df.prior_saledate)

In [None]:
df["appreciation_time"] = df.saledate - df.prior_saledate

In [None]:
df["appreciation_time"] = df.appreciation_time.dt.days

In [None]:
# drop cases where the saleprice is the same (no appreciation) most of the time double recording of transactions
df = df[df.saledate != df.prior_saledate]

In [None]:
# drop negatvie appreciation time (only four cases)
df = df[df.appreciation_time > 0]

In [None]:
# calculate the appreciation per day
df["appreciation_per_day"] = np.where(df.appreciation_time == 0, df.appreciation, (df["appreciation"] / df.appreciation_time))

In [None]:
df[df.appreciation_per_day > 10][["appreciation_per_day", "price", "prior_price", "appreciation_time"]]
# we have to filter out very small prior_prices (e.g. 10 euro)

In [None]:
# filter out prior prices smaller than 100
df = df[df.prior_price > 100]

In [None]:
df[df.appreciation_per_day > 10][["appreciation_per_day", "price", "prior_price", "appreciation_time"]]

In [None]:
df[df.appreciation_per_day.abs() > 1][["appreciation_per_day", "price", "prior_price", "appreciation_time"]]

In [None]:
df["appreciation_per_year"] = df.appreciation_per_day * 360 

In [None]:
df[(df["appreciation_per_year"] > 100) & (df["appreciation_time"] < 360)][["price", "prior_price", "saledate", "prior_saledate", "appreciation_per_year", "appreciation_time"]]

In [None]:
# delete some outliers 
df = df[df["appreciation_per_year"].abs() <= 100]

### Data Analysis

#### Target Plotting

##### age and eff_age

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
sns.scatterplot(df, y="appreciation_per_day", x="age", alpha=.5, ax=ax)

In [None]:
df_copy = df.copy()

In [None]:
df_copy["age_range"] = pd.cut(df_copy.age ,bins=20)
df_copy["age_range_q"] = pd.qcut(df_copy.age ,q=20, duplicates="drop")

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="age_range", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="age_range", ax=ax, estimator=np.median)

**Finding:**
- we can see a large appreciation value for new houses
- but in general we can see that we have a quadratic relationship between the age and the appreciation per year
- the appreication first increases with the age up until an age of around 130 to 150 years and after that the appreciation seems to decrease
- there are still some extreme cases where we have very old houses with large appreciation values, but if we blent out these extreme case, we can generall see this relationship

- same picture can be seen if we consider the median values and not the mean values

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_day", x="age_range_q", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df, y="appreciation_per_day", x="eff_age", alpha=.5, ax=ax)

In [None]:
df_copy["eff_age_range"] = pd.cut(df_copy.eff_age ,bins=20)
df_copy["eff_age_range_q"] = pd.qcut(df_copy.eff_age, q=20)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="eff_age_range", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="eff_age_range_q", ax=ax)

##### city

In [None]:
# city
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="city", ax=ax)

In [None]:
# city
# Set display format to avoid e-notation
pd.options.display.float_format = '{:.9f}'.format
df_copy.groupby(by="city")["appreciation_per_day"].mean().sort_values(ascending=False) * 100

In [None]:
# city
# Set display format to avoid e-notation
pd.options.display.float_format = '{:.9f}'.format
df_copy.groupby(by="city")["appreciation_per_day"].median().sort_values(ascending=False) * 100

**Findings**:
- City with the largest appreciation is chaplin with mean value of 117 %, afterthat are Colchester, windsor and Lebanon which have yearly appreciation values in mean of more than 100 %.
- the Cities with the lowest appreicaiton are Dunn Loring with an mean apprication of areound 8,14 %, Oakton, Fairfax Station, Reston and Burke alos have low appreciation values smaller than 9 %.

- However it must be noted that chaplin also only has 141 transaction, therefore the appreication of individual transaction have an bigger effect on the mean values

- Therefore, to decrease the effect of outliers we also have a look at the median values (if we do this we have clearly other results)
- We do this because we have many outliers in this dataset regarding the appreciation
- Regarding the median values, we can see that bridgeport has the highest median values of 11,6 % then hartford (10,65%) then New Britain
- The lowest median vlaues can be seen in stafford with 0,51%, Weston (1,22%) and Redding (1,56%)


In [None]:
# city
# Set display format to avoid e-notation
# problem if we calculate yearly appreciation, it could be the case that this appreciation was not realized because the appreciation time is below one year
# but this problem can be neglected because for the most the appreciation time is over one year
pd.options.display.float_format = '{:.9f}'.format
df_copy.groupby(by="city")["appreciation_per_year"].median().sort_values(ascending=False) * 100

In [None]:
# city
# Set display format to avoid e-notation
# problem if we calculate yearly appreciation, it could be the case that this appreciation was not realized because the appreciation time is below one year
# but this problem can be neglected because for the most the appreciation time is over one year
pd.options.display.float_format = '{:.9f}'.format
df_copy.groupby(by="city")["appreciation_per_year"].mean().sort_values(ascending=False) * 100

In [None]:
# county
# Set display format to avoid e-notation
pd.options.display.float_format = '{:.9f}'.format
df_copy.groupby(by="county")["appreciation_per_year"].mean().sort_values(ascending=False) * 100

In [None]:
# county
# Set display format to avoid e-notation
pd.options.display.float_format = '{:.9f}'.format
df_copy.groupby(by="county")["appreciation_per_year"].median().sort_values(ascending=False) * 100

**Findings:**
- County with the highest appreciation of 83,37 % is windham, after that are new london and tolland
- Counties with the lowest are fairfax with 10,36 %, Fairfield (32,72%) and Hartford (37,52%)

- Regarding the median values we can see that Windham still has the highest with 6,66 %, then Fairfax (6,34%) and then New London (5,29%)

In [None]:
pd.set_option("display.max_rows", None)
df_copy.groupby(by="city").size().sort_values(ascending=False)

##### Condition Description

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="cond_desc", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="age", x="cond_desc", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_time", x="cond_desc", ax=ax)

**Findings:**
- regarding the relationship between the condition and the appreciation, we cannot say that a better condition at the time of the sale leads to better appreciation
- actionally we can see that regarding the mean appreciation for each condition category, that the poor condition category clearly exibits the highes mean appreciation, after that we can see that the fair condition category has the second highest mean apprecation value.
- the difference between the other contition categories regarding the appreciation is not that high, for average plus we can overall see the lowest mean appreciation
- this can be connected to the age of the house, since we already have seen that up to a certain age we can see an increase in the appreciation
- if we check that we can see that the poor condtion category as well as the fair condition category have the highes mean age compared to the other categories, with the mean value being still under 100
- as mentioned before the appreciation values seemed to increase up to an age of 150

- if we have a look ath the median appreciation values we can not clearly see a difference between the cateogries. ther are all at around 5 % t0 6 % appreciation

In [None]:
px.box(df_copy, y="appreciation_per_year", x="cond_desc")

In [None]:
df_copy.groupby("cond_desc")["appreciation_per_year"].median().sort_values()

In [None]:
df_copy.groupby("cond_desc").size().sort_values()

##### Sale Year

In [None]:
# date
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="appreciation_per_year", x="saledate", ax=ax)

In [None]:
px.bar(df_copy.groupby("year")["appreciation_per_year"].mean())

In [None]:
px.bar(df_copy.groupby("year")["appreciation_per_year"].median())

**Findings:**
- Regarding the development of the apprecation over the years we can see that we do not have a clear trend over the year, we have a many fluctuations
- we can see peaks at 1980, 1986, 1994, 2005, a similar picture is seen if we have a look at the median values
- overall we cann see that the appreciation on is low level compared to the years before since 2009
- howeve the appreciation seems to increase since 2015

##### Sale Month

In [None]:
# month
fig, ax = plt.subplots(figsize=(20,10))
sns.barplot(df_copy, y="appreciation_per_year", x="month", ax=ax)

**Findings:**
- Overall we cann see hiher appreciation in the winter months (especially from january to april) and the lowest appreicaiton in sale months during the summer especially june july

In [None]:
# month
fig, ax = plt.subplots(figsize=(20,10))
sns.barplot(df_copy, y="appreciation_per_year", x="month", ax=ax, estimator=np.median)

In [None]:
df_copy.groupby(by="month")["appreciation_per_year"].mean().sort_values(ascending=False)

In [None]:
df_copy.groupby(by="month")["appreciation_per_year"].median().sort_values(ascending=False)

##### Bedrooms

In [None]:
# bedrooms
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy[df_copy.nbed < 10], y="appreciation_per_year", x="nbed", ax=ax)

In [None]:
# bedrooms
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy[df_copy.nbed < 10], y="appreciation_per_year", x="nbed", ax=ax, estimator=np.median)

**Findings:**
- Some extreme cases regarding the number of bedrooms
- if we have a look at the  mean values regarding the appreciation, we can not clearly see a pattern (high appreciation for zero and one bedroom lower appreciation for 2 to 5 than increase again)
- however looking at the median vlaues it seems that a higher number of bedrooms leads to higher median appreciation 

##### Bathrooms and Half-Bathrooms

In [None]:
# bathrooms
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy, y="appreciation_per_year", x="nbath", ax=ax)

In [None]:
# bathrooms
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy, y="appreciation_per_year", x="nbath", ax=ax, estimator=np.median)

In [None]:
px.box(df_copy, y="nbath")

In [None]:
# number of bathrooms (upper fence 4)
px.histogram(df_copy, "nbath")

In [None]:
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy[df_copy.nbath < 7], y="appreciation_per_year", x="nbath", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(40,20))
sns.barplot(df_copy[df_copy.nhalfbath < 4], y="appreciation_per_year", x="nhalfbath", ax=ax)

**Finding:**
- no clear pattern for the number of bathrooms nor for half bathrooms

##### distance_aerodrome

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="appreciation_per_year", x="distance_aerodrome", ax=ax)

In [None]:
df_copy["bins_aerodrome"] = pd.cut(df_copy.distance_aerodrome, bins=40)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_aerodrome", ax=ax, estimator=np.median)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_aerodrome", ax=ax)

**Findings:**
- No clear pattern for the distance to airport, also mean and median values give use very different pictures

##### distance_ferry_terminal

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="appreciation_per_year", x="distance_ferry_terminal", ax=ax)

In [None]:
df_copy["bins_ferry"] = pd.cut(df_copy.distance_ferry_terminal, bins=30)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_ferry", ax=ax, estimator=np.median)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_ferry", ax=ax)

In [None]:
df_copy.groupby("bins_ferry")["appreciation_per_year"].mean()

**Findings:**
- no clear and simple relationship between appreciation and distance to ferry terminal
- decrease until 10 km then increase again until 30 to 38 km then decrease again unitl 46 km then increase again (overall no clear relationship)

##### distance_hospital

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="appreciation_per_year", x="distance_hospital", ax=ax)

In [None]:
df_copy["bins_hospital"] = pd.cut(df_copy.distance_hospital, bins=30)

In [None]:
# fig, ax = plt.subplots(figsize=(30,10))
#px.box(df_copy, y="appreciation_per_year", x="bins_hospital")

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_hospital", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_hospital", ax=ax, estimator=np.median)

**Findings:**
- For the distance to the hospital we seem to have a quadratic relationship where the appreciation decreases with the distance and then increaseses again, this is the case if we only look at the mean values for the appreciation
- if we have a look at the median values, wee can overall see and decrease in appreciation with an increasing distance of the distances

##### distance_hotel

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="appreciation_per_year", x="distance_hotel", ax=ax)

In [None]:
df_copy["bins_hotel"] = pd.cut(df_copy.distance_hotel, bins=30)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="appreciation_per_year", x="bins_hotel", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_hotel", ax=ax)

**Findings:**
- regarding the distance to the closest hotel, we can say that the mean values of the appreciation suggest an relationhip of increasing appreciaiton with an increasing distance to the hotel
- looking at the median value we can not see an clear cut pattern since the median values for the differnt distances are very close together and do not significantly differ

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_hotel", ax=ax, estimator=np.median)

In [None]:
df_copy.groupby(by="bins_hotel")["price"].mean()

In [None]:
df_copy.groupby(by="bins_hotel")["price"].median()

##### distance_market

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="appreciation_per_year", x="distance_market", ax=ax, alpha=.3)

In [None]:
df_copy["bins_market"] = pd.cut(df_copy.distance_market, bins=30)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="appreciation_per_year", x="bins_market", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_market", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_market", ax=ax, estimator=np.median)

**Findings:**
- no clear pattern, median and mean values different results

##### distance_museum

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="appreciation_per_year", x="distance_museum", ax=ax, alpha=.3)

In [None]:
df_copy["bins_museum"] = pd.cut(df_copy.distance_museum, bins=20)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="appreciation_per_year", x="bins_museum", ax=ax)

In [None]:
df_copy.groupby(by="bins_museum")["appreciation_per_year"].median()

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_museum", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_museum", ax=ax, estimator=np.median)

**Findings:**
- no clear relationship, the distance to museums does not seem to effect the appreciation

##### distance_railway_station

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="appreciation_per_year", x="distance_railway_station", ax=ax, alpha=.3)

In [None]:
df_copy["bins_railway"] = pd.cut(df_copy.distance_railway_station, bins=20)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="appreciation_per_year", x="bins_railway", ax=ax)

In [None]:
mean = np.mean(df_copy.appreciation_per_year)
upper = mean + 2* np.std(df_copy.appreciation_per_year)
lower = mean - 2* np.std(df_copy.appreciation_per_year)

In [None]:
pd.set_option("display.max_rows", 20)
df_test = df_copy.copy()
df_test = df_test[(df_test.appreciation_per_year < upper) & (df_test.appreciation_per_year > lower)]

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_test, y="appreciation_per_year", x="bins_railway", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_railway", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_test, y="appreciation_per_year", x="bins_railway", ax=ax, estimator=np.median)

In [None]:
df_copy.groupby(by="bins_railway")["appreciation_per_year"].mean()

In [None]:
df_copy.groupby(by="bins_railway")["appreciation_per_year"].median()

**Findings:**
- no clear pattern can be identified, even when we filter out outliers using the mean and astandard deviation
- still median and mean sresult in different patterns, therefore its not clear how the relationship is

##### n_accommodation

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="n_accommodation", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="n_accommodation", ax=ax, estimator=np.median)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(df_copy, y="appreciation_per_year", x="n_accommodation", ax=ax)

In [None]:
px.bar(df_copy.groupby(by="n_accommodation")["appreciation_per_year"].median())

In [None]:
df_copy["bins_acc"] = pd.cut(df_copy.n_accommodation, bins=10)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_acc", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_acc", ax=ax, estimator=np.median)

**Findings:**
- No clear patter, median and mean different results
- mean -> increasing number acco leads to decrasing appreciation
- median the other way around

##### n_food_drink

In [None]:
df_copy["bins_food"] = pd.cut(df_copy.n_food_drink, bins=10)
df_test["bins_food"] = pd.cut(df_test.n_food_drink, bins=10)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="appreciation_per_year", x="bins_food", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_test, y="appreciation_per_year", x="bins_food", ax=ax)

In [None]:
df_copy["bins_food"] = pd.cut(df_copy.n_food_drink, bins=10, labels=range(10))
px.bar(df_copy.groupby(by="bins_food")["appreciation_per_year"].median())

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="appreciation_per_year", x="n_food_drink", ax=ax, alpha=.3)

##### n_adults_entertain

In [None]:
df_copy.n_adults_entertain.unique()

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="n_adults_entertain", ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.scatterplot(df_copy, y="price", x="n_adults_entertain", ax=ax, alpha=.4)

In [None]:
df_copy["bins_adults"] = pd.cut(df_copy.n_adults_entertain, bins=20)

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.barplot(df_copy, y="price", x="bins_adults", ax=ax)

In [None]:
df_copy["bins_adults"] = pd.cut(df_copy.n_food_drink, bins=20, labels=range(20))
px.bar(df_copy.groupby(by="bins_adults")["price"].median())

##### frequency_features

In [None]:
df_copy = df.copy()

In [None]:
columns= [
'n_animalcare',
 'n_commu_serv',
 'n_commu_venu',
 'n_edu_fac',
 'n_emergency',
 'n_entertainment',
 'n_financial',
 'n_food_drink',
 'n_government_civic',
 'n_healthcare',
 'n_recreational',
 'n_reli_inst',
 'n_shopping',
 'n_sports',
 'n_transport',
 'n_utilities',
]

In [None]:
for col in columns:
    print(col)
    df_copy[f"bins_{col}"] = pd.cut(df_copy[col], bins=20)
    fig, ax = plt.subplots(figsize=(30,10))
    sns.barplot(df_copy, y="price", x=f"bins_{col}", ax=ax) 
    plt.show()

In [None]:
for col in columns:
    print(col)
    df_copy[f"bins_{col}"] = pd.cut(df_copy[col], bins=20)
    fig, ax = plt.subplots(figsize=(30,10))
    sns.barplot(df_copy, y="price", x=f"bins_{col}", ax=ax, estimator=np.median) 
    plt.show()

### Results

in most frequency features there is no clear cut pattern that can be seen.

n_utilities: 
- we can see that higher frequency leads to lower price

n_shopping:
- we can see that increasing frquency leads to higher price

reli_institions, entertianment, healthcare:
- we can see that there is first an decrease in price with increasing frequency and then we can see an increase in price agian. indicateds and quadrartic relatinship

financial, transport:
- in this case it the other way around, we can see in increase in price first and the na sudden decrease


n_adults_entertain:
- no clear cut pattern, we can see an increase in price with increasing freuqncy and then a decrese in price again

n_food_drink:
- up to an frequency of 260, we can see that there is an continous increase of the mean price and median price

n_accomodation:
- no clear cut pattern can be seen


distnace to railway station:
- we can see an decrase in the price with an increasing distance to the railway station. But at some point the prices somehow increase again. But there are significalty less house with large distances to the railway station which then have higher effect on the mean price. We can see an increase again at distance of around 62 km.


distance to musuem:
- with increasing distance we can generally see and decrease in the mean price as well as in the median price

distance to market (eventuell löschen): 
- we cannot see a clear pattern


distance to hotel:
- with higher distance the price increases slightly. 
- but we can clearly see that with some high distance to hotels we can see that the mean price is quite small and decreases. starting from distance of 15 km we can see that the price nearly continously decreases
- same is the case for median price



distnace to hospital:
- we can see a decrease in the median price with higher distance to the hospital. first a littly increase but then we can see clear decrease
- the same is the case for the mean price.


distance ferry terminal:
- the median price does not really differ for different distances to a ferry terminal
- the mean price varies, but not really with a clear linear pattern. This can be due to the effect of outliers on the mean. The amount of outliers is varies and is similar to the effect on the mean price.

distance aerodrome:
- we can see a slight increase of the price with an increasing distance to the airport.
- after a certain distance to the airport the median and also mean price decreases again.
- this can maybe be explained because airports tend to be outside of the cities
- therefore being near to an airport could indicate being far away from the city center

Number of halfbaths:
- upper fence already at 2
- generally we can also see the same effect as for number of bathrooms, when we do not consider the outliers, or extrem values in the dataset
- increasing number of halfbathrooms increasing price
- but we can say that we generally do not have many houses with many halfbathrooms


Number of bathrooms:
- varies greatly (most values around 0 to 4 upper fence)
- some extrem values (seen in descriptive analysis)
- if we do not consider the outlier values (bigger than 6), do not want to get biased picture through extreme cases
- generally we can see that there is an increase in price with increasing numbber of bathrooms
- interesting point, some buildings without bathrooms


Number of Bedrooms:
- for the number the bedrooms we see that there is an increase of the sales price with increasing number of bbedrooms up until a number of 5 bedrooms. 
- after that we do not see this effect as clearly, which is also because the upper fence of the boxplot is at 5, we have only very little numbber of houses with more than 5 bedrooms

Sale Month:
- we see an slight increase in the mean sales price for the summer month (june july, august) 
- in the winter months it is lower (january, novermber, oktober)

Sales Year:
- we see an increase in the mean sales price per year until year 1990 (first decrease), then slight decrease in the sales price.
- very significant decrease in mean sales price in year 2008 and 2009, then again a slight increase
- then 2015 and 2016 again slight decrease


Condition:
- since we have very different number houses for the different conditions, especially for poor condition houses we have very small number of houses (1537 only vs 610260 for Average). Therefore, mean prices for the different categories would lead to a much different image than for the median, since the different transaction have an higher effect for the poor category.
For the median price for each category, we get the expected image that the median is lowest for poor - > fair -> average -> average plus -> good.


City:
- highes mean price in cities Greenwich, Westport and New Canaan.
- very high mean price for Greenwich with 2.072.040,71 
- lowest mean price in cities Plainville, Norwich and East Hartford.
- very low mean price in city Plainville with 133.000 dollars



Age:
- we observe more houses with higher prices with in low age e.g. 0 to 20 years of age, however in needs to be considered that we have more younger houses in our dataset (with most having age 0)
- when we examine the average price for different age ranges we can see that the average price for the price ranges decreases with increasing price and the goes up with increasing age. However we have to keep in mind that older houses are less freqent in our dataset which results to individual transaction having an hgiher effect on the mean for this age range

effective age:
- for the effective age we see similar effects
- if we take the average of different qunatiles we see that there is general the effect that with an increasing age there is a decreasing price up until one point where it slightly increases again