# 01 Data Analysis (Descriptive Statistics)

In this notebook we have a look at the descriptive statistics of our features. We use boxplots and histograms to have a look at the values distribution of different features.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import math
import numpy as np
from shapely import wkt

In [None]:
df = pd.read_csv("../data/cleaned/df_final.csv", index_col=0)

In [None]:
df_back = df.copy()

### Preprocessing

In [None]:
# drop columns
columns_to_drop = [
    "state_y",
    "geometry_y",
    "lag_month",
    "lag_year_y",
    "date_y",
    "lag_year_x",
]
df = df.drop(columns=columns_to_drop)

In [None]:
df["age"] = (df.year - df.yrblt)
df["eff_age"] = (df.year - df.effyrblt)

In [None]:
# drop where age is negative
df = df[df.age >= 0]

In [None]:
# set effyrblt and eff_age to NaN for the case where effyrblt > saledate (data leakage)
df.loc[df.eff_age < 0, "effyrblt"] = np.nan
df.loc[df.eff_age < 0, "eff_age"] = np.nan

In [None]:
# drop transactions with yrblt of zero
df = df[df.yrblt != 0]

In [None]:
# drop transactions with yrblt of zero
df = df[df.effyrblt != 0]

In [None]:
# create features longitude, latitude
df["geometry_x"] = df.geometry_x.apply(wkt.loads)
df["geometry_x"] = df.geometry_x.apply(lambda x: x.centroid)

df["longitude"] = df.geometry_x.apply(lambda x: x.x)
df["latitude"] = df.geometry_x.apply(lambda x: x.y)

### Data Analysis

#### Descriptive

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
pd.options.display.float_format = '{:.2f}'.format
# Assuming df is your DataFrame
summary = df.describe().transpose()

# Drop the 'count' row
summary = summary.drop(columns='count')


In [None]:
summary

- nbed, nbath, nhalfbath have very high max values

In [None]:
pd.options.display.float_format = '{:.2f}'.format
# Assuming df is your DataFrame
summary = df.describe(include="O").transpose()

# Drop the 'count' row
summary = summary.drop(columns='count')


In [None]:
summary.rename(index={"state_x":"state"}).drop(index=["addr", "id", "geometry_x", "date_x"])

#### Plots

In [None]:
px.histogram(df.nbed)

In [None]:
px.histogram(df.nbath)

In [None]:
px.box(df.age)

In [None]:
px.histogram(df.eff_age)

In [None]:
px.box(df.eff_age)

In [None]:
px.histogram(df.yrblt)

In [None]:
px.box(df.yrblt)

In [None]:
px.box(df.effyrblt)

In [None]:
px.histogram(df.effyrblt)

In [None]:
px.histogram(df.city)

In [None]:
df.groupby(by="city").size().sort_values()

In [None]:
df.groupby(by="county").size().sort_values()

In [None]:
df.groupby(by="cond_desc").size().sort_values() / df.shape[0]

In [None]:
px.histogram(df.cond_desc)

In [None]:
px.histogram(df.distance_aerodrome)

In [None]:
px.box(df.distance_aerodrome)

In [None]:
px.box(df.distance_aerodrome)

In [None]:
px.box(df.distance_ferry_terminal)

In [None]:
px.box(df.distance_hospital)

In [None]:
px.box(df.livarea)

In [None]:
px.box(df.efflivarea)

#### Insights:
Livarea:
- most houses have around 1300 to 2500 sq feet



Distance ferry terminal:
- most houses are around 17 to 40 km away from the nearest ferry terminal

Distance Aerodrome:
- most houses are around 6 to 13 km away from the airport

Condition Description:
- most houses have around 60 % have average condition and around 26 % have a good condition

City:
- most transactions in the cities Alexandria, Fairfax and Springfield
- the least number of transactions in Plainville, Darien, Arlington

County:
- most transactions in Fairfax, hartford and New haven
- the least amount of transaction in Windham, Middlesex, Tollandcv

Bedrooms:
- most houses have three bedrooms
- some outliers with the house with the most bedrooms having 124 
- the particular house with 124 bedrooms is a hotel

Bathrooms:
- most houses have two bathrooms
- building with number of bathrooms of 239 is a real estate agency

Age:
- most houses were sold directly after they were built having an age of 0 to one year
- most houses are between 0 (14) to 116 years old at time of sale

Effective Age:
- most houses are effectively 18 to 27 years old at time of sale

Yrblt: 
- most houses were built in 1986 and overall most built between 1956 and 1988

Effyrblt:
- most houses were refurbische around 1987 to 2001
- most houses were refurbisched at 1992
