| Research Question   | Hypotheses           | Indicators            | 
|--|----------------------|-----------------------|
| 1. **Prices**: do house prices vary systematically by location-related characteristics? | 1.1 If a house has a higher-quality view, then its sale price tends to be higher than that of houses without a view.<br>1.2 If a house has waterfront access, then its sale price tends to be higher than that of non-waterfront houses. <br>1.3 If houses are located in different zip area, then their price distributions differ.<br>1.4 If a house is located closer to central latitude/longitude clusters, then its price tends to be higher. | 1.1 price, view<br>1.2 price, waterfront (binary)<br>1.3 price, zipcode<br>1.4 price, lat, long |
| 2. **Housing condition**: are observable housing characteristics associated with better housing condition? | 2.1 The more recently a house was built, the more likely it is to be in a higher condition category.<br>2.2 The higher the grade of a house, the higher its reported condition tends to be.<br>2.3 If a house has been renovated more recently (excluding houses with no recorded renovation), then its condition tends to be higher. | 2.1 condition, yr_built<br> 2.2 condition, grade<br>2.3 condition, yr_renovated|
| 3. **Neighborhood**: Can distinct neighborhood types be identified using housing and sales characteristics? | 3.1 If houses are located in different zip codes, then their average living space and lot size differ.<br>3.2 Some areas exhibit a higher frequency of house sales than others.<br>3.3 The larger a house is, the larger the surrounding houses and lots in its neighborhood tend to be.| 3.1 zipcode, sqft_living, sqft_lot<br>3.2 zipcode, count of sales<br>3.3 sqft_living, sqft_living15, sqft_lot, sqft_lot15 | 

Charles Christensen (Seller - Invest in houses which promise big returns)

Overall question: which factors influence returns for an investor or what are projects with high returns

| Research Question   | Hypotheses           | Indicators            | 
|--|----------------------|-----------------------|
| 1. **Market Dynamics**: How have housing prices evolved over time, and how volatile is the market? | 1.1 The later a house is sold, the higher its sale price tends to be.<br>1.2 Price dispersion increases over time, indicating rising investment risk. | 1.1 date, price<br>1.2 price, date|
| 2. **Timing**: Does timing improve expected returns? | 2.1 Houses sold in certain months tend to achieve higher prices.<br>2.2 Periods with higher transaction volume are associated with different price levels. | 2.1 date -> month, price<br> 2.2 date, price, sales_count (derived)|
| 3. **Neighborhood opportunity**: Where are above-average returns likely? | 3.1 Price levels and variability differ significantly across zip codes.<br>3.2 Some neighborhoods have relatively low prices despite strong housing characteristics.<br> 3.3 Some neighborhoods show consistently higher sales activity than others. | 3.1 zipcode, price<br>3.2 price, zipcode, sqft_living, sqft_living15, grade<br>3.3 zipcode, sales_count (derived) | 
| 4. **Property characteristics**: What types of houses offer upside? | 4.1 Price per square foot decreases as living area increases.<br>4.2 Houses in poorer condition sell at a discount relative to comparable houses.<br> 4.3 Higher-grade houses command a price premium even after controlling for size. | 4.1 price, sqft_living, price_per_sqft (derived)<br>4.2 price, condition, sqft_living, zipcode<br>4.3 price, grade, sqft_living | 
| 5. **Renovation**: Does renovation pay off? | 5.1 Renovated houses tend to sell at higher prices than non-renovated houses.<br>5.2 Houses in poor condition show larger relative price gains after renovation.<br> 5.3 The price premium from renovation varies by neighborhood. | 5.1 price, yr_renovated (binary) <br>5.2 condition, yr_renovated, price<br>5.3 price, yr_renovated, zipcode |
| 6. **High-return Properties**: What characterizes high-return projects? | 6.1 Houses with high relative prices share common patterns in location, condition, and quality. | 6.1 price (top decile), zipcode, grade, condition, sqft_living, waterfront, view |


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Optional: log-transform price to reduce skewness
df_housing['log_price'] = np.log1p(df_housing['price'])  # log1p handles zero values safely

# =========================
# RQ1: Prices by location-related characteristics
# =========================

# 1.1 Price vs. view (ordinal categorical)
plt.figure(figsize=(8,5))
sns.boxplot(x='view', y='log_price', data=df_housing)
plt.title('Log Price by View Quality')
plt.xlabel('View')
plt.ylabel('Log(Price)')
plt.show()

# 1.2 Price vs. waterfront (binary)
plt.figure(figsize=(6,5))
sns.boxplot(x='waterfront', y='log_price', data=df_housing)
plt.title('Log Price by Waterfront Access')
plt.xlabel('Waterfront')
plt.ylabel('Log(Price)')
plt.show()

# 1.3 Price vs. zipcode (categorical)
# Limit to top 10 most frequent zipcodes for clarity
top_zipcodes = df_housing['zipcode'].value_counts().nlargest(10).index
plt.figure(figsize=(10,5))
sns.boxplot(x='zipcode', y='log_price', data=df_housing[df_housing['zipcode'].isin(top_zipcodes)])
plt.title('Log Price by Top 10 Zipcodes')
plt.xlabel('Zipcode')
plt.ylabel('Log(Price)')
plt.xticks(rotation=45)
plt.show()

# 1.4 Price vs. latitude/longitude
plt.figure(figsize=(8,6))
sns.scatterplot(x='long', y='lat', hue='log_price', size='log_price', alpha=0.5, data=df_housing)
plt.title('Geographic Distribution of Log Price')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(bbox_to_anchor=(1.05,1), loc='upper left')
plt.show()

# =========================
# RQ2: Housing condition vs characteristics
# =========================

# 2.1 Condition vs. yr_built
plt.figure(figsize=(8,5))
sns.scatterplot(x='yr_built', y='condition', alpha=0.3, data=df_housing)
plt.title('Condition vs Year Built')
plt.xlabel('Year Built')
plt.ylabel('Condition')
plt.show()

# 2.2 Condition vs. grade (ordinal)
plt.figure(figsize=(8,5))
sns.boxplot(x='grade', y='condition', data=df_housing)
plt.title('Condition by Grade')
plt.xlabel('Grade')
plt.ylabel('Condition')
plt.show()

# 2.3 Condition vs. yr_renovated (exclude 0 values)
plt.figure(figsize=(8,5))
sns.scatterplot(x='yr_renovated', y='condition', alpha=0.3, data=df_housing[df_housing['yr_renovated']>0])
plt.title('Condition vs Year Renovated')
plt.xlabel('Year Renovated')
plt.ylabel('Condition')
plt.show()

# =========================
# RQ3: Neighborhood characteristics
# =========================

# 3.1 Average living space & lot size per zipcode
zipcode_summary = df_housing.groupby('zipcode')[['sqft_living','sqft_lot']].mean().reset_index()
plt.figure(figsize=(10,5))
sns.barplot(x='zipcode', y='sqft_living', data=zipcode_summary)
plt.title('Average Living Space by Zipcode')
plt.xlabel('Zipcode')
plt.ylabel('Average Sqft Living')
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(10,5))
sns.barplot(x='zipcode', y='sqft_lot', data=zipcode_summary)
plt.title('Average Lot Size by Zipcode')
plt.xlabel('Zipcode')
plt.ylabel('Average Sqft Lot')
plt.xticks(rotation=45)
plt.show()

# 3.2 Number of sales per zipcode
plt.figure(figsize=(10,5))
sns.countplot(x='zipcode', order=df_housing['zipcode'].value_counts().index)
plt.title('Number of Sales by Zipcode')
plt.xlabel('Zipcode')
plt.ylabel('Count of Sales')
plt.xticks(rotation=45)
plt.show()

# 3.3 House size vs. neighbors
plt.figure(figsize=(8,5))
sns.scatterplot(x='sqft_living', y='sqft_living15', alpha=0.5, data=df_housing)
plt.title('House Size vs Average Neighbor Size')
plt.xlabel('Sqft Living')
plt.ylabel('Sqft Living of 15 Nearest Neighbors')
plt.show()

plt.figure(figsize=(8,5))
sns.scatterplot(x='sqft_lot', y='sqft_lot15', alpha=0.5, data=df_housing)
plt.title('Lot Size vs Average Neighbor Lot Size')
plt.xlabel('Sqft Lot')
plt.ylabel('Sqft Lot of 15 Nearest Neighbors')
plt.show()