
# House Sales in King County, USA — Final Project (Step-by-step Solution Guide)

This notebook is a clean, **step-by-step** template to complete your Coursera/Skills Network final project.
Run each cell in order and **take screenshots** of the output as required by the rubric.

> **Tip:** If a cell shows a plot, leave the plot visible on screen and take a screenshot that includes both the **code and the output**.


## 0) Setup — Import libraries

In [None]:

# Run this cell first
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score

# For cleaner plots
plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["figure.dpi"] = 120

print("Libraries imported.")



## 1) Load the dataset

- Put the **King County CSV file** in the same folder as this notebook (typical names: `kc_house_data.csv` or `kc_house_data_NaN.csv`).  
- If your file name is different, set it in the `csv_candidates` list below.


In [None]:

# Try common filenames; edit this list if your filename is different.
csv_candidates = [
    "kc_house_data.csv",
    "kc_house_data_NaN.csv",
    "kc_house_data (1).csv",
    "kc_house_data_kc_house_data.csv"
]

df = None
for name in csv_candidates:
    try:
        df = pd.read_csv(name)
        print(f"Loaded: {name}")
        break
    except Exception as e:
        pass

if df is None:
    raise FileNotFoundError("CSV not found. Place the King County CSV in this folder and update csv_candidates.")

print("Rows, Columns:", df.shape)
df.head()


## Q1) Display the data types of each column (`dtypes`)

In [None]:

# Q1
df.dtypes



## Q2) Drop `id` and `Unnamed: 0` (axis=1, `inplace=True`), then `describe()`

> **Important:** If a column is missing in your CSV, dropping it will raise a KeyError.  
> We guard against that using `errors='ignore'`.


In [None]:

# Q2
df.drop(["id", "Unnamed: 0"], axis=1, inplace=True, errors='ignore')
desc = df.describe(include='all')
desc


## Q3) Count houses by unique `floors` and convert to DataFrame

In [None]:

# Q3
floors_count_df = df['floors'].value_counts().to_frame(name='count').sort_index()
floors_count_df



## Q4) Boxplot — price vs waterfront

**Goal:** Determine whether houses **with** a waterfront view have more price outliers than those **without**.

- `waterfront` is typically binary (0 = no, 1 = yes).
- Include both the code and the displayed boxplot in your screenshot.


In [None]:

# Q4
sns.boxplot(data=df, x='waterfront', y='price')
plt.title("Price distribution by Waterfront (0 = No, 1 = Yes)")
plt.xlabel("Waterfront")
plt.ylabel("Price")
plt.show()



## Q5) Seaborn `regplot` — `sqft_above` vs `price`

**Goal:** Visually check whether `sqft_above` is **positively** or **negatively** correlated with price.


In [None]:

# Q5
sns.regplot(data=df, x='sqft_above', y='price', scatter_kws={'alpha':0.3})
plt.title("sqft_above vs price (with regression line)")
plt.show()


## 2) Train/Test Split (used for the following questions)

In [None]:

# We'll reuse the same split for all modeling questions
target = 'price'
y = df[target]

# Simple feature used in Q6
X_sqft = df[['sqft_living']]

# Multi-feature set used in Q7–Q10
features = [
    "floors",
    "waterfront",
    "lat",
    "bedrooms",
    "sqft_basement",
    "view",
    "bathrooms",
    "sqft_living15",
    "sqft_above",
    "grade",
    "sqft_living"
]

# Some datasets may have missing values; drop rows with NA in required columns
model_df = df[features + [target]].dropna()
X_multi = model_df[features]
y_multi = model_df[target]

# Train/Test Split
X_sqft_train, X_sqft_test, y_sqft_train, y_sqft_test = train_test_split(X_sqft, y, test_size=0.15, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X_multi, y_multi, test_size=0.15, random_state=1)

print("Single-feature train/test shapes:", X_sqft_train.shape, X_sqft_test.shape)
print("Multi-feature   train/test shapes:", X_train.shape, X_test.shape)


## Q6) Linear Regression with `sqft_living` → compute R²

In [None]:

# Q6
lin1 = LinearRegression()
lin1.fit(X_sqft_train, y_sqft_train)

y_sqft_pred = lin1.predict(X_sqft_test)
r2_q6 = r2_score(y_sqft_test, y_sqft_pred)

print("Intercept:", lin1.intercept_)
print("Coefficient for sqft_living:", lin1.coef_[0])
print("R^2 on test:", r2_q6)



## Q7) Linear Regression with multiple features → compute R²

Features used:
- `floors`, `waterfront`, `lat`, `bedrooms`, `sqft_basement`, `view`, `bathrooms`, `sqft_living15`, `sqft_above`, `grade`, `sqft_living`


In [None]:

# Q7
lin_multi = LinearRegression()
lin_multi.fit(X_train, y_train)

y_pred_lin_multi = lin_multi.predict(X_test)
r2_q7 = r2_score(y_test, y_pred_lin_multi)

print("Coefficients:", pd.Series(lin_multi.coef_, index=X_train.columns))
print("Intercept:", lin_multi.intercept_)
print("R^2 on test:", r2_q7)



## Q8) Pipeline: Scale → Polynomial Transform (degree=2) → Linear Regression → R²


In [None]:

# Q8
pipe = Pipeline(steps=[
    ("scale", StandardScaler(with_mean=False)),  # with_mean=False is safe if sparse
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("linreg", LinearRegression())
])

pipe.fit(X_train, y_train)
y_pred_pipe = pipe.predict(X_test)
r2_q8 = r2_score(y_test, y_pred_pipe)
print("R^2 on test (Pipeline degree=2):", r2_q8)


## Q9) Ridge Regression (alpha=0.1) on original features → R²

In [None]:

# Q9
# Scale features before Ridge to keep coefficients well-conditioned
ridge_pipe = Pipeline(steps=[
    ("scale", StandardScaler()),
    ("ridge", Ridge(alpha=0.1, random_state=1))
])
ridge_pipe.fit(X_train, y_train)
y_pred_ridge = ridge_pipe.predict(X_test)
r2_q9 = r2_score(y_test, y_pred_ridge)
print("R^2 on test (Ridge alpha=0.1):", r2_q9)



## Q10) Polynomial (degree=2) features **then** Ridge (alpha=0.1) → R²

- Perform a second-order polynomial transform on **train** and **test**.
- Fit **Ridge** on transformed **train**, compute **R²** on transformed **test**.


In [None]:

# Q10
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly  = poly.transform(X_test)

# Scale polynomial features then Ridge
ridge_poly_pipe = Pipeline(steps=[
    ("scale", StandardScaler(with_mean=False)),
    ("ridge", Ridge(alpha=0.1, random_state=1))
])
ridge_poly_pipe.fit(X_train_poly, y_train)
y_pred_ridge_poly = ridge_poly_pipe.predict(X_test_poly)
r2_q10 = r2_score(y_test, y_pred_ridge_poly)
print("R^2 on test (Poly degree=2 + Ridge alpha=0.1):", r2_q10)



---

### Final Checklist for Screenshots
Take screenshots that include **both the code and the output** for:

1. `df.dtypes` (Q1)  
2. `df.drop(..., inplace=True)` and `describe()` (Q2)  
3. `value_counts().to_frame()` for `floors` (Q3)  
4. Seaborn **boxplot** (`waterfront` vs `price`) (Q4)  
5. Seaborn **regplot** (`sqft_above` vs `price`) (Q5)  
6. Linear Regression with `sqft_living` and the **R²** (Q6)  
7. Multi-feature Linear Regression and the **R²** (Q7)  
8. **Pipeline** (scale → poly → linear) **R²** (Q8)  
9. **Ridge (alpha=0.1)** on original features **R²** (Q9)  
10. **Polynomial (degree=2) + Ridge (alpha=0.1)** **R²** (Q10)

You're done! Download your notebook (`File → Download`) and submit it along with your screenshots.
