# **Assignment: Building Permit Analysis & ML in Construction**

#### *Student Name:*  
#### *Date:*  

Welcome to the assignment notebook! Follow the instructions below to complete all tasks. Use additional code and markdown cells as needed. Make sure to document your decisions and findings.

---
## **1. Data Loading & Initial Exploration**
1. Download the dataset (e.g., Seattle Building Permits) from Kaggle.
2. Load it here, handle any immediate file path issues.
3. Print dataset shape, columns, and first few rows.

> **Hint**: If the dataset is large, consider sampling or filtering to a certain time period to keep the notebook manageable.


In [None]:
# 1. Data Loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score, classification_report

sns.set_theme(style='whitegrid')

# Example: load data
# df = pd.read_csv('path_to_your_building_permits.csv')
# print(df.shape)
# df.head()

### **2. Data Cleaning & Preprocessing**
Here, you'll:
1. Inspect missing values and decide how to handle them.
2. Check for duplicates or incorrect data.
3. Potentially create new columns (feature engineering).

In [None]:
# 2. Data Cleaning
# Example steps:
# print(df.isna().sum())
# df.dropna(subset=['SomeImportantColumn'], inplace=True)
# df['NewFeature'] = df['ColumnA'] / df['ColumnB']
# df.drop_duplicates(inplace=True)
# etc.


### **3. Exploratory Data Analysis (EDA)**
Create plots (histograms, bar charts, correlation heatmap, etc.) to understand distributions and relationships. Summarize interesting insights.

In [None]:
# 3. Exploratory Data Analysis
# Example EDA code:
# df.hist(figsize=(12,8))
# plt.show()

# sns.countplot(x='PermitClass', data=df)
# plt.show()

# # Correlation heatmap:
# corr = df.corr(numeric_only=True)
# sns.heatmap(corr, annot=True, cmap='coolwarm')
# plt.show()

# # Summaries:
# print(df.describe())

### **4. Choose a Target & Split Data**
- Decide whether you are predicting a numeric target (regression) or a categorical target (classification).
- Perform a train/test split.
- Watch out for data leakage (e.g., no usage of future columns, date of completion, etc.).

In [None]:
# 4. Choose target column
# Example: If predicting ProjectValue (regression)
# or PermitStatus (classification)
# X = df.drop('TargetColumn', axis=1)
# y = df['TargetColumn']

# # Train/test split
# X_train, X_test, y_train, y_test = train_test_split(X, y,
#                                                    test_size=0.2,
#                                                    random_state=42)
# X_train.shape, X_test.shape

### **5. Build & Evaluate Models**
1. Start with a **baseline model** (e.g., Linear Regression, Logistic Regression).
2. Move on to an **advanced model** (e.g., Random Forest, XGBoost).
3. Evaluate using **appropriate metrics** (MAE, RMSE, R² for regression; accuracy, precision, recall, F1 for classification).

In [None]:
# 5. Model Training & Evaluation Example
# from sklearn.linear_model import LinearRegression
# model = LinearRegression()
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)

# # Evaluate (for regression)
# mae = mean_absolute_error(y_test, y_pred)
# rmse = mean_squared_error(y_test, y_pred, squared=False)
# r2 = r2_score(y_test, y_pred)
# print("MAE:", mae)
# print("RMSE:", rmse)
# print("R^2 :", r2)

# # Evaluate (for classification)
# acc = accuracy_score(y_test, y_pred)
# print("Accuracy:", acc)
# print(classification_report(y_test, y_pred))

### **6. Advanced ML & Hyperparameter Tuning (Optional)**
Try **Random Forest**, **XGBoost**, or any other ensemble methods. Experiment with `GridSearchCV` or `RandomizedSearchCV` to find optimal parameters.

Document your parameter choices and final model results.

In [None]:
# Example Random Forest for regression
# from sklearn.ensemble import RandomForestRegressor
# rf = RandomForestRegressor(random_state=42)
# rf.fit(X_train, y_train)
# y_pred_rf = rf.predict(X_test)

# # Evaluate
# mae_rf = mean_absolute_error(y_test, y_pred_rf)
# print("RF MAE:", mae_rf)

# # Tuning example
# from sklearn.model_selection import GridSearchCV
# param_grid = {
#    'n_estimators': [50, 100],
#    'max_depth': [None, 10, 20]
# }
# grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='neg_mean_squared_error')
# grid_search.fit(X_train, y_train)
# best_rf = grid_search.best_estimator_
# print("Best Parameters:", grid_search.best_params_)

### **7. Results & Discussion**
- Summarize key findings.
- Compare baseline vs. advanced model performance.
- Reflect on data leakage precautions.
- Make domain-specific suggestions if relevant.


In [None]:
# Summarize your final metrics here
# Example:
# print("Model Summary:")
# print("Baseline LR - MAE, RMSE, R^2:", mae, rmse, r2)
# print("Advanced Model (RF) - MAE, RMSE, R^2:", mae_rf, rmse_rf, r2_rf)
# # etc.

---
# **End of Notebook**
Ensure you export/save this notebook (`.ipynb`) as part of your submission.
You can now write a short write-up or use the accompanying Word template to formally present your results and reflections.
