# Homework 1 Problem 2. Design an Imputation Algorithm (Module 2)
40 Points Total<br><br>

In the lecture notes an example of finding and replacing missing values from data is provided. This example will be the starting point for inserting/replacing any missing values. <br>

In this problem you will be designing an intelligent algorithm to replace missing values in datasets, often referred to as imputation. The choice of imputation method can significantly affect the performance of subsequent analyses so keep this in mind. The dataset needs to be selected from the Kaggle or similar data repository.

# Fill before submitting
- **Student:** `Zach Hatzenbeller`
- **Course:** `Data Science Modeling & Analytics`
- **Homework:** `HW #1`
- **Date:** `2025-09-14`
- **Instructor:** `Ben Rodriguez, PhD`

---

## What to Submit in your `.ipynb`
Your notebook must include, in this order:

1. **Cover Block** — Name, course, HW #, date.  
2. **README (Execution & Setup)** — Python version; required packages + install steps; dataset source + download/use instructions; end‑to‑end run steps; any hardware notes (GPU/CPU).  
3. **Adjustable Inputs** — A single, clearly marked code cell where we can change paths, seeds, and key hyperparameters.  
4. **Problem Sections** — Each problem and sub‑part clearly labeled (e.g., “Problem 1 (a)”).  
5. **Results, Summary & Conclusions** — Your takeaways, trade‑offs, limitations.  
6. **References & Attributions** — Cite datasets, code you reused, articles, and **any AI tools** used (and how).

> **One file only**. The notebook must run **top‑to‑bottom** with no errors.

---

## How You’re Graded (what “full credit” looks like)

**1) Completeness & Problem Coverage (20%)**  
<div style="margin-left: 40px"> To earn full points, students must ensure that all parts of the assignment, including sub-questions, are fully answered. Both qualitative and quantitative components should be addressed where required, and any coding tasks must be implemented completely without omissions. </div>

**2) Writing Quality, Technical Accuracy & Justification (20%)**  
<div style="margin-left: 40px"> Writing should be clear, concise, and demonstrate graduate-level quality. All technical content must be correct, and reasoning should be sound and well-supported. Students are expected to justify their design choices and conclusions with logical arguments that reflect a strong understanding of the material. </div>   

**3) Quantitative Work (0% on this HW)**  
<div style="margin-left: 40px"> Assignments should clearly state all assumptions before attempting solutions. Derivations and calculations must be shown step by step, either in Markdown cells or through annotated code. Final results should be presented with appropriate units and precision, ensuring they are easy to interpret and technically correct. </div>
 
**4) Code Quality, Documentation & Execution (30%)**  
<div style="margin-left: 40px"> Code must run from top to bottom without errors, avoiding “Traceback” or other runtime issues. Programs should follow best practices for naming, formatting, and organization, with descriptive variables and functions. Meaningful comments should be included to explain key logic, making the code both efficient and easy to follow. </div>

**5) Examples, Test Cases & Visuals (20%)**  
<div style="margin-left: 40px"> Students should include realistic examples and test cases that demonstrate program functionality, with outputs clearly labeled. Figures and tables must be properly titled, captioned, and have labeled axes. For machine learning tasks, particularly those with imbalanced datasets such as Credit Card Fraud or NSL-KDD, evaluation metrics must go beyond simple accuracy and include measures like precision, recall, F1-score, and ROC or PR curves. </div>

**6) Notebook README & Reproducibility (10%)**  
<div style="margin-left: 40px"> Each notebook must include a README section containing the Python version, a list of required packages with installation instructions, dataset details with download information, and complete steps to run the notebook. The work should be fully reproducible on another system, with seeds set for consistency and relative paths used instead of system-dependent absolute paths. </div>

---

## README (Execution & Setup)

**Use this section to make your notebook reproducible.**

- **Python version:** `3.11.1`
- **Required packages:** `numpy`, `pandas`, `scikit-learn`, `matplotlib`
- **Install instructions (if non-standard):**
  ```bash
  pip install numpy pandas scikit-learn matplotlib
  ```
- **Datasets used:**
  - `yasserh/housing-prices-dataset` dataset was downloaded directly from kaggle with kagglehub
  - All steps are in order and will clean/transform the dataset if necessary
- **How to run this notebook:**
  1. Run all cells in order (Kernel → Restart & Run All).
  2. Verify that all outputs match those in the **Sample Tests** section.
  3. Ensure figures and tables render correctly.


# Problem Statement for Problem 2. Design an Imputation Algorithm

## Steps to Design an Imputation Algorithm

Use the following steps for your algorithm:
* Analyze the Data
* Choose an Appropriate Imputation Method
* Implement Imputation Algorithm
* Validate the Imputation

You will need to implement 2 of the following algorithms:
1) Mean/Median/Mode Imputation
2) K-Nearest Neighbors (KNN) Imputation
   - Use the KNN algorithm to find 'k' samples closest in distance to the missing value and imputes them based on nearest neighbors.
3) Regression Imputation
   - Use the regression model to predict missing values based on other variables.
4) Random Forest Imputation
   - Use the Random Forest algorithm to predict missing values.
5) Deep Learning-Based Imputation
   - Use neural networks, particularly Autoencoders, for imputing missing values.
6) Expectation-Maximization (EM) Algorithm
   - Use the estimate to determine the maximum likelihood of missing data.

# (a) [5 points] Analyze the Data

## Type your analysis of the data here ##

### Download the data from kaggle and transform it for use

In [55]:
## In your analysis of the data consider the following: ##
# 1) Understand the nature of your data (categorical, numerical, time-series, etc.).
# 2) Identify the pattern of missingness (Missing Completely at Random, Missing at Random, Missing Not at Random). 

import kagglehub
import pandas as pd
import numpy as np
from pathlib import Path
import random

# Download latest version
path = kagglehub.dataset_download("yasserh/housing-prices-dataset")
df_housing = pd.read_csv(Path(path, "Housing.csv"))

true_false_cols = ["mainroad", "guestroom", "basement", "hotwaterheating", "airconditioning", "prefarea"]

for col in true_false_cols:
    df_housing.loc[df_housing[col] == "yes", col] = True
    df_housing.loc[df_housing[col] == "no", col] = False

# one hot encode the only object column
df_housing = pd.get_dummies(df_housing, columns=["furnishingstatus"])*1
df_housing_na = df_housing.copy()

# replace 10% of data in each column, except "y" columns, with NaN to simulate a data set with missing data
for col in df_housing_na.columns:
    if df_housing_na[col].dtype == object:
        df_housing_na[col] = df_housing_na[col].astype(int)
    
    if df_housing[col].dtype == object:
        df_housing[col] = df_housing[col].astype(int)
    # avoid creating nans in y column "price"
    if col != "price":
        random_float = random.uniform(0.02, 0.10)
        df_housing_na.loc[df_housing_na.sample(frac=random_float).index, col] = np.nan

print("Path to dataset files:", path)
df_housing_na.head()

Path to dataset files: C:\Users\zhatz\.cache\kagglehub\datasets\yasserh\housing-prices-dataset\versions\1


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_furnished,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,7420.0,4.0,2.0,,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0
1,12250000,8960.0,,4.0,4.0,1.0,,0.0,0.0,1.0,3.0,0.0,1.0,0.0,0.0
2,12250000,9960.0,3.0,2.0,2.0,1.0,0.0,1.0,0.0,0.0,2.0,1.0,0.0,1.0,0.0
3,12215000,7500.0,4.0,2.0,2.0,1.0,0.0,1.0,0.0,1.0,3.0,1.0,1.0,0.0,0.0
4,11410000,7420.0,4.0,1.0,2.0,1.0,1.0,1.0,0.0,1.0,,0.0,1.0,0.0,0.0


### Get a description of the data and determine amount of missing values in dataset

In [56]:
display(df_housing_na.describe()); display(df_housing_na.info())

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_furnished,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
count,545.0,494.0,506.0,510.0,510.0,504.0,505.0,525.0,503.0,501.0,521.0,527.0,513.0,493.0,498.0
mean,4766729.0,5125.062753,2.956522,1.290196,1.8,0.855159,0.178218,0.354286,0.045726,0.321357,0.694818,0.237192,0.25731,0.432049,0.327309
std,1870440.0,2181.522484,0.729904,0.507413,0.868121,0.35229,0.383075,0.478752,0.209097,0.467464,0.864332,0.425765,0.437578,0.495864,0.469703
min,1750000.0,1650.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3430000.0,3520.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4340000.0,4500.0,3.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5740000.0,6415.0,3.0,2.0,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 15 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   price                            545 non-null    int64  
 1   area                             494 non-null    float64
 2   bedrooms                         506 non-null    float64
 3   bathrooms                        510 non-null    float64
 4   stories                          510 non-null    float64
 5   mainroad                         504 non-null    float64
 6   guestroom                        505 non-null    float64
 7   basement                         525 non-null    float64
 8   hotwaterheating                  503 non-null    float64
 9   airconditioning                  501 non-null    float64
 10  parking                          521 non-null    float64
 11  prefarea                         527 non-null    float64
 12  furnishingstatus_furni

None

# (b) [5 points] Choose an Appropriate Imputation Method

## Type your description of the imputation method here ##

### Mean/Median/Mode Imputation

The first imputation method I will use is the mean/median/mode method where depending on the data type of each column one of the three of these will be employed. If the data type is an object then the mode or the most frequent value will be used to impute the data. If the column is numeric then the mean or median will be used to impute the data. This will depend on if there are outliers in that specific column as the mean of the column can be swayed more heavily with outliers compared to the median.

### Random Forest Algorithm for Imputation

The second method will be a random forest

In [57]:
## In choosing your Imputation Method consider the following: ##
# 1) The method should align with the data type and missingness pattern.


# (c) [20 points] Implement Imputation Algorithm

## Type your implemented imputation Algorithm here ##

In [58]:
## In your Implementation of the Imputation Algorithm consider the following: ##
# 1) Develop the algorithm to handle different data types and patterns.
# 2) Ensure the method preserves the original data distribution and relationships.
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import LabelEncoder

class SimpleImputer:

    def __init__(self, data: pd.DataFrame, method: str = "mean"):
        self.data = data
        self.method = method
        self.numeric_cols = self.data.select_dtypes(include=["number"])
        self.object_cols = self.data.select_dtypes(include=["object"])
    
    def mean_meadian_mode(self):
        for col in self.numeric_cols:
            if self.method == "mean":
                self.data.loc[self.data[col].isna(), col] =  np.nanmean(self.data[col])
            else:
                self.data.loc[self.data[col].isna(), col] =  np.nanmedian(self.data[col])
        
        for col in self.object_cols:
            self.data.loc[self.data[col].isna(), col] =  self.data[col].mode().values[0]
        
        return self.data


class RandomForestImputer:

    def __init__(self, data: pd.DataFrame, random_state: int = 42):
        self.data = data.copy()
        self.random_state = random_state
        self.encoders = {}   # store label encoders for categorical columns

    def fit_transform(self):
        df_imputed = self.data.copy()

        for col in df_imputed.columns:
            
            if df_imputed[col].isna().sum() == 0:
                continue  # no missing values in this column

            print(f"Imputing column: {col}")

            # Split into observed and missing
            observed = df_imputed[df_imputed[col].notna()]
            missing = df_imputed[df_imputed[col].isna()]

            # Features = all other columns
            features = df_imputed.columns.drop(col)
            X_train = observed[features]
            X_test = missing[features]

            # Encode categoricals for ML
            # observed_enc = self._encode_categoricals(observed[features])
            # missing_enc = self._encode_categoricals(missing[features])

            # Target
            y = observed[col]

            model = RandomForestRegressor(n_estimators=100, random_state=self.random_state)

            # Train
            model.fit(X_train, y)

            # Predict missing values
            preds = model.predict(X_test)

            # Fill in missing values
            df_imputed.loc[df_imputed[col].isna(), col] = preds

        return df_imputed

df_simple = df_housing_na.copy()
df_forest = df_housing_na.copy()

imputer = SimpleImputer(df_simple)
simple_imputed = imputer.mean_meadian_mode()
print("Mean/Median/Mode Imputer:")
display(simple_imputed.describe())

imputer_forest = RandomForestImputer(df_forest)
forest_imputed = imputer_forest.fit_transform()
print("Random Forest Imputer:")
display(forest_imputed.describe())

Mean/Median/Mode Imputer:


  self.data.loc[self.data[col].isna(), col] =  np.nanmean(self.data[col])


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_furnished,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
count,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5125.062753,2.956522,1.290196,1.8,0.855159,0.178218,0.354286,0.045726,0.321357,0.694818,0.237192,0.25731,0.432049,0.327309
std,1870440.0,2076.747533,0.703253,0.490818,0.83973,0.338755,0.368723,0.469869,0.200864,0.448161,0.84505,0.418662,0.424513,0.47157,0.448954
min,1750000.0,1650.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3430000.0,3600.0,2.956522,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4340000.0,4880.0,3.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5740000.0,6240.0,3.0,1.290196,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.237192,0.25731,1.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0


Imputing column: area
Imputing column: bedrooms
Imputing column: bathrooms
Imputing column: stories
Imputing column: mainroad
Imputing column: guestroom
Imputing column: basement
Imputing column: hotwaterheating
Imputing column: airconditioning
Imputing column: parking
Imputing column: prefarea
Imputing column: furnishingstatus_furnished
Imputing column: furnishingstatus_semi-furnished
Imputing column: furnishingstatus_unfurnished
Random Forest Imputer:


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_furnished,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
count,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5157.080699,2.959878,1.288937,1.797853,0.855807,0.179416,0.353023,0.04669,0.321021,0.695255,0.235765,0.257269,0.415297,0.323339
std,1870440.0,2110.980793,0.713597,0.49363,0.846122,0.340845,0.371741,0.470991,0.202162,0.451863,0.847416,0.420121,0.433971,0.490845,0.467781
min,1750000.0,1650.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3430000.0,3600.0,2.455314,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4340000.0,4500.0,3.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5740000.0,6420.0,3.0,1.842081,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.113977,0.864226,1.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0


# (d) [10 points] Validate the Imputation

## Type your validation of the implemented imputation algorithm here ##

In [60]:
## In your validation of the Imputation Algorithm consider the following: ##
# 1) Use statistical tests and visualization to assess the quality of imputation.
# 2) Compare the performance of models trained on imputed data versus non-imputed data.

from sklearn.metrics import r2_score

metrics_dict = {}
for col in df_housing:
    y_actual = df_housing[col].values.astype(float)
    y_pred_simple = simple_imputed[col].values.astype(float)
    y_pred_forest = forest_imputed[col].values.astype(float)
    simple_score = r2_score(y_actual, y_pred_simple)
    forest_score = r2_score(y_actual, y_pred_forest)
    metrics_dict[col] = {"Simple Imputation R2 Score": simple_score, "Random Forest Imputation R2 Score": forest_score}
    # print(f"Column Imputed: {col}")
    # print(f"Simple Imputation Score: {simple_score:0.03f}, Random Forest Imputation Score: {forest_score:0.03f}")
    # print("")

pd.DataFrame(metrics_dict).T.round(2)

Unnamed: 0,Simple Imputation R2 Score,Random Forest Imputation R2 Score
price,1.0,1.0
area,0.92,0.93
bedrooms,0.91,0.94
bathrooms,0.95,0.97
stories,0.94,0.96
mainroad,0.94,0.93
guestroom,0.93,0.94
basement,0.97,0.98
hotwaterheating,0.92,0.91
airconditioning,0.93,0.95


Overall both imputation methods work well to rid the dataset of NaN values which then allow for running machine learning algorithms. The Random forest imputer does a better job overall comapred to the simple imputer that only uses the mean/median/mode. Intuitively this makes sense as the Random forest imputer trains a random forest regression model for each column to the predict what the nan values should be in that column. This allows for better NaN replacement as it learns a relationhip within the data that it can exploit. The simple imputer can only replace the NaNs with a single method and thus lacks reliability with edge case values that might be outliers in the data.

Both methods are powerful and should be used according to your use case. On complex data sets a more advanced algorithm might be beneficial since it can generally be more accurate. For a smaller dataset a simpler technique might work better. If time cosntraints are a concern then simpler methods tend to work much better and are faster overall.

# References
[1] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.<br><br>
[2] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006. isbn: 0387310738.<br><br>
[3] Barry J. Shepherd C. Wayne Brown. Graphics File Formats: Reference and Guide. Manning
Publications, 1995. isbn: 1884777007.<br><br>
[4] Thomas H. Cormen et al. Introduction to Algorithms. 3rd. MIT Press, 2009. isbn: 780262033848.<br><br>
[5] W. R. Dillon and M. Goldstein. Multivariate Analysis Method and Applications. New York, NY:
John Wiley Sons, Inc, 1984.<br><br>
[6] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. 2nd. Wiley-
Interscience, 2000.<br><br>
[7] Duin et al. PRTools. https://cmp.felk.cvut.cz/cmp/software/stprtool/index.html.<br><br>
[8] L. Euler. “Nova Acta Acad. Sci. Petrop”. In: (1960).<br><br>
[9] R.A. Fisher. “The use of Multiple Measurements in Taxonomic Problems”. In: Proceedings of
Annals of Eugenics 7 (1936), pp. 179–188.<br><br>
[10] Vojtech Franc and Vaclav Hlavac. Statistical Pattern Recognition Toolbox. https://cmp.felk.
cvut.cz/cmp/software/stprtool/index.html.<br><br>
[11] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. 1st. Academic Press, 1972.
isbn: 0122698509.<br><br>
[12] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. 2nd. Academic Press, 1990.
isbn: 0122698517.<br><br>
[13] Herman H. Goldstine. A History of Numerical Analysis from the 16th through the 19th Century.
Springer New York, 1977. isbn: 978-0-387-90277-7.<br><br>
[14] H. Hotelling. “Analysis of a complex of statistical variables into principal components”. In: Jour-
nal of Educational Psychology 24 (1933), pp. 417–441.<br><br>
[15] Averill Law. Simulation Modeling and Analysis. 5th. Mcgraw-hill Series in Industrial Engineering
and Management, 2014.<br><br>
[16] Machine Learning at Waikato University. https://www.cs.waikato.ac.nz/~ml/index.html.<br><br>
[17] James D. Murry and William vanRyper. Encyclopedia of Graphics File Formats: The Com-
plete Reference on CD-ROM with Links to Internet Resources. 2nd. O’Reilly Media, 1996. isbn:
1565921615.<br><br>
[18] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning
Research 12 (2011), pp. 2825–2830.<br><br>
[19] Casey J. Richards et al. “Multimodal data fusion using signal/image processing methods for
multi-class machine learning”. In: Signal Processing, Sensor/Information Fusion, and Target
Recognition XXXII. Ed. by Ivan Kadar, Erik P. Blasch, and Lynne L. Grewe. Vol. 12547. Inter-
national Society for Optics and Photonics. SPIE, 2023, 125470N. doi: 10.1117/12.2664987.
url: https://doi.org/10.1117/12.2664987.<br><br>
[20] Benjamin M. Rodriguez. “Multi-Class Classification for Identifying JPEG Steganography Em-
bedding Methods”. PhD thesis. Air Force Institute of Technology, 2008. url: https://scholar.
afit.edu/cgi/viewcontent.cgi?article=3642&context=etd.<br><br>
[21] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 4th. Prentice Hall,
2020.<br><br>
[22] Amir Saeed et al. “Reinforcement learning application to satellite constellation sensor tasking”.
In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications V.
Ed. by Latasha Solomon and Peter J. Schwartz. Vol. 12538. International Society for Optics and
Photonics. SPIE, 2023, 125381B. doi: 10.1117/12.2664346. url: https://doi.org/10.1117/
12.2664346.<br><br>
[23] C. E. Shannon. “Programming a Computer for Playing Chess”. In: Philosophical Magazine.
7th ser. 41.314 (1950).<br><br>
[24] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT Press,
2018.<br><br>
[25] Sergios Theodoridis and Konstantinos Koutroumbas. Pattern Recognition. 3rd. Academic Press,
2006. isbn: 0123695317.<br><br>
[26] Alan M. Turing. “Computing Machinery and Intelligence”. In: Mind 59.236 (1950), pp. 433 –460.<br><br>
[27] P. Winston. Artificial Intelligence. 3rd. Pearson, 1992.<br><br>