## Arbitrary Value imputation in Scikit-learn
We will use the data from the 
[Housing Dataset](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv)

In [None]:
import io
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# Limit our data and use only these columns
cols_to_use = [
    "TotalBsmtSF",
    "GrLivArea",
    "BsmtUnfSF",
    "LotFrontage",
    "MasVnrArea",
    "GarageYrBlt",
    "SalePrice",
]

In [None]:
# Load the House Prices dataset.
data = pd.read_csv(io.BytesIO(uploaded['houseprice.csv']), usecols=cols_to_use)
data.head()

In [None]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  
    data["SalePrice"],  
    test_size=0.3,  
    random_state=42
    )

X_train.shape, X_test.shape

In [None]:
# Find missing data
X_train.isnull().mean()

# SimpleImputer - default

In [None]:
imputer = SimpleImputer(
    strategy="constant",
    fill_value=999,
)

# We fit the imputer to the train set.
# The imputer will assign 999 to all missing data
imputer.fit(X_train)

In [None]:
# Imputed Values for each Column
imputer.statistics_

In [None]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

X_train

In [None]:
# Encode Train Set back to a Dataframe
X_train = pd.DataFrame(
    X_train,
    columns=imputer.get_feature_names_out(),  
)

X_train.head()

In [None]:
# Let's explore the distributions after the imputation
X_train.hist(bins=50, figsize=(10, 10))
plt.show()

The tall bar at 999 in LotFrontage's histogram appeared after the imputation. You can also see the imputation effect on GarageYrBlt, with the bar at the far left right at the 999 value.

# SimpleImputer - dataframe

In [None]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  
    data["SalePrice"],  
    test_size=0.3,  
    random_state=42
    )

X_train.shape, X_test.shape

In [None]:
imputer = SimpleImputer(
    strategy="constant",
    fill_value=999,
).set_output(transform="pandas")

imputer.fit(X_train)

In [None]:
# Imputed Value
imputer.statistics_

In [None]:
# Impute the Data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

# Dataframe
X_train.head()

# SimpleImputer - feature subsets

In [None]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  
    data["SalePrice"],  
    test_size=0.3,  
    random_state=42
    )

X_train.shape, X_test.shape

### Impute Specific Columns with different strategies

- `Mean` -> **LotFrontage** 
- `Median` -> **MasVnrArea** & **GarageYrBlt** 

In [None]:
imputer = ColumnTransformer(
    transformers=[
        (
            "LotFrontAgeImputer",
            SimpleImputer(strategy="constant", fill_value=999),
            ["LotFrontage"],
        ),
        (
            "MasVnrAreaImputer",
            SimpleImputer(strategy="constant", fill_value=1999),
            ["MasVnrArea"],
        ),
        (
            "GarageYrBltImputer",
            SimpleImputer(strategy="constant", fill_value=2999),
            ["GarageYrBlt"],
        ),
    ],
    remainder="drop", # Untransformed columns will be dropped to the final dataframe 
    # verbose_feature_names_out=False #Uncomment in order to remove prefix in transformed df
)

**Note:** Use `remainder = passthrough` to retain untransformed columns to the final dataframe

In [None]:
imputer.set_output(transform="pandas")

In [None]:
imputer.fit(X_train)

In [None]:
# Explore the Imputers
imputer.transformers

In [None]:
# Imputer Statistics for LotFrontAge
imputer.named_transformers_["LotFrontAgeImputer"].statistics_

In [None]:
# Imputer Statistics for MasVnrArea
imputer.named_transformers_["MasVnrAreaImputer"].statistics_

In [None]:
# Imputer Statistics for GarageYrBlt
imputer.named_transformers_["GarageYrBltImputer"].statistics_

In [None]:
# Impute the train and test set
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
X_train.head()

**Note:** Having `remainder='drop'` in the `columnTransformer` returns only 3 variables we specified

In [None]:
X_train.hist(bins=50, figsize=(10, 10))
plt.show()