### 3. How the SingleImputer works under the Hood
---
<div>A lot has to happen behind the scenes for the `SingleImputer` to meet the goals outlined in section 2. This section peeks under the hood of the `SingleImputer` to explore how it makes the imputation process easy, familiar, and flexible. In this section, we cover:</div>
<ol>
    <li>How the SingleImputer Fits the Imputation Model</li>
    <li>Customizing a SingleImputer by Tuning its Arguments</li>
</ol>

This section utilizes the toy dataset created below. Note that the dataset has no missing values. This is fine because we will not perform imputations. Rather, we are interested in how we can control an `Imputer` as we prepare to impute. Our toy dataframe has columns of mixed types. `age`, `salary`, and `weight` are numeric, while `gender` and `employment` are categorical.

In [None]:
toy_df = pd.DataFrame({
    "age": np.random.choice(np.arange(20,80), 50),
    "gender": np.random.choice(["Male","Female"], 50),
    "employment": np.random.choice(["Unemployed","Employed", "Part Time", "Self-Employed"], 50),
    "salary": np.random.choice(np.arange(50_000, 1_000_000), 50),
    "weight": np.random.choice(np.arange(100, 300), 50),
})
toy_df.head()

#### 3.1 How the SingleImputer Fits the Imputation Model
Before we tune anything, let's start with a `SingleImputer` that uses default arguments.

In [None]:
si = SingleImputer()
si.fit(toy_df)

In this case, we create an instance of the `SingleImputer` and `fit` it to our toy dataset. This fitting process is easy and should be quite familiar. As with `sklearn`, our `SingleImputer` used default arguments to fit a dataset and returned an instance of the `Imputer` class, which we assigned the name `si`. **But what actually happened when we fit the Imputer model?**

Just like `sklearn` `Transformers`, the `SingleImputer` generates a `statistics_` attribute when we fit the model:

In [None]:
si.statistics_

While the `statistics_` are quite dense, they demonstrate what a `SingleImputer` does by default. `si.statistics_` returns a dictionary, where the key is a column in the dataset and its value is a series-imputer that corresponds to the strategy set for the given column. **The concept of the series-imputer is critical to the design of the Autoimpute Imputers**. The `strategy` assigned to each column in a `SingleImputer` maps to a series-imputer class prefixed by the strategy's name. Because we did not specify any strategies for any columns, the strategy assigned to each column is `default predictive`. As a result, each column is imputed with a series-imputer called the `DefaultPredictiveImputer`.

These series-imputers are in the `autoimpute.imputations.series` folder, but they are hidden from end users because they aren't meant for public use. They are simply workers behind the scenes. The `SingleImputer` delegates work to the series-imputer that corresponds to a specified strategy. The series-imputer then implements the algorithm on behalf of the `SingleImputer` and returns the results. The reason for this design becomes more clear once we start to customize Imputers by tuning their arguments.

#### 3.2 Customizing a SingleImputer by Tuning its Arguments
In this section, we'll review how we can **customize the SingleImputer by tuning its arguments**. We'll look at how the `statistics_` change as we customize our `SingleImputer`. Then, we'll return to the default example we introduced in section 3.1 above.

<div><b><em><span style="color:navy">The strategy argument</span></em></b></div>
The **strategy** argument is the most important argument within the `SingleImputer`. The list below shows all the strategies available to impute a column within a dataframe. `predictive default` is the default strategy if a user does not specify one. It chooses the preferred strategy to use depending on a columns data type (`pmm` for numerical, `multinomial logistic` for categorical). Note that some of these strategies are for categorical data, while others are for numeric data. As we'll see later, the Imputers let you know whether a strategy will work for a given column when you try to fit the imputation model.

In [None]:
print_header("Strategies Available for Imputation")
print(list(SingleImputer().strategies.keys()))

We have a wealth of imputation methods at our disposal, and we continue to make more available. That being said, imputation strategies are restricted to the list of strategies provided above. A user cannot even create an instance of an Imputer if the strategy he or she provides is not supported. Improper strategy specification throws a `ValueError`, as shown below. Again, the traceback is removed to keep this tutorial clean, but the error below is clear. The `strategy` argument is validated when instantiating the class.

In [None]:
# proviging a strategy not yet supported or that doens't exist
print_header("Creating a SingleImputer with an unsupported strategy")
try:
    SingleImputer(strategy="unsupported")
except ValueError as ve:
    print(f"{ve.__class__.__name__}: {ve}")

<div>So how can utilize supported strategies? We can set the <b>strategy</b> in three ways:</div>
<ul>
    <li>As a <b>string</b>, which broadcasts the strategy across every column in the DataFrame</li>
    <li>As a <b>list or tuple</b>, where the position of strategies in the iterator are applied to the corresponding column</li>
    <li>As a <b>dictionary</b>, where the key is the column we want to impute, and the value is the strategy to use</li>
</ul>
<br>
<div>We advise the <b>dictionary method</b>, as it is the most explicit and allows the user to impute all or a subset of the columns in a DataFrame. It is also the least prone to unexpected behavior and errors when trying to fit the imputation model. Let's look at some examples below, where we run into problems with the string and iterator method but have better control with the dictionary method.</div>

In [None]:
# string strategy broadcasts across all strategies
si_str = SingleImputer(strategy="mean")

# list strategies, where each item is a corresponding strategy
si_list1 = SingleImputer(strategy=["mean", "mode", "binary logistic", "categorical", "median"])
si_list2 = SingleImputer(strategy=["mean", "binary logistic", "median"])

# dictionary strategy, where we specify column and strategy together
# Note that with the dictionary, we can specify a SUBSET of columns to impute
si_dict = SingleImputer(strategy={"gender":"categorical", "salary": "pmm"})

<div>Note that we instantiated each SingleImputer with no issues yet. We provided valid types (string, iterator, or dictionary) for the strategy argument, and each strategy we provided is one of the strategies supported. So we are able to at least crete an instance of our class. <b>But that does not necessarily mean the strategies we've chosen will work with the columns of our DataFrame</b>. This is something we cannot validate <b>until the user fits the Imputer to a dataset</b> because the Imputer itself knows nothing about the dataset's columns or the column types until the Imputer is fit. We will see how this plays out below when we try to fit each imputer to our toy dataframe.</div>

First, let's try to fit `si_str`, `si_list1`, and `si_list2`. All are valid Imputers but won't work with our toy data.

In [None]:
# fitting the string strategy, which yields an error
print_header("Fitting si_str, a SingleImputer that broadcasts strategy='mean'")
try:
    si_str.fit(toy_df)
except TypeError as te:
    print(f"{te.__class__.__name__}: {te}")

While a valid imputer, `si_str` failed to fit the dataset. We set the strategy as `mean`, so the `si_str` `Imputer` broadcast `mean` to all columns in our dataset. `mean` worked fine when imputing `age`, the first column, because `age` is numerical. But an error occurred when we tried to take the `mean` of the `employment` column. We cannot take the `mean` of a categorical column such as `employment`, so the `Imputer` throws an error. (Note that the same error would have occurred when the `Imputer` reached the `gender` column.)

Therefore, we must be very careful when setting the strategy with a string since that strategy is broadcast to all columns in the DataFrame. When setting strategy with a string, we must ensure that we want the same strategy for each column, and we must ensure that our DataFrame does not contain any columns that the strategy cannot fit. As we'll see later, we could have specified `mean` for every column except `employment` and `gender` if we had used a dictionary, or we could have specified a different strategy for `employment` and `gender` had we used a list.

Next, we'll attempt to fit both of the Imputers that use a list of strategies.

In [None]:
# fitting the string strategy, which yields an error
print_header("Fitting si_list1, a SingleImputer with a list of strategies to apply")
try:
    si_list1.fit(toy_df)
except TypeError as te:
    print(f"{te.__class__.__name__}: {te}")

The code segment above demonstrates the trouble setting strategies with lists. In `si_list1`, the problem is similar to the `mean` example. In `si_list1`, we specified the fourth strategy (index=3) as `categorical`. Therefore, our imputer tried to fit `categorical` to the fourth column in `toy_df`, which is `salary`. Unfortunately, `salary` expects a numerical strategy, and `categorical` applies to categorical columns only. As a result, the imputer throws a `TypeError`.

In [None]:
# fitting the string strategy, which yields an error
print_header("Fitting si_list2, a SingleImputer with a list of strategies to apply")
try:
    si_list2.fit(toy_df)
except ValueError as ve:
    print(f"{ve.__class__.__name__}: {ve}")

With `si_list2`, a different problem occurs. If we use a list as the value to the `strategy` argument, the list **must contain one strategy per column**. When we created the `Imputer`, the list contained 3 valid strategies, so no problem with instantiation. But when we tried to fit the `Imputer` to the dataset, the `Imputer` noticed the dataset had five columns. The `Imputer` does not know how to handle the fourth and fifth column, and the `Imputer` has not been told explicitly to ignore these columns, so a `ValueError` is thrown.

Finally, let's examine `si_dict`:

In [None]:
print_header("Fitting si_dict, a SingleImputer with a dictionary of strategies to apply")
si_dict.fit(toy_df)

The `si_dict` `Imputer` successfully fit the toy dataset. For `gender`, it used the `categorical` method. For `salary`, it used `pmm`. Because we did not specify any imputation method for `age`, `weight`, or `employment`, these columns are not imputed and are ignored. Not only is the dictionary method more flexible, but it can drastically speed up the time it takes an Imputer to fit a model if we have hundreds of columns but only need to impute a couple of them.

<b>So what did the fit method actually do?</b> Well, it fit `pmm` to salary and `categorical` to gender, and it used all `predictors` but no `imp_kwgs`. To understand what `predictors` and `imp_kwgs` actaully do, let's revisit the `statistics_` from fitting the model.

In [None]:
print_header("Accessing statistics after fitting the si_dict Imputer")
si_dict.statistics_

In `si_dict.statistics_`, the keys are the names of the strategies fit to specific columns and the values are the **respective series-imputers** that map to the strategy specified for the given column. Note the difference between the statistics in the code above and the statistics in section 3.1 Here, `gender` receives the series-imputer `CategoricalImputer`, which handles categorical imputation. `salary` receives the series-imputer `PMMImputer`, which implements the `pmm` algorithm.

Observe that the **series-imputers take arguments of their own**. This occurs because certain strategies may need additional information in order to implement their imputation model. In this example, the `categorical` strategy has no additional parameters necessary to pass to its series-imputer, while the `pmm` strategy has 10 additional parameters that control the way the `PmmImputer` fits a dataset. While each strategy's respective series-imputer sets default arguments, we want to be able to control those arguments to alter how the strategy ultimately works and performs. We can do so using the **imp_kwgs** argument in the `SingleImputer`, which by default is set to `None`.

<div><b><em><span style="color:navy">The img_kwgs argument</span></em></b></div>

Let's review our `si_dict` `Imputer`. By default, it's value is set to `None`.

In [None]:
si_dict

When `imp_kwgs` is `None`, all series-imputers for given strategies use their default arguments. We observe those default once we've fit our `Imputer` by accessing its statistics.

In [None]:
si_dict.statistics_

We specified `pmm` (predictive mean matching) as the strategy to use for the `salary` column. `pmm` is a semi-parametric method that borrows logic from bayesian regression, linear regression, and nearest neighbor search. We'll cover details of imputation algorithms in another tutorial, but for now, let's review some of the arguments the `PMMImputer` takes. Specifically, we'll focus on **neighbors** and **fill_value**.
* **neighbors** is the number of observations `pmm` will use to determine an imputation value.
* If **fill_value** is set to **random**, `pmm` randomly selects one of the `n` neighbors as the imputation. Random is the default.
* If the **fill_value** is set to **mean**, `pmm` takes the mean of the `n` neighbors and uses the mean as the imputation.  

We'll create two new Imputers to demonstrate how we can tweak the behavior of the `PMMImputer` through `imp_kwgs`.

In [None]:
# using the column name
si_dict_col = SingleImputer(
    strategy={"gender":"categorical", "salary": "pmm", "weight": "pmm"},
    imp_kwgs={"salary": {"neighbors": 10, "fill_value": "mean"}}
)

# using the strategy name
si_dict_strat = SingleImputer(
    strategy={"gender":"categorical", "salary": "pmm", "weight": "pmm"},
    imp_kwgs={"pmm": {"neighbors": 10, "fill_value": "mean"}}
)

In [None]:
# fit the si_dict_col imputer
si_dict_col.fit(toy_df)

In [None]:
# fit the si_dict_strat imputer
si_dict_strat.fit(toy_df)

We fit two new imputers to the our toy dataset. The first `Imputer`, `si_dict_col`, sets `pmm` as the strategy for both `weight` and `salary`. Additionally, it sets `imp_kwgs` to fine-tune **the pmm algorithm for the salary column only**. Our second imputer, `si_dict_strat`, sets the same strategies, but it sets `imp_kwgs` to fine-tune **any column that uses the pmm algorithm**.

As a result, the `si_dict_col` `Imputer` uses a customized version of `pmm` for salary but the default version of `pmm` for weight. We can see the differences by accessing the Imputer's statistics, as shown below. The `weight` column has the default number of `neighbors` (5) and the default `fill_value` (random). The `salary` column, on the other hand, uses 10 `neighbors`, and its `fill_value` is set to `mean`.

In [None]:
print_header("PMMImputer for weight")
print(si_dict_col.statistics_["weight"])
print_header("PMMImputer for salary")
print(si_dict_col.statistics_["salary"])
print_header("Number of neighbors used for weight vs. salary")
print(
    {"number of neighbors for salary": si_dict_col.statistics_["salary"].neighbors, 
     "number of neighbors for weight": si_dict_col.statistics_["weight"].neighbors}
)

The `si_dict_strat` `Imputer` applies `imp_kwgs` to **any column that has pmm as its strategy**. Therefore, the customized `PMMImputer` applies to both the `salary` and the `weight` column, as shown below. The number of neighbors is the same for both columns, as is the `fill_value`.

In [None]:
print_header("PMMImputer for weight")
print(si_dict_strat.statistics_["weight"])
print_header("PMMImputer for salary")
print(si_dict_strat.statistics_["salary"])
print_header("Number of neighbors used for weight vs. salary")
print(
    {"number of neighbors for salary": si_dict_strat.statistics_["salary"].neighbors, 
     "number of neighbors for weight": si_dict_strat.statistics_["weight"].neighbors}
)

Therefore, we can customize the series-imputer for any given imputation strategy **by column** or **by strategy itself**. While we demonstrated this behavior for `pmm`, the same logic applies to any imputation strategy. Below is an example using `interpolate` as an imputation strategy. We'll specify `imp_kwgs` by column.

In [None]:
# interpolate with imp_kwgs using the column name
si_interp = SingleImputer(
    strategy={"salary": "interpolate", "weight": "interpolate"},
    imp_kwgs={"salary": {"fill_strategy": "linear"}, "weight": {"fill_strategy": "quadratic"}}
)

# fit the imputer
si_interp.fit(toy_df)

In [None]:
si_interp.statistics_

In this case, we've created an `Imputer` that uses **linear interpolation** to impute `salary` and **quadratic interpolation** to impute `weight`. This is just another example of how we can use `imp_kwgs` along with a column or strategy to fine-tune the imputation algorithm itself when we create an instance of an `Imputer`.

#### Revisiting Imputer Model Fitting
In the code samples above, we introduce the concept of a **series-imputer**. Remember that the `strategy` for each column in a `SingleImputer` is simply a key that maps to a series-imputer class prefixed by the strategy's name. Also recall that the `SingleImputer` delegates work to the series-imputer that corresponds to a specified strategy. The series-imputer then implements the algorithm on behalf of the `SingleImputer` and returns the results.

Let's return to the `si_dict` imputer. We'll access its statistics, which contain each strategy's respective series-imputer.

In [None]:
si_dict.statistics_

The `si_dict` `Imputer` delegates the work for `gender` to the `CategoricalImputer`, which in turn fits the categorical imputation model. Similarly, the `SingleImputer` delegates the work for `salary` to the `PMMImptuer`, which in turn fits the pmm imputation model. In this case, we did not tune the arguments for the `PMMImputer`, as we left the `imp_kwgs` set to `None`.

Now let's look at the `Imputer` we created with the `interpolate` strategy.

In [None]:
si_interp.statistics_

In this case, we created an `Imputer` and we also set custom `imp_kwgs` for the **type of interpolation we wanted for each column**. As a result, the `si_interp` `Imputer` delegates the work for `salary` to the `InterpolateImputer`, with `fill_strategy` set to `linear`. Similarly, the `si_interp` `Imputer` delegates the work for `weight` to another `InterpolateImputer`, wich uses `quadratic` instead of `linear` for its `fill_strategy`. Each column gets a **separate instance** of the `InterpolateImputer`, which does the imputation work for column it's supposed to impute. These series-imputers act independently of one another to fit an imputation model for the column they've been assigned to.

This **design pattern** is inherently robust and flexible. The `SingleImputer` focuses on all the class setup, error handling, and dirty work to make the `Imputer` easy, familiar, and flexible. It then prepares the data for all strategy-specific series-imputers to handle. But in the end, the `SingleImputer` delegates all the imputation work and simply waits for the results of each strategy-specific series-imputer to do its job. This makes it easy to identify a bug should something break down during imputation because we've isolated responsibilities and independently delegated work for each column's imputation.