## Getting the Most out of the Imputer Classes: Part II
---
This tutorial is part II of a comprehensive overview of `Autoimpute` Imputers. It includes:
1. Creating a Toy Dataset for the `SingleImputer`
2. How the `SingleImputer` Works under the Hood
3. Customizing a `SingleImputer` through its Arguments

[Part I](https://github.com/kearnz/autoimpute-tutorials/blob/master/tutorials/imputer_mechanics_I.ipynb) of this series motivates the need for imputation and reviews the main design considerations behind Autoimpute Imputers.

Part III will extend the concepts in the first two parts of this series to the `MultipleImputer`.

### [1. Creating a Toy Dataset for the SingleImputer](#1.-Creating-a-Toy-Dataset-for-the-SingleImputer)
---
This tutorial utilizes the toy dataset created below. Note that the dataset has no missing values. This is fine because we will not perform imputations. Rather, we are interested in how the `SingleImputer` works under the hood and how we can control the `SingleImputer` to fit imputation models capable of imputing missing data. While the imputations themselves may be of primary interest to the end user, they are simply the output from a fitted imputation model. Therefore, this tutorial places emphasis on how `Autoimpute` Imputers **fit imputation models**.

In [1]:
# imports to create toy df
import warnings
import numpy as np
import pandas as pd
warnings.filterwarnings('ignore')

# helper functions used throughout this project
print_header = lambda msg: print(f"{msg}\n{'-'*len(msg)}")

# dataframe with columns as random selections from various arrays
toy_df = pd.DataFrame({
    "age": np.random.choice(np.arange(20,80), 50),
    "gender": np.random.choice(["Male","Female"], 50),
    "employment": np.random.choice(["Unemployed","Employed", "Part Time", "Self-Employed"], 50),
    "salary": np.random.choice(np.arange(50_000, 1_000_000), 50),
    "weight": np.random.choice(np.arange(100, 300, 0.1), 50),
})

# helper functions used throughout this project
print_header("Creating a toy dataset for demonstration purposes")
toy_df.head()

Creating a toy dataset for demonstration purposes
-------------------------------------------------


Unnamed: 0,age,employment,gender,salary,weight
0,26,Part Time,Female,114588,178.4
1,75,Part Time,Male,895440,294.2
2,79,Unemployed,Male,268436,277.9
3,53,Employed,Male,166998,209.2
4,72,Unemployed,Female,693478,299.6


Our toy dataframe has 5 columns of mixed types. `age`, `salary`, and `weight` are numeric, while `gender` and `employment` are categorical. For the numeric variables, `age` and `salary` take integer values, while `weight` is float. For the categorical variables, `gender` is binary while `employment` is multiclass.  As we'll see later in this tutorial, column types matter for imputation. Imputers handle numeric and categorical columns, but users must be aware of which imputation methods apply to which data types.

### [2. How the SingleImputer Works under the Hood](#2.-How-the-SingleImputer-Works-under-the-Hood)
---
A lot has to happen behind the scenes for the `SingleImputer` to meet Autoimpute's design goals addressed in [Part I, section 2](https://github.com/kearnz/autoimpute-tutorials/blob/master/tutorials/imputer_mechanics_I.ipynb). This section peeks under the hood of the `SingleImputer` to explore how it makes the imputation process easy, familiar, and flexible. Additionally, users are more adequately prepared to utilize Imputers if they have a solid understanding of `Imputer` mechanics. In this section, we cover:
<ol>
    <li>How the `SingleImputer` Fits the Imputation Model</li>
    <li>What are "series-imputers"?</li>
    <li>The Design Patterns behind Imputers</li>
</ol>

#### 2.1 How the SingleImputer Fits the Imputation Model
Recall from [Part I, section 2](https://github.com/kearnz/autoimpute-tutorials/blob/master/tutorials/imputer_mechanics_I.ipynb) that `Autoimpute` Imputers set smart defaults for each argument. Therefore, we can instantiate and fit an imputer that can handle any dataset we give it. Let's create a default instance of the `SingleImputer` class and fit it to our toy dataset.

In [2]:
from autoimpute.imputations import SingleImputer
si = SingleImputer()

print_header("Fit Method returning instance of the SingleImputer class")
si.fit(toy_df)

Fit Method returning instance of the SingleImputer class
--------------------------------------------------------


SingleImputer(copy=True, imp_kwgs=None, predictors='all', scaler=None,
       seed=None, strategy='default predictive', verbose=False,
       visit='default')

In the code above, we create a default instance of the `SingleImputer` and assign it the variable name `si`. Note that because we did not specify a strategy, the `SingleImputer` set the strategy to **default predictive** for each columns in the dataset. We'll explore the strategy argument in more detail in [section 3](#3.-Customizing-a-SingleImputer-through-its-Arguments) of this tutorial.

We then fit `si` to our toy dataset. This fitting process is easy and should be quite familiar. As with `sklearn`, our `SingleImputer` uses default arguments to fit the dataset and returns an instance of the `Imputer` class itself when the fitting process is complete (The `BaseEstimator` `__repr__` prints the instance and its arguments to the console). **But what actually happened when we fit the Imputer?** Just like `sklearn` `Transformers`, the `SingleImputer` generates a `statistics_` attribute when the `fit` method is called. In our case, the `statistics_` stores the imputation model built for each column we want to impute. We can access the `statistics_` of `si` to explore what happened after the `fit` process completed and the imputation model(s) were created.

In [3]:
print_header("Statistics generated from fit method of the SingleImputer")
si.statistics_

Statistics generated from fit method of the SingleImputer
---------------------------------------------------------


{'age': DefaultPredictiveImputer(cat_imputer=MultinomialLogisticImputer(),
              cat_kwgs=None,
              num_imputer=PMMImputer(am=57.03147018626208, asd=10,
       bm=array([ 4.07991e-06, -3.71088e-02,  2.50454e+00,  4.20461e+00,
        -4.72861e+00, -1.98054e+00, -3.12426e+00,  3.12426e+00]),
       bsd=10, fill_value='random', init='auto', neighbors=5, sample=1000,
       sig=1, tune=1000),
              num_kwgs=None),
 'employment': DefaultPredictiveImputer(cat_imputer=MultinomialLogisticImputer(),
              cat_kwgs=None,
              num_imputer=PMMImputer(am=None, asd=10, bm=None, bsd=10, fill_value='random', init='auto',
       neighbors=5, sample=1000, sig=1, tune=1000),
              num_kwgs=None),
 'gender': DefaultPredictiveImputer(cat_imputer=MultinomialLogisticImputer(),
              cat_kwgs=None,
              num_imputer=PMMImputer(am=None, asd=10, bm=None, bsd=10, fill_value='random', init='auto',
       neighbors=5, sample=1000, sig=1, tune=1000

While the `statistics_` are quite dense, they demonstrate what a `SingleImputer` does by default. `si.statistics_` returns a dictionary, where the key is a column we are imputing in our dataset and the value is a **series-imputer** that corresponds to the strategy set for the given column (or key). Each series-imputer contains within itself the **imputation model** for the given column it is called upon to fit. The next section discusses the concept of a series-imputer in more detail.

#### 2.2 What are "series-imputers"?
The **series-imputer** is a critical component of how `Autoimpute` Imputers work under the hood. From the the end-user's perspective, series-imputers are simply workers behind the scenes that implement a given imputation model. From Autoimpute's pespective, series-imputers are classes that implement the "imputation interface" and therefore can be used interchangeably within the `SingleImputer` when different strategies are passed.

The `strategy` a `SingleImputer` assigns to each column in a dataset maps to a **series-imputer class prefixed by the strategy's name**. These series-imputers are in the `autoimpute.imputations.series` folder, but they are hidden from end users because they aren't meant for public use. The `SingleImptuer` and `MultipleImputer` are the main classes `Autoimpute` exposes to end users because they are robust and work with `DataFrames`. Under the hood, however, both of these `Imptuers` rely on series-imputers to do all of the actual work once a dataset is ready for imputation models to be fit.

We'll explain the importance of and reasoning behind this pattern design in the next section (2.3). For now, let's return to our `si` `Imputer` and explore the series-imputer used by default for each column. We'll focus on the statistics for the `employment` column first.

In [5]:
# get the series imputer for employment column
emp_series_imputer = si.statistics_["employment"]

print_header("The series-imputer for the 'employment' column in the toy dataset")
emp_series_imputer

The series-imputer for the 'employment' column in the toy dataset
-----------------------------------------------------------------


DefaultPredictiveImputer(cat_imputer=MultinomialLogisticImputer(),
             cat_kwgs=None,
             num_imputer=PMMImputer(am=None, asd=10, bm=None, bsd=10, fill_value='random', init='auto',
      neighbors=5, sample=1000, sig=1, tune=1000),
             num_kwgs=None)

The code above returns the value of `si.statistics_` for the `employment` key. We observe that our `si` `Imputer` fit the `employment` column using the `DefaultPredictiveImputer`. In fact, `si` fits all the columns in our toy dataset with the `DefaultPredictiveImputer`. This occurs because we did not specify a strategy when creating an instance of the Imputer. Therefore, the Imputer set each column's strategy to `default predictive` - the default strategy deployed when a user does not set the imputation strategy.

The `DefaultPredictiveImputer` **is an example of a series-imputer**. Specifically, it is the series-imputer that maps to any column with the `default predictive` strategy. As we mentioned above, each series-imputer is a class itself that implements a specific imputation model following a general set of rules to which all series-imputers must adhere. The `DefaultPredictiveImputer` is actually one of the most complex because it has to be flexible enough to fit numeric and categorical columns. In fact, the `DefaultPredictiveImputer` delegates its work to other series-imputers depending on the column type, as showin the code below.

In [6]:
print_header("The series imputer for categorical columns within DefaultPredictiveImputer")
print(emp_series_imputer.cat_imputer)
print()
print_header("The series imputer for numerical columns within DefaultPredictiveImputer")
print(emp_series_imputer.num_imputer)

The series imputer for categorical columns within DefaultPredictiveImputer
--------------------------------------------------------------------------
MultinomialLogisticImputer()

The series imputer for numerical columns within DefaultPredictiveImputer
------------------------------------------------------------------------
PMMImputer(am=None, asd=10, bm=None, bsd=10, fill_value='random', init='auto',
      neighbors=5, sample=1000, sig=1, tune=1000)


The `DefaultPredictiveImputer` class takes the `cat_imputer` and `num_imputer` arguments. The `cat_imputer` is the series-imputer that the `DefaultPredictiveImputer` uses when it comes across a categorical column (i.e. `object` datatype in `pandas` `DataFrames`). The `num_imputer` is the series-imputer that a `DefaultPredictiveImputer` users when it comes across a numerical column (i.e. `numeric` datatype in `pandas` `Dataframes`). `cat_imputer` is set to `MultinomialLogisticImputer` by default, while `num_imputer` is set to `PMMImputer` by default. `MultinomialLogisticImputer` is the series-imputer for the `multinomial logistic` strategy, while `PMMImputer` is the series-imputer for the `pmm` strategy. Therefore, the `default predictive` strategy is simply an abstraction that chooses between the `multionmial logistic` strategy and `pmm` strategy depending on the column datatype. Behind these strategies are their respective series-imputers. The `DefaultPredictiveImputer` is an abstraction that chooses to implement the `MultinomialLogisticImputer` or the `PMMImputer` depending on the column datatype.

Before we move to the next section, we'll look at the series-imputer for `age`.

In [7]:
print_header("The series-imputer for the 'employment' column in the toy dataset")
si.statistics_["age"]

The series-imputer for the 'employment' column in the toy dataset
-----------------------------------------------------------------


DefaultPredictiveImputer(cat_imputer=MultinomialLogisticImputer(),
             cat_kwgs=None,
             num_imputer=PMMImputer(am=57.03147018626208, asd=10,
      bm=array([ 4.07991e-06, -3.71088e-02,  2.50454e+00,  4.20461e+00,
       -4.72861e+00, -1.98054e+00, -3.12426e+00,  3.12426e+00]),
      bsd=10, fill_value='random', init='auto', neighbors=5, sample=1000,
      sig=1, tune=1000),
             num_kwgs=None)

The astute reader will have already noticed that the `DefaultPredictiveImputer` for `age` is a bit different than the one used for `employment`, even though each column has the same strategy ("default predictive"). The differences are the values of the arguments within the `PMMImputer`. This is an implementation detail when the `PMMImputer` actually fits a dataset, which we will address in a separate tutorial on what each imputation strategy (or series-imputer) actually does. In our case, what's important to remember is that the `default predictive` strategy delegates to either `pmm` or `multinomial logistic` depending on the column type. Because `age` is a numerical column, it's "default predictive" strategy is actually `pmm`, so the `PMMImputer` is evoked to fit an imputation model. With `employment`, `PMMImputer` is never evoked since the column is categorical. The `MultinomialLogisticImputer` is evoked instead, as the `default predictive` strategy for `employment` delegates to `multinomial logistic`.

#### 2.3 The Design Considerations behind Imputers

### [3. Customizing a SingleImputer through its Arguments](#3.-Customizing-a-SingleImputer-through-its-Arguments)
---
In this section, we explore how we can **customize the SingleImputer** by tuning its arguments. As we tune arguments, we look at how the `statistics_` attribute of the `SingleImputer` changes. We learn that some of the arguments we pass to the `SingleImputer` alter the behavior of the `SingleImputer` itself, while other arguments modify the specific imputation models created by the series-imputers to which the `SingleImputer` delegates its work.

#### The strategy argument
The **strategy** argument is the most important argument within the `SingleImputer`. The list below shows all the strategies available to impute a column within a dataframe. `predictive default` is the default strategy if a user does not specify one. As we observed in [section 2](#2.-How-the-SingleImputer-Works-under-the-Hood), `predictive default` chooses the preferred strategy to use depending on a columns data type (`pmm` for numerical, `multinomial logistic` for categorical). Note that some of these strategies are for categorical data, while others are for numeric data. As we'll see later, the Imputers let you know whether a strategy will work for a given column when you try to fit the imputation model.

In [8]:
print_header("Strategies Available for Imputation")
print(list(SingleImputer().strategies.keys()))

Strategies Available for Imputation
-----------------------------------
['default predictive', 'least squares', 'stochastic', 'binary logistic', 'multinomial logistic', 'bayesian least squares', 'bayesian binary logistic', 'pmm', 'default univariate', 'default time', 'mean', 'median', 'mode', 'random', 'norm', 'categorical', 'interpolate', 'locf', 'nocb']


We have a wealth of imputation methods at our disposal, and we continue to make more available. That being said, imputation strategies are restricted to the list of strategies provided above. A user cannot even create an instance of an Imputer if the strategy he or she provides is not supported. Improper strategy specification throws a `ValueError`, as shown below. The traceback is removed to keep this tutorial clean, but the error below is clear. The `strategy` argument is validated when instantiating the class.

In [9]:
# proviging a strategy not yet supported or that doens't exist
print_header("Creating a SingleImputer with an unsupported strategy")
try:
    SingleImputer(strategy="unsupported")
except ValueError as ve:
    print(f"{ve.__class__.__name__}: {ve}")

Creating a SingleImputer with an unsupported strategy
-----------------------------------------------------
ValueError: Strategy unsupported not a valid imputation method.
 Strategies must be one of ['default predictive', 'least squares', 'stochastic', 'binary logistic', 'multinomial logistic', 'bayesian least squares', 'bayesian binary logistic', 'pmm', 'default univariate', 'default time', 'mean', 'median', 'mode', 'random', 'norm', 'categorical', 'interpolate', 'locf', 'nocb'].


<div>So how can utilize supported strategies? We can set the <b>strategy</b> in three ways:</div>
<ul>
    <li>As a <b>string</b>, which broadcasts the strategy across every column in the DataFrame</li>
    <li>As a <b>list or tuple</b>, where the position of strategies in the iterator are applied to the corresponding column</li>
    <li>As a <b>dictionary</b>, where the key is the column we want to impute, and the value is the strategy to use</li>
</ul>
<br>
<div>We advise the <b>dictionary method</b>, as it is the most explicit and allows the user to impute all or a subset of the columns in a DataFrame. It is also the least prone to unexpected behavior and errors when trying to fit the imputation model. Let's look at some examples below, where we run into problems with the string and iterator method but have better control with the dictionary method.</div>

In [21]:
# string strategy broadcasts across all strategies
si_str = SingleImputer(strategy="mean")

# list strategies, where each item is a corresponding strategy
si_list1 = SingleImputer(strategy=["mean", "mode", "binary logistic", "categorical", "median"])
si_list2 = SingleImputer(strategy=["mean", "binary logistic", "median"])

# dictionary strategy, where we specify column and strategy together
# Note that with the dictionary, we can specify a SUBSET of columns to impute
si_dict = SingleImputer(strategy={"gender":"categorical", "salary": "pmm"})

<div>Note that we instantiated each SingleImputer with no issues yet. We provided valid types (string, iterator, or dictionary) for the strategy argument, and each strategy we provided is one of the strategies supported. So we are able to at least crete an instance of our class. <b>But that does not necessarily mean the strategies we've chosen will work with the columns of our DataFrame</b>. This is something we cannot validate <b>until the user fits the Imputer to a dataset</b> because the Imputer itself knows nothing about the dataset's columns or the column types until the Imputer is fit. We will see how this plays out below when we try to fit each imputer to our toy dataframe.</div>

First, let's try to fit `si_str`, `si_list1`, and `si_list2`. All are valid Imputers but won't work with our toy data.

In [11]:
# fitting the string strategy, which yields an error
print_header("Fitting si_str, a SingleImputer that broadcasts strategy='mean'")
try:
    si_str.fit(toy_df)
except TypeError as te:
    print(f"{te.__class__.__name__}: {te}")

Fitting si_str, a SingleImputer that broadcasts strategy='mean'
---------------------------------------------------------------
TypeError: mean not appropriate for Series employment of type object.


While a valid imputer, `si_str` failed to fit the dataset. We set the strategy as `mean`, so the `si_str` `Imputer` broadcast `mean` to all columns in our dataset. `mean` worked fine when imputing `age`, the first column, because `age` is numerical. But an error occurred when we tried to take the `mean` of the `employment` column. We cannot take the `mean` of a categorical column such as `employment`, so the `Imputer` throws an error. (Note that the same error would have occurred when the `Imputer` reached the `gender` column.)

Therefore, we must be very careful when setting the strategy with a string since that strategy is broadcast to all columns in the DataFrame. When setting strategy with a string, we must ensure that we want the same strategy for each column, and we must ensure that our DataFrame does not contain any columns that the strategy cannot fit. As we'll see later, we could have specified `mean` for every column except `employment` and `gender` if we had used a dictionary, or we could have specified a different strategy for `employment` and `gender` had we used a list.

Next, we'll attempt to fit both of the Imputers that use a list of strategies.

In [12]:
# fitting the string strategy, which yields an error
print_header("Fitting si_list1, a SingleImputer with a list of strategies to apply")
try:
    si_list1.fit(toy_df)
except TypeError as te:
    print(f"{te.__class__.__name__}: {te}")

Fitting si_list1, a SingleImputer with a list of strategies to apply
--------------------------------------------------------------------
TypeError: categorical not appropriate for Series salary of type int64.


The code segment above demonstrates the trouble setting strategies with lists. In `si_list1`, the problem is similar to the `mean` example. In `si_list1`, we specified the fourth strategy (index=3) as `categorical`. Therefore, our imputer tried to fit `categorical` to the fourth column in `toy_df`, which is `salary`. Unfortunately, `salary` expects a numerical strategy, and `categorical` applies to categorical columns only. As a result, the imputer throws a `TypeError`.

In [13]:
# fitting the string strategy, which yields an error
print_header("Fitting si_list2, a SingleImputer with a list of strategies to apply")
try:
    si_list2.fit(toy_df)
except ValueError as ve:
    print(f"{ve.__class__.__name__}: {ve}")

Fitting si_list2, a SingleImputer with a list of strategies to apply
--------------------------------------------------------------------
ValueError: Length of columns not equal to number of strategies.
Length of columns: 5
Length of strategies: 3


With `si_list2`, a different problem occurs. If we use a list as the value to the `strategy` argument, the list **must contain one strategy per column**. When we created the `Imputer`, the list contained 3 valid strategies, so no problem with instantiation. But when we tried to fit the `Imputer` to the dataset, the `Imputer` noticed the dataset had five columns. The `Imputer` does not know how to handle the fourth and fifth column, and the `Imputer` has not been told explicitly to ignore these columns, so a `ValueError` is thrown.

Finally, let's examine `si_dict`:

In [22]:
print_header("Fitting si_dict, a SingleImputer with a dictionary of strategies to apply")
si_dict.fit(toy_df)

Fitting si_dict, a SingleImputer with a dictionary of strategies to apply
-------------------------------------------------------------------------


SingleImputer(copy=True, imp_kwgs=None, predictors='all', scaler=None,
       seed=None, strategy={'gender': 'categorical', 'salary': 'pmm'},
       verbose=False, visit='default')

The `si_dict` `Imputer` successfully fit the toy dataset. For `gender`, it used the `categorical` method. For `salary`, it used `least squares`. Because we did not specify any imputation method for `age`, `weight`, or `employment`, these columns are not imputed and are ignored. Not only is the dictionary method more flexible, but it can drastically speed up the time it takes an Imputer to fit a model if we have hundreds of columns but only need to impute a couple of them.

**So what did the fit method actually do?** As we learned in [section 2](#2.-How-the-SingleImputer-Works-under-the-Hood), Imputers delegate the work for each column to a series-imputer that maps to the specified strategy. In this case, we specified `pmm` for `salary` and `categorical` for `gender`, so `si_dict` delegated work for `salary` to the `PMMImputer` and delegated work for `gender` to the `CategoricalImputer`. 

In [23]:
print_header("Accessing statistics after fitting the si_dict Imputer")
si_dict.statistics_

Accessing statistics after fitting the si_dict Imputer
------------------------------------------------------


{'gender': CategoricalImputer(),
 'salary': PMMImputer(am=456939.78037656035, asd=10,
       bm=array([ 9.55614e+02, -2.89887e+01, -2.06254e+04,  4.46729e+03,
        -4.79621e+04,  6.41201e+04,  1.70351e+04, -1.70351e+04]),
       bsd=10, fill_value='random', init='auto', neighbors=5, sample=1000,
       sig=1, tune=1000)}

Note the difference between the statistics in the code above and the statistics in section 2.1. In section 2.1, all columns including `gender` and `salary` received the `DefaultPredictiveImputer`. Here, `gender` receives the series-imputer `CategoricalImputer`, which maps to the `categorical` strategy; `salary` receives the series-imputer `PMMImputer` which maps to the `pmm` strategy; and the remaining columns receive **no series-imputer at all** because we explicitly ignored them. Both `si` and `si_dict` are instances of the same `SingleImputer`, but each looks very different because we've tuned the `strategy` argument. Because the `strategy` argument maps to a `series-imputer` that creates each column's imputation model, `si` and `si_dict` end up with completely different `statistics_`. If we actually impute data, each of these Imputers would produce a very different set of imputations. 

#### The img_kwgs argument
Observe that at times, the **series-imputers take arguments of their own**. This occurs because certain strategies may need additional information in order to implement their imputation model. In the example above, the `categorical` strategy has no additional parameters necessary to pass to its series-imputer, while the `pmm` strategy has 10 additional parameters that control the way the `PmmImputer` fits a dataset. While each strategy's respective series-imputer sets default arguments as well, we want to be able to control those arguments to alter how the strategy ultimately works and performs. We can do so using the **imp_kwgs** argument in the `SingleImputer`, which by default is set to `None`.

Let's review our `si_dict` `Imputer`. By default, it's value is set to `None`.

In [24]:
si_dict

SingleImputer(copy=True, imp_kwgs=None, predictors='all', scaler=None,
       seed=None, strategy={'gender': 'categorical', 'salary': 'pmm'},
       verbose=False, visit='default')

When `imp_kwgs` is `None`, all series-imputers for given strategies use their default arguments. We observe those default once we've fit our `Imputer` by accessing its statistics.

In [25]:
si_dict.statistics_

{'gender': CategoricalImputer(),
 'salary': PMMImputer(am=456939.78037656035, asd=10,
       bm=array([ 9.55614e+02, -2.89887e+01, -2.06254e+04,  4.46729e+03,
        -4.79621e+04,  6.41201e+04,  1.70351e+04, -1.70351e+04]),
       bsd=10, fill_value='random', init='auto', neighbors=5, sample=1000,
       sig=1, tune=1000)}

We specified `pmm` (predictive mean matching) as the strategy to use for the `salary` column. `pmm` is a semi-parametric method that borrows logic from bayesian regression, linear regression, and nearest neighbor search. We'll cover details of imputation algorithms in another tutorial, but for now, let's review some of the arguments the `PMMImputer` takes. Specifically, we'll focus on **neighbors** and **fill_value**.
* **neighbors** is the number of observations `pmm` will use to determine an imputation value.
* If **fill_value** is set to **random**, `pmm` randomly selects one of the `n` neighbors as the imputation. Random is the default.
* If the **fill_value** is set to **mean**, `pmm` takes the mean of the `n` neighbors and uses the mean as the imputation.  

We'll create two new Imputers to demonstrate how we can tweak the behavior of the `PMMImputer` through `imp_kwgs`.

In [26]:
# using the column name
si_dict_col = SingleImputer(
    strategy={"gender":"categorical", "salary": "pmm", "weight": "pmm"},
    imp_kwgs={"salary": {"neighbors": 10, "fill_value": "mean"}}
)

# using the strategy name
si_dict_strat = SingleImputer(
    strategy={"gender":"categorical", "salary": "pmm", "weight": "pmm"},
    imp_kwgs={"pmm": {"neighbors": 10, "fill_value": "mean"}}
)

In [27]:
# fit the si_dict_col imputer
si_dict_col.fit(toy_df)

SingleImputer(copy=True,
       imp_kwgs={'salary': {'neighbors': 10, 'fill_value': 'mean'}},
       predictors='all', scaler=None, seed=None,
       strategy={'gender': 'categorical', 'salary': 'pmm', 'weight': 'pmm'},
       verbose=False, visit='default')

In [28]:
# fit the si_dict_strat imputer
si_dict_strat.fit(toy_df)

SingleImputer(copy=True,
       imp_kwgs={'pmm': {'neighbors': 10, 'fill_value': 'mean'}},
       predictors='all', scaler=None, seed=None,
       strategy={'gender': 'categorical', 'salary': 'pmm', 'weight': 'pmm'},
       verbose=False, visit='default')

We fit two new imputers to the our toy dataset. The first `Imputer`, `si_dict_col`, sets `pmm` as the strategy for both `weight` and `salary`. Additionally, it sets `imp_kwgs` to fine-tune **the pmm algorithm for the salary column only**. Our second imputer, `si_dict_strat`, sets the same strategies, but it sets `imp_kwgs` to fine-tune **any column that uses the pmm algorithm**.

As a result, the `si_dict_col` `Imputer` uses a customized version of `pmm` for salary but the default version of `pmm` for weight. We can see the differences by accessing the Imputer's statistics, as shown below. The `weight` column has the default number of `neighbors` (5) and the default `fill_value` (random). The `salary` column, on the other hand, uses 10 `neighbors`, and its `fill_value` is set to `mean`.

In [29]:
print_header("PMMImputer for weight")
print(si_dict_col.statistics_["weight"])
print_header("PMMImputer for salary")
print(si_dict_col.statistics_["salary"])
print_header("Number of neighbors used for weight vs. salary")
print(
    {"number of neighbors for salary": si_dict_col.statistics_["salary"].neighbors, 
     "number of neighbors for weight": si_dict_col.statistics_["weight"].neighbors}
)

PMMImputer for weight
---------------------
PMMImputer(am=227.65915432069784, asd=10,
      bm=array([-3.91893e-01, -1.30703e-06,  1.38004e+01, -7.00962e-01,
       -3.54279e+01,  2.23285e+01, -1.12760e+01,  1.12760e+01]),
      bsd=10, fill_value='random', init='auto', neighbors=5, sample=1000,
      sig=1, tune=1000)
PMMImputer for salary
---------------------
PMMImputer(am=456939.78037656035, asd=10,
      bm=array([ 9.55614e+02, -2.89887e+01, -2.06254e+04,  4.46729e+03,
       -4.79621e+04,  6.41201e+04,  1.70351e+04, -1.70351e+04]),
      bsd=10, fill_value='mean', init='auto', neighbors=10, sample=1000,
      sig=1, tune=1000)
Number of neighbors used for weight vs. salary
----------------------------------------------
{'number of neighbors for salary': 10, 'number of neighbors for weight': 5}


The `si_dict_strat` `Imputer` applies `imp_kwgs` to **any column that has pmm as its strategy**. Therefore, the customized `PMMImputer` applies to both the `salary` and the `weight` column, as shown below. The number of neighbors is the same for both columns, as is the `fill_value`.

In [30]:
print_header("PMMImputer for weight")
print(si_dict_strat.statistics_["weight"])
print_header("PMMImputer for salary")
print(si_dict_strat.statistics_["salary"])
print_header("Number of neighbors used for weight vs. salary")
print(
    {"number of neighbors for salary": si_dict_strat.statistics_["salary"].neighbors, 
     "number of neighbors for weight": si_dict_strat.statistics_["weight"].neighbors}
)

PMMImputer for weight
---------------------
PMMImputer(am=227.65915432069784, asd=10,
      bm=array([-3.91893e-01, -1.30703e-06,  1.38004e+01, -7.00962e-01,
       -3.54279e+01,  2.23285e+01, -1.12760e+01,  1.12760e+01]),
      bsd=10, fill_value='mean', init='auto', neighbors=10, sample=1000,
      sig=1, tune=1000)
PMMImputer for salary
---------------------
PMMImputer(am=456939.78037656035, asd=10,
      bm=array([ 9.55614e+02, -2.89887e+01, -2.06254e+04,  4.46729e+03,
       -4.79621e+04,  6.41201e+04,  1.70351e+04, -1.70351e+04]),
      bsd=10, fill_value='mean', init='auto', neighbors=10, sample=1000,
      sig=1, tune=1000)
Number of neighbors used for weight vs. salary
----------------------------------------------
{'number of neighbors for salary': 10, 'number of neighbors for weight': 10}


Therefore, we can customize the series-imputer for any given imputation strategy **by column** or **by strategy itself**. While we demonstrated this behavior for `pmm`, the same logic applies to any imputation strategy that takes additional arguments. Below is an example using `interpolate` as an imputation strategy. We'll specify `imp_kwgs` by column.

In [31]:
# interpolate with imp_kwgs using the column name
si_interp = SingleImputer(
    strategy={"salary": "interpolate", "weight": "interpolate"},
    imp_kwgs={"salary": {"fill_strategy": "linear"}, "weight": {"fill_strategy": "quadratic"}}
)

# fit the imputer
si_interp.fit(toy_df)

SingleImputer(copy=True,
       imp_kwgs={'salary': {'fill_strategy': 'linear'}, 'weight': {'fill_strategy': 'quadratic'}},
       predictors='all', scaler=None, seed=None,
       strategy={'salary': 'interpolate', 'weight': 'interpolate'},
       verbose=False, visit='default')

In [32]:
si_interp.statistics_

{'salary': InterpolateImputer(end=None, fill_strategy='linear', order=None, start=None),
 'weight': InterpolateImputer(end=None, fill_strategy='quadratic', order=None,
           start=None)}

In this case, we've created an `Imputer` that uses **linear interpolation** to impute `salary` and **quadratic interpolation** to impute `weight`. This is just another example of how we can use `imp_kwgs` along with a column or strategy to fine-tune the imputation algorithm itself when we create an instance of an `Imputer`.