# **Replacing missing values with an arbitrary number**

Replace missing values with an arbitrary number using pandas, Scikit-learn and Feature-engine.

In [1]:
pip install feature-engine

Collecting feature-engine
  Downloading feature_engine-1.3.0-py2.py3-none-any.whl (260 kB)
[?25l[K     |█▎                              | 10 kB 18.0 MB/s eta 0:00:01[K     |██▌                             | 20 kB 23.8 MB/s eta 0:00:01[K     |███▊                            | 30 kB 16.0 MB/s eta 0:00:01[K     |█████                           | 40 kB 6.7 MB/s eta 0:00:01[K     |██████▎                         | 51 kB 6.6 MB/s eta 0:00:01[K     |███████▌                        | 61 kB 7.7 MB/s eta 0:00:01[K     |████████▉                       | 71 kB 8.2 MB/s eta 0:00:01[K     |██████████                      | 81 kB 7.6 MB/s eta 0:00:01[K     |███████████▎                    | 92 kB 8.4 MB/s eta 0:00:01[K     |████████████▋                   | 102 kB 7.4 MB/s eta 0:00:01[K     |█████████████▉                  | 112 kB 7.4 MB/s eta 0:00:01[K     |███████████████                 | 122 kB 7.4 MB/s eta 0:00:01[K     |████████████████▍               | 133 kB 7.4 MB

In [2]:
import pandas as pd

# to split the data sets:
from sklearn.model_selection import train_test_split

# to impute missing data with sklearn:
from sklearn.impute import SimpleImputer

# to impute missing data with Feature-engine:
from feature_engine.imputation import ArbitraryNumberImputer

## **Load data**

In [3]:
data = pd.read_csv("credit_approval_uci.csv")
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,target
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


## **Split data in train and test sets**

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

((483, 15), (207, 15))

## **Arbitrary imputation with pandas**

[pd.fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

### **All variables with the same string**

In [5]:
# Find variables maximum values:
X_train[["A2", "A3", "A8", "A11"]].max()

A2     76.750
A3     26.335
A8     20.000
A11    67.000
dtype: float64

In [6]:
# Replace missing data with 99:
X_train[["A2", "A3", "A8", "A11"]] = X_train[[
    "A2", "A3", "A8", "A11"]].fillna(99)
X_test[["A2", "A3", "A8", "A11"]] = X_test[[
    "A2", "A3", "A8", "A11"]].fillna(99)

In [7]:
# Corroborate  absence of missing values:
X_train[["A2", "A3", "A8", "A11"]].isnull().sum()

A2     0
A3     0
A8     0
A11    0
dtype: int64

In [8]:
# Corroborate  absence of missing values:
X_test[["A2", "A3", "A8", "A11"]].isnull().sum()

A2     0
A3     0
A8     0
A11    0
dtype: int64

### **Imputing with different values**

In [9]:
# Let's separate into train and test set:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

In [10]:
# Create a dictionary with the numbers to impute
# each variable:
imputation_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}

# Replace missing data:
X_train.fillna(value=imputation_dict, inplace=True)
X_test.fillna(value=imputation_dict, inplace=True)

## **Arbitrary imputation with Scikit-learn**

In [11]:
# Let's separate into train and test set:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

In [12]:
# Set up the imputer to replace missing data with 99:
imputer = SimpleImputer(strategy="constant", fill_value=99)

# Fit the imputer to the train set:
imputer.fit(X_train[["A2", "A3", "A8", "A11"]])

SimpleImputer(fill_value=99, strategy='constant')

In [13]:
# Replace missing data with 99:
X_train[["A2", "A3", "A8", "A11"]] = imputer.transform(
    X_train[["A2", "A3", "A8", "A11"]]
)
X_test[["A2", "A3", "A8", "A11"]] = imputer.transform(
    X_test[["A2", "A3", "A8", "A11"]])

In [14]:
# Corroborate  absence of missing values:
X_train[["A2", "A3", "A8", "A11"]].isnull().sum()

A2     0
A3     0
A8     0
A11    0
dtype: int64

In [15]:
# Corroborate  absence of missing values:
X_test[["A2", "A3", "A8", "A11"]].isnull().sum()

A2     0
A3     0
A8     0
A11    0
dtype: int64

## **Arbitrary imputation imputation with Feature-engine**

In [16]:
# Let's separate into train and test set:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

In [17]:
# Set up the imputer to replace missing data 
# with 99 in a subset of variables:
imputer = ArbitraryNumberImputer(
    arbitrary_number=99,
    variables=["A2", "A3", "A8", "A11"],
)

# Replace missing data with 99:
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

In [18]:
# Corroborate  absence of missing values:
X_train[["A2", "A3", "A8", "A11"]].isnull().sum()

A2     0
A3     0
A8     0
A11    0
dtype: int64

In [19]:
# Corroborate  absence of missing values:
X_test[["A2", "A3", "A8", "A11"]].isnull().sum()

A2     0
A3     0
A8     0
A11    0
dtype: int64

### **Imputing with different values**

In [20]:
# Let's separate into train and test set:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

In [21]:
# Remember our imputation dictionary:
imputation_dict

{'A11': 9999, 'A2': -1, 'A3': -1, 'A8': 999}

In [22]:
# Set up the imputer to replace missing data 
# with different numbers in each variable:
imputer = ArbitraryNumberImputer(
    imputer_dict=imputation_dict,
    variables=["A2", "A3", "A8", "A11"],
)

# Replace missing data with 99
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

In [23]:
# Corroborate  absence of missing values:
X_train[["A2", "A3", "A8", "A11"]].isnull().sum()

A2     0
A3     0
A8     0
A11    0
dtype: int64

In [24]:
# Check the imputation value:
X_train["A2"].min()

-1.0

In [25]:
# Check the imputation value:
X_train["A8"].max()

999.0