Example of LOCF (Last Observation Carried Forward) using pandas built-in functions

In [12]:
import pandas as pd
import numpy as np

d = {'timestamp': ['2021-11-11 12:00:00','2021-11-11 12:01:00','2021-11-11 12:02:00','2021-11-11 12:03:00','2021-11-11 12:04:00',
                    '2021-11-11 12:05:00','2021-11-11 12:06:00','2021-11-11 12:07:00','2021-11-11 12:08:00','2021-11-11 12:09:00'],
    'value1': [1,1,2,3,4,4,4,np.nan,6,6],
    'value2': [10,10,9,7,np.nan,5,5,5,5,5]
    }

df = pd.DataFrame(data=d)

## Fill in the missing data in the value1 column
df["value1"].ffill(axis=0, inplace=True)

print(df)

             timestamp  value1  value2
0  2021-11-11 12:00:00     1.0    10.0
1  2021-11-11 12:01:00     1.0    10.0
2  2021-11-11 12:02:00     2.0     9.0
3  2021-11-11 12:03:00     3.0     7.0
4  2021-11-11 12:04:00     4.0     NaN
5  2021-11-11 12:05:00     4.0     5.0
6  2021-11-11 12:06:00     4.0     5.0
7  2021-11-11 12:07:00     4.0     5.0
8  2021-11-11 12:08:00     6.0     5.0
9  2021-11-11 12:09:00     6.0     5.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["value1"].ffill(axis=0, inplace=True)


Example of NOCB (Next Observation Carried Backward) using pandas built-in functions

In [13]:
## Fill in the missing data in the 'value2' column
df["value2"].bfill(axis=0, inplace=True)

print(df)

             timestamp  value1  value2
0  2021-11-11 12:00:00     1.0    10.0
1  2021-11-11 12:01:00     1.0    10.0
2  2021-11-11 12:02:00     2.0     9.0
3  2021-11-11 12:03:00     3.0     7.0
4  2021-11-11 12:04:00     4.0     5.0
5  2021-11-11 12:05:00     4.0     5.0
6  2021-11-11 12:06:00     4.0     5.0
7  2021-11-11 12:07:00     4.0     5.0
8  2021-11-11 12:08:00     6.0     5.0
9  2021-11-11 12:09:00     6.0     5.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["value2"].bfill(axis=0, inplace=True)


Examples of linear interpolation using numPy

In [14]:
x = [2015]
xp = [2000,2005,2010,2020]
yp = [1000000,1157018,1238120,1682971]

y = np.interp(x, xp, yp)
y

array([1460545.5])

One more example but with filling multiple missing values

In [16]:
evens = [2000,2002,2004,2006,2008]
even_values = [10000,8300,6124,3971,1795]

# Define odds and odd_values
odds = range(2001, 2008, 2)
odd_values = np.interp(odds, evens, even_values)

print(odd_values)

[9150.  7212.  5047.5 2883. ]


Another example of implementing linear interpolation but this time using scipy library

In [15]:
from scipy.interpolate import interp1d

x = [2000,2005,2010,2020]
y = [1000000,1157018,1238120,1682971]

x_value = 2015

# Finding the interpolation line
interp_line = interp1d(x, y)

y_value = interp_line(x_value)
y_value

array(1460545.5)

Example of implementing multiple imputation using sklearn library

In [4]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pandas as pd

# Create the dataset as a Python dictionary
d = {
    'X': [5.4,13.8,14.7,17.6,np.nan,1.1,12.9,3.4,np.nan,10.2],
    'Y': [18,27.4,np.nan,18.3,49.6,48.9,np.nan,13.6,16.1,42.7],
    'Z': [7.6,4.6,4.2,np.nan,4.7,8.5,3.5,np.nan,1.8,4.7]
}

dTest = {
    'X': [13.1, 10.8, np.nan, 9.7, 11.2],
    'Y': [18.3, np.nan, 14.1, 19.8, 17.5],
    'Z': [4.2, 3.1, 5.7,np.nan, 9.6]
}

# Create the pandas DataFrame from our dictionary
df = pd.DataFrame(data=d)
dfTest = pd.DataFrame(data=dTest)

# Create the IterativeImputer model to predict missing values
imp = IterativeImputer(max_iter=10, random_state=0)

# Fit the model to the test dataset
imp.fit(dfTest)

# Transform the model on the entire dataset
dfComplete = pd.DataFrame(np.round(imp.transform(df),1), columns=['X','Y','Z'])

print(dfComplete.head(10))

      X     Y    Z
0   5.4  18.0  7.6
1  13.8  27.4  4.6
2  14.7  17.4  4.2
3  17.6  18.3  5.6
4  11.2  49.6  4.7
5   1.1  48.9  8.5
6  12.9  17.4  3.5
7   3.4  13.6  5.7
8  11.2  16.1  1.8
9  10.2  42.7  4.7
