# **Information:**
- The data consists of 80 columns which can be divided into 4 groups, namely F1, F2, F3, F4.
- F1 (0 - 14) float64
- F2 (0 - 24) int64
- F3 (0 - 24) float64
- F4 (0 - 14) float64
- There are 1 000 000 entries.
- There are 1 000 000 missing values.
- Column F2 has no missing values.

[Imputation guide](https://www.kaggle.com/code/parulpandey/a-guide-to-handling-missing-values-in-python)

# **Strategy:**
## Focus on investigating F1 and F3, F4 is control.

#### **mean, median** 
1. F1 mean imputer, F3 mean imputer, F4 mean imputer **[1.41613]** (Benchmark)
2. F1 mean imputer, F3 median imputer, F4 mean imputer **[1.41620]**
3. F1 median imputer, F3 mean imputer, F4 mean imputer **[1.41621]**
4. F1 median imputer, F3 median imputer, F4 mean imputer **[1.41627]**

#### **ffill, bfil** (check code for implementation detail)
5. F1 ffill, F3 ffill, F4 mean imputer **[1.64095]**
6. F1 ffill, F3 bfil, F4 mean imputer　 **[1.63977]**
7. F1 bfil, F3 ffill, F4 mean imputer　**[1.64218]**
8. F1 bfil, F3 bfil, F4 mean imputer　**[1.64101]**
9. F1 ffill, F3 mean imputer, F4 mean imputer　**[1.49984]**
10. F1 bfil, F3 mean imputer, F4 mean imputer **[1.50119]**
11. F1 ffill, F3 median imputer, F4 mean imputer **[1.56481]**
12. F1 bfil, F3 median imputer, F4 mean imputer **[1.56481]**
13. F1 mean imputer, F3 ffill, F4 mean imputer **[1.56357]**
14. F1 mean imputer, F3 bfil, F4 mean imputer **[1.49990]**
15. F1 median imputer, F3 ffill, F4 mean imputer **[1.56488]**
16. F1 median imputer, F3 bfil, F4 mean imputer **[1.56364]** 

#### **linear interpoaltion**
17. F1 bfil, F3 linear interpoaltion, F4 mean imputer **[1.57352]** 
18. F1 ffill, F3 linear interpoaltion, F4 mean imputer **[1.57222]** 
19. F1 linear interpoaltion, F3 bfill, F4 mean imputer **[1.60245]** 
20. F1 linear interpoaltion, F3 ffill, F4 mean imputer **[1.60366]** 
21. F1 linear interpoaltion, F3 linear interpoaltion, F4 mean imputer **[1.53327]** 
22. F1 linear interpoaltion, F3 mean imputer, F4 mean imputer **[1.45895]** 
23. F1 linear interpoaltion, F3 median imputer, F4 mean imputer **[1.45901]** 
24. F1 mean imputer, F3 linear interpoaltion, F4 mean imputer **[1.49259]** 
25. F1 median imputer, F3 linear interpoaltion, F4 mean imputer  **[1.49266]** 

In [None]:
# SELECT STRATEGY
current_strategy = 25 # See strategy list above. [Integer (1-25)]

# Check 
if current_strategy < 0 or current_strategy > 25:
    raise Exception("Please enter a valid number, see the list of strategies above.") 

index_strategy = current_strategy - 1

In [None]:
# List of Strategy
strategy = [['mean','mean'],
            ['mean','median'],
            ['median','mean'],
            ['median','median'],
            
            ['ffill','ffill'],
            ['ffill','bfill'],
            ['bfill','ffill'],
            ['bfill','bfill'],
            ['ffill','mean'],
            ['bfill','mean'],
            ['ffill','median'],
            ['bfill','median'],
            ['mean','ffill'],
            ['mean','bfill'],
            ['median','ffill'],
            ['median','bfill'],
            
            ['bfill','linear'],
            ['ffill','linear'],
            ['linear','bfill'],
            ['linear','ffill'],
            ['linear','linear'],
            ['linear','mean'],
            ['linear','median'],
            ['mean','linear'],
            ['median','linear'],
           ]

In [None]:
def print_current_strategy(codename):
    if codename == 'mean':
        strategy_name = f"{codename} imputation."
    elif codename == 'median':
        strategy_name = f"{codename} imputation."
    elif codename == 'ffill':
        strategy_name = f"foward fill imputation."
    elif codename == 'bfill':
        strategy_name = f"back fill imputation."
    elif codename == 'linear':
        strategy_name = f"linear interpolation imputation."
    return strategy_name

In [None]:
# Print used strategy

F1_code = strategy[index_strategy][0]
F3_code = strategy[index_strategy][1]

print("Current strategy: ")
print("F1: ",print_current_strategy(F1_code))
print("F3: ",print_current_strategy(F3_code))

In [None]:
# Import libs
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm import tqdm
pd.options.mode.chained_assignment = None  # default='warn'
pd.set_option('display.max_columns', 100)

from sklearn.impute import SimpleImputer

In [None]:
# Read data file
input_path = Path('/kaggle/input/tabular-playground-series-jun-2022/')

data = pd.read_csv(input_path / 'data.csv', index_col='row_id')
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='row-col')

### **Process**

In [None]:
# divide data in to 4 group
features = list(data.columns)
features_1, features_2, features_3, features_4 = [], [], [], []
F = [[], [], [], [], []]
for feature in features:
    for i in [1, 2, 3, 4]:
        if feature.split('_')[1] == str(i):
            F[i].append(feature)
df = [[], [], [], [], []]

for i in [1, 2, 3, 4]:
    df[i] = data[F[i]]
    corr = df[i].corr()

F1_df=df[1].copy()
F2_df=df[2].copy()
F3_df=df[3].copy()
F4_df=df[4].copy()

In [None]:
# Using SimpleImputer to fill missing values
mean_imp = SimpleImputer(
        missing_values=np.nan,
        strategy='mean')
median_imp = SimpleImputer(
        missing_values=np.nan,
        strategy='median')

def used_simple_imputer(codename, data):
    if codename == 'mean':
        data_imputed = pd.DataFrame(mean_imp.fit_transform(data), columns = data.columns)
    elif codename == 'median':
        data_imputed = pd.DataFrame(median_imp.fit_transform(data), columns = data.columns)
    elif codename == 'ffill':
        data_imputed =  data.fillna(method='ffill')
        data_imputed =  data_imputed.fillna(method='bfill') # to impute first row 
    elif codename == 'bfill':
        data_imputed =  data.fillna(method='bfill')
        data_imputed =  data_imputed.fillna(method='ffill') # to impute last row 
    elif codename == 'linear':
        data_imputed =  data.interpolate(method='linear',limit_direction="both")
    return data_imputed

In [None]:
# F4
F4_df_final = pd.DataFrame(mean_imp.fit_transform(F4_df), columns = F4_df.columns)

In [None]:
# Impute F1 and F3
F1_df_final = used_simple_imputer(F1_code, F1_df)
F3_df_final = used_simple_imputer(F3_code, F3_df)

### **Join**

In [None]:
# join dataframe
final_data1 = F1_df_final.join(F2_df)
final_data2 = F3_df_final.join(F4_df_final)
final_data3 = final_data1.join(final_data2)
final_data3.head()

In [None]:
final_data3.tail()

In [None]:
final_data3.describe()

In [None]:
final_data3.info()

In [None]:
# Use row-col from the sample submission to find the imputed values
for i in tqdm(submission.index):
    row = int(i.split('-')[0])
    col = i.split('-')[1]
    submission.loc[i, 'value'] = final_data3.loc[row, col]

submission.to_csv('submission.csv')

**Please post your strategies and public results in the comments section if you copy this notebook and try untested strategies that listed in this notebook.**
# **Thank you.**