# June 2022 Kaggle Competition Notebook
This is my notebook that I will use for the June 2022 Kaggle Competition. It has been a while since I participated in one of these.

## Step 0: Setup
Here, I will import libraries such as numpy, pandas, specific functions from sklearn, and the os module that the Kaggle notebooks use. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.ensemble import RandomForestRegressor
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Step 1: Get the data
Here, I turned the provided data and sample submission into a pandas dataframe. I will also use some basic methods to peek into the data.

In [None]:
#Make a dataframe from the csv file containing the data.
df=pd.read_csv("../input/tabular-playground-series-jun-2022/data.csv")

#Print the first few rows of the dataframe.
print(df.head())

#Print the unique data types used in the dataframe.
print(df.dtypes.unique())

#Make another dataframe from the sample submission csv file.
sample_sub=pd.read_csv("../input/tabular-playground-series-jun-2022/sample_submission.csv")

#Print the first ten rows of the sample submission dataframe.
print(sample_sub.head(10))

In [None]:
print(df.describe())

## Step 2: Build Submission DataFrame
In the step above, I saw that the row-col column for the sample submission has values in the following format:
(Row Number)-(Column Name)
Therefore, it is clear that I need to build a column for my submission csv that indicates the row and column name for the missing values.

In [None]:
#Get column names.
col_names=df.columns

#Turn the column names array into a list.
col_names=list(col_names)

#Remove the first column as it is an index containing row ids.
col_names.pop(0)

#Find the index locations of every null value in every column.

#Find the number of columns.
num_of_cols=len(df.columns)

#Make a list that contains an empty list for each of the columns in the dataframe.
col_missing_lists=[[] for i in range(len(df.columns)-1)]

#Iterate through each column and append the indexes where the value is null.
for i in range(len(col_names)):
    col_missing_lists[i]=df[df[col_names[i]].isnull()].index.to_list()

#Make a list for the row-col values.
ls_submission_rc=[]

# For each column, for each index in the list of indexes for missing data in that column, append a string 
# that follows the (Row)-(Column Name) format for each missing entry.
for i in range(len(col_names)):
    for j in col_missing_lists[i]:
        ls_submission_rc.append(str(j)+'-'+col_names[i])
        
#Make a new dataframe for submission.
ls_submission=pd.DataFrame({"row-col":ls_submission_rc})
print(ls_submission.head())

In [None]:
print(col_names)

Check to see if the values in ls_submission are the null values in the original dataframe.

## Step 3: Impute missing values
Here, we need to decide which methods to use for data imputation. From the compeition page, we are told that the data, "...contains missing values due to electronic errors." This means that the missing data is missing at random. Here, we are given a reason why there are errors in the dataframe (electronic errors). Based on the type of missing data, I will utilize sklearn's iterative imputer, since it is appropriate for handling missing data that is missing at random. 
I will start with the Iterative Imputer.

In [None]:
# Make a copy of the submission dataframe.
ls_submission_ii=ls_submission

# Initiate the iterative imputer.
imp=IterativeImputer(max_iter=10,random_state=0,tol=1e-8)

# Make a copy of the original dataframe.
df1=df[col_names]

# Fit the imputer on the dataframe.
imp.fit(df1)

# Transform the data.
df_imp=imp.transform(df1)
df_imp=pd.DataFrame(df_imp,columns=col_names)

# Create a new list for the imputed values.
ii_list=[]

# Find the imputed values corresponding to the cells that are null.
for i in range(len(col_names)):
    for j in col_missing_lists[i]:
        ii_list.append(df_imp[col_names[i]].iloc[j])
        
# Make a new column in the submission dataframe that has our values.
ls_submission_ii["value"]=ii_list

# Output a CSV for Submission
ls_submission_ii.to_csv('ls_submission_ii',index=False)

# Step 4: Re-evaluate
The iterative imputer submission resulted in a score of 0.98250.
After reading through the sci-kit learn documentation, I realized that I glossed over the iterative imputer's ability to utilize other machine learning algorithms as its estimator. Since we are told that the continuous features are the only features that have missing values, we need an estimator that takes in and outputs continuous numeric data. I tried the linear regression and SGDRegressor as my estimator.

### Step 4A: Linear Regression

In [None]:
# Make a copy of the submission dataframe.
ls_submission_ii_LR=ls_submission

# Initiate the iterative imputer.
imp_LR=IterativeImputer(estimator=LinearRegression(),max_iter=10,random_state=0,tol=1e-8)

# Make a copy of the original dataframe.
df1=df[col_names]

# Fit the imputer on the dataframe.
imp_LR.fit(df1)

# Transform the data.
df_imp_LR=imp_LR.transform(df1)
df_imp_LR=pd.DataFrame(df_imp_LR,columns=col_names)

# Create a new list for the imputed values.
ii_LR_list=[]

# Find the imputed values corresponding to the cells that are null.
for i in range(len(col_names)):
    for j in col_missing_lists[i]:
        ii_LR_list.append(df_imp_LR[col_names[i]].iloc[j])
        
# Make a new column in the submission dataframe that has our values.
ls_submission_ii_LR["value"]=ii_LR_list

# Output a CSV for Submission
ls_submission_ii_LR.to_csv('ls_submission_ii_LR',index=False)

### Step 4B: SGDRegressor

In [None]:
# Make a copy of the submission dataframe.
ls_submission_ii_SGD=ls_submission

# Initiate the iterative imputer.
imp_SGD=IterativeImputer(estimator=SGDRegressor(),max_iter=10,random_state=0,tol=1e-8)

# Make a copy of the original dataframe.
df1=df[col_names]

# Fit the imputer on the dataframe.
imp_SGD.fit(df1)

# Transform the data.
df_imp_SGD=imp_SGD.transform(df1)
df_imp_SGD=pd.DataFrame(df_imp_SGD,columns=col_names)

# Create a new list for the imputed values.
ii_SGD_list=[]

# Find the imputed values corresponding to the cells that are null.
for i in range(len(col_names)):
    for j in col_missing_lists[i]:
        ii_SGD_list.append(df_imp_SGD[col_names[i]].iloc[j])
        
# Make a new column in the submission dataframe that has our values.
ls_submission_ii_SGD["value"]=ii_SGD_list

# Output a CSV for Submission
ls_submission_ii_SGD.to_csv('ls_submission_ii_SGD',index=False)

## Step 5: Competition End
The following are the results of my submissions:

ls_submission_ii_LR
* Private Score: 0.97939
* Public Score: 0.98252

ls_submission_ii_SGD
* Private Score: 0.99311
* Public Score: 0.99662

ls_submission_ii
* Private Score: 0.97937
* Public Score: 0.98250

Clearly, it is not always the case that adding a estimator will increase the accuracy of the imputation. 