# In this notebook, I have explored the F_2 features to see if there is some magic to be found that can boost this competition.

# I have tried the following things and included all plots here.

1. Dissecting the data using one F_2 variable at a time and exploring the other features based on those dissections. For e.g Let's take F_2_1 feature. I have cut the data using the distinct values of this feature into separate buckets and tried to analyze these buckets separately. I also tried imputing all of these buckets separately.
2. Summing all the F_2 variables for a row and dissecting using the sum. **This has an interesting insight.**

# The insights have been interesting but the analysis still hasn't yielded anything "magical" to boost the competition. The scores from imputing using the dissections and the sum are also the same as without them.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rcParams["figure.figsize"] = (18,6)

In [None]:
df=pd.read_csv('/kaggle/input/tabular-playground-series-jun-2022/data.csv', index_col='row_id')

# A dictionary is created to store a dataframe for each of the F_2 features. So, the dictionary will have 25 dataframes.

# Each dataframe is keyed using the particular F_2 feature name and contains as many columns as the number of distinct values for that particular F_2 feature.

In [None]:
dict_of_df={}

for j in range(0,25):
    key = 'F_2_' + str(j)
    
    dict_of_df[key] = pd.DataFrame(np.nan, index=['F_1_0', 'F_1_1', 'F_1_2', 'F_1_3', 'F_1_4', 'F_1_5', 'F_1_6', 'F_1_7',
       'F_1_8', 'F_1_9', 'F_1_10', 'F_1_11', 'F_1_12', 'F_1_13', 'F_1_14', 'F_3_0', 'F_3_1', 'F_3_2', 'F_3_3',
       'F_3_4', 'F_3_5', 'F_3_6', 'F_3_7', 'F_3_8', 'F_3_9', 'F_3_10',
       'F_3_11', 'F_3_12', 'F_3_13', 'F_3_14', 'F_3_15', 'F_3_16', 'F_3_17',
       'F_3_18', 'F_3_19', 'F_3_20', 'F_3_21', 'F_3_22', 'F_3_23', 'F_3_24',
       'F_4_0', 'F_4_1', 'F_4_2', 'F_4_3', 'F_4_4', 'F_4_5', 'F_4_6', 'F_4_7',
       'F_4_8', 'F_4_9', 'F_4_10', 'F_4_11', 'F_4_12', 'F_4_13', 'F_4_14'], columns=range(0,18))
    
    for i in range(0,18):
        dict_of_df[key][key + "_" + str(i)] = df[df[key]==i].isna().mean()
    
    dict_of_df[key].dropna(axis=1, inplace=True)

# Each dataframe has been plotted as a heatmap to identify cuts where the proportion of missing values is either too high or too low as compared to the population average of 1.8%

# There are many cuts where this proportion is significantly higher/lower but I am not sure if this can be used to impute the missing values and get a better result.

# Also, it was a lot of numbers to look at in each of these plots and I might have missed something. Putting it out there for others to have a look as well.

In [None]:
dict_of_df['F_2_0'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_1'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_2'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_3'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_4'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_5'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_6'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_7'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_8'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_9'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_10'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_11'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_12'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_13'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_14'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_15'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_16'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_17'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_18'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_19'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_20'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_21'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_22'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_23'].style.background_gradient(cmap='coolwarm')

In [None]:
dict_of_df['F_2_24'].style.background_gradient(cmap='coolwarm')

In [None]:
df_list = []

for key in dict_of_df.keys():
    df_list.append(dict_of_df[key])

In [None]:
final_df = pd.concat(df_list, axis=1)

In [None]:
final_df.T.max()

## I also joined all the 25 dataframes into one big dataframe so that we can look at cuts of a F1, F3 and F4 features across all the F2 features.

In [None]:
final_df.style.background_gradient(cmap='coolwarm')

In [None]:
col_list=['F_2_0', 'F_2_1', 'F_2_2', 'F_2_3', 'F_2_4', 'F_2_5', 'F_2_6', 'F_2_7',
       'F_2_8', 'F_2_9', 'F_2_10', 'F_2_11', 'F_2_12', 'F_2_13', 'F_2_14',
       'F_2_15', 'F_2_16', 'F_2_17', 'F_2_18', 'F_2_19', 'F_2_20', 'F_2_21',
       'F_2_22', 'F_2_23', 'F_2_24']

# Summing all the F2 variables in each row

In [None]:
df['sum'] = df[col_list].sum(axis=1)

# There are only 66 unique values for the sums of F2 features

In [None]:
df['sum'].nunique()

In [None]:
df['sum'].max()

In [None]:
df['sum'].nsmallest(3)

In [None]:
df[df['sum']==20][col_list]

# Concatenating all F_2 variables in each row

In [None]:
df['concat'] = df[col_list].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

# Each row in the data is unique as far as the F_2 values are considered

In [None]:
df['concat'].nunique()

In [None]:
df_by_sum = df[['sum', 'concat']].groupby('sum').count().reset_index()

In [None]:
df_by_sum

# When grouped by the sum of F2 features, the distribution of the rows is Normal. 

# This seems interesting but it is yet to be figured out if this actually helps for the competition.

In [None]:
sns.barplot(data=df_by_sum, x='sum', y='concat')

## I am still exploring more ways to look at F2 features. If anyone has done something similar or has any ideas, I would be happy to collaborate.