I just finished Fast.ai's [Practical Deep Learning for Coders Part 1](https://course.fast.ai/)
and as a next step am trying to learn a little more via competitions.
In this notebook, I'll explore this dataset using methods discovered via the course. 
I have also run a bunch of other beginner-friendly methods and submitted, and I will post a comparison of the different methods.

I first ran this whole thing without fixing the severe class imbalance. Then ran it again after fixing the imbalance. 

Comments and tips on how to improve on this most welcome


In [None]:
! pip install dtreeviz imbalanced-learn
# Dtreeviz helps with decsion tree isualization and 
# imbalanced learn will help create a class-balanced dataset

In [None]:
import numpy as np 
import pandas as pd  
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
from fastai import *
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor
#from dtreeviz.trees import *
from IPython.display import Image
import joblib

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv', low_memory=False)

In [None]:
test_df = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv', low_memory=False)

In [None]:
train_df.drop(columns="id", inplace = True) # Because the ID column is useless

In [None]:
dep_var = "target" 

## Basic EDA

In [None]:
import seaborn as sns
train_df.describe().style.background_gradient(axis=1,cmap=sns.light_palette('green', as_cmap=True))

There are a LOT OF ZEROs in this dataset

In [None]:
plt.figure(figsize=(5, 30))
plt.spy(train_df.drop(columns="target").sample(n = 100),  markersize=1 )

seeing all the zeros, I added 1 as a constant to the whole dataframe (df+=1), and ran a random forest on that, which gave me my best manual leaderboard score so far 1.11039, my overall best scores are still from AUTOML. But here we go. 

## Resampling to fix class imbalance

In [None]:
from imblearn.over_sampling import RandomOverSampler, SMOTE
# from imblearn.under_sampling import NearMiss,RandomUnderSampler
from collections import Counter
from matplotlib import pyplot
from numpy import where

In [None]:
y = train_df["target"]

In [None]:
x = train_df.drop(columns = "target")

In [None]:
counter = Counter(y)
print(counter)

In [None]:
train_df.target.hist()
# Clearly there's a massive imbalance. 
# In my other non resampled trials, the valid dataset, even when using  
# class balanced  split would only have class 2 and 3 in it, 
# and this lead to a lot of confusion/underfitting. 

In [None]:
smote = SMOTE() # this initializes the function 
# I recently learned that generally all sklearn functions that start 
# with capital letters need to be initialized before being called

IN learning about resampling, i found[ this article about SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/) very helpful. Here is a link to [the original paper on SMOTE](https://arxiv.org/pdf/1106.1813.pdf). The [imbalanced learn docs](https://imbalanced-learn.org/stable/over_sampling.html#smote-adasyn) are also great

In [None]:
x_resampled, y_resampled = smote.fit_resample(x, y)


In [None]:
counter = Counter(y_resampled)
print(counter)

In [None]:
 x_resampled['target'] = y_resampled #adding the y to the x 

In [None]:
x_.target.value_counts()

In [None]:
x_.to_csv("resampled_train.csv")

## Fastai Tabular Data prep

In [None]:
x_resampled = pd.read_csv("../input/resampled-traincsv/resampled_train.csv")

In [None]:
x_resampled.target.hist() #Shows all classes are equal in number

Fast.ai's tabular methods have an [automatic categorical data identifier or `cont_cat_split`](https://docs.fast.ai/tabular.core.html#cont_cat_split) -- This function works by determining if a column is continuous or categorical based on the cardinality of its values. It's interesting to play around with. 

In [None]:
cont,cat = cont_cat_split(x_resampled, dep_var=dep_var)


In [None]:
cat, cont

In [None]:
splits = RandomSplitter()(range_of(x_resampled))

In [None]:
to = TabularPandas(x_resampled,cat, cont, y_names="target", splits=splits)
len(to.train),len(to.valid)

In [None]:
save_pickle('/kaggle/working/to_resampled.pkl',to)

In [None]:
to = load_pickle('/kaggle/working/to_resampled.pkl')

## Decision Tree 

In [None]:
xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y

In [None]:
m = DecisionTreeRegressor(max_leaf_nodes=6)
m.fit(xs, y);

In [None]:
tree.export_graphviz(m, out_file='tree.dot', feature_names = xs.columns.tolist(),
           rounded = True, proportion = False, precision = 3, filled = True)
!dot -Tpng tree.dot -o tree.png 
from IPython.display import Image
Image(filename = 'tree.png')

In [None]:
samp_idx = np.random.permutation(len(y))[:1500]
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
        fontname='monospace', scale=2, label_fontsize=10,
        orientation='LR')

What's interesting is that these graphs gives us some idea about the feature importance
as found by decision tree, and they are so very different from the ones found by randomforest 
and by other methods

In [None]:
 m.predict(valid_xs)

In [None]:
valid_y.unique()

In [None]:
m.get_n_leaves(), len(xs)

## Random forest with resampled data

In [None]:
xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y

In [None]:
xs.shape

I am using a high number of n_estimators because while a lower number can get a good reduction in error rate,  proximity measures do not improve with lower numbers. sadly i havent been able to figure out how to plot them in python. 

In [None]:
def rf(xs, y, n_estimators=1000, max_samples=183991,
       max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestClassifier(n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)


In [None]:
m = rf(xs, y)

In [None]:
m_probs = m.predict_proba(valid_xs)


In [None]:
m_probs

In [None]:
from sklearn.metrics import log_loss
score = log_loss(valid_y, m_probs)
score
#That's a good score but maybe overfitted? 

OOB score and the score for the tree are roughly the same, this means the's no major stuff missing or some skew in the random forest classifier

In [None]:
m.oob_score_

In [None]:

m.score(valid_xs, valid_y)

In [None]:
joblib.dump(m, "/kaggle/working/random_forest_resampled")

Let's look at the feature importance

In [None]:
def rf_feat_importance(m, df):
    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
                       ).sort_values('imp', ascending=False)

In [None]:
fi = rf_feat_importance(m, xs)
fi[:10]

These features are so very different from the ones that the decision tree found. 
INterestingly, in my experiments with LGBoost, Tabular NN, XGboost and CatBoost, roughly the same features seem to be the most interestnig.

In [None]:
def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(fi[:30]);

In [None]:
from scipy.cluster import hierarchy as hc

def cluster_columns(df, figsize=(10,16), font_size=12):
    corr = np.round(scipy.stats.spearmanr(df).correlation, 4)
    corr_condensed = hc.distance.squareform(1-corr)
    z = hc.linkage(corr_condensed, method='average')
    fig = plt.figure(figsize=figsize)
    hc.dendrogram(z, labels=df.columns, orientation='left', leaf_font_size=font_size)
    plt.show()

What a cluster chart shows is how similar the various features are in importance. the earlier the split, the less similar the feature. In this case, none of the features seeem to be too similar as they all split up pretty early. This is useful in random forests because unlike other algorithms, RFs do better if you prune the featuers and remove redundant ones.

In [None]:
cluster_columns(xs)


### Run the resampled Model on test data

In [None]:
test_df= pd.read_csv("../input/tabular-playground-series-may-2021/test.csv")
test_df.drop(columns="id", inplace =True)

In [None]:
test_df+=1

In [None]:
# We need to feed this tree a tabular pandas object because 
# that's what we trained it using. 
def load_pandas(fname):
    "Load in a `TabularPandas` object from `fname`"
    distrib_barrier()
    res = pickle.load(open(fname, 'rb'))
    return res


In [None]:
save_pickle('/kaggle/working/to_resampled.pkl',to)

In [None]:
to_load = load_pandas('/kaggle/working/to_resampled.pkl')

In [None]:
to_new = to_load.train.new(test_df)

In [None]:
testing_df = to_new.xs

In [None]:

test_preds= m.predict_proba(test_xs)


In [None]:
submission= pd.read_csv("../input/tabular-playground-series-may-2021/sample_submission.csv")
submission.iloc[:, 1:] = test_preds.data
submission.to_csv('rf_resampled_smot.csv', index = False)

In [None]:
kaggle competitions submit -c tabular-playground-series-may-2021 -f rf_zerodeleted.csv  -m "RF with zeroes removed"

The  Kaggle score for this entry was : 1.28073

Let's look at Permutation Importance. 

 
Permutation Importance predicts feature importance by looking at how much the score (accuracy, F1, decreases when a feature is not available.

But selecting based on this did not make much of a difference and actually worsened by scores, because i think the difference is 0.0002 and lesser for the worst performing ones. 
 

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(m, random_state=31).fit(valid_xs, valid_y)
arry = eli5.show_weights(perm, feature_names = valid_xs.columns.tolist())

In [None]:
arry

In [None]:
from sklearn.feature_selection import SelectFromModel
sel = SelectFromModel(perm, threshold=0.0005, prefit=True)
xs_trans = sel.transform(xs)


In [None]:
valid_xs_trans = sel.transform(valid_xs)

In [None]:
test_xs=  sel.transform(testing_df)

In [None]:
m = rf(xs_trans, y)

In [None]:
m.score(xs_trans, y)

In [None]:
m.oob_score_

In [None]:
m.score(valid_xs_trans, valid_y)

## FastAI Tabular NN Trial with resampled data

In [None]:
train_df_nn= pd.read_csv("../input/resampled-traincsv/resampled_train.csv", low_memory=False, )

In [None]:
train_df_nn.drop(columns="Unnamed: 0", inplace= True) # Iforgot to pass index =false while saving the file

In [None]:
train_df_nn.head()

In [None]:
train_df_nn.target.value_counts()

In [None]:
dep_var = "target"

In [None]:
cont_nn,cat_nn = cont_cat_split(train_df_nn, dep_var=dep_var)

In [None]:
cat_nn

Procs  is another fastai method, you can select what kind of data processing you would like the tabular data to go through and it will do it like a pipeiline. 

`Categorify`  takes every categorical variable and makes a map from integer to unique categories, then replace the values by the corresponding index.
`FillMissing` will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)
`Normalize` will normalize the continuous variables (substract the mean and divide by the std)



In [None]:
procs = [Categorify, Normalize] 

In [None]:
splits = RandomSplitter()(range_of(train_df_nn))

In [None]:
to_nn = TabularPandas(train_df_nn, procs, cat_nn, cont_nn,
                      splits=splits, y_names=dep_var)

In [None]:
save_pickle("to_nn_resampeled.pkl", to_nn)

In [None]:
to_nn

In [None]:
dls = to_nn.dataloaders(1024) 
# a Smaller batch size is probably better but this works
# because tabular data doesn't take up much juice


In [None]:
y = to_nn.train.y
y.min(),y.max()
#Once we have the range of the Y, we can declare it to the learner

In [None]:
from sklearn.metrics import confusion_matrix, hamming_loss
# The Hamming loss is the fraction of labels that are incorrectly predicted.
hammingloss= HammingLoss() 

# In multilabel classification, Accuracy computes subset accuracy: the set of
# labels predicted for a sample must exactly match the corresponding set 
# of labels in y_true.



Tabular learner creates a NN customized for your data and automatically  picks up most things that need to be otherwise delcared.


In [None]:
learn = tabular_learner(dls,  y_range=(0,3),  wd=0.1,  
                        metrics=[accuracy,hammingloss])
 

In [None]:
learn.lr_find()

Fast.ai's   `fit one cycle` learner is based on [Leslie Smith's 1cycle policy](https://arxiv.org/pdf/1803.09820.pdf) [Link to paper].  For a more graphical and intuitive explanation check out [Sylvain Gugger's post.](https://sgugger.github.io/the-1cycle-policy.html). I find that this leads to much faster improvements than most other methods Ive tried

In [None]:
learn.fit_one_cycle(10, 3e-3)
# As a rule of thumb, pick a LR that 
# is about 10 times lesser than the lowest LR

In [None]:
learn.recorder.plot_loss()
# As you can see, the loss hasn't levelled off,  and there doesnt seem 
#to be any over fitting, so we can run this for longer

In [None]:
learn.lr_find()

In [None]:
learn.fit_one_cycle(50, 8e-8) 

In [None]:
learn.recorder.plot_loss()

OK, the valid loss has levelled off, and the train loss is showing no sign of reducing. more would just lead to overfitting.

I have experimented with upto a 1000 ephocs with adjustments of the LR But the score doesn't significantly improve beyond this point.

In [None]:
learn.lr_find()

In [None]:
learn.fit_one_cycle(50, 4e-6)

In [None]:
learn.recorder.plot_loss()

In [None]:
# Fastai Has a native confusion matrix method
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

We can see that the accuracy is best for 
`class_2` > `class_1` > `class_4` > `class_3` and that `class_3` is most frequently confused with `class_2`

If you remember the class imbalance, it was ({'Class_2': 57497, 'Class_3': 21420, 'Class_4': 12593, 'Class_1': 8490})`
and we would have expected class_1 and class_4 to have worse classification. but SMOTE has helped. 
Now, the next way to improve this would be to maybe resample using other methods, over and under etc. But without knowing more about the data, it's difficult to do other data engineering, at my level.

In [None]:
test_dl = dls.test_dl(test_df)

In [None]:
test_preds, _ = learn.get_preds(dl = test_dl)
test_preds.shape

In [None]:
submission = pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')
submission.iloc[:, 1:] = test_preds.data

In [None]:
submission.to_csv('nn_resamp_100ep.csv', index = False)

Te Kagge score for this entry was 1.31014 so the random forest actually did better