# The February 2022 competition with Fastai v2

This notebook is a quick demonstration, who to use the Fastai v2 library for a Kaggle tabular competition. Fastai v2 is based on pytorch and allows you, to build a decent machine learning application. For more information please visit the Fastai documentation: https://docs.fast.ai/. I will link to "Chapter 9, Tabular Modelling Deep Dive" and the notebook "09_tabular.ipynb"

This monthly competition is a classical classification task: find the correct bacteria species for the offered test data set. In this notebook i will use a neural network approach and i will train this network with the offered traing data set.

Let's start and import the needed stuff ..

In [None]:
from fastai.tabular.all import * 
from fastai.test_utils import show_install
from math import factorial

show_install()

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

In [None]:
def set_seed_value(seed=718):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

set_seed_value()

In [None]:
path = Path('../input/tabular-playground-series-feb-2022')
Path.BASE_PATH = path
path.ls()

In [None]:
train_df = pd.read_csv(os.path.join(path, 'train.csv')).set_index("row_id")
test_df = pd.read_csv(os.path.join(path, 'test.csv')).set_index("row_id")
sample_submission = pd.read_csv(os.path.join(path, 'sample_submission.csv'))

I use Pandas to import them and to verify, where null values are there or some values are missing. The result shows, that the data set is complete, so that no additional data completion is needed. That's a goog result!

In [None]:
train_df.isna().sum().sum(), test_df.isna().sum().sum(), train_df.isnull().sum().sum(), test_df.isnull().sum().sum()

Let's have look on the training data set.

In [None]:
train_df.head()

We must define our depended variable 'target' and get the number of different bacteria species in the training data set. 10 different species are detected.

In [None]:
dep_var= 'target'
indep_vars = [e for e in train_df.columns if e != dep_var]

train_df[dep_var] = train_df[dep_var].astype('category')
nunOfBacteriaTypes = len(train_df[dep_var].unique())

print('Number of dectected species ', nunOfBacteriaTypes)

That's a good result, all species or target values are equal distributed, there is no sknewness!

Oher of this competition mentioned a public paper, where a analysis method for bacterial species is described. The following link refers to the publication: https://www.frontiersin.org/articles/10.3389/fmicb.2020.00257/full
The paper describes a method, where some bais values are add to the orignal measued valuea and how the are transformed into values like our ones. At the following cells i will add a function to transform the folating point values into integer values.  These transformed values are used as learning data for the neural network below.

In [None]:
do_conversion = True

In [None]:
def get_bias(w, x, y, z):
    return factorial(10) / (factorial(int(w)) * factorial(int(x)) * factorial(int(y)) * factorial(int(z)) * 4**10)

def convert_to_FBC_spectrum(df):
    for col in indep_vars:
        coeffs = re.findall(r'\d+', col)
        bias = get_bias(coeffs[0], coeffs[1], coeffs[2], coeffs[3])
        df[col] = ((df[col] + bias) *1e6).round()
        df[col] = df[col].astype(int)
    return df

In [None]:
if do_conversion:
    train_df = convert_to_FBC_spectrum(train_df)
    test_df = convert_to_FBC_spectrum(test_df)

In [None]:
train_df.head()

Let's list the value counts for each column. We can use this info later, when we split the columns into catgegorised and continous variables. 

In [None]:
val_counts = pd.DataFrame(columns=['col', 'value_counts'])

for col in indep_vars:
    val_counts['col'] = col
    val_counts['value_counts'] = train_df[col].value_counts()
    
val_counts =val_counts.sort_values(by=['value_counts'], ascending=True)
val_counts.head(10)    

Next we will check whether duplicated rows in the training dataset exists. We can remove these duplicates by setting the flag 'drop_duplicates' 

In [None]:
drop_duplicates = False

In [None]:
train_df.duplicated().sum()

In [None]:
if drop_duplicates:
    train_df = train_df.drop_duplicates()

We know that the test dataset contains also duplicated rows. We will get their indices and we will store them in a list of list to verify the predicted species later on. The assumption is that all rows with duplicated rows must have the same  specie name.

In [None]:
def get_index_of_duplicated(df):
    tmp_df = df.copy()
    tmp_df = tmp_df[tmp_df.duplicated(keep=False)]
    duplicated_ids = tmp_df.groupby(list(tmp_df)).apply(lambda x: tuple(x.index)).tolist()
    return duplicated_ids

In [None]:
duplicated_test_ids = get_index_of_duplicated(test_df)

len(duplicated_test_ids)

Let's see the first 5 entries of the list. The rows (247934, 260325, 267489, 280701, 299358) must have the same predicted specie later on. And (294925, 296888) and so on.

In [None]:
duplicated_test_ids[:5]

How are the target values or species distributed in the training data set?

I need a list of the column names, which are candidates for category variables and which are no candidates, also called continous variables. The Fastai library offers the function 'cont_cat_split' to do this for us. Our training data set contains only floating values for the independed variables, therefore we expect that no category variables are available.

In [None]:
cont_vars, cat_vars = cont_cat_split(train_df, dep_var=dep_var)
len(cat_vars), len(cont_vars)

The next step is to create a data loader. The Fastai library offers a powerful helper called 'TabularPandas'. It needs the data frame, list of the category and continous variables, the depened variable and a splitter. The splitter divides the data set into two parts: one for the training and one for the validation and for internal optimization step in each epoch. The batch size is set to 1024, because we have a large data set. We can use a random split because the rows in the data set are independed.

You can use the option 'genSmallDataset=True' if you need for a quick turn around or for a prototyping a faster learning process.

In [None]:
def getData(df, batchSize=1024, randomSplit=True, genSmallDataset=False):
    
  if genSmallDataset: 
    example_idx = np.random.choice(range(len(df)), 25000)
    df = df.iloc[example_idx]
  
  splits = null
  if randomSplit:  
    splits = RandomSplitter(valid_pct=0.2, seed=718)(range_of(df))
  else:
    l = len(df)
    splits = (L(np.arange(0, 0.8*l), use_list=True),
              L(np.arange(0.8*l+1, l-1), use_list=True))
  to_train = TabularPandas(df, 
                           [Categorify, Normalize],
                           cat_vars,
                           cont_vars, 
                           splits=splits,  
                           device = device,
                           y_block=CategoryBlock(),
                           y_names=dep_var) 

  return to_train.dataloaders(bs=batchSize)

In [None]:
dls = getData(train_df, batchSize=2048, randomSplit=True, genSmallDataset=False)
len(dls.train), len(dls.valid), type(dls.train), dls.train.device

Show me the transformed data, which will be used in the network later. 

In [None]:
dls.show_batch()

In [None]:
class BatchSwapNoise(Module):
    "Swap Noise Module"
    def __init__(self, p): store_attr()

    def forward(self, x):
        if self.training:
            mask = torch.rand(x.size()) > (1 - self.p)
            l1 = torch.floor(torch.rand(x.size()) * x.size(0)).type(torch.LongTensor)
            l2 = (mask.type(torch.LongTensor) * x.size(1))
            res = (l1 * l2).view(-1)
            idx = torch.arange(x.nelement()) + res
            idx[idx>=x.nelement()] = idx[idx>=x.nelement()]-x.nelement()
            return x.flatten()[idx].view(x.size())
        else:
            return x

In [None]:
#Cell
class TabularModelBatchSwapNoise(Module):
    "Basic model for tabular data."
    def __init__(self, emb_szs, n_cont, out_sz, layers, ps=None, embed_p=0.,
                 y_range=None, bs_noise=None, use_bn=True, bn_final=False, bn_cont=True, act_cls=nn.ReLU(inplace=True),
                 lin_first=True):
        
        ps = ifnone(ps, [0]*len(layers))
        if not is_listy(ps): ps = [ps]*len(layers)
        self.embeds = nn.ModuleList([Embedding(ni, nf) for ni,nf in emb_szs])
        self.emb_drop = nn.Dropout(embed_p)
        self.bn_cont = nn.BatchNorm1d(n_cont) if bn_cont else None
        n_emb = sum(e.embedding_dim for e in self.embeds)
        self.n_emb,self.n_cont = n_emb,n_cont
        sizes = [n_emb + n_cont] + layers + [out_sz]
        actns = [act_cls for _ in range(len(sizes)-2)] + [None]
        _layers = [LinBnDrop(sizes[i], sizes[i+1], bn=use_bn and (i!=len(actns)-1 or bn_final), p=p, act=a, lin_first=lin_first)
                       for i,(p,a) in enumerate(zip(ps+[0.],actns))]
        if y_range is not None: _layers.append(SigmoidRange(*y_range))
        self.layers = nn.Sequential(*_layers)
        
        self.bsn_value = bs_noise
        if(self.bsn_value != None): self.batch_swap_noise = BatchSwapNoise(self.bsn_value)

    def forward(self, x_cat, x_cont=None):
        if self.bsn_value != None:
            x_cat = self.batch_swap_noise(x_cat)
            x_cont = self.batch_swap_noise(x_cont)
        if self.n_emb != 0:
            x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
            x = torch.cat(x, 1)
            x = self.emb_drop(x)
        if self.n_cont != 0:
            if self.bn_cont is not None: x_cont = self.bn_cont(x_cont)
            x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont
        return self.layers(x)


In [None]:
@delegates(Learner.__init__)
def tabular_learner_with_swap_noise(dls, layers=None, emb_szs=None, config=None, n_out=None, y_range=None,  **kwargs):
    "Get a `Learner` using `dls`, with `metrics`, including a `TabularModel` created using the remaining params."
    if config is None: config = tabular_config()
    if layers is None: layers = [286,100]
    to = dls.train_ds
    emb_szs = get_emb_sz(dls.train_ds, {} if emb_szs is None else emb_szs)
    if n_out is None: n_out = get_c(dls)
    assert n_out, "`n_out` is not defined, and could not be inferred from data, set `dls.c` or pass `n_out`"
    if y_range is None and 'y_range' in config: y_range = config.pop('y_range')
    # use a 'batch swap noise' to swap some columns 
    bs_noise=None
    if 'bs_noise' in config: bs_noise = config.pop('bs_noise')
    model = TabularModelBatchSwapNoise(emb_szs, len(dls.cont_names), n_out, layers, y_range=y_range, bs_noise=bs_noise, **config)
    return TabularLearner(dls, model, **kwargs)

At least i create a learner pasing the dataloader into it. The default settings are two hidden layers with 200 and 100 elements. Remember that we have 286 independed variables, therefor i choose a hidden layer structure of [286,128] as my default and baseline strcuture..

Increasing the number of parameters in the neural network will improve the accuarcy and score, hopefully.

In [None]:
my_config = tabular_config(ps=0.15, 
                           embed_p=0.15, 
                           use_bn=True, 
                           bn_cont=True, 
                           bs_noise=0.15, 
                           y_range=(0, nunOfBacteriaTypes-1))

learn = tabular_learner_with_swap_noise(dls,
                        n_out = nunOfBacteriaTypes,
                        #layer=[286,100], as default or baseline
                        layers=[1024,1024,1204,512,128,32],
                        config=my_config,
                        metrics=[accuracy])
learn.summary()

In [None]:
learn.lr_find()

I will use a maximum learning rate of 5e-3. Starting the learning process is quite easy, i will run for 600 epochs. I will save the model with the best, with the lowest validation lost value. The Fastai library offers the SaveModelCallback callback. You must specify the file name only. The option with_opt=True stores the values of the optimizer also. You will find the new file under models/kaggle_tps_feb_2022.pth

In [None]:
learn.fit_one_cycle(600, 5e-3, wd=0.01, cbs=SaveModelCallback(fname='kaggle_tps_feb_2022', with_opt=True)) 

The confusion matrix below shows us the quality of test data predictions.

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(normalize=True, norm_dec=3, figsize=(10, 10))

To calculate the predictions for this competition, i will load the best model from the training process. Best model means the model with the lowest validation loss value.

In [None]:
 learn.load('kaggle_tps_feb_2022')

Now it's time to calculate the predictions for the test data set.

In [None]:
dlt = learn.dls.test_dl(test_df, bs=4096) 
preds, _ = learn.get_preds(dl=dlt) 
print(preds[:2])

I got the 'one hot encoded' prediction values, which are probabilities for the different specie types. np.argmax returns the index with the maximum probability value, like 1 or 7. This index must be translated into the specie name.  The function 'vocab.map_ids' returns the name for the index values.

In [None]:
decoded_preds_str = dls.train.categorize.vocab.map_ids(np.argmax(preds, axis=1))
sample_submission[dep_var] =  decoded_preds_str
sample_submission.to_csv("submission.csv", index=False)
sample_submission.head(10)

As the next step i verify that the rows with duplicated features in the test dataset predict the same specie name: they must have 1 unique name. If not the prediction generates a wrong value and is less accurate.

In [None]:
tmp_submission = sample_submission.copy()
tmp_submission.set_index('row_id', inplace=True)

error_count = 0
for idxs in duplicated_test_ids:
    dups = tmp_submission[tmp_submission.index.isin(idxs)]
    num_unique_values = dups.nunique().item()
    if num_unique_values != 1:
        print( dups)
        error_count  += 1
        
print('Number of wrong duplicated predictions ', error_count)

The following table shows the distribution of the different specie name in the training data set and in the prediction from the test data set. The total number and the quota in percent are shown. I assume that the distribution in the test data sholld be same as in the training data set. Do you agree?

In [None]:
dep_var_dist = pd.DataFrame({
    'target_count': train_df[dep_var].value_counts(),
    'target_quota (%)': train_df[dep_var].value_counts() / train_df.shape[0] * 100,
})

dep_var_dist['pred_count'] = pd.Series(decoded_preds_str, index=test_df.index).value_counts()
dep_var_dist['pred_quota (%)'] = dep_var_dist['pred_count'] / len(test_df) * 100
dep_var_dist.sort_index().head(11)

In [None]:
!ls -la 

The end!