This notebook is a quick demonstration, who to use the Fastai v2 library for a Kaggle tabular competition. Fastai v2 is based on pytorch and allows you, to build a decent machine learning application. For more information please visit the Fastai documentation: https://docs.fast.ai/. I will link to "Chapter 9, Tabular Modelling Deep Dive" and the notebook "09_tabular.ipynb".

This competition is a binary classification problem: find the correct state, wheter a passenger is transported. The offered dataset has 14 differend features and for many rows, some values are missing.
In this notebook i will use a neural network approach and i will train this network with the traing data set.

Let's start and import the needed stuff ..

In [None]:
from fastai.tabular.all import * 
from fastai.test_utils import show_install
from IPython.display import display, clear_output
import seaborn as sns
show_install()

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
def set_seed_value(seed=718):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

set_seed_value()

In [None]:
path = Path('../input/spaceship-titanic/')
Path.BASE_PATH = path
path.ls()

Load the datasets and define the depending variable: Transported

In [None]:
train_df = pd.read_csv(os.path.join(path, 'train.csv'))
test_df = pd.read_csv(os.path.join(path, 'test.csv'))
sample_submission = pd.read_csv(os.path.join(path, 'sample_submission.csv'))

dep_var = 'Transported'

Let's see the first rows of our training data set to get an overview:

In [None]:
train_df.head()

Let's see the columns and their types. The function info() shows the number of rows with values for each column. If these numbers differ from row to row and the total amount of rows, we have a dataset with missing values, mostly NaN named.

In [None]:
train_df.info()

In [None]:
train_df.describe()

Let's print the number of NaN rows for each column.

In [None]:
print(train_df.isna().sum())

The name values are mostly unique and should'nt have any influence on the 'Transpored' value, therefore i will remove them.

In [None]:
train_df.drop(['Name'], axis=1, inplace=True)
test_df.drop(['Name'], axis=1, inplace=True)

As mentioned the description of the data, the values inn the columns 'Cabin' and 'PassengerId' are combined string values, The values for 'Cabin' constist of the value for the deck, a number and s side value. The values for 'PassengerId' are are the string concatenation of a group value and unique number inside this group. 
I will define a function to replace the values in the columns 'Cabin' and 'PassengerId' with these sub values. The original columns can be droped.
The name values are mostly unique and should'nt have any influence on the 'Transpored' value, therefore i will remove them.

In [None]:
def split_columns_with_combinded_data(df, drop_orgin:bool=False):
    df[['Deck','Num', 'Side']] = df['Cabin'].str.split('/', expand=True)
    df[['PGroup','PNr']] = df['PassengerId'].str.split('_', expand=True)
    if drop_orgin:
        df.drop(['Cabin', 'PassengerId'], axis=1, inplace=True)
    return df

In [None]:
train_df = split_columns_with_combinded_data(train_df, drop_orgin=True)
test_df = split_columns_with_combinded_data(test_df, drop_orgin=True)

It seems that the columns 'Destination' and 'HomePlanet' are enumeration types. I will print the number of thier unique values to check my assumption.

In [None]:
train_df['Destination'].nunique(), train_df['HomePlanet'].nunique(),

Okay there a a handful unique values in both columns. That's lighten handling of the missing values in these columns. We can convert them into 'one-hot-encoded' values. If a row contains a Nan value in the orginal column, none of the derived rows contain the value 1, all rows have the value 0.

In [None]:
def convert_to_dummies(df):
    df = pd.get_dummies(df, columns=['Destination'], prefix="D")
    df = pd.get_dummies(df, columns=['HomePlanet'], prefix="H")
    df = pd.get_dummies(df, columns=['Side'])
    df = pd.get_dummies(df, columns=['VIP'])
    df = pd.get_dummies(df, columns=['CryoSleep'])
    df = pd.get_dummies(df, columns=['Deck'])
    
    return df

In [None]:
train_df = convert_to_dummies(train_df)
test_df = convert_to_dummies(test_df)

The range of the values for the different service values is quite high, they are in the ramng from 0 to 24000. To provide issues in the later training process and because these values are belong to the independent columns, i scale them down and i use the ln(x+1) function from numpy. 

In [None]:
def scale_service_values(df):
    for s in ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']:
        # df[s] = np.log1p(df[s])
        df[s] = df[s]/1024.0
    return df

In [None]:
train_df = scale_service_values(train_df)
test_df = scale_service_values(test_df)

I will change the type for some columns, that needed for the neural network later ..

In [None]:
def change_column_type(df):
    df['Num'] = df['Num'].astype('float')
    df['PNr'] = df['PNr'].astype('int')
    df['PGroup'] = df['PGroup'].astype('float')
    
    return df

In [None]:
train_df = change_column_type(train_df)
testn_df = change_column_type(test_df)

Let's recheck the number of NaN rows for each column after the preprocessing:

In [None]:
print(train_df.isna().sum())

First i will look at the correlation matrix to verify how important a feature is.

In [None]:
plt.figure(figsize=(15,15))

corr=train_df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, robust=True, center=0,square=True, linewidths=.6,cmap='rainbow')
plt.title('Correlation')
plt.show()

I need a list of the column names, which are candidates for category variables and which are no candidates, also called continous variables. The Fastai library offers the function 'cont_cat_split' to do this for us. Our training data set contains only floating values for the independed variables, therefore we expect that no category variables are available.

In [None]:
cont_vars, cat_vars = cont_cat_split(train_df, dep_var=dep_var)
cont_vars, cat_vars

The next step is to create a data loader. The Fastai library offers a powerful helper called 'TabularPandas'. It needs the data frame, list of the category and continous variables, the depened variable and a splitter. The splitter divides the data set into two parts: one for the training and one for the validation and for internal optimization step in each epoch. The batch size is set to 1024, because we have a large data set. We can use a random split because the rows in the data set are independed.

In [None]:
def getData(df, batchSize=128):
    
    to_train = TabularPandas(df, 
                           [Normalize, Categorify, FillMissing],
                           cat_vars,
                           cont_vars, 
                           splits=RandomSplitter(valid_pct=0.2)(df),  
                           device = device,
                           y_block=CategoryBlock(),
                           y_names=dep_var) 

    return to_train.dataloaders(bs=batchSize)

In [None]:
dls = getData(train_df)
len(dls.train), len(dls.valid)

Show me the transformed data, which will be used in the network later.

In [None]:
dls.show_batch()

At least i create a learner pasing the dataloader into it. The default settings are two hidden layers with 200 and 100 elements. Increasing the number of parameters in the neural network will improve the accuarcy and score, hopefully: Change number and the depth of the hidden layers, use a a batch normalization and/or a dropout layer, etc.

In [None]:
my_config = tabular_config(y_range=(0,1), use_bn=True, ps=0.1, embed_p=0.1)

learn = tabular_learner(dls,
                        config = my_config,
                        layers=[200,100],
                        metrics=[accuracy])

learn.summary()

We needd a proper leraning rate. The The Fastai library offers the funtcion lr_find() for this job.

In [None]:
lr_min,lr_steep = learn.lr_find(suggest_funcs=(minimum, steep))

In [None]:
print(f"Minimum/10: {lr_min:.2e}, steepest point: {lr_steep:.2e}")

I will use a maximum learning rate of 5e-3. Starting the learning process is quite easy, i will run for 30 epochs. I will save the model with the best, with the lowest validation lost value. The Fastai library offers the SaveModelCallback callback. You must specify the file name only. The option with_opt=True stores the values of the optimizer also. You will find the new file in the subdirectory 'models'.

In [None]:
learn.fit_one_cycle(30, 5e-3, wd=0.01, cbs=SaveModelCallback(fname='kaggle_spaceship_titanic', with_opt=True))

The confusion matrix below shows us the quality of data prediction during the learning phase.

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(normalize=True, norm_dec=3)

Now it's time to calculate the predictions for the test data set. Therefor i load the 'best' model with the lowest validation loss value

In [None]:
learn.load('kaggle_spaceship_titanic')

In [None]:
learn.show_results()

I got the 'one hot encoded' prediction values, which are probabilities for the different target values. np.argmax returns the index with the maximum probability value, like 0 or 1.

In [None]:
dlt = learn.dls.test_dl(test_df) 
nn_preds,_ ,preds = learn.get_preds(dl=dlt , with_decoded=True) 

nn_preds

In [None]:
sample_submission[dep_var] = np.argmax(nn_preds, axis=1) == 1
sample_submission.to_csv("submission.csv", index=False)
sample_submission.head()

In [None]:
!ls -la

As we can see i achieve an accuracy value roughly 0.802 - 0.805 with the default Fastai settings and with a minimal features engineering. That is a great result and is the baseline to investigate in more feature engineering and/or modeling to get a better final result. At this point you can start your own experience. Fell free and use my notebokk if you like, or tell me your concerns. Feedback is wellcome!