This worksheet is to create a basice neural network model.  

# Data Preprocessing

First I've imported all of fastai which includes pandas and numpy. Then I'll import some of the other functionality I'll need. Next I uploaded the train and test csv files as dataframes and then show their heads just to see what the columns contain. 

In [1]:
from fastai.imports import *

from fastai.tabular.all import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

I want to add new columns for the group and size subcomponents in PassengerId, two new columns to separate out the deck and the side of the Cabin, a new columnn for last name, and a new column to sum up all the spending for the RoomService, FoodCourt, ShoppingMall, Spa and VRDeck columns.

In [3]:
train['Group'] = train['PassengerId'].str[0:4]
train['Deck'] = train['Cabin'].str[0]
train['Side'] = train['Cabin'].str[4]
train['Side'] = train['Cabin'].str.strip().str[-1]
splitted = train['Name'].str.split()
train['LastName'] = splitted.str[-1]
train['Spend'] = train['RoomService'] + train['FoodCourt'] + train['ShoppingMall'] + train['Spa'] + train['VRDeck']

In [4]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Group,Deck,Side,LastName,Spend
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,1,B,P,Ofracculy,0.0
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,2,F,S,Vines,736.0
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,3,A,S,Susent,10383.0
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,3,A,S,Susent,5176.0
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,4,F,S,Santantines,1091.0


I also want to do the same things on the test dataset

In [5]:
test['Group'] = test['PassengerId'].str[0:4]
test['Deck'] = test['Cabin'].str[0]
test['Side'] = test['Cabin'].str.strip().str[-1]
splitted = test['Name'].str.split()
test['LastName'] = splitted.str[-1]
test['Spend'] = test['RoomService'] + test['FoodCourt'] + test['ShoppingMall'] + test['Spa'] + test['VRDeck']

In [6]:
test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Group,Deck,Side,LastName,Spend
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning,13,G,S,Carsoning,0.0
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers,18,F,S,Peckers,2832.0
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus,19,C,S,Unhearfus,0.0
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter,21,C,S,Caltilter,7418.0
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez,23,F,S,Harperez,645.0


Since fastai Categorify will change PassengerId to a number, I'm creating a separate dataframe to store the original values which I'll put back into the test dataset before creating the submission file. 

In [7]:
test_passenger = test[['PassengerId']]

In [8]:
test_passenger.head()

Unnamed: 0,PassengerId
0,0013_01
1,0018_01
2,0019_01
3,0021_01
4,0023_01


I want to do the easiest separation of continuous and categorical variables possible, so I'm using cont_cat_split. 

In [9]:
cont,cat = cont_cat_split(train)

We can see that now the continuous and categorical columns are identified. 

In [10]:
cont

['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Spend']

In [11]:
cat

['PassengerId',
 'HomePlanet',
 'CryoSleep',
 'Cabin',
 'Destination',
 'VIP',
 'Name',
 'Transported',
 'Group',
 'Deck',
 'Side',
 'LastName']

In preparation for setting up the validation dataset I want to remove the target Transported column from the cat values, now that I've converted it to a number, and also identify it as the dependent variable. I also want to remove the PassengerId column as noted above. 

In [12]:
cat.remove('Transported')

In [13]:
cat

['PassengerId',
 'HomePlanet',
 'CryoSleep',
 'Cabin',
 'Destination',
 'VIP',
 'Name',
 'Group',
 'Deck',
 'Side',
 'LastName']

I'll also split out the continuos and categorical features for the test dataset.

In [14]:
test_cont,test_cat = cont_cat_split(test)

First I'll split train into the training and validation datasets. 

In [15]:
splits = RandomSplitter(valid_pct=0.25, seed=42)(range_of(train))

TabularPandas is the main dataloader for fastai for tabular data. Categorify changes the categorical values into numbers. FillMissing fills missing data with the mean of their columns. I tried to use the mode option instead, but I couldn't get it to work. For neural networks we need to normalize the data, so Normalize does that.

In [16]:
to = TabularPandas(train, procs=[Categorify, FillMissing, Normalize],
                   cat_names = cat,
                   cont_names = cont,
                   y_names='Transported',
                   y_block = CategoryBlock,
                   splits=splits)

In [17]:
train_to= to.train.xs

In [18]:
train_to.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name,Group,Deck,Side,...,Spa_na,VRDeck_na,Spend_na,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Spend
2246,2247,0,2,1018,3,1,2316,1604,3,2,...,1,1,1,1.060233,-0.333266,-0.281079,-0.277151,-0.268876,-0.260867,-0.512477
4598,4599,1,2,6250,3,1,2667,3297,7,2,...,1,1,1,0.78155,-0.333266,-0.281079,-0.277151,-0.268876,-0.260867,-0.512477
5381,5382,2,1,78,3,2,7289,3871,1,1,...,1,1,1,0.433197,0.03742,-0.008341,0.492159,-0.26802,1.526277,0.625822
1283,1284,1,1,1412,3,1,7605,905,5,2,...,1,1,1,-0.263509,-0.333266,0.34259,-0.277151,-0.268876,-0.260867,-0.149546
3816,3817,0,2,6063,3,1,4761,2733,7,1,...,1,1,1,-0.542191,-0.333266,-0.281079,-0.277151,-0.268876,-0.260867,-0.512477


To break out the x's and y's for the training and validation sets.

In [19]:
X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_valid, y_valid = to.valid.xs, to.valid.ys.values.ravel()

To create the random forest model based on previous grid search work. 

In [20]:
rf_model = RandomForestClassifier(criterion='entropy', 
                                  n_estimators=300,
                                  min_samples_split=14,
                                  min_samples_leaf=1,
                                  oob_score=True,
                                  max_depth=16,
                                  random_state=1,
                                  max_features='log2',
                                  n_jobs=-1)

To actually run the model to get the accuracy score.

In [21]:
rf_model.fit(X_train, y_train)

rf_predictions = rf_model.predict(X_valid)
accuracy_score(y_valid, rf_predictions)

0.8053382420616659

Next I need to do the same data cleaning work on the test dataset

In [22]:
test_to = TabularPandas(test, procs=[Categorify, FillMissing, Normalize],
                   cat_names = test_cat,
                   cont_names = test_cont,
                   )

In [23]:
X_test= test_to.train.xs

In [24]:
X_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name,Group,Deck,Side,...,Spa_na,VRDeck_na,Spend_na,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Spend
0,1,1,2,2785,3,1,2913,1,7,2,...,1,1,1,-0.114146,-0.357343,-0.283839,-0.312179,-0.267842,-0.246709,-0.508841
1,2,1,1,1868,3,1,2407,2,6,2,...,1,1,1,-0.684308,-0.357343,-0.277878,-0.312179,2.287505,-0.246709,0.546791
2,3,2,2,258,1,1,3377,3,3,2,...,1,1,1,0.170935,-0.357343,-0.283839,-0.312179,-0.267842,-0.246709,-0.508841
3,4,2,1,260,3,1,2712,4,3,2,...,1,1,1,0.669827,-0.357343,4.121501,-0.312179,-0.104002,0.226645,2.256229
4,5,1,1,1941,3,1,669,5,6,2,...,1,1,1,-0.613037,-0.340727,-0.283839,0.832137,-0.267842,-0.246709,-0.268416


To add the predicted values to the test dataset

In [25]:
X_test['Transported'] = rf_model.predict(X_test)

In [26]:
X_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name,Group,Deck,Side,...,VRDeck_na,Spend_na,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Spend,Transported
0,1,1,2,2785,3,1,2913,1,7,2,...,1,1,-0.114146,-0.357343,-0.283839,-0.312179,-0.267842,-0.246709,-0.508841,0
1,2,1,1,1868,3,1,2407,2,6,2,...,1,1,-0.684308,-0.357343,-0.277878,-0.312179,2.287505,-0.246709,0.546791,0
2,3,2,2,258,1,1,3377,3,3,2,...,1,1,0.170935,-0.357343,-0.283839,-0.312179,-0.267842,-0.246709,-0.508841,1
3,4,2,1,260,3,1,2712,4,3,2,...,1,1,0.669827,-0.357343,4.121501,-0.312179,-0.104002,0.226645,2.256229,1
4,5,1,1,1941,3,1,669,5,6,2,...,1,1,-0.613037,-0.340727,-0.283839,0.832137,-0.267842,-0.246709,-0.268416,1


To add back in the actual PassengerId to the test dataset.

In [27]:
X_test['PassengerId'] = test_passenger['PassengerId']

In [28]:
X_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name,Group,Deck,Side,...,VRDeck_na,Spend_na,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Spend,Transported
0,0013_01,1,2,2785,3,1,2913,1,7,2,...,1,1,-0.114146,-0.357343,-0.283839,-0.312179,-0.267842,-0.246709,-0.508841,0
1,0018_01,1,1,1868,3,1,2407,2,6,2,...,1,1,-0.684308,-0.357343,-0.277878,-0.312179,2.287505,-0.246709,0.546791,0
2,0019_01,2,2,258,1,1,3377,3,3,2,...,1,1,0.170935,-0.357343,-0.283839,-0.312179,-0.267842,-0.246709,-0.508841,1
3,0021_01,2,1,260,3,1,2712,4,3,2,...,1,1,0.669827,-0.357343,4.121501,-0.312179,-0.104002,0.226645,2.256229,1
4,0023_01,1,1,1941,3,1,669,5,6,2,...,1,1,-0.613037,-0.340727,-0.283839,0.832137,-0.267842,-0.246709,-0.268416,1


To convert the Transported values back to True/False values. 

In [29]:
X_test['Transported'] = np.where(X_test['Transported'] == 1, 'True', 'False')

In [30]:
X_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name,Group,Deck,Side,...,VRDeck_na,Spend_na,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Spend,Transported
0,0013_01,1,2,2785,3,1,2913,1,7,2,...,1,1,-0.114146,-0.357343,-0.283839,-0.312179,-0.267842,-0.246709,-0.508841,False
1,0018_01,1,1,1868,3,1,2407,2,6,2,...,1,1,-0.684308,-0.357343,-0.277878,-0.312179,2.287505,-0.246709,0.546791,False
2,0019_01,2,2,258,1,1,3377,3,3,2,...,1,1,0.170935,-0.357343,-0.283839,-0.312179,-0.267842,-0.246709,-0.508841,True
3,0021_01,2,1,260,3,1,2712,4,3,2,...,1,1,0.669827,-0.357343,4.121501,-0.312179,-0.104002,0.226645,2.256229,True
4,0023_01,1,1,1941,3,1,669,5,6,2,...,1,1,-0.613037,-0.340727,-0.283839,0.832137,-0.267842,-0.246709,-0.268416,True


In [31]:
submit_benchmark = X_test[['PassengerId', 'Transported']]

In [32]:
submit_benchmark.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,False
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


In [33]:
submit_benchmark.to_csv('submit_rf_fastai_data.csv', index=False)

For this submission I received a score of 0.7898, which was much lower than the model result above.