For this worksheet I want to test a stratified Kfold cross validation technique to add up all the averages of each fold to make the predictions. It was suggested in the Kaggle discussion: https://www.kaggle.com/competitions/spaceship-titanic/discussion/312173

# Data Preprocessing

First I've imported all of fastai which includes pandas and numpy. Then I'll import some of the other functionality I'll need. Next I uploaded the train and test csv files as dataframes and then show their heads just to see what the columns contain. 

In [1]:
from fastai.imports import *

#from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
#from sklearn.tree import DecisionTreeRegressor

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [4]:
test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez


Below are all the preprocessing steps for the train and test dataframes from the part 1 worksheet. 

In [5]:
modes = train.mode().iloc[0]
train.fillna(modes, inplace=True)
cont,cat = cont_cat_split(train)
for i in cat:
    train[i] = pd.Categorical(train[i])
cat.remove('Transported')
dep='Transported'

In [6]:
modes_test = test.mode().iloc[0]
test.fillna(modes_test, inplace=True)
test_passid = pd.DataFrame(test['PassengerId'])
test_cont,test_cat = cont_cat_split(test)
for i in test_cat:
    test[i] = pd.Categorical(test[i])
test_passid['PassIdCode'] = test.PassengerId.cat.codes

Before running the Stratified K Fold function I need to convert all the categorical data into their category codes. 

In [7]:
train[cat] = train[cat].apply(lambda x: x.cat.codes)

In [8]:
train[cat].head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name
0,0,1,0,149,2,0,5252
1,1,0,0,2184,2,0,4502
2,2,1,0,1,2,1,457
3,3,1,0,1,2,0,7149
4,4,0,0,2186,2,0,8319


Then I need to combine them into the X value of independent variables. 

In [9]:
X = train[cat+cont]

In [10]:
X.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0,1,0,149,2,0,5252,39.0,0.0,0.0,0.0,0.0,0.0
1,1,0,0,2184,2,0,4502,24.0,109.0,9.0,25.0,549.0,44.0
2,2,1,0,1,2,1,457,58.0,43.0,3576.0,0.0,6715.0,49.0
3,3,1,0,1,2,0,7149,33.0,0.0,1283.0,371.0,3329.0,193.0
4,4,0,0,2186,2,0,8319,16.0,303.0,70.0,151.0,565.0,2.0


In [11]:
y = train[dep]

In [12]:
y.head()

0    False
1     True
2    False
3    False
4     True
Name: Transported, dtype: category
Categories (2, object): [False, True]

I discovered when I first ran the Stratified K Fold function that the y target values can't have an 'unknown' data type, so I have to switch them to a binary integer array. 

In [13]:
from sklearn.utils.multiclass import type_of_target

In [14]:
type_of_target(y)

'unknown'

In [15]:
y = np.array(y, dtype=int)

In [16]:
type_of_target(y)

'binary'

In [17]:
y

array([0, 1, 0, ..., 1, 0, 1])

# Preprocessing the Test Data to Run the Model On

First I'm converting the test dataset categorical data into their continuous values. Then I'm using the function I used earlier to classify all the current columns in the test dataset as independent variables. 

In [17]:
test[test_cat] = test[test_cat].apply(lambda x: x.cat.codes)

In [18]:
test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0,0,1,2784,2,27.0,0,0.0,0.0,0.0,0.0,0.0,2912
1,1,0,0,1867,2,19.0,0,0.0,9.0,0.0,2823.0,0.0,2406
2,2,1,1,257,0,31.0,0,0.0,0.0,0.0,0.0,0.0,3376
3,3,1,0,259,2,38.0,0,0.0,6652.0,0.0,181.0,585.0,2711
4,4,0,0,1940,2,20.0,0,10.0,0.0,635.0,0.0,0.0,668


# Develop and Run the Model

I need to import the sklearn functions needed for the next steps. 

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import StratifiedKFold

In [20]:
y_probs = []

In [21]:
folds = StratifiedKFold(n_splits = 5, shuffle=True)
for fold, (train_id, test_id) in enumerate(folds.split(X, y)):
    
    print("fold : ", fold + 1, end = ' ')
    
    X_train = X.iloc[train_id]
    y_train = y[train_id]
    X_valid = X.iloc[test_id]
    y_valid = y[test_id]
    
    #X_train[cat] = X_train[cat].apply(lambda x: x.cat.codes)
    #X_valid[cat] = X_valid[cat].apply(lambda x: x.cat.codes)
    
    #y_train = pd.Categorical(y_train).codes
    #y_valid = pd.Categorical(y_valid).codes
    
    model =  RandomForestClassifier()
    model.fit(X_train, y_train)
    
    valid_pred = model.predict(X_valid)
    #valid_score = accuracy_score(y_valid, valid_pred)
    valid_score = mean_absolute_error(y_valid, model.predict(X_valid))
    print( "Validation score: ", valid_score, end = ' ')
    
    y_probs.append(model.predict_proba(test))
    print(" ")
    
    

fold :  1 Validation score:  0.21506612995974697  
fold :  2 

Feature names must be in the same order as they were in fit.



Validation score:  0.21391604370327774  
fold :  3 

Feature names must be in the same order as they were in fit.



Validation score:  0.21506612995974697  
fold :  4 

Feature names must be in the same order as they were in fit.



Validation score:  0.21461449942462602  
fold :  5 

Feature names must be in the same order as they were in fit.



Validation score:  0.1956271576524741  


Feature names must be in the same order as they were in fit.



In [22]:
y_probs

[array([[0.48, 0.52],
        [0.62, 0.38],
        [0.49, 0.51],
        ...,
        [0.41, 0.59],
        [0.69, 0.31],
        [0.38, 0.62]]),
 array([[0.65, 0.35],
        [0.58, 0.42],
        [0.49, 0.51],
        ...,
        [0.37, 0.63],
        [0.64, 0.36],
        [0.47, 0.53]]),
 array([[0.61, 0.39],
        [0.56, 0.44],
        [0.44, 0.56],
        ...,
        [0.4 , 0.6 ],
        [0.73, 0.27],
        [0.56, 0.44]]),
 array([[0.52, 0.48],
        [0.55, 0.45],
        [0.46, 0.54],
        ...,
        [0.39, 0.61],
        [0.72, 0.28],
        [0.55, 0.45]]),
 array([[0.64, 0.36],
        [0.56, 0.44],
        [0.34, 0.66],
        ...,
        [0.29, 0.71],
        [0.69, 0.31],
        [0.56, 0.44]])]

In [23]:
y_prob_3 = sum(y_probs) / len(y_probs)

The first number is the probability that the answer belongs to class 0 and the second that the answer belongs to class 1. They should both add up to 1.0. Therefore if the right is greater than the left, it should be True since that has a higher probability of the answer being a 1, which means they survived. 

In [24]:
y_prob_3

array([[0.57 , 0.43 ],
       [0.522, 0.478],
       [0.468, 0.532],
       ...,
       [0.356, 0.644],
       [0.68 , 0.32 ],
       [0.488, 0.512]])

The Kaggle discussion for this suggests that if the right is greater than the left value, I can assign True. Otherwise False. However I got a very low score when I did this. So I switched it and got a much higher score, but still much lower than the benchmark random forest results. 

In [25]:
y_prob_value = []
x = []

In [26]:
for i in y_prob_3:
    if i[1] > i[0]:
        x = 'True' 
    else:
        x = 'False'
    y_prob_value.append(x)


In [27]:
y_prob_value[0:5]

['False', 'False', 'True', 'False', 'False']

In [28]:
test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0,0,1,2784,2,27.0,0,0.0,0.0,0.0,0.0,0.0,2912
1,1,0,0,1867,2,19.0,0,0.0,9.0,0.0,2823.0,0.0,2406
2,2,1,1,257,0,31.0,0,0.0,0.0,0.0,0.0,0.0,3376
3,3,1,0,259,2,38.0,0,0.0,6652.0,0.0,181.0,585.0,2711
4,4,0,0,1940,2,20.0,0,10.0,0.0,635.0,0.0,0.0,668


# Create Submission File

I uploaded the submission file and compared it to the test file to make sure it had the same number of rows and that the the Passengerid seemed the same. 

In [29]:
submit = pd.read_csv('sample_submission.csv')

In [30]:
submit.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,False
1,0018_01,False
2,0019_01,False
3,0021_01,False
4,0023_01,False


In [31]:
submit.describe()

Unnamed: 0,PassengerId,Transported
count,4277,4277
unique,4277,1
top,0013_01,False
freq,1,4277


I need to rename the PassengerId column into a PassCode column so I can next work to match the coded continuous data back to its original categorical data. 

In [32]:
test.rename(columns={'PassengerId': 'PassCode'}, inplace=True)

In [33]:
test.head()

Unnamed: 0,PassCode,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0,0,1,2784,2,27.0,0,0.0,0.0,0.0,0.0,0.0,2912
1,1,0,0,1867,2,19.0,0,0.0,9.0,0.0,2823.0,0.0,2406
2,2,1,1,257,0,31.0,0,0.0,0.0,0.0,0.0,0.0,3376
3,3,1,0,259,2,38.0,0,0.0,6652.0,0.0,181.0,585.0,2711
4,4,0,0,1940,2,20.0,0,10.0,0.0,635.0,0.0,0.0,668


I need to create a PassengerID categorical data column in the test dataset to match what is in the submit dataset. So where the PassIDCode column from the test_passid dataframe I created earlier matches the PassCode column in the test dataframe, then th PassengerId column in the test dataframe will be filled with the PassengerId value from this earlier test_passid dataframe. Otherwise a NaN value will be created.  

In [34]:
test['PassengerId'] = np.where(test_passid['PassIdCode'] == test['PassCode'], test_passid['PassengerId'], 'NaN')

In [35]:
test.head()

Unnamed: 0,PassCode,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,PassengerId
0,0,0,1,2784,2,27.0,0,0.0,0.0,0.0,0.0,0.0,2912,0013_01
1,1,0,0,1867,2,19.0,0,0.0,9.0,0.0,2823.0,0.0,2406,0018_01
2,2,1,1,257,0,31.0,0,0.0,0.0,0.0,0.0,0.0,3376,0019_01
3,3,1,0,259,2,38.0,0,0.0,6652.0,0.0,181.0,585.0,2711,0021_01
4,4,0,0,1940,2,20.0,0,10.0,0.0,635.0,0.0,0.0,668,0023_01


Below I'm making sure there are no NaN values for the new PassengerId column. 

In [36]:
test['PassengerId'].isna().sum()

0

I'm just moving the PassengerId to the front of the datafram here. 

In [37]:
col = test.pop('PassengerId')
test.insert(0, col.name, col)

In [38]:
test.head()

Unnamed: 0,PassengerId,PassCode,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,0,0,1,2784,2,27.0,0,0.0,0.0,0.0,0.0,0.0,2912
1,0018_01,1,0,0,1867,2,19.0,0,0.0,9.0,0.0,2823.0,0.0,2406
2,0019_01,2,1,1,257,0,31.0,0,0.0,0.0,0.0,0.0,0.0,3376
3,0021_01,3,1,0,259,2,38.0,0,0.0,6652.0,0.0,181.0,585.0,2711
4,0023_01,4,0,0,1940,2,20.0,0,10.0,0.0,635.0,0.0,0.0,668


Here I'm adding the Transported column and placing in the y_prob_value values into it. 

In [39]:
test['Transported'] = y_prob_value

In [40]:
test.head()

Unnamed: 0,PassengerId,PassCode,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0013_01,0,0,1,2784,2,27.0,0,0.0,0.0,0.0,0.0,0.0,2912,False
1,0018_01,1,0,0,1867,2,19.0,0,0.0,9.0,0.0,2823.0,0.0,2406,False
2,0019_01,2,1,1,257,0,31.0,0,0.0,0.0,0.0,0.0,0.0,3376,True
3,0021_01,3,1,0,259,2,38.0,0,0.0,6652.0,0.0,181.0,585.0,2711,False
4,0023_01,4,0,0,1940,2,20.0,0,10.0,0.0,635.0,0.0,0.0,668,False


This creates the actual submission file with this initial benchmark random forest model. 

In [41]:
submit_benchmark = test[['PassengerId', 'Transported']]

In [42]:
submit_benchmark.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,False
1,0018_01,False
2,0019_01,True
3,0021_01,False
4,0023_01,False


In [43]:
submit_benchmark.to_csv('submit_rf_stratKfold.csv', index=False)

For the submssions for right vs left I received a score of 0.28057 when the greater right number was 'True' and 0.68973 when the greater left number was 'True'. Both are much worse than my other submissions.  

