# Exploratory Data Analysis

First, we are going to import the libraries and modules we will be using:

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

And the datasets:

In [4]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

---

Let's take a look at the data:

In [5]:
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [6]:
train_df.isna().any()

PassengerId     False
HomePlanet       True
CryoSleep        True
Cabin            True
Destination      True
Age              True
VIP              True
RoomService      True
FoodCourt        True
ShoppingMall     True
Spa              True
VRDeck           True
Name             True
Transported     False
dtype: bool

**Observation:** All the columns have null values, except for ```PassengerId``` and ```Transported```.


In [7]:
print('Sum of null:')
train_df.isna().sum()

Sum of null:


PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

# Imputing Null Values

I am much more comfortable with doing all this on a new copy, just in case I mess up.

In [8]:
train_df_copy = train_df.copy()

make an ```Expenses``` column. 
**if someone is in cryosleep, they are not spending any money**. So, knowing someone's expenses can help us impute values for ```CryoSleep```.

In [9]:
train_df_copy['Expenses'] = train_df_copy[['RoomService', 'FoodCourt',
                                           'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)

Since we can't really guess anyone's age, I'll just impute these null values with the median.  **only people who are 13+ have expenses**. 

In [10]:
train_df_copy.Age = train_df_copy.Age.fillna(train_df_copy.Age.median())

Let's take a look at our dataset now:

In [11]:
train_df_copy.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Expenses
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,0.0
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,736.0
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,10383.0
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,5176.0
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,1091.0


 make a new column for cryosleep, with **all** values equal to **False** (or 0):

In [12]:
train_df_copy['Cryosleep'] = 0

Now, for every row where ```Expenses``` is ```0```, we're going to put ``1`` as the value. Because **if someone has not spent any money, they are proably in cryosleep**. 

In [13]:
train_df_copy.loc[train_df_copy['Expenses'] == 0, 'Cryosleep'] = 1

Now, we are going to set this column's value to ``1`` wherever the original ```CryoSleep``` is equal to **True**.

In [14]:
train_df_copy.loc[train_df_copy.CryoSleep.astype('str') == 'True', 'Cryosleep'] = 1

Conversely, we will put it to ``0`` wherever ``CryoSleep`` is **False**. 

In [15]:
train_df_copy.loc[train_df_copy.CryoSleep.astype('str') == 'False', 'Cryosleep'] = 0

Let's take a look at this new column now:

In [16]:
train_df_copy.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Expenses,Cryosleep
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,0.0,0
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,736.0,0
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,10383.0,0
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,5176.0,0
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,1091.0,0


What we have done here is:

* First, we set **all** values for cryosleep as false.
* Next, we set cryosleep as true for **everyone who hasn't spent any money**.
* Finally, we used the original `Cryosleep colum`, to correct cryosleep status for the people who **haven't spent any money, but aren't in cryosleep**. Just in case our last step **incorrectly** classified them as **being in cryosleep**.



Now, let's just replace the original column with this one. There's probably a better way of doing this than how I did it here here:

In [17]:
train_df_copy['Cryosleep'] = train_df_copy['Cryosleep'].astype('bool')
train_df_copy['CryoSleep'] = train_df_copy['Cryosleep']
train_df_copy.drop('Cryosleep',axis=1,inplace=True)

Let's take another look at our dataset now:

In [18]:
train_df_copy.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Expenses
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,0.0
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,736.0
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,10383.0
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,5176.0
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,1091.0


**We have now replaced the values of our original `CryoSleep` column, that had missing values, with the values of our newly created `Cryosleep` column which doesn't have any null values. Then we dropped our new column.**



In [19]:
train_df_copy.CryoSleep.isnull().any()

False

In [20]:
train_df_copy.drop('Name',axis=1,inplace=True)

Now for the amenities, we can easily impute null values for ```Cryosleep``` == True, since **we know they are going to be zero as the person is in cryosleep**.

In [21]:
train_df_copy.loc[train_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']] = 0
train_df_copy.loc[train_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']].isna().sum()

RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

Before dealing with the rest of the amenities' values, let's make some more new columns to aid us.

In [22]:
train_df_copy['Adults'] = train_df_copy['Age'] >= 13

make a column now that tells us if **someone is 13+** and **is spending money**.

In [23]:
train_df_copy['Adult_and_spending'] = (train_df_copy['Expenses'] > 0) & (train_df_copy['Age'] >=13)

Let's take a look at the rows that are **True** for our new `Adult_and_spending` column:

In [24]:
train_df_copy.loc[train_df_copy.Adult_and_spending == True]

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Expenses,Adults,Adult_and_spending
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,736.0,True,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,10383.0,True,True
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,5176.0,True,True
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,1091.0,True,True
5,0005_01,Earth,False,F/0/P,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,True,774.0,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8687,9275_03,Europa,False,A/97/P,TRAPPIST-1e,30.0,False,0.0,3208.0,0.0,2.0,330.0,True,3540.0,True,True
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,8536.0,True,True
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,1873.0,True,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,4637.0,True,True


So there are **5040** people who **are 13+** and **are spending money**.

Now we are going to impute the values for our amenities.

We know if someone is **not an adult** and has **zero expenses**, they are either below 13, which means they **definitely** haven't spent on **any** amenities, or they are **in cryosleep**, which again means they **definitely** haven't spent on amenities.

So, wherever we have `Adult_and_spending` == False, we'll impute them with `0`.

In [25]:
train_df_copy.RoomService = train_df_copy.RoomService.fillna(train_df_copy.RoomService.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'RoomService'] = 0

train_df_copy.FoodCourt = train_df_copy.FoodCourt.fillna(train_df_copy.FoodCourt.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'FoodCourt'] = 0

train_df_copy.ShoppingMall = train_df_copy.ShoppingMall.fillna(train_df_copy.ShoppingMall.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'ShoppingMall'] = 0

train_df_copy.Spa = train_df_copy.Spa.fillna(train_df_copy.Spa.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'Spa'] = 0

train_df_copy.VRDeck = train_df_copy.VRDeck.fillna(train_df_copy.VRDeck.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'VRDeck'] = 0

In [26]:
train_df_copy[['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']].isna().sum()

RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

Perfect.

For the remaining columns, we can't figure out what values to fill in this manner. So we are just going to fill them with the values that the majority of people have in the dataset, i.e., the **mode**. 

In [27]:
train_df_copy.HomePlanet.mode()

0    Earth
Name: HomePlanet, dtype: object

In [28]:
train_df_copy.Destination.mode()

0    TRAPPIST-1e
Name: Destination, dtype: object

In [29]:
train_df_copy.VIP.mode()

0    False
Name: VIP, dtype: object

So, these are the values we will be imputing with.

In [30]:
train_df_copy.HomePlanet = train_df_copy.HomePlanet.fillna('Earth')
train_df_copy.Destination = train_df_copy.Destination.fillna('TRAPPIST-1e')
train_df_copy.VIP = train_df_copy.VIP.fillna('False')
train_df_copy.VIP = train_df_copy.VIP.astype('bool')

Aaand done!

Let's see how much we are done:

In [31]:
train_df_copy.isnull().sum()

PassengerId             0
HomePlanet              0
CryoSleep               0
Cabin                 199
Destination             0
Age                     0
VIP                     0
RoomService             0
FoodCourt               0
ShoppingMall            0
Spa                     0
VRDeck                  0
Transported             0
Expenses                0
Adults                  0
Adult_and_spending      0
dtype: int64

The cabin is the only column that remains with null values! 

Filling this is not easy due to my limited skill. I am just going to use **ffill** to fill these null values. What that does is basically use the **previous** value to impute the **missing** one. 

So, for example, if we have a dataset like:

[1, 2, 3, **null**, 4]

If we use **ffill** on this, it'll become:

[1, 2, 3, **3**, 4].

In [32]:
train_df_copy['Cabin'] = train_df_copy.Cabin.fillna(method='ffill')

In [33]:
train_df_copy.isnull().sum()

PassengerId           0
HomePlanet            0
CryoSleep             0
Cabin                 0
Destination           0
Age                   0
VIP                   0
RoomService           0
FoodCourt             0
ShoppingMall          0
Spa                   0
VRDeck                0
Transported           0
Expenses              0
Adults                0
Adult_and_spending    0
dtype: int64

And so, we are done with imputing. Time to move on to feature engineering.

# Feature Engineering

These are the features that I am going to add to this dataset (again, I got the idea for them [here](https://www.kaggle.com/code/mateuszk013/spaceship-titanic-81-eda-ml/notebook)).

In [34]:
train_df_copy['Group_nums'] = train_df_copy.PassengerId.apply(lambda x: x.split('_')).apply(lambda x: x[0])
train_df_copy['Grouped'] = ((train_df_copy['Group_nums'].value_counts() > 1).reindex(train_df_copy['Group_nums'])).tolist()
train_df_copy['Deck'] = train_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[0])
train_df_copy['Side'] = train_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[2])
train_df_copy['Has_expenses'] = train_df_copy['Expenses'] > 0
train_df_copy['Is_Embryo'] = train_df_copy['Age'] == 0

These specifiy:

* If someone was **alone** or **in a group**.
* Which **deck** someone was in.
* Which side (**Starboard** or **Port**).
* If the passenger was **0 years old** (i.e, an **embryo**).

Let's get rid of our temporary columns:

In [35]:
train_df_copy.drop(['Adult_and_spending','Group_nums','Expenses'],axis=1,\
                   inplace=True)

This is our final dataset:

In [36]:
train_df_copy.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Adults,Grouped,Deck,Side,Has_expenses,Is_Embryo
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,False,B,P,False,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,True,False,F,S,True,False
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,True,A,S,True,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,True,True,A,S,True,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,True,False,F,S,True,False


In [37]:
train_df_copy.to_csv('Cleaned imputed data.csv',index=False)

---

Since even our test data has missing values, we have to do **all** that to our test data as well.

### Test Data

In [38]:
test_df_copy = test_df.copy()

test_df_copy['Expenses'] = test_df_copy[['RoomService', 'FoodCourt',
                                           'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)

test_df_copy.Age = test_df_copy.Age.fillna(test_df_copy.Age.median())

test_df_copy['Adult_spending_awake'] = (test_df_copy['Expenses'] > 0)\
                                     & (test_df_copy['Age'] >= 13)\
                                     & (test_df_copy['CryoSleep'] == False)

test_df_copy['Cryosleep'] = 0
test_df_copy.loc[test_df_copy['Expenses'] == 0, 'Cryosleep'] = 1
test_df_copy.loc[test_df_copy.CryoSleep.astype('str') == 'True', 'Cryosleep'] = 1
test_df_copy.loc[test_df_copy.CryoSleep.astype('str') == 'False', 'Cryosleep'] = 0
test_df_copy['Cryosleep'] = test_df_copy['Cryosleep'].astype('bool')
test_df_copy['CryoSleep'] = test_df_copy['Cryosleep']
test_df_copy.drop('Cryosleep',axis=1,inplace=True)
test_df_copy.drop('Name',axis=1,inplace=True)

test_df_copy.loc[test_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']] = 0

test_df_copy['Adults'] = test_df_copy['Age'] >= 13

test_df_copy['Adult_and_spending'] = (test_df_copy['Expenses'] > 0) & (test_df_copy['Age'] >=13)
test_df_copy.loc[test_df_copy.Adult_and_spending == True]

test_df_copy.RoomService = test_df_copy.RoomService.fillna(test_df_copy.RoomService.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'RoomService'] = 0

test_df_copy.FoodCourt = test_df_copy.FoodCourt.fillna(test_df_copy.FoodCourt.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'FoodCourt'] = 0

test_df_copy.ShoppingMall = test_df_copy.ShoppingMall.fillna(test_df_copy.ShoppingMall.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'ShoppingMall'] = 0

test_df_copy.Spa = test_df_copy.Spa.fillna(test_df_copy.Spa.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'Spa'] = 0

test_df_copy.VRDeck = test_df_copy.VRDeck.fillna(test_df_copy.VRDeck.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'VRDeck'] = 0

test_df_copy.HomePlanet = test_df_copy.HomePlanet.fillna('Earth')
test_df_copy.Destination = test_df_copy.Destination.fillna('TRAPPIST-1e')
test_df_copy.VIP = test_df_copy.VIP.fillna('False')
test_df_copy.VIP = test_df_copy.VIP.astype('bool')

test_df_copy['Cabin'] = test_df_copy.Cabin.fillna(method='ffill')

test_df_copy['Group_nums'] = test_df_copy.PassengerId.apply(lambda x: x.split('_')).apply(lambda x: x[0])
test_df_copy['Grouped'] = ((test_df_copy['Group_nums'].value_counts() > 1).reindex(test_df_copy['Group_nums'])).tolist()
test_df_copy['Deck'] = test_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[0])
test_df_copy['Side'] = test_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[2])
test_df_copy['Has_expenses'] = test_df_copy['Expenses'] > 0
test_df_copy['Is_Embryo'] = test_df_copy['Age'] == 0

test_df_copy.columns
test_df_copy.drop(['Expenses', 'Adult_spending_awake', 'Adult_and_spending','Adults'],axis=1, inplace=True)

test_df_copy.to_csv('Cleaned test data.csv',index=False)

# Model Building

Let's import **Logistic Regression**. I'm also going to import **train-test split**, just for some light evaluation.

In [39]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Now, we import the csv's that we saved earlier.

In [40]:
df_train = pd.read_csv('Cleaned and imputed data.csv')
df_test = pd.read_csv('Cleaned and imputed test data.csv')

In [41]:
df_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Adults,Grouped,Deck,Side,Has_expenses,Is_Embryo
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,False,B,P,False,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,True,False,F,S,True,False
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,True,A,S,True,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,True,True,A,S,True,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,True,False,F,S,True,False


In [42]:
df_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Group_nums,Grouped,Deck,Side,Has_expenses,Is_Embryo
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,13,False,G,S,False,False
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,18,False,F,S,True,False
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,19,False,C,S,False,False
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,21,False,C,S,True,False
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,23,False,F,S,True,False


## feature selection.

In [43]:
df_train.dtypes

PassengerId      object
HomePlanet       object
CryoSleep          bool
Cabin            object
Destination      object
Age             float64
VIP                bool
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Transported        bool
Adults             bool
Grouped            bool
Deck             object
Side             object
Has_expenses       bool
Is_Embryo          bool
dtype: object

In [44]:
features = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP',
            'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 
            'Grouped', 'Deck', 'Has_expenses', 'Side', 'Is_Embryo']

In [45]:
X = pd.get_dummies(df_train[features])
y = df_train['Transported']

In [46]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1)

Let's fit and score:

In [47]:
model = LogisticRegression(max_iter=10000)
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.8003679852805887

Not bad.

Since we actually have to predict the **test** set that Kaggle has provided, we want to use all of the **train** data to train the model. The more data the model gets to learn from, the better the prediction.

In [48]:
model2 = LogisticRegression(max_iter=10000)
model2.fit(X,y)
model2.score(X,y)

0.792016565052341

Let's predict our test set now and save it:

In [49]:
y_pred_log2 = model2.predict(pd.get_dummies(df_test[features]))

---

using **K-Neighbors Classifier**.

In [50]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

And use **GridSearchCV** to get the optimal K value (code commented out as it takes time to run):

In [51]:
# knn = KNeighborsClassifier()
# param_grid = {'n_neighbors':np.arange(2,15)}
# knn_gscv = GridSearchCV(knn, param_grid, cv=5)
# knn_gscv.fit(X,y)
# knn_gscv.best_params_

In [52]:
knn2 = KNeighborsClassifier(n_neighbors=14)
knn2.fit(X,y)
knn2.score(X,y)

0.8149085471068676

And save:

In [53]:
y_pred_knn = knn2.predict(pd.get_dummies(df_test[features]))

In [54]:
from sklearn.ensemble import GradientBoostingClassifier
gbr = GradientBoostingClassifier(random_state = 1)
  
# Fit to training set
gbr.fit(X, y)
gbr.score(X,y)

0.8130679857356494

Seems slightly worse than our K-Neighbors Classifier. But still, we'll keep its predictions as well.

In [55]:
pred_y_gbr = gbr.predict(pd.get_dummies((df_test[features])))
pred_y_gbr

array([ True, False,  True, ...,  True,  True,  True])

In [56]:
gbc = GradientBoostingClassifier()
parameters = {
    "n_estimators":[5,50,100],
    "max_depth":[1,3,5],
   "learning_rate":[0.01,0.1,1]
}

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

cv = RandomizedSearchCV(gbc, parameters, n_iter=27, scoring='accuracy', n_jobs=-1, cv=5, random_state=1)
cv.fit(X,y)
cv.best_params_

{'n_estimators': 50, 'max_depth': 5, 'learning_rate': 0.1}

In [57]:
gbc1 = GradientBoostingClassifier(n_estimators=50,max_depth=5,learning_rate=0.1) #best params from gscv

gbc1.fit(X,y)
gbc1.score(X,y)

0.831013459105027

Trying the AdaBoostClassifier

In [58]:
#Giving a 0 on kaggle so we need to find something else
#from sklearn.ensemble import AdaBoostClassifier
#from sklearn.datasets import make_classification
#X, y = make_classification(n_samples=8040, n_features=27,
#                           n_informative=2, n_redundant=0,
#                           random_state=0, shuffle=False)
#clf = AdaBoostClassifier(n_estimators=100, random_state=0)
#clf.fit(X, y)
#AdaBoostClassifier(n_estimators=100, random_state=0)
#clf.predict(pd.get_dummies((df_test[features])))
#array([1])
#clf.score(X, y)


0.875

In [59]:
#pred_y_gbr2 = clf.predict(pd.get_dummies((df_test[features])))



Trying AdaBoost Regressor

In [78]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,
                        random_state=0, shuffle=False)
regr = AdaBoostRegressor(random_state=0, n_estimators=100)
regr.fit(X, y)
AdaBoostRegressor(n_estimators=100, random_state=0)
regr.predict([[0, 0, 0, 0]])

regr.score(X, y)


0.9771376939813695

In [79]:
y_pred = regr.predict(X)

ExtraTrees Regressor - NO

In [89]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
     X, y, random_state=0)
reg = ExtraTreesRegressor(n_estimators=100, random_state=0).fit(
    X_train, y_train)
reg.score(X_test, y_test)

0.27081747066124695

Gradient Boosting Classifier 

In [90]:
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
     max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)


0.913

Gradient Boosting Regressor - NO

In [91]:
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
X, y = make_regression(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
   X, y, random_state=0)
reg = GradientBoostingRegressor(random_state=0)
reg.fit(X_train, y_train)
GradientBoostingRegressor(random_state=0)
reg.predict(X_test[1:2])
reg.score(X_test, y_test)

0.43848663277068134

Combining 3 models: GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier

In [101]:
from sklearn.ensemble import GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from scipy.stats import randint, uniform, reciprocal

# Define the classifiers
gbc_clf = GradientBoostingClassifier(random_state=42)
et_clf = ExtraTreesClassifier(random_state=42)

# Define the parameter distributions
param_dist_gbc = {
    "n_estimators": randint(100, 200),
    "learning_rate": reciprocal(0.001, 1.0),
    "max_depth": randint(5, 15),
    "max_leaf_nodes": randint(2, 12),
    "min_samples_split": randint(2, 12),
    "min_samples_leaf": randint(1, 8),
    "min_impurity_decrease": uniform(0.001, 0.1),
    "n_iter_no_change": randint(5, 15),
    "max_features": randint(10, 25),
}
param_dist_et = {
    "n_estimators": randint(100, 200),
    "criterion": ["gini", "entropy"],
    "max_depth": randint(5, 15),
    "max_features": randint(10, 25),
    "min_samples_split": randint(2, 12),
    "min_samples_leaf": randint(1, 8),
    "bootstrap": [True, False],
}

# Define the random search objects
gbc_rnd_search = RandomizedSearchCV(
    gbc_clf,
    param_distributions=param_dist_gbc,
    cv=10,
    n_iter=100,
    n_jobs=-1,
    scoring="accuracy",
    random_state=42,
)
et_rnd_search = RandomizedSearchCV(
    et_clf,
    param_distributions=param_dist_et,
    cv=10,
    n_iter=100,
    n_jobs=-1,
    scoring="accuracy",
    random_state=42,
)

# Perform the random searches
gbc_rnd_search.fit(X, y)
et_rnd_search.fit(X, y)

# Get the best estimators
gbc_best = gbc_rnd_search.best_estimator_
et_best = et_rnd_search.best_estimator_

# Define the voting classifier
voting_clf = VotingClassifier(
    estimators=[("gbc", gbc_best), ("et", et_best)],
    voting="hard",
)

# Fit the voting classifier
voting_clf.fit(X, y)

# Get the accuracy score
voting_clf.score(X, y)


550 fits failed out of a total of 1000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
550 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\jkobe\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jkobe\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 586, in fit
    n_stages = self._fit_stages(
  File "C:\Users\jkobe\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 663, in _fit_stages
    raw_predictions = self._fit_stage(
  File "C:\Users\jkobe\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 246, in _fit_stage
    tree.fit(X, residual, sample_weight=sam

0.985

Combining: RandomForestClassifier, GradientBoostingClassifier

In [102]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from scipy.stats import randint, uniform, reciprocal

# Define the classifiers
gbc_clf = GradientBoostingClassifier(random_state=42)
rf_clf = RandomForestClassifier(random_state=42)
knn_clf = KNeighborsClassifier()

# Define the parameter distributions
param_dist_gbc = {
    "n_estimators": randint(100, 200),
    "learning_rate": reciprocal(0.001, 1.0),
    "max_depth": randint(5, 15),
    "max_leaf_nodes": randint(2, 12),
    "min_samples_split": randint(2, 12),
    "min_samples_leaf": randint(1, 8),
    "min_impurity_decrease": uniform(0.001, 0.1),
    "n_iter_no_change": randint(5, 15),
    "max_features": randint(10, 25),
}
param_dist_rf = {
    "n_estimators": randint(100, 200),
    "criterion": ["gini", "entropy"],
    "max_depth": randint(5, 15),
    "max_features": ["auto", "sqrt", "log2"],
    "min_samples_split": randint(2, 12),
    "min_samples_leaf": randint(1, 8),
}
param_dist_knn = {
    "n_neighbors": randint(1, 20),
    "weights": ["uniform", "distance"],
    "algorithm": ["ball_tree", "kd_tree", "brute"],
    "leaf_size": randint(10, 50),
}

# Define the random search objects
gbc_rnd_search = RandomizedSearchCV(
    gbc_clf,
    param_distributions=param_dist_gbc,
    cv=10,
    n_iter=100,
    n_jobs=-1,
    scoring="accuracy",
    random_state=42,
)
rf_rnd_search = RandomizedSearchCV(
    rf_clf,
    param_distributions=param_dist_rf,
    cv=10,
    n_iter=100,
    n_jobs=-1,
    scoring="accuracy",
    random_state=42,
)
knn_rnd_search = RandomizedSearchCV(
    knn_clf,
    param_distributions=param_dist_knn,
    cv=10,
    n_iter=100,
    n_jobs=-1,
    scoring="accuracy",
    random_state=42,
)

# Perform the random searches
gbc_rnd_search.fit(X, y)
rf_rnd_search.fit(X, y)
knn_rnd_search.fit(X, y)

# Get the best estimators
gbc_best = gbc_rnd_search.best_estimator_
rf_best = rf_rnd_search.best_estimator_
knn_best = knn_rnd_search.best_estimator_

# Compute the cross-validation scores
gbc_scores = cross_val_score(gbc_best, X, y, cv=10)
rf_scores = cross_val_score(rf_best, X, y, cv=10)
knn_scores = cross_val_score(knn_best, X, y, cv=10)

# Print the scores
print("Gradient Boosting Classifier Scores: ", gbc_scores.mean())
print("Random Forest Classifier Scores: ", rf_scores.mean())
print("K-Nearest Neighbors Classifier Scores: ", knn_scores.mean())

550 fits failed out of a total of 1000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
550 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\jkobe\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jkobe\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 586, in fit
    n_stages = self._fit_stages(
  File "C:\Users\jkobe\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 663, in _fit_stages
    raw_predictions = self._fit_stage(
  File "C:\Users\jkobe\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 246, in _fit_stage
    tree.fit(X, residual, sample_weight=sam

Gradient Boosting Classifier Scores:  0.96
Random Forest Classifier Scores:  0.9480000000000001
K-Nearest Neighbors Classifier Scores:  0.9240000000000002


Trying the Ensemble Stacking Classifier

In [106]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
X, y = load_iris(return_X_y=True)
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', make_pipeline(StandardScaler(),
     LinearSVC(random_state=42))) ]
clf = StackingClassifier(
estimators=estimators, final_estimator=LogisticRegression())
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, random_state=42 )
clf.fit(X_train, y_train).score(X_test, y_test)

0.9473684210526315

Ensemble Voting Classifier

In [109]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

eclf1 = VotingClassifier(estimators=[
         ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')
eclf1 = eclf1.fit(X, y)
print(eclf1.predict(X))

np.array_equal(eclf1.named_estimators_.lr.predict(X),
                eclf1.named_estimators_['lr'].predict(X))

eclf2 = VotingClassifier(estimators=[
         ('lr', clf1), ('rf', clf2), ('gnb', clf3)],
         voting='soft')
eclf2 = eclf2.fit(X, y)
print(eclf2.predict(X))

# Define X_test and y_test
X_test = np.array([[0, 0], [-1, -2], [4, 2]])
y_test = np.array([1, 1, 2])

print("Accuracy score:", eclf2.score(X_test, y_test))


[1 1 1 2 2 2]
[1 1 1 2 2 2]
Accuracy score: 1.0


HistGradientBoostingClassifier 

In [110]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
clf = HistGradientBoostingClassifier().fit(X, y)
clf.score(X, y)

1.0

Combinations of Ensemble Methods with KNN as a base 

In [124]:
#from sklearn.ensemble import AdaBoostClassifier
#from sklearn.neighbors import KNeighborsClassifier
#from sklearn.model_selection import GridSearchCV
#from sklearn.pipeline import Pipeline
#from sklearn.preprocessing import StandardScaler
#import warnings
#warnings.simplefilter(action='ignore', category=FutureWarning)

#param_grid_knn = {'knn__n_neighbors': [3, 5, 7]}
#param_grid_adaboost = {'adaboost__n_estimators': [50, 100, 150], 'adaboost__learning_rate': [0.1, 0.5, 1]}

#pipe = Pipeline([
#    ('scaler', StandardScaler()),
#    ('knn', KNeighborsClassifier()),
#    ('adaboost', AdaBoostClassifier())
#])
#param_grid = {**param_grid_knn, **param_grid_adaboost}

#grid = GridSearchCV(pipe, param_grid, cv=5)
#grid.fit(X_test, y_test)

#print("Best parameters: ", grid.best_params_)
#print("Best score: ", grid.best_score_)



Combination of KNN with stacking ensemble method: 

In [123]:
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

knn = Pipeline([
    ('scaler', StandardScaler()),
    ('kneighborsclassifier', KNeighborsClassifier())
])

rf = RandomForestClassifier()

estimators = [('knn', knn), ('rf', rf)]

stacking = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),
    cv=3
)

param_grid_knn = {'knn__kneighborsclassifier__n_neighbors': [3, 5, 7]}
param_grid_rf = {'rf__n_estimators': [50, 100, 150], 'rf__max_depth': [3, 5, 7]}
param_grid_stacking = {'final_estimator__C': [0.1, 1, 10]}

param_grid = {**param_grid_knn, **param_grid_rf, **param_grid_stacking}

grid = GridSearchCV(stacking, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best parameters: ", grid.best_params_)
print("Best score: ", grid.best_score_)



Best parameters:  {'final_estimator__C': 0.1, 'knn__kneighborsclassifier__n_neighbors': 7, 'rf__max_depth': 3, 'rf__n_estimators': 50}
Best score:  0.9636363636363636


submission.

In [60]:
# Logist_out2 = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported': y_pred_log2})
# Logist_out2.to_csv('submission.csv',index=False)

Logistic Regression competition Score = **0.79448**

In [61]:
# knn_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported': y_pred_knn})
# knn_out.to_csv('submission.csv',index=False)

KNN competition score = **0.79261**

In [62]:
# gbr_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported': pred_y_gbr})
# gbr_out.to_csv('submission.csv',index=False)

Gradient Boost competition score = **0.80056**

In [63]:
clf_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported':pred_y_gbr2})
clf_out.to_csv('submission.csv',index=False)

Tuned Gradient Boost competition score = **0.80476**

And so, we have a winner.

---