# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [29]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [30]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [31]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [32]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [33]:
#your code here
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [34]:
#your code here
spaceship.dropna()
spaceship.columns=spaceship.columns.str.lower()

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [35]:
#your code here
spaceship.head()
spaceship['cabin'].unique()


array(['B/0/P', 'F/0/S', 'A/0/S', ..., 'G/1499/S', 'G/1500/S', 'E/608/S'],
      dtype=object)

In [36]:
spaceship['cabin']=spaceship['cabin'].str[0]

- Drop PassengerId and Name

In [37]:
#your code here
spaceship=spaceship.drop(['passengerid', 'name'], axis=1)


In [38]:
spaceship 

Unnamed: 0,homeplanet,cryosleep,cabin,destination,age,vip,roomservice,foodcourt,shoppingmall,spa,vrdeck,transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False
8689,Earth,True,G,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False
8690,Earth,False,G,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True
8691,Europa,False,E,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False


- For non-numerical columns, do dummies.

In [39]:
#let´s see our non numerical columns

cat_cols=spaceship.select_dtypes(exclude='number').columns
cat_cols=cat_cols.drop(['transported'])
cat_cols

Index(['homeplanet', 'cryosleep', 'cabin', 'destination', 'vip'], dtype='object')

In [40]:
#let´s do the dummies
spaceship_dummies=pd.get_dummies(spaceship, columns=cat_cols)
spaceship_dummies

Unnamed: 0,age,roomservice,foodcourt,shoppingmall,spa,vrdeck,transported,homeplanet_Earth,homeplanet_Europa,homeplanet_Mars,...,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T,destination_55 Cancri e,destination_PSO J318.5-22,destination_TRAPPIST-1e,vip_False,vip_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,False,True,False,...,False,False,False,False,False,False,False,True,True,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,True,False,False,...,False,False,True,False,False,False,False,True,True,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,False,True,False,...,False,False,False,False,False,False,False,True,False,True
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,False,True,False,...,False,False,False,False,False,False,False,True,True,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,True,False,False,...,False,False,True,False,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,False,False,True,False,...,False,False,False,False,False,True,False,False,False,True
8689,18.0,0.0,0.0,0.0,0.0,0.0,False,True,False,False,...,False,False,False,True,False,False,True,False,True,False
8690,26.0,0.0,0.0,1872.0,1.0,0.0,True,True,False,False,...,False,False,False,True,False,False,False,True,True,False
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,False,False,True,False,...,False,True,False,False,False,True,False,False,True,False


**Perform Train Test Split**

In [None]:
#select features 
target=spaceship['transported']
features=spaceship_dummies
print(features)

       age  roomservice  foodcourt  shoppingmall     spa  vrdeck  transported  \
0     39.0          0.0        0.0           0.0     0.0     0.0        False   
1     24.0        109.0        9.0          25.0   549.0    44.0         True   
2     58.0         43.0     3576.0           0.0  6715.0    49.0        False   
3     33.0          0.0     1283.0         371.0  3329.0   193.0        False   
4     16.0        303.0       70.0         151.0   565.0     2.0         True   
...    ...          ...        ...           ...     ...     ...          ...   
8688  41.0          0.0     6819.0           0.0  1643.0    74.0        False   
8689  18.0          0.0        0.0           0.0     0.0     0.0        False   
8690  26.0          0.0        0.0        1872.0     1.0     0.0         True   
8691  32.0          0.0     1049.0           0.0   353.0  3235.0        False   
8692  44.0        126.0     4688.0           0.0     0.0    12.0         True   

      homeplanet_Earth  hom

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [60]:
#check for null values in our features

features.isna().sum()
features=features.fillna(features.median())


In [61]:
#your code 

X_train, X_test, y_train, y_test= train_test_split(features, target, test_size=.20, random_state=0)

In [62]:
X_train.head()
y_train.head()

4278    False
5971    False
464     False
4475    False
8469     True
Name: transported, dtype: bool

In [63]:
# summon classifier
from sklearn.neighbors import KNeighborsClassifier


In [64]:
knn=KNeighborsClassifier(n_neighbors=3)

In [65]:
y_train

4278    False
5971    False
464     False
4475    False
8469     True
        ...  
4373     True
7891    False
4859    False
3264    False
2732    False
Name: transported, Length: 6954, dtype: bool

In [66]:
#fit the model to our data
knn.fit(X_train, y_train)

In [67]:
pred=knn.predict(X_test)
pred

array([False,  True, False, ...,  True, False,  True])

- Evaluate your model's performance. Comment it

In [68]:
#your code here

knn.score(X_test, y_test)



0.8332374928119609