In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np

# Training data

In [None]:
df = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv") # read training data
df.head() # print first 5 lines of dataframe

In [None]:
df.shape # dataframe dimensions

In [None]:
df.describe() # main properties of numeric features

In [None]:
df.describe(exclude="number") # main properties of non-numeric features

# Datatypes parsing

First, let's parse aggregated data into their components.
In particular let's split `PassengerId`, `Cabin`, and `Name` information.

## `PasssengerId` parsing

As documentations reports, Passenger ID is formed as `gggg_pp` where `gggg` identifies the group the passenger is travelling with and `pp` the passenger itself.

In [None]:
df.insert(loc=0, column="GroupMember", value=df.PassengerId.apply(lambda x: x.split("_")[-1]))
df.insert(loc=0, column="Group", value=df.PassengerId.apply(lambda x: x.split("_")[-2]))
df = df.drop(columns="PassengerId")

df.head()

`Group` and `GroupMember` are still represented by strings.

In [None]:
df.iloc[:, :2].dtypes

Let's parse as numbers.

In [None]:
df.Group = df.Group.astype(int)
df.GroupMember = df.GroupMember.astype(int)

df.head()

## `Cabin` parsing

As the documentation reports, the cabin number takes the form `deck/num/side`, where side can be either $P$ for Port or $S$ for Starboard.

In [None]:
n = df.columns.tolist().index("Cabin") # Cabin column position in dataframe

for i, col in enumerate(["Side", "CabinNumber", "Deck"]):
    df.insert(loc=n, column=col, value=df.Cabin.apply(lambda x: x.split("/")[-i-1] if type(x) == str else np.nan)) # there are NaNs to consider!
    
df = df.drop(columns="Cabin")
df.head()

In [None]:
df.loc[:, ["Deck", "CabinNumber", "Side"]].isna().sum()

Null values remained the same. This is correct.

## `Name` parsing

In [None]:
df.Name.apply(lambda x: len(x.split()) if type(x) == str else 0).value_counts() # count words in names

There are names with first and family name and missing values. Let's split first and family name. This can be useful to search for family connections.

In [None]:
n = df.columns.tolist().index("Name") # Cabin column position in dataframe

for i, col in enumerate(["FamilyName", "FirstName"]):
    df.insert(loc=n, column=col, value=df.Name.apply(lambda x: x.split()[-i-1] if type(x) == str else np.nan)) # there are NaNs to consider!
    
df = df.drop(columns="Name")
df.head()

In [None]:
df.loc[:, ["FirstName", "FamilyName"]].isna().sum()

Null values remained the same. This is correct.

## Final parsing

Let's check if all datatypes arre correctly parsed.

In [None]:
df.dtypes.loc[lambda x: x == "object"]

`CryoSleep` and `VIP` should be boolean.

In [None]:
df.loc[:, ["CryoSleep", "VIP"]] = df.loc[:, ["CryoSleep", "VIP"]].astype(bool)

In [None]:
df.dtypes

# Missing values cleaning

Null values can be replaced with valid ones or they can just be dropped.
Usually trying to keep more information is better since we will have more data to train a model but sometimes is quite difficult find the best way to replace all missing values.

There are various techniques that can be used:
- for numeric series interpolation (i.e. fit a function between two valid values to fill missing ones);
- find out some useful relation between data and use another feature (i.e. column) to get information on another one;
- generate random values using valid ones distribution;
- [...]

There are also various ways to drop missing data:
- drop all entries (i.e. rows) with any null value;
- drop an entire feature with missing values.

In addition, when dropping values we could also use a threshold one the missing values percentage above all values.

During training, entries drop is not so painful because we are just discarding some data loosing information but the same information remains in other data.
On the contrary, when an entire feature is discarded its information is totally lost and a model trained without it will not be able to use it anymore.

In this notebook entries with missing values are just dropped but one of the techniques mentioned above can be used.

In [None]:
df.isna().any(axis=1).sum() / df.shape[0] * 100 # percentage of rows with missing values

We are going to drop $24\%$ of the original data. 

In [None]:
df = df.dropna() # drop rows with missing values
df = df.reset_index(drop=True) # reset the incremental index dropping the current one

df.head()

In [None]:
df.isna().any(axis=None) # is there any missing value in the dataframe?

# Non-numeric values encoding

Libraries tools work only on numeric data. Because of this we have to encode all objects to numeric values.

The resulting range is not so important, here the main key is that every qualitative value is represented by a different number.
Next numeric values will be transformed some way based on the supervised model to train so at that time is important to choose wich range values should have.

To encode numeric values we will use _Scikit-learn_ instruments.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
df.dtypes.loc[lambda x: x == "object"]

In [None]:
cols = df.dtypes.loc[lambda x: x == "object"].index.tolist() # features with non-numeric values
encoders = {col: LabelEncoder() for col in cols} # one encoder per feature

for col in encoders.keys():
    encoders[col] = encoders[col].fit(df.loc[:, col]) # determine number of labels to use
    df.loc[:, col] = encoders[col].transform(df.loc[:, col]) # encode labels

In [None]:
df.head()

In [None]:
df.dtypes

# Notebook output

Now encoders and final dataframe are saved for further processing in other notebooks.
Encoders are saved if inverse transformation of labels is needed.

In [None]:
import pickle as pkl

In [None]:
# save encoders using pickle serialization
with open("encoders.pkl", "wb") as file: 
    pkl.dump(encoders, file)

# save dataframe to a csv file
df.to_csv("train.csv", index=False) # don't save incremental index