In [None]:
%%capture
!pip install missingno

In [None]:
import numpy as np 
import pandas as pd 
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display_html 

import missingno as msno

plt.style.use("ggplot")

# <div style="color:white;display:fill;border-radius:5px;background-color:#4f77a4;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Overview</p></div>

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6c/PIA19048_realistic_color_Europa_mosaic_edited.jpg/1920px-PIA19048_realistic_color_Europa_mosaic_edited.jpg" width=700 class="center">

*Source: Wikipedia*


The purpose of this notebook is to perform a general analysis on the data to try to better understand the variables and foresee possible relationships between them and the target we want to predict (being transported or not). The idea is that this knowledge could help to develop better methods that yield more accurate results. For those reasons I will use only training data and won't perform any feature engineering (almost).

# <div style="color:white;display:fill;border-radius:5px;background-color:#4f77a4;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Preparing the data</p></div>

Firstly, we are going to load the data (only train data) and take a quick look.

In [None]:
df_train = pd.read_csv("../input/spaceship-titanic/train.csv")
df_train.head()

In [None]:
TARGET = 'Transported'
feat = [col for col in df_train if col != TARGET]

df_train[feat].isna().sum()

In [None]:
msno.matrix(df_train[feat], figsize=(12,4), fontsize=12)
plt.show()

We see that around 200 values per column are missing, but according to the plot above, they are distributed quite heterogeneously across rows. The numbers on the right-hand column mean that we have a minimum of 10 columns and a maximum of 13 without any missing values -i.e., with all the data. Since they are not so many NaNs I am going to simply drop those rows for this first EDA. I'll consider later to use some imputation techniques for continuous and categorical variables. 

Another thing I will do now is to split the **PassengerId** and **Cabin** data according to the description given in the competition and create another two columns with the result. As explained in the instructions, **PassengerId** has the passenger ID and the group ID, and **Cabin** has deck, cabin number and side. 

In [None]:
def preprocess_col(df):
    
    df = df.copy()
    cabin_code = df['Cabin'].str.split('/', expand=True).rename(columns={0: 'CabinDeck', 1: 'CabinNum', 2: 'CabinSide'})
    df = pd.concat([cabin_code, df.drop('Cabin', axis=1)], axis=1)
    
    group_id = df['PassengerId'].str.split('_', expand=True).rename(columns={0: 'group', 1: 'id'})
    df = pd.concat([group_id, df.drop('PassengerId', axis=1)], axis=1)
    
    return df

In [None]:
df_train = df_train.dropna()

df_train = preprocess_col(df_train)
df_train.head()

This is the dataframe we are going to use for the EDA.

# <div style="color:white;display:fill;border-radius:5px;background-color:#4f77a4;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Exploratory Data Analysis</p></div>


## 1. Relationship between categorical variables and the target

In [None]:
sel_features = ['CabinDeck', 'CabinSide', 'VIP', 'Destination', 'HomePlanet', 'CryoSleep']

fig, axes = plt.subplots(2, 3, figsize=(22, 12))
axes = axes.ravel()
fig.suptitle('Categorical features VS. Transported')

for k, f in enumerate(sel_features):
    axs_ind = np.unravel_index(k, (2,3))
    sns.countplot(y=f, hue=TARGET,
                  edgecolor=".6",
                  data=df_train, ax=axes[k])
    axes[k].set_title(f)

Interesting insights from the plot above:

1. It seems to be more likely to be transported for people staying in decks **B** and **C**, and less likely for those in decks **E** and **F**
2. Slightly more people were transported from cabins on side **S**
3. The **VIP** category is pretty unbalanced and it's difficult to infer a relationship with the target
4. **HomePlanet** and **Destination** seem to be related to being transported, especially traveling to **TRAPPIST-1e** (less likely) or **55 Cancri e** (more likely), and coming from **Earth** (less likely) and **Europa** (more likely) -are those passengers somehow related?
5. Being in **CryoSleep** shows a clear tendency; around 80% of the people in sleep were transported.



Next, trying to answer question 4, we are going to explore the relation between origin and destination.

## 2. Relationship between passengers' origin and destination

In [None]:
c = sns.color_palette('Set2')
cTr, cCa, cPs = c[6], c[7], c[5]

f, axes = plt.subplots(figsize=(10, 5))
sns.histplot(data=df_train, y='HomePlanet', hue='Destination', 
             palette={'TRAPPIST-1e':cTr, 'PSO J318.5-22':cPs, '55 Cancri e':cCa}, multiple="stack")
plt.title('Nb of passengers grouped by HomePlanet and Destination')
plt.show()

From the plot above we can conclude that:
1. A majority of passengers comes from **Earth** and a similar number from **Europa** and **Mars**.
2. A majority of passengers is traveling to **TRAPPIST-1e** regardless of their origin.
3. **PSO J318.5-22** is the destination with less travelers, and most of them come from **Earth**

Well, this shows some tendencies and makes sense since the furthest planet is actually **PSO J318.5-22**, approximately [80-light-years](https://en.wikipedia.org/wiki/PSO_J318.5%E2%88%9222) away from the Solar System. On the other hand, [55 Cancri e](https://en.wikipedia.org/wiki/55_Cancri_e) and [TRAPPIST-1e](https://en.wikipedia.org/wiki/TRAPPIST-1e) are both 40-light-years away (the distances between Earth, Mars and Europe are negligible at this scale). So, **PSO J318.5-22** would be the last stop being twice further away from the corresponding home planets. One could hypothesize that more people would choose to travel in cryosleep when heading to the furthest planet. Let's see that. 

In [None]:
dist_cryo = pd.crosstab(df_train.loc[df_train[df_train['CryoSleep']==True].dropna().index]['HomePlanet'], df_train.loc[df_train[df_train['CryoSleep']==True].dropna().index]['Destination'])
dist_noCryo = pd.crosstab(df_train.loc[df_train[df_train['CryoSleep']==False].dropna().index]['HomePlanet'], df_train.loc[df_train[df_train['CryoSleep']==False].dropna().index]['Destination'])

df1_styler = dist_cryo.style.set_table_attributes("style='display:inline'").set_caption('Passengers in CryoSleep')
df2_styler = dist_noCryo.style.set_table_attributes("style='display:inline'").set_caption('Passengers awake')

display_html(df2_styler._repr_html_() + df1_styler._repr_html_(), raw=True)

In [None]:
f, axes = plt.subplots(1, 2, figsize=(18, 4), sharey=True)
axes = axes.ravel()
sns.histplot(data=df_train.loc[df_train[df_train['CryoSleep']==False].dropna().index], y='HomePlanet', hue='Destination', 
             palette={'TRAPPIST-1e':cTr, 'PSO J318.5-22':cPs, '55 Cancri e':cCa}, multiple="stack", ax=axes[0])
axes[0].set_title('No CryoSleep')

sns.histplot(data=df_train.loc[df_train[df_train['CryoSleep']==True].dropna().index], y='HomePlanet', hue='Destination', 
             palette={'TRAPPIST-1e':cTr, 'PSO J318.5-22':cPs, '55 Cancri e':cCa}, multiple="stack", ax=axes[1])
axes[1].set_title('In CryoSleep')
axes[1].set_xlim([0,2600])

plt.show()

If we split the previous plot into CryoSleep and No Cryosleep we get the tables and distributions above. Some comments on that:

1. The distribution of passengers that are not in cryosleep is very similar to the total one we saw before. Therefore, the proportions we saw there still hold here. 
2. If we compare CryoSleep and No CryoSleep in regards with **PSO J318.5-22**, we don't see big differences, which means that almost half of the people going to this planet are in cryosleep, and the other half is not. 
3. For the other two destinations, a smaller proportion is in general travelling in cryosleep. This is clear for **TRAPPIST-1e**, but not so much for **55 Cancri e**, especially for passengers coming from **Europa**.
4. **Europa** seems actually the one less affected by CryoSleep condition, in second place we would have **Mars**, and the one that shows the largest differences is **Earth**. 

In a nutshell, when we remove the furthest destination (**PSO J318.5-22**), which is the one that should be more affected by the CryoSleep condition, what do we have? We see that **Earth** is the one affected the most, **Mars** a little less and **Europa** is barely affected. 

This raises the question: Why would so many people from **Europa** (compared to **Earth**) travel in cryosleep? Travelling in cryosleep must be more expensive ([I guess](https://en.wikipedia.org/wiki/Cryonics#Cryonics_in_practice)), isn't it a bit too much just to go to **55 Cancri e** (50%) or to TRAPPIST-1e (39%), which are both a stone's throw away? 

We might want to take a look at possible relationships between origin, cryosleep and money.


## 3. Relationship between purchasing power and origin

I didn't want to do any feature engineering in this kernel, but I will add a new column with the sum of the luxury amenities as a proxy for purchasing power. 

In [None]:
lux_amenities = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
df_train['BilledLuxury'] = df_train[lux_amenities].sum(axis=1)

We can see from the plot below that it is indeed *a matter of money*. Apparently, passengers from **Europa** are the rich people from the Solar System, so many of them prefer maybe to pay a bit more a get to their destination in a heartbeat. **Mars** would come in second place and, finally, passengers from **Earth** are the ones with the smallest purchasing power and, therefore, the ones more affected by the CryoSleep condition as we saw before. 

(I've set the y-scale in log because BilledLuxury has a very long-tailed distribution: there are a few people that spend a lot of money and push the average up -as happens in real life, which is why just providing the mean is often not a very good idea)

In [None]:
c = sns.color_palette('Set2')
cEu, cMa, cEa = c[0:3]
g = sns.boxplot(x="HomePlanet", y="BilledLuxury", data=df_train, palette={'Earth': cEa, 'Mars': cMa, 'Europa': cEu})
g.set_yscale("log")
plt.show()

Let's explore a bit more the relationship between **HomePlanet** and purchasing power of passengers in terms of being VIP, luxury amenities and cabin deck.

In [None]:
pie_vip = df_train.loc[df_train['VIP']==True]['HomePlanet'].value_counts()
pie_novip = df_train.loc[df_train['VIP']==False]['HomePlanet'].value_counts()

f, axes = plt.subplots(1, 3, figsize=(13, 5))
axes[0].pie(df_train['VIP'].value_counts(), labels=['No VIP', 'VIP'], autopct='%1.1f%%', colors=[c[-1], c[-2]])
axes[1].pie(pie_novip, labels=pie_novip.index, autopct='%1.1f%%', colors=[cEa, cEu, cMa])
axes[1].set_title('No VIP')
axes[2].pie(pie_vip, labels=pie_vip.index, autopct='%1.1f%%', colors=[cEu, cMa])
axes[2].set_title('VIP')
plt.show()

In [None]:
f, ax = plt.subplots(2, 1, figsize=(8,10))
ax = ax.ravel()

g1 = sns.histplot(data=df_train.sort_values('CabinDeck'), x='CabinDeck', hue='HomePlanet', 
                  palette={'Earth': cEa, 'Mars': cMa, 'Europa': cEu}, multiple="stack", stat='percent', ax=ax[0])

g2 = sns.boxplot(data=df_train.sort_values("CabinDeck"), x="CabinDeck", y="BilledLuxury", 
                 color=c[7], ax=ax[1])
g2.set_yscale("log")

The previous figure seem to confirm our hypothesis. First of all, most passengers don't travel in **VIP**. From those who do, 68% of them come from **Europa**, the other 32% come from **Mars** and none from **Earth**. 

Moreover, most of them are in decks **A**, **B** and **C**, which might be the most comfortable ones -no passenger coming from **Earth** or **Mars** is in those cabins. Cabin decks **D** and **E** seem to be medium level with more hetereogenity. Finally, decks **F** and **G** are probably the cheapest ones, where most of people coming from **Earth** is staying. Only a few people are staying in deck **T** and all of them come from **Europa**, which could indicate it's also an exclusive one (?).   

Finally, let's see how this insights are related to age.

## 4. Relationship between passengers' age, economic spending and origin

In [None]:
f, ax = plt.subplots(figsize=(12, 4))
g = sns.histplot(data=df_train, x='Age', bins=40, kde=True, ax=ax)
plt.show()

We can see that the distribution is centered around 20-30, with a longer right tail. Maybe it can be interesting to group data according to age ranges.

In [None]:
x0 = 0
step = 10

n_ranges = np.ceil(df_train['Age'].max()/step).astype(int)
for n in range(n_ranges):
    x1 = x0 + step
    age_range = df_train.loc[(df_train['Age']>=x0) & (df_train['Age']<x1)]
    df_train.at[age_range.index, 'AgeRange'] = f'{x0}-{x1}'
    x0 = x1    

In [None]:
f, ax = plt.subplots(2, 1, figsize=(10, 10))
sns.stripplot(x="AgeRange", y="BilledLuxury", data=df_train.sort_values('AgeRange'), alpha=0.4, palette='Set3', ax=ax[0])
sns.histplot(data=df_train.sort_values('AgeRange'), x='AgeRange', hue='HomePlanet', 
             palette={'Earth': cEa, 'Mars': cMa, 'Europa': cEu}, multiple="stack", ax=ax[1])
ax[0].set_title('Age vs. BilledLuxury')
ax[1].set_title('Age vs. HomePlanet')
plt.show()

So, the passengers between 20 and 50 years-old are the ones spending the most in general. The distribution per **HomePlanet** is quite similar across planets, although the median is now around 20-30 years. This means that we have a majority of people around 20-30 years-old. Although there are more people from **Earth** in the range 20-30, **Europa** and **Mars** are more balanced between 20-30 and 30-40, which makes that the money spent by those older ranges has a larger weight in the **Age vs. BilledLuxury** distribution. 

I have one last question: Where are those old people going...?

In [None]:
f, ax = plt.subplots(figsize=(10, 5))
sns.histplot(data=df_train.sort_values('AgeRange'), x='AgeRange', hue='Destination', 
             palette={'TRAPPIST-1e':cTr, 'PSO J318.5-22':cPs, '55 Cancri e':cCa}, multiple="stack")
plt.show()

Fortunately, not many of them are going to **PSO J318.5-22** (which is 80 light-years away), but even for the other two destinations, 40 light-years seems too much for passengers being already 70-80 years-old, right? 

Please, tell me they are at least traveling in cryosleep...

In [None]:
f, ax = plt.subplots(figsize=(10, 5))
sns.histplot(data=df_train.sort_values('AgeRange'), x='AgeRange', hue='CryoSleep', 
             palette='Set1', multiple="stack")
plt.show()

Ok, so, there are just a few passengers between 70 and 80 years-old and they don't travel in cryosleep?! Nobody told those old men and women that it would make sense to pay a bit more and arrive alive to their destination? Well, I hope that in 1000 years from now the life expectancy will be way higher...

# <div style="color:white;display:fill;border-radius:5px;background-color:#4f77a4;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Conclusion</p></div>

After this analysis we saw possible relationship between our categorical variables and the possibility of being trasported. On top of that, we explored possible population differences between groups coming from or heading to the six planets at hand. We might want to consider the distance between origin and destination in the model according to what we saw with the cryosleep condition. And there is also a socio-economic pattern that affects the cabin deck where the passengers are staying and if they are in cryosleep, and therefore, can have an impact on the possibility of being transported. 

If you've reached this point I just want to thank you for reading and hope this can be of help for your models!