# Spaceship Titanic Data Analysis
https://www.kaggle.com/competitions/spaceship-titanic/data?select=train.csv

## What we need:
- [x] Data Loading
- [x] Data Cleaning
- [x] Data Wrangling and Aggregation
- [x] Data Plotting

### Data Loading
***

In [None]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt

In [None]:
#Load the csv data and print row one.
df = pd.read_csv('../input/spaceship-titanic/train.csv')
#Test that it's open
df.head(1)

### Data Cleaning
***

This simply shows the first row of the dataset, which gives a rough description as to how we'll be using it.

In [None]:
#view column info and null values
df.info()

This is a more in-depth explanation of our dataset. As you can see, there are 14 total columns, and the info shows the data types of the columns as well. There are 6 float data types, which is Age, RoomService, RoomCourt, ShoppingMall, Spa, and VRDeck.  There are also 7 object data types, which we can also view as strings in this case. These are PassengerId, HomePlanet, CryoSleep, Cabin, Destination, VIP, and Name. The final data type is bool, which is only for the Transported column. This information also shows the size of the file in (memory usage), as well as the number of non-null values within each column.

In [None]:
df.describe()

This shows more information on the numerical columns of the dataset. Here we can see statistics based on the six float data types. This shows that there were a significant portion of people who didn't spend any money on services, and the most someone spend on a service was at the FoodCourt being 29813 dollars and the smallest max was on RoomService at 14327 dollars.


We know that there are null values on some people's rows, so just to make reading the data easier for us we're going to remove every row that has a null value. This should make it easier for us to view the data later on, and will weed out any incomplete portions of the dataset.

In [None]:
df.isnull().sum()

In [None]:
df.dropna(inplace = True)
df

Now we can turn CryoSleep and VIP from Object type to Bool type.

In [None]:
df['CryoSleep'] = df['CryoSleep'].astype(bool)
df['VIP'] = df['VIP'].astype(bool)

Now if there are any duplicate names, we'll go ahead and remove them.

In [None]:
df.drop_duplicates(subset = ['Name'], inplace = True)
df

Now we'll merge all of the costed services together into one large group so that it's easier to graph and analyze.

In [None]:
df["ServiceSum"] = df["RoomService"] + df["FoodCourt"] + df["ShoppingMall"] + df["Spa"] + df["VRDeck"]

In [None]:
#Check for null values
df.isnull().sum()

#### Cleaning Summary
***
- [x] Drop all of the NULL values
- [x] Drop the name duplicates
- [x] Turn CryoSleep and VIP into bool values
- [x] Merge all services
***
From the table below, we can reevaluate the dataset and see that all of our columns are a nice, even number, the column amount hasn't changed, 

In [None]:
df.info()

We can also see that VIP and CryoSleep are also bool values now, which will trememndously help with data wrangling later on.

In [None]:
df.describe()

We can also see that there is a slight change in our data since we removed all the rows with null values, however it isn't a significant portion of people or a significant amount of data lost.

### Data Wrangling and Aggregation with Plotting
***

Next what we'll do is compare these three graphs with eachother using Seaborn so we can view the differences between all of the different people in order to predict what kind of person would be transported to the alternate universe.

##### Transported x Age

In [None]:
sb.barplot(data=df, x='Transported', y='Age')

This shows that Age is not a direct reason for being transported, so let's look into the rest of values to see what they determine.

##### Transported x ServiceSum

In [None]:
sb.barplot(data=df, x='Transported', y='ServiceSum')

Just to interrupt, here we see that services might be a reason, so we can check to see if a person performed a specific service to determine whether they were transported or not.

##### Transported x RoomService

In [None]:
sb.barplot(data=df, x='Transported', y='RoomService')

This shows that those who spent more in room service were typically not transported. Let's explore the rest.
##### Transported x FoodCourt

In [None]:
sb.barplot(data=df, x='Transported', y='FoodCourt')

##### Transported x ShoppingMall

In [None]:
sb.barplot(data=df, x='Transported', y='ShoppingMall')

##### Transported x Spa

In [None]:
sb.barplot(data=df, x='Transported', y='Spa')

##### Transported x VRDeck

In [None]:
sb.barplot(data=df, x='Transported', y='VRDeck')

From all of this data, it seems that someone who would be transported would be a person who didn't spend money on the Room Service, the Spa, and the VRDeck. We're going to continue to look at the rest of the variables to determine what kind of person would be transported.

##### Age x Transported

In [None]:
sb.histplot(data=df, x='Age', hue='Transported')

This graph shows that those younger than 14 were transported more often than older age groups.

##### Transported x CryoSleep

In [None]:
sb.barplot(data=df, x='Transported', y='CryoSleep')

This plot shows that those in CryoSleep were more likely to be transported than those who weren't.

##### Destination x Transported

In [None]:
sb.histplot(data=df, x='Destination', hue='Transported')

It seems that those whose destination was '55 Cancri e' were significantly more likely to be transported than the other two destinations.

As of right now, we can determine that if one were to be Transported, they would not spend money on RoomService, Spa, or VRDeck, would likely be younger than 14, go into CryoSleep, and their destination is '55 Cancri e'

In [None]:
sb.relplot(data=df, x='Age', y='RoomService', col='Destination', row='Transported', size='VRDeck', style='CryoSleep', hue='Spa')