# Dataset Description
In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

### File and Data Field Descriptions
**train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
**PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
**HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
**CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
**Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
**Destination** - The planet the passenger will be debarking to.
**Age** - The age of the passenger.
**VIP** - Whether the passenger has paid for special VIP service during the voyage.
`RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
**Name** - The first and last names of the passenger.
**Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.


**test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

**sample_submission.csv** - A submission file in the correct format.
`PassengerId` - `Id` for each passenger in the test set.
`Transported` - The target. For each passenger, predict either True or False.

### Planning
I am going to trying out various ML algorithms and find best algorithm and fine tune it
- Random Forests
- Xgboost
- LGBM

Approaching problem using Deep learning
- ANN (hypertune)

Imports required libs/modules

In [1]:
import opendatasets as od
import pandas as pd

## 1. Download the Dataset
- Install required libraries
- Download data from Kaggle
- View dataset files
- Load training set with Pandas
- Load test set with Pandas
Dataset link: [Spaceship Titanic](https://www.kaggle.com/competitions/spaceship-titanic/data)

Download Data from Kaggle
We'll use the [opendatasets](https://github.com/JovianML/opendatasets) library

In [2]:
dataset_url = 'https://www.kaggle.com/competitions/spaceship-titanic/data'
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:Your Kaggle Key:Downloading spaceship-titanic.zip to ./spaceship-titanic


100%|██████████| 299k/299k [00:00<00:00, 2.15MB/s]


Extracting archive ./spaceship-titanic/spaceship-titanic.zip to ./spaceship-titanic





In [4]:
data_dir = './spaceship-titanic'

## Loading Training Set and Test Set

In [6]:
spaceship_titanic_df = pd.read_csv(f'{data_dir}/train.csv')
spaceship_titanic_test_df = pd.read_csv(f'{data_dir}/test.csv')

In [7]:
spaceship_titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [8]:
spaceship_titanic_df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


## Exploratory data analysis

## Preprocessing

### Handle Missing Values