# First workbook on Titanic data

The titanic data set is a data set found on : 
https://www.kaggle.com/c/titanic/data

This data we would be making use of to get an initial understanding of how to work with data.

So to start with, we know we'll require pandas and numpy, since the data is numeric and stored in a csv.
Those libraries are imported and the existence of required files are verified.

In [1]:
# Importing required libraries
import pandas
import numpy
import os

# Check if files exist
files = []
for dirname, _, filenames in os.walk('./data/'):
    for filename in filenames:
        files.append(os.path.join(dirname, filename))

print(files)

['./data/actual_result.csv', './data/gender_submission.csv', './data/test.csv', './data/train.csv']


## Loading the data

We load files gender_submission.csv and train.csv to peek at the data.

In [2]:
gender_submission = pandas.read_csv(files[0])
gender_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [3]:
train_data = pandas.read_csv(files[2])
train_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


gender_submission.csv is a sample file for submission.
train.csv is the data file we'll be working with.

We describe the data as the following:

| Column | Definition | Keys if any |
|:------:|:----------:|:-----------:|
| PassengerId | Key for identifying individual passengers |   |
| Survived | Tells if a person survived or not | 0 for dead 1 for alive |
| Pclass | Class of ticket purchased | n for nth class |
| Name | Name of the passenger | |
| Sex | The gender of the passenger | |
| SibSp | Number of siblings/spouses aboard the titanic | |
| Parch | Number of parents/children aboard the titanic | |
| Ticket | Ticket Number | |
| Fare | The fare for the ticket | |
| Cabin | The cabin numbers for people with a cabin | |
| Embarked | Port of embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Provided that this data is the actual data of the Titanic, this is bound to be raw unprocessed data.
We first cleanse it. We need to see if all the data checks out.

In [18]:
def is_categorical(label):
    return (len(train_data[label].value_counts()) / len(train_data[label]) * 100 < 20)

is_categorical('Fare')

False

In [12]:
len(train_data['Fare'].value_counts()) / len(train_data['Fare'])

0.4043062200956938

418