Let's import numpy and Pandas packages.

In [None]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

Pandas has its own functions to read csv files, so we no longer need to use python `csv` package.

## DataFrames

In [None]:
df = pd.read_csv('data/titanic/train.csv', header=0)

In [None]:
df

In [None]:
df.head(2)

Earlier, csv package was treating everything as string but Pandas is more intelligent. Let's see what the datatypes now are.

In [None]:
df.dtypes

A useful command to get an overview.

In [None]:
df.info()

There's a lot of useful info there! You can see immediately we have 891 entries (rows), and for most of the variables we have complete values (891 are non-null). But not for Age, or Cabin, or Embarked -- those have nulls somewhere. Now try:


In [None]:
df.describe()

This is also very useful: pandas has taken all of the numerical columns and quickly calculated the mean, std, minimum and maximum value.

## Filtering and Cleaning

Let's acquire first 10 rows of age column.

In [None]:
df['Age'][0:10]

In [None]:
type(df['Age'])

In [None]:
df['Age'].mean()

In [None]:
df['Age'].median()

We can acquire multiple columns too.

In [None]:
df[['Age', 'Survived', 'Fare']][0:10]

Let's now do some filtering.

In [None]:
df[df['Age'] > 70][['Name', 'Age', 'Sex']]

## Feature Engineering

In [None]:
len(df['Age'].dropna())

In [None]:
df['Gender'] = df['Sex'].map(lambda x: x[0].upper())

In [None]:
df.head(3)

In [None]:
df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

In [None]:
df.head(6)

Let's replace null values with median value.

In [None]:
df['Age'][df['Age'].isnull()] = df['Age'].median()

In [None]:
df.info()

We can create new features that might make more sense.

In [None]:
df['FamilySize'] = df['SibSp'] + df['Parch']

In [None]:
df.head(6)

In [None]:
df['Age*Pclass'] = df['Age'] * df['Pclass']

In [None]:
df.info()

Adding useless features is an overhead and negatively affects model of our performance. So, let's drop them.

In [None]:
df = df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1) 

In [None]:
df.info()

We can convert Pandas to numpy array at any time.

In [None]:
df.values

Great! Now, we are ready to apply machine learning.