Many machine learning algorithms are sensitive to the scale of the features. In this recipe, we will learn to visualize the feature magnitudes and most common statistical metrics.

====================================================================================================

To download the Titanic data, visit this [website](https://www.kaggle.com/c/titanic/data)

Click on the link 'train.csv', and then click the 'download' blue button towards the right of the screen, to download the dataset. Rename the file to titanic.csv and save it to the parent directory of this repo (../titanic.csv).

**Note that you need to be logged in to Kaggle and accept the competition terms and conditions to download the datasets**.

====================================================================================================

In [1]:
import pandas as pd

In [2]:
# load numerical variables of the Titanic Dataset

data = pd.read_csv('../titanic.csv',
                   usecols=['Pclass', 'Age', 'Fare'])

data.head()

Unnamed: 0,Pclass,Age,Fare
0,3,22.0,7.25
1,1,38.0,71.2833
2,3,26.0,7.925
3,1,35.0,53.1
4,3,35.0,8.05


In [3]:
# let's have a look at the values of those variables
# to get an idea of the feature magnitudes

data.describe()

Unnamed: 0,Pclass,Age,Fare
count,891.0,714.0,891.0
mean,2.308642,29.699118,32.204208
std,0.836071,14.526497,49.693429
min,1.0,0.42,0.0
25%,2.0,20.125,7.9104
50%,3.0,28.0,14.4542
75%,3.0,38.0,31.0
max,3.0,80.0,512.3292


In the table we observe the main statistics of the variables, e.g., the 25th, 50th and 75th quantiles, the mean, standard deviation and minimum and maximum value. Comparing these parameters we can quickly understand whether our features are in a similar scale. In this case, they are clearly not. PClass takes values 1-3 whereas Age takes values 0 to 80, and Fare takes values 0 to 512.

In [4]:
# let's now calculate the range of the variables

data.max() - data.min()

Pclass      2.0000
Age        79.5800
Fare      512.3292
dtype: float64

The ranges of the variables, as expected are quite different.