# Exploratory Dataset Analysis (EDA) with Titanic dataset

## Titanic

Now we will explore a little more the [Titanic dataset](https://www.kaggle.com/c/titanic). We will read the dataset, and also the first steps of EDA, answering several very interesting questions.


In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("../data/titanic.csv")

Now that the data has been read, let's start **looking** at them.

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.shape

(891, 12)

In [5]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The information above is fundamental to differentiate which columns contain **categorical** data and which contain **numeric** data.

- **Categorical data**: are qualitative data, almost always expressed in the form of **strings**. Practically all models cannot deal with categorical data directly. So, if we want to use them, we will have to do some procedure that transforms the categorical data into numerical data.

- **Numerical data**: are numerical data, which we can use directly.

In [7]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [8]:
df.loc[:, df.dtypes != object]

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.2500
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.9250
3,4,1,1,35.0,1,0,53.1000
4,5,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000
887,888,1,1,19.0,0,0,30.0000
888,889,0,3,,1,2,23.4500
889,890,1,1,26.0,0,0,30.0000


In [9]:
df.shape[0]

891

In [10]:
df.isnull().sum() / df.shape[0]

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

In [11]:
df.describe().apply(lambda x: round(x, 1))

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.4,2.3,29.7,0.5,0.4,32.2
std,257.4,0.5,0.8,14.5,1.1,0.8,49.7
min,1.0,0.0,1.0,0.4,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.1,0.0,0.0,7.9
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.5
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3


### Groupby

The .groupby() is a method that helps us build a **dynamic table** (pivot table) with the data. This type of structure helps us a lot to do the important step of **looking at the data**.

Assume that you want to investigate the relationship between the class on the ship and survival. To do this, you **group** the data by these two columns.

In [12]:
df["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [13]:
df["Survived"].mean()

0.3838383838383838

In [14]:
df.groupby(["Pclass"])[["Survived"]].mean()

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363


In [15]:
df.groupby(["Pclass", "Sex"])[["Survived"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Pclass,Sex,Unnamed: 2_level_1
1,female,0.968085
1,male,0.368852
2,female,0.921053
2,male,0.157407
3,female,0.5
3,male,0.135447


It is also possible to make a pivot table through the function **pd.pivot_table()**.

In [16]:
pd.pivot_table(df, values='Survived', index='Pclass', columns='Sex', aggfunc=np.mean)

Sex,female,male
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.968085,0.368852
2,0.921053,0.157407
3,0.5,0.135447


Or even with **filters**.

In [17]:
df[(df["Pclass"] == 1) & (df["Sex"] == "male")]["Survived"].mean()

0.36885245901639346

In [18]:
df[(df["Pclass"] == 1) & (df["Sex"] == "male")]["Survived"].value_counts(normalize=True)

0    0.631148
1    0.368852
Name: Survived, dtype: float64

It is possible to pass more than one aggregation function. Other example question: **what is the relationship between the port of embarkation, the survival rate, and the class?**

In [19]:
df.groupby(["Embarked", "Pclass"])[['Survived']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Embarked,Pclass,Unnamed: 2_level_1
C,1,0.694118
C,2,0.529412
C,3,0.378788
Q,1,0.5
Q,2,0.666667
Q,3,0.375
S,1,0.582677
S,2,0.463415
S,3,0.189802


In [20]:
df.groupby(["Embarked", "Pclass"])[['Survived']].agg(["count", "sum", "mean"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Survived,Survived
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,mean
Embarked,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
C,1,85,59,0.694118
C,2,17,9,0.529412
C,3,66,25,0.378788
Q,1,2,1,0.5
Q,2,3,2,0.666667
Q,3,72,27,0.375
S,1,127,74,0.582677
S,2,164,76,0.463415
S,3,353,67,0.189802


In [21]:
df[(df["Embarked"] == "Q") & (df["Pclass"] == 1)]["Survived"]

245    0
412    1
Name: Survived, dtype: int64

It is also possible to use the **pd.crosstab()** function:

In [22]:
pd.crosstab([df["Survived"]], 
            [df["Embarked"], df["Pclass"]], 
            margins=True)

Embarked,C,C,C,Q,Q,Q,S,S,S,All
Pclass,1,2,3,1,2,3,1,2,3,Unnamed: 10_level_1
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
0,26,8,41,1,1,45,53,88,286,549
1,59,9,25,1,2,27,74,76,67,340
All,85,17,66,2,3,72,127,164,353,889


#### Other examples

In [23]:
# how many people died? How many survived?
df["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [24]:
# what is the proportion of survivors and dead?
df["Survived"].value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

In [25]:
# what is the amount who died and survived, split by sex
df.groupby(["Survived", "Sex"])[["Survived"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Survived,Sex,Unnamed: 2_level_1
0,female,81
0,male,468
1,female,233
1,male,109


In [26]:
# modify the example from before to include not only port and class, but also sex as a grouping variable
df.groupby(["Embarked", "Pclass", "Sex"])[['Survived']].agg(["count", "sum", "mean"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Survived,Survived,Survived
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,sum,mean
Embarked,Pclass,Sex,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
C,1,female,43,42,0.976744
C,1,male,42,17,0.404762
C,2,female,7,7,1.0
C,2,male,10,2,0.2
C,3,female,23,15,0.652174
C,3,male,43,10,0.232558
Q,1,female,1,1,1.0
Q,1,male,1,0,0.0
Q,2,female,2,2,1.0
Q,2,male,1,0,0.0


In [27]:
df[(df["Embarked"] == "C") & 
   (df["Pclass"] == 1) &
   (df["Sex"] == "male")]["Survived"].value_counts()

0    25
1    17
Name: Survived, dtype: int64

In [28]:
# how many passengers are in each age group? (0-15, 15-30, 30-45, 45+) (closed upper, open lower)
print("0-15 years:\t", df[df["Age"] <= 15].shape[0])
print("15-30 years:\t", df[(df["Age"] > 15) & (df["Age"] <= 30)].shape[0])
print("30-45 years:\t", df[(df["Age"] > 30) & (df["Age"] <= 45)].shape[0])
print("45+ years:\t", df[df["Age"] > 45].shape[0])

0-15 years:	 83
15-30 years:	 326
30-45 years:	 202
45+ years:	 103


In [29]:
df['group'] = pd.cut(df['Age'], [0,15,30,45,df["Age"].max()], include_lowest = True, right=True).copy()
df["group"].value_counts()

(15.0, 30.0]      326
(30.0, 45.0]      202
(45.0, 80.0]      103
(-0.001, 15.0]     83
Name: group, dtype: int64

In [30]:
df["FE"] = df["Age"].apply(lambda x: x//15.01 if (x//15.01)<=3 else 3)
df["FE"].value_counts()

1.0    326
3.0    280
2.0    202
0.0     83
Name: FE, dtype: int64

In [31]:
# and what is the sex distribution in each age group?
print(df[df["Age"] <= 15]["Sex"].value_counts(normalize=True), "\n")
print(df[(df["Age"] > 15) & (df["Age"] <= 30)]["Sex"].value_counts(normalize=True), "\n")
print(df[(df["Age"] > 30) & (df["Age"] <= 45)]["Sex"].value_counts(normalize=True), "\n")
print(df[df["Age"] > 45]["Sex"].value_counts(normalize=True), "\n")

female    0.518072
male      0.481928
Name: Sex, dtype: float64 

male      0.647239
female    0.352761
Name: Sex, dtype: float64 

male      0.638614
female    0.361386
Name: Sex, dtype: float64 

male      0.708738
female    0.291262
Name: Sex, dtype: float64 



In [32]:
# and what is the proportion of survivors and dead in each age group?
print(df[df["Age"] <= 15]["Survived"].value_counts(normalize=True), "\n")
print(df[(df["Age"] > 15) & (df["Age"] <= 30)]["Survived"].value_counts(normalize=True), "\n")
print(df[(df["Age"] > 30) & (df["Age"] <= 45)]["Survived"].value_counts(normalize=True), "\n")
print(df[df["Age"] > 45]["Survived"].value_counts(normalize=True), "\n")

1    0.590361
0    0.409639
Name: Survived, dtype: float64 

0    0.641104
1    0.358896
Name: Survived, dtype: float64 

0    0.574257
1    0.425743
Name: Survived, dtype: float64 

0    0.631068
1    0.368932
Name: Survived, dtype: float64 



### Apply

Let's now get to know the **.apply()** method, which is extremely useful for **modifying columns** or **creating new columns from old columns**.

The structure of the names is: **Surname, Title. First Names**.

Can we extract a column **only with surnames?** And another column **only with the titles?**

In [33]:
name = df["Name"][0]
name

'Braund, Mr. Owen Harris'

In [34]:
name.split(",")

['Braund', ' Mr. Owen Harris']

In [35]:
for i in range(5):    
    name = df["Name"][i]    
    print(name, " | ", name.split(",")[0])

Braund, Mr. Owen Harris  |  Braund
Cumings, Mrs. John Bradley (Florence Briggs Thayer)  |  Cumings
Heikkinen, Miss. Laina  |  Heikkinen
Futrelle, Mrs. Jacques Heath (Lily May Peel)  |  Futrelle
Allen, Mr. William Henry  |  Allen


We can **apply** this same procedure simultaneously to all elements of the name column using the **.apply()** method.

In [36]:
df["Name"].apply(lambda name: name.split(",")[0])

0         Braund
1        Cumings
2      Heikkinen
3       Futrelle
4          Allen
         ...    
886     Montvila
887       Graham
888     Johnston
889         Behr
890       Dooley
Name: Name, Length: 891, dtype: object

In [37]:
df["Surname"] = df["Name"].apply(lambda x: x.split(",")[0])

In [38]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,group,FE,Surname
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,"(15.0, 30.0]",1.0,Braund
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"(30.0, 45.0]",2.0,Cumings
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,"(15.0, 30.0]",1.0,Heikkinen
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"(30.0, 45.0]",2.0,Futrelle
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,"(30.0, 45.0]",2.0,Allen


Any new columns will always go, by default, to the end of the dataframe. To change the order of the columns:

In [39]:
df.columns.tolist()

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked',
 'group',
 'FE',
 'Surname']

In [40]:
df = df[['Survived',
         'Pclass',
         'Surname',
         'Sex',
         'Age',
         'SibSp',
         'Parch',
         'Fare',
         'Embarked']].copy()

In [41]:
df.head()

Unnamed: 0,Survived,Pclass,Surname,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,Braund,male,22.0,1,0,7.25,S
1,1,1,Cumings,female,38.0,1,0,71.2833,C
2,1,3,Heikkinen,female,26.0,0,0,7.925,S
3,1,1,Futrelle,female,35.0,1,0,53.1,S
4,0,3,Allen,male,35.0,0,0,8.05,S


This process that we did is called **feature engineering**, which consists of using original features (name) to create **new features** that may be more useful than the original feature.

### Missing data

Now that we know the base better, let's look again at the missing data:

In [42]:
df.isnull().sum()

Survived      0
Pclass        0
Surname       0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

There are different ways to attack the problem of missing data in the base. What we are going to do here is just **for exploratory purposes**. When we deal with missing data **for modeling**, it is important to use more robust tools.

In [43]:
# axis=0 means drop rows
df_drop_null_rows = df.dropna(axis=0, how="any")

In [44]:
df.shape, df_drop_null_rows.shape

((891, 9), (712, 9))

In [45]:
df_drop_null_rows.isnull().sum()

Survived    0
Pclass      0
Surname     0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [46]:
# axis=1 means drop columns
df_drop_null_cols = df.dropna(axis=1, how="any")

In [47]:
df.shape, df_drop_null_cols.shape

((891, 9), (891, 7))

In [48]:
df_drop_null_cols.isnull().sum()

Survived    0
Pclass      0
Surname     0
Sex         0
SibSp       0
Parch       0
Fare        0
dtype: int64

In the approaches above we are throwing data away, and this is rarely an interesting option. If we could **fill in** the missing data in some way that is justifiable, this could be a good approach.

#### Fill in missing data

Let's discuss now how to fill in the "Embarked" and "Age" columns.

In [49]:
df[df["Embarked"].isnull()]

Unnamed: 0,Survived,Pclass,Surname,Sex,Age,SibSp,Parch,Fare,Embarked
61,1,1,Icard,female,38.0,0,0,80.0,
829,1,1,Stone,female,62.0,0,0,80.0,


The missing data for "embarked" refers to two women from the first class, who survived. Let's take another look at the grouping of these columns:

In [50]:
df.groupby(["Pclass", "Survived", "Sex", "Embarked"])[["Survived"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Survived
Pclass,Survived,Sex,Embarked,Unnamed: 4_level_1
1,0,female,C,1
1,0,female,S,2
1,0,male,C,25
1,0,male,Q,1
1,0,male,S,51
1,1,female,C,42
1,1,female,Q,1
1,1,female,S,46
1,1,male,C,17
1,1,male,S,28


A large portion of the women who survived the first class embarked at port C or S. How to decide between these two?

In [51]:
df.groupby(["Embarked", "Pclass", "Sex"])[["Fare"]].agg(["count", "mean", "median"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Fare,Fare,Fare
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,median
Embarked,Pclass,Sex,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
C,1,female,43,115.640309,83.1583
C,1,male,42,93.536707,61.6792
C,2,female,7,25.268457,24.0
C,2,male,10,25.42125,25.8604
C,3,female,23,14.694926,14.4583
C,3,male,43,9.352237,7.2292
Q,1,female,1,90.0,90.0
Q,1,male,1,90.0,90.0
Q,2,female,2,12.35,12.35
Q,2,male,1,12.35,12.35


Based of the fare, it is possible to infer that the port of embarkation is S. But it is also possible to infer that it is C. In this case, we will **fill in the missing data with the most common value**.

In [52]:
df["Embarked"] = df["Embarked"].fillna(value = "S")
df.isnull().sum()

Survived      0
Pclass        0
Surname       0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      0
dtype: int64

To fill in the ages (a numerical data, but with a lot of missing data), there are several possible alternatives. Let's see some:

In [53]:
# first, we create a copy of the dataframe, so we can recover it later
df_checkpoint = df.copy()

Option 1: Filling in with the overall average

In [54]:
df["Age"] = df["Age"].fillna(value=df["Age"].mean())
df.isnull().sum()

Survived    0
Pclass      0
Surname     0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

Option 2: Filling in with the average separated by class

In [55]:
# recover the dataframe with the missing ages
df = df_checkpoint.copy()
df.isnull()["Age"].sum()

177

In [56]:
# grouping by class and taking the mean of ages
mean_age_by_class = df.groupby("Pclass")[["Age"]].mean().reset_index()
mean_age_by_class

Unnamed: 0,Pclass,Age
0,1,38.233441
1,2,29.87763
2,3,25.14062


In [57]:
# sub-base of missing ages by class
df_missing_age = df.loc[df["Age"].isnull()][["Pclass"]]
df_missing_age.shape[0]

177

In [58]:
# cross the base of missing ages with the mean of ages by class, keeping the index of the left base
age_to_fill = df_missing_age.merge(mean_age_by_class, 
                        on="Pclass", 
                        how="left").set_index(df_missing_age.index)["Age"]
age_to_fill.shape[0]

177

In [59]:
# set the series above to the missing ages
df.loc[df["Age"].isnull(), "Age"] = age_to_fill
df.isnull().sum()

Survived    0
Pclass      0
Surname     0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

Option 3: Generalize the above procedure for as many columns as we want to use in the crossing


In [60]:
# recover the dataframe with the missing ages
df = df_checkpoint.copy()
df.isnull().sum()

Survived      0
Pclass        0
Surname       0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      0
dtype: int64

In [61]:
# fill with the mean of the respective ages: class, sex, port of embarkation
cols = ["Pclass", "Sex", "Embarked"]
mean_ages = df.groupby(cols)[["Age"]].mean().reset_index()
df_missing_age = df.loc[df["Age"].isnull()][cols]
age_to_fill = df_missing_age.merge(mean_ages, 
                                             on=cols, 
                                             how="left").set_index(df_missing_age.index)["Age"]

age_to_fill

5      28.142857
17     30.875889
19     14.062500
26     25.016800
28     22.850000
         ...    
859    25.016800
863    23.223684
868    26.574766
878    26.574766
888    23.223684
Name: Age, Length: 177, dtype: float64

In [62]:
# set the missing ages
df.loc[df["Age"].isnull(), "Age"] = age_to_fill
df.isnull().sum()

Survived    0
Pclass      0
Surname     0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [63]:
# redefining the checkpoint at this point
df_checkpoint = df.copy()

### Processing categorical data

Machine learning models cannot use categorical data directly, so it is necessary to transform them into numerical data before, which is part of the pre-processing.

Again, here, our end is **exploratory**. For the construction of models, we must do this numerical encoding of categorical features in a more robust way.

In [64]:
# capture a subdataframe with the columns that are not numeric
df = df_checkpoint.copy()
df_cat = df.select_dtypes(exclude=[np.number])

In [65]:
# check the unique values of each categorical column
for col in df_cat.columns:
    print("Column:", col, end=" --> ")
    # display only columns with less than 50 categorical levels
    if len(df[col].unique().tolist()) > 50:
        print("Too many categorical levels:", len(df[col].unique().tolist()))
    else:
        unique_values = df[col].unique().tolist()
        print(unique_values)

Column: Surname --> Too many categorical levels: 667
Column: Sex --> ['male', 'female']
Column: Embarked --> ['S', 'C', 'Q']


In some columns, ["Sex", "Embarked"], there are not many **categorical levels**. When this is the case, we can **represent each category as a number**

What we are going to do is:
- convert the column to category (with the method `astype('category')`);
- use the `cat.codes` attributes to create numeric codes that represent the categories.


In [66]:
df.drop(columns=["Surname"], inplace=True)

In [67]:
for col in ["Sex", "Embarked"]:
    df[col] = df[col].astype('category').cat.codes

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int8   
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    int8   
dtypes: float64(2), int64(4), int8(2)
memory usage: 43.6 KB


In [69]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2


#### One-hot encoding

Other very common way to use categorical variables is through the creation of **dummy variables**.

<img src="https://miro.medium.com/max/2474/1*ggtP4a5YaRx6l09KQaYOnw.png" width=700>

This is easily done with pandas using the function [pd.get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [70]:
# restore the dataframe with the original categorical features
df = df_checkpoint.copy()
df.head()

Unnamed: 0,Survived,Pclass,Surname,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,Braund,male,22.0,1,0,7.25,S
1,1,1,Cumings,female,38.0,1,0,71.2833,C
2,1,3,Heikkinen,female,26.0,0,0,7.925,S
3,1,1,Futrelle,female,35.0,1,0,53.1,S
4,0,3,Allen,male,35.0,0,0,8.05,S


In [71]:
# get dummy only in columns with few categorical levels
df = pd.get_dummies(df, columns = ["Sex", "Embarked"]).head()
df.head()

Unnamed: 0,Survived,Pclass,Surname,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,Braund,22.0,1,0,7.25,0,1,0,0,1
1,1,1,Cumings,38.0,1,0,71.2833,1,0,1,0,0
2,1,3,Heikkinen,26.0,0,0,7.925,1,0,0,0,1
3,1,1,Futrelle,35.0,1,0,53.1,1,0,0,0,1
4,0,3,Allen,35.0,0,0,8.05,0,1,0,0,1


In [72]:
df.columns.tolist()

['Survived',
 'Pclass',
 'Surname',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Sex_female',
 'Sex_male',
 'Embarked_C',
 'Embarked_Q',
 'Embarked_S']

__________

### Wrap up

Now that we have finished processing the base, it would be interesting to save it so that we would not lose the changes we made. Pandas allows you to save files with a single line of code

In [73]:
df.to_csv("titanic_processed.csv")