# Titanic dataset analysis

#### Author: Romullo Ferreira

The aim of this project is to analyse the Kaggle Titanic dataset, which includes the following steps:

1. Business Questions (Business Understanding)
2. Data Understanding
3. Data wrangling
    - 3.1. Gather
    - 3.2. Assess
    - 3.3. Prepare Data (Clean)
4. Data exploration and visualization (Data Modeling).
5. Evaluate the Results.

## 3.1. Gather

Firstly, let's import the necessary libraries for this project

In [1]:
#Fazendo o import das bibliotecas que utilizaremos em nossa análise de dados.
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Import CSV file

In [2]:
#Adquirindo os dados(Lendo os dados)
df_titanic = pd.read_csv('titanic-data-6.csv', sep=',')

## 3.2. Assess

Maybe it is difficult to analyze a large data set, but if we analyze smaller samples we can answer some questions at the beginning.

##### Let's take a first look at the data using the head() function. That returns the first 5 lines of the dataframe.

In [4]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


##### We can also return the last 5 lines of the dataframe using the tail() function.

In [6]:
df_titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


##### Descriptive statistics are useful for each column of data.
The describe() function gives us an idea of the mean of all our columns and other useful information like Max and Min.

In [7]:
df_titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


##### Returning the dimensions of the dataframe

In [8]:
df_titanic.shape

(891, 12)

Dataframe dimensions. Just above we can see that the dataframe has 891 rows and 12 columns.

##### Using the info () function we will display a concise summary of the dataframe, including the number of non-null values in each column, see if they have missing values and the types of data for each resource

In [9]:
df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


As we can see, the Age, Cabin and Embarked columns have missing values.

##### What are the data types of the columns? We can also see data types with the dtypes.

In [None]:
df_titanic.dtypes

I believe that we have no problem with data types.
The Age variable is float because there are children on board under the age of 1. Let's leave it at that.

##### Let's see the number of unique values for each column.

In [10]:
df_titanic.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

##### Are there duplicate rows? Using the function 'sum' with the function duplicated we can check this information.

In [1]:
sum(df_titanic.duplicated())

NameError: name 'df_titanic' is not defined

It looks like we don't have any duplicate rows, this is very good.

## 3.3. Prepare Data (Clean)

### #Missing values (Fixing NaN data values in the Age column.)

- There are missing values at Age column

##### Define

- We use the fillna method to correct missing data in the Age column.

##### Code

In [2]:
mean = df_titanic['Age'].mean()
df_titanic['Age'].fillna(mean, inplace=True)

NameError: name 'df_titanic' is not defined

##### Test

In [12]:
df_titanic.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            891
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

Nice! We solved the problem at the Age column.

### ##Delete columns 

- The Cabin column also has missing values, but I chose to delete it because I will not use it in this project. 

##### Define

- We will delete the Cabin column.

##### Code

In [None]:
drop_column = ['Cabin']
df_titanic.drop(drop_column, axis=1, inplace = True)

##### Test

In [None]:
df_titanic.head()

Great! Cabin column successfully deleted!

## 4. Data exploration and visualization (Data Modeling)

##### a. -What was the average age of passengers on board?