## Prepare Data: Read in the data

_Throughout this course we are going to be learning to implement a few different ensemble learning techniques and then at the end of the course we will compare the performance of each of them against each other._

_In order to set up for that, in this chapter we are going to introduce the data we will be using, clean it up, and then write out our training, validation, and test sets._

_The dataset we will be using is this Titanic dataset which is a publicly available dataset that is commonly used for machine learning. And if you have taken any of my other Applied Machine Learning courses, you'll be familiar with this data._

_It's worth mentioning that we are only using the training set from the Kaggle competition in this course because the target values have been stripped from the test set. We need those target values to evaluate our models so we will take the training set from the Kaggle competition and then split that into our own training, validation set._

_So this dataset contains information about 891 people who were on board the ship when departed in 1912. Some people aboard the ship were more likely to survive the wreck than others. There were not enough lifeboats for everybody so certain groups of people were prioritized. Our task is to build a model to predict which people would survive using certain information about the 891 people on board. The features to be used:_

_READ THROUGH FEATURES_

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition (we are only using the training set).

This dataset contains information about 891 people who were on board the ship when departed on April 15th, 1912. As noted in the description on Kaggle's website, some people aboard the ship were more likely to survive the wreck than others. There were not enough lifeboats for everybody so women, children, and the upper-class were prioritized. Using the information about these 891 passengers, the challenge is to build a model to predict which people would survive based on the following fields:

- **Name** (str) - Name of the passenger
- **Pclass** (int) - Ticket class
- **Sex** (str) - Sex of the passenger
- **Age** (float) - Age in years
- **SibSp** (int) - Number of siblings and spouses aboard
- **Parch** (int) - Number of parents and children aboard
- **Ticket** (str) - Ticket number
- **Fare** (float) - Passenger fare
- **Cabin** (str) - Cabin number
- **Embarked** (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

### Read in Data

_Lets get into the data. We will start my importing pandas, reading in our data, and taking a look at the first five rows._

_You'll see each of the features that we discussed above. The survived column is the one that we'll be trying to predict (0 = did not survive, 1 = survived) and all of the other columns are to be considered as potential features for the model. As I mentioned before, in this section we will be cleaning the data and then writing out our training, validation, and test sets. The main idea behind cleaning the data is it allows us to format or structure the data as well as possible to potentially predict whether a person survived or not. If you are interested in digging more into this topic, feel free to take a look at my Feature Engineering course that is part of this Applied Machine Learning course._

_In the next video, we will dive into cleaning up this data._

In [1]:
import pandas as pd

titanic = pd.read_csv('../titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
