# Introduction to Pandas 
Welcome to the Pandas I. In this lesson we will be covering: 
- **Creating, reading and writing dataframes**
- **Indexing in Pandas**
- **Mapping and Summarizing with pandas**

For this exercise we wil be using the Titanic Survival Dataset from Kaggle. We will perform various tranformations, edits and exploration. 

These are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `?`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `?`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Lets start with importing some of the libraries we will be using.

In [4]:
import pandas as pd
import numpy as np
import matplotlib as plt

### Why Pandas?
Pandas is the go to python library to use for any data scientist or machine learning engineer. It allows for easy data access, easy data manipulation, and its free! Lets start off by reading the titanic survival dataset. 

### Reading in Data

In [5]:
# Read in the titanic survival dataset 
titanic_data = pd.read_csv('titanic_data.csv')
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,?,C,?,328,?
1305,3,0,"Zabour, Miss. Thamine",female,?,1,0,2665,14.4542,?,C,?,?,?
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,?,C,?,304,?
1307,3,0,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.225,?,C,?,?,?


Pandas has various options to read data. If you are curious, you can read about them here: [Pandas Input/Output](https://pandas.pydata.org/pandas-docs/stable/reference/io.html). For the majority of the exercises, we will be `read_csv`. If you would like to see all the options for read_csv, [here is the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv). 
A CSV file is a table of values seperated by commas, Comma Seperated Values (CSV). The file look like:

data1,data2,data3

The data is read into a dataframe. A dataframe as you can see from above, is a table. Every entry corresponds to a row and column.

### Pandas Series 
A Series is a sequence of data. A dataframe is a table and a series is a list. Whats the difference between a Series and a DataFrame? A Dataframe is a table and a Series would be just a single column of data. Dataframes have column names and Series do not. 

In [6]:
# Lets take a look at A Series within the titanic survival dataset. 
titanic_series_data = titanic_data['survived']
titanic_series_data

0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: survived, Length: 1309, dtype: int64

In [7]:
# To do: Choose a column and create your own series below


### Summary Functions 
Summary functions are used to quickly summarize your data. They are very handy tools when you are working with a new dataset and we will explore in a later lesson why that is important. 

Lets start by using using the `describe()` method. 

In [8]:
# Use the describe method to quickly summarize the titanic survival dataset
titanic_data.describe()

Unnamed: 0,pclass,survived,sibsp,parch
count,1309.0,1309.0,1309.0,1309.0
mean,2.294882,0.381971,0.498854,0.385027
std,0.837836,0.486055,1.041658,0.86556
min,1.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,0.0
50%,3.0,0.0,0.0,0.0
75%,3.0,1.0,1.0,0.0
max,3.0,1.0,8.0,9.0


The result of the `describe()` method are 8 values: 
- **count:** 
  Count number of non-NA/null observations.
- **mean:**
  Mean of the values.
- **std:**
  Standard deviation of the observations.
- **min:**
  Minimum of the values in the object.
- **max:**
  Maximum of the values in the object.
- **25%-75%:**
  These values represent quantiles. We will go over what these mean for another lesson


Now if you want to only calculate one value, you can use the following way below:

In [9]:
#individual columns stats 
titanic_data.survived.mean()

0.3819709702062643

In [1]:
# To do: Calculate the count of the survived column


#### The `unique()` and `value_counts()` methods 

The following methods allow you to quickly summerize a column.

- **unique:** This method will display all the unique values of your column. 
- **value_counts:** This method will display all the unique values and their counts. 

In [10]:
#unique names
titanic_data.sex.unique()

array(['female', 'male'], dtype=object)

In [11]:
#value counts 
titanic_data.sex.value_counts()

male      843
female    466
Name: sex, dtype: int64

#### Datatype's
Per our previous lessons, dataypes are the way we store specific data to a specific datatype. Such as numbers to int,float,double, and text to str,char. It is important to know what datatypes you are working with, because at times you will need to either alter, edit, adjust, or replace values in your data. When altering your data, you will need to ensure that the altered data matches the datatype of the data that you are changing. 

Lets take a look at how to find the data type of our columns from the titanic dataset

In [12]:
#dtypes, types of data
titanic_data.dtypes

pclass        int64
survived      int64
name         object
sex          object
age          object
sibsp         int64
parch         int64
ticket       object
fare         object
cabin        object
embarked     object
boat         object
body         object
home.dest    object
dtype: object

In [13]:
# To do: find the datatype of just one column from the titanic dataset


Now that we know how to check data types, lets try changing datatypes. In the example below we will be converting the, age, column from object(str,char) to numeric. 

#### Question
Now why would we want to convert the age value from str to numeric? Answer this question below.  To answer, make sure to double click the cell below. 

(Double click here) Answer: 


In [14]:
titanic_data['age'] = pd.to_numeric(titanic_data['age'], errors = 'coerce')
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,?,C,?,328,?
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,?,C,?,?,?
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.225,?,C,?,304,?
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.225,?,C,?,?,?


Now lets verify our change

In [15]:
titanic_data.age.dtype

dtype('float64')

As you can see, we were able to successfully convert our age data frm str to numeric

#### Missing Values 
Missing values are displayed as NaN. Nan or Not a number are values that do not have any data. It is important to understand if your data has missing values, because your AI model will only be as good as the data.

In our data we do not have any missing values, but lets still explore on how to find missing values. 

In [16]:
# Use the .isnull() method to find missing values in our dataset 
titanic_data.isnull()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1305,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1306,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1307,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Lets all explore how to count how many missing values we have per column of data.

In [51]:
titanic_data.isnull().sum()

pclass       0
survived     0
name         0
sex          0
age          0
sibsp        0
parch        0
ticket       0
fare         0
cabin        0
embarked     0
boat         0
body         0
home.dest    0
last_name    0
dtype: int64

#### Question:
Why do you believe missing values are bad for AI? 

(Double click here) Answer: 


### Indexing Selecting and Assigning

Selecting, indexing, and assigning data will be your bread and butter of data analysis. These tools will allow you to select aspects of your data quickly and efficiently to analyze. 

Lets start by selecting a column of data. There are two ways to do this either by accessing the column using the dot operator or by refrencing the column explicitly.

In [52]:
# Select one column data using the dot operator
titanic_data.ticket

0        24160
1       113781
2       113781
3       113781
4       113781
         ...  
1304      2665
1305      2665
1306      2656
1307      2670
1308    315082
Name: ticket, Length: 1309, dtype: object

In [53]:
# Select one column by refrencing the column explicitly
titanic_data['ticket']

0        24160
1       113781
2       113781
3       113781
4       113781
         ...  
1304      2665
1305      2665
1306      2656
1307      2670
1308    315082
Name: ticket, Length: 1309, dtype: object

Now lets take a look at the first value of a column. Lets refrence the column explicitly, and use square brackets

In [54]:
titanic_data['ticket'][1]

'113781'

We were able to access the second row of data from the ticket column by refrencing the column explicity and by using square brackets. Now the method that is typically used in industry and is the recommended way of accessing data, is by using iloc. 

Lets use iloc below and access the same data as we did above

In [19]:
#indexing 
titanic_data['ticket'].iloc[1]

'113781'

iloc accesses data by using row-first, column-second. 

- iloc[row:column]


Accessing columns using iloc

In [66]:
titanic_data.iloc[:,0]

0       1
1       1
2       1
3       1
4       1
       ..
1304    3
1305    3
1306    3
1307    3
1308    3
Name: pclass, Length: 1309, dtype: int64

Selecting rows 

In [69]:
titanic_data.iloc[:3,0]

0    1
1    1
2    1
Name: pclass, dtype: int64

How to access last 5 rows 

In [70]:
titanic_data.iloc[-5:,0]

1304    3
1305    3
1306    3
1307    3
1308    3
Name: pclass, dtype: int64

In [76]:
#loc and iloc
titanic_data.loc[0:2,'last_name']

0      Allen
1    Allison
2    Allison
Name: last_name, dtype: object

What is the index, and how you can change it, and why you would need to do it

In [78]:
#changing the index
titanic_data_new_index = titanic_data.set_index("sex")
titanic_data_new_index

Unnamed: 0_level_0,pclass,survived,name,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
female,1,1,"Allen, Miss. Elisabeth Walton",29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO",Allen
male,1,1,"Allison, Master. Hudson Trevor",0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON",Allison
female,1,0,"Allison, Miss. Helen Loraine",2,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison
male,1,0,"Allison, Mr. Hudson Joshua Creighton",30,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON",Allison
female,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
female,3,0,"Zabour, Miss. Hileni",14.5,1,0,2665,14.4542,?,C,?,328,?,Zabour
female,3,0,"Zabour, Miss. Thamine",?,1,0,2665,14.4542,?,C,?,?,?,Zabour
male,3,0,"Zakarian, Mr. Mapriededer",26.5,0,0,2656,7.225,?,C,?,304,?,Zakarian
male,3,0,"Zakarian, Mr. Ortin",27,0,0,2670,7.225,?,C,?,?,?,Zakarian


Conditional indexing, selecting values based on other values 

In [79]:
#conditional indexing 
titanic_data.sex == 'female'

0        True
1       False
2        True
3       False
4        True
        ...  
1304     True
1305     True
1306    False
1307    False
1308    False
Name: sex, Length: 1309, dtype: bool

In [80]:
titanic_data.loc[titanic_data.sex == 'female']

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO",Allen
2,1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63,1,0,13502,77.9583,D7,S,10,?,"Hudson, NY",Andrews
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53,2,0,11769,51.4792,C101,S,D,?,"Bayside, Queens, NY",Appleton
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1286,3,1,"Whabee, Mrs. George Joseph (Shawneene Abi-Saab)",female,38,0,0,2688,7.2292,?,C,C,?,?,Whabee
1290,3,1,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,?,S,?,?,?,Wilkes
1300,3,1,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15,1,0,2659,14.4542,?,C,?,?,?,Yasbeck
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,?,C,?,328,?,Zabour


In [81]:
titanic_data.loc[(titanic_data.sex == 'female') & (titanic_data.survived ==1)]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO",Allen
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63,1,0,13502,77.9583,D7,S,10,?,"Hudson, NY",Andrews
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53,2,0,11769,51.4792,C101,S,D,?,"Bayside, Queens, NY",Appleton
11,1,1,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",female,18,1,0,PC 17757,227.525,C62 C64,C,4,?,"New York, NY",Astor
12,1,1,"Aubart, Mme. Leontine Pauline",female,24,0,0,PC 17477,69.3,B35,C,9,?,"Paris, France",Aubart
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1260,3,1,"Turja, Miss. Anna Sofia",female,18,0,0,4138,9.8417,?,S,15,?,?,Turja
1261,3,1,"Turkula, Mrs. (Hedwig)",female,63,0,0,4134,9.5875,?,S,15,?,?,Turkula
1286,3,1,"Whabee, Mrs. George Joseph (Shawneene Abi-Saab)",female,38,0,0,2688,7.2292,?,C,C,?,?,Whabee
1290,3,1,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,?,S,?,?,?,Wilkes


In [83]:
titanic_data.loc[titanic_data.last_name.isin(['Allen','Allison'])]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO",Allen
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON",Allison
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON",Allison
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison
618,3,0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,?,S,?,?,"Lower Clapton, Middlesex or Erdington, Birmingham",Allen


In [84]:
#assigning data
#Using pandas we can perform various transformation 
#1. such as adding new data 
titanic_data_last_name = titanic_data
titanic_data_last_name['last_name'] = titanic_data['name'].str.split().str[0].replace(',','',regex=True)
titanic_data_last_name

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO",Allen
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON",Allison
2,1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON",Allison
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,?,C,?,328,?,Zabour
1305,3,0,"Zabour, Miss. Thamine",female,?,1,0,2665,14.4542,?,C,?,?,?,Zabour
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,?,C,?,304,?,Zakarian
1307,3,0,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.225,?,C,?,?,?,Zakarian


In [88]:
#Assign another column 
titanic_data_name = titanic_data_last_name
titanic_data_name['first_name'] = titanic_data['name'].str.split().str[2].replace(',','',regex=True)
titanic_data_name

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO",Allen,Elisabeth
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON",Allison,Hudson
2,1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison,Helen
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON",Allison,Hudson
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison,Hudson
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,?,C,?,328,?,Zabour,Hileni
1305,3,0,"Zabour, Miss. Thamine",female,?,1,0,2665,14.4542,?,C,?,?,?,Zabour,Thamine
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,?,C,?,304,?,Zakarian,Mapriededer
1307,3,0,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.225,?,C,?,?,?,Zakarian,Ortin


In [27]:
#explain how to add a new column and ask for something simple
#add your own column to the data 

# Create a new column using the Titanic Dataset


What is mapping, and how can we use it. Lets normalize values 

In [99]:
# Mapping Functions 

max_age = titanic_data.age.max()
min_age = titanic_data.age.min()
titanic_data.age.map(lambda p: (p-min_age)/(max_age-min_age))

0       0.361169
1       0.009395
2       0.022964
3       0.373695
4       0.311064
          ...   
1304    0.179540
1305         NaN
1306    0.329854
1307    0.336117
1308    0.361169
Name: age, Length: 1309, dtype: float64

Creating a function and then using it in mapping 

In [101]:
def rename_embarked(row):
    if row.embarked == 'S':
        row.embarked = 'UK'
    elif row.embarked == 'C':
        row.embarked = 'FR'
    elif row.embarked == 'Q':
        row.embarked = 'IE'
    return row
    
titanic_data = titanic_data.apply(rename_embarked,axis='columns')
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,UK,2,?,"St Louis, MO",Allen,Elisabeth
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,UK,11,?,"Montreal, PQ / Chesterville, ON",Allison,Hudson
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.55,C22 C26,UK,?,?,"Montreal, PQ / Chesterville, ON",Allison,Helen
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.55,C22 C26,UK,?,135,"Montreal, PQ / Chesterville, ON",Allison,Hudson
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.55,C22 C26,UK,?,?,"Montreal, PQ / Chesterville, ON",Allison,Hudson
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,?,FR,?,328,?,Zabour,Hileni
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,?,FR,?,?,?,Zabour,Thamine
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.225,?,FR,?,304,?,Zakarian,Mapriededer
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.225,?,FR,?,?,?,Zakarian,Ortin


What is groupings, how you can use it to count things

In [103]:
# Grouping 
titanic_data.groupby('embarked').embarked.count()

embarked
?       2
FR    270
IE    123
UK    914
Name: embarked, dtype: int64

How to chain them together 

In [107]:
titanic_data.groupby(['embarked']).fare.agg([len,min,max])

Unnamed: 0_level_0,len,min,max
embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
?,2,80.0,80
FR,270,106.425,91.0792
IE,123,10.7083,90
UK,914,0.0,?


How to sort values 

In [108]:
# Sorting 
titanic_data.sort_values(by='embarked')

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
168,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80,B28,?,6,?,?,Icard,Amelie
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80,B28,?,6,?,"Cincinatti, OH",Stone,George
953,3,1,"Leeni, Mr. Fahim ('Philip Zenni')",male,22.0,0,0,2620,7.225,?,FR,6,?,?,Leeni,Fahim
531,2,0,"Pernot, Mr. Rene",male,,0,0,SC/PARIS 2131,15.05,?,FR,?,?,?,Pernot,Rene
538,2,1,"Portaluppi, Mr. Emilio Ilario Giuseppe",male,30.0,0,0,C.A. 34644,12.7375,?,FR,14,?,"Milford, NH",Portaluppi,Emilio
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
543,2,0,"Reeves, Mr. David",male,36.0,0,0,C.A. 17248,10.5,?,UK,?,?,"Brighton, Sussex",Reeves,David
544,2,0,"Renouf, Mr. Peter Henry",male,34.0,1,0,31027,21,?,UK,12,?,"Elizabeth, NJ",Renouf,Peter
545,2,1,"Renouf, Mrs. Peter Henry (Lillian Jefferys)",female,30.0,3,0,31027,21,?,UK,?,?,"Elizabeth, NJ",Renouf,Peter
528,2,0,"Parkes, Mr. Francis 'Frank'",male,,0,0,239853,0,?,UK,?,?,Belfast,Parkes,Francis


In [111]:
titanic_data.sort_values(by='embarked', ascending=False)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,UK,2,?,"St Louis, MO",Allen,Elisabeth
784,3,0,"Dyker, Mr. Adolf Fredrik",male,23.0,1,0,347072,13.9,?,UK,?,?,"West Haven, CT",Dyker,Adolf
794,3,1,"Emanuel, Miss. Virginia Ethel",female,5.0,0,0,364516,12.475,?,UK,13,?,"New York, NY",Emanuel,Virginia
793,3,0,"Elsbury, Mr. William James",male,47.0,0,0,A/5 3902,7.25,?,UK,?,?,"Illinois, USA",Elsbury,William
788,3,0,"Ekstrom, Mr. Johan",male,45.0,0,0,347061,6.975,?,UK,?,?,"Effington Rut, SD",Ekstrom,Johan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
953,3,1,"Leeni, Mr. Fahim ('Philip Zenni')",male,22.0,0,0,2620,7.225,?,FR,6,?,?,Leeni,Fahim
243,1,0,"Rosenshine, Mr. George ('Mr George Thorne')",male,46.0,0,0,PC 17585,79.2,?,FR,?,16,"New York, NY",Rosenshine,George
654,3,0,"Baccos, Mr. Raffull",male,20.0,0,0,2679,7.225,?,FR,?,?,?,Baccos,Raffull
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80,B28,?,6,?,"Cincinatti, OH",Stone,George


In [114]:
titanic_data.sort_values(by = ['sex','first_name'])

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
333,2,1,"Ball, Mrs. (Ada E Hall)",female,36.0,0,0,28551,13,D,UK,10,?,"Bristol, Avon / Jacksonville, FL",Ball,(Ada
371,2,1,"Christy, Mrs. (Alice Frances)",female,45.0,0,2,237789,30,?,UK,12,?,London,Christy,(Alice
484,2,1,"Lemore, Mrs. (Amelia Milley)",female,34.0,0,0,C.A. 34260,10.5,F33,UK,14,?,"Chicago, IL",Lemore,(Amelia
1026,3,1,"Moor, Mrs. (Beila)",female,27.0,0,1,392096,12.475,E121,UK,14,?,?,Moor,(Beila)
666,3,0,"Barbara, Mrs. (Catherine David)",female,45.0,0,1,2691,14.4542,?,FR,?,?,"Syria Ottawa, ON",Barbara,(Catherine
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1219,3,0,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,?,UK,?,?,?,Spector,Woolf
876,3,0,"Ilieff, Mr. Ylio",male,,0,0,349220,7.8958,?,UK,?,?,?,Ilieff,Ylio
750,3,0,"Danoff, Mr. Yoto",male,27.0,0,0,349219,7.8958,?,UK,?,?,"Bulgaria Chicago, IL",Danoff,Yoto
1186,3,0,"Samaan, Mr. Youssef",male,,2,0,2662,21.6792,?,FR,?,?,?,Samaan,Youssef


How to replace columns name

In [125]:
# Renaming
titanic_data = titanic_data.replace({'?': None})
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,UK,2,,"St Louis, MO",Allen,Elisabeth
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,UK,11,,"Montreal, PQ / Chesterville, ON",Allison,Hudson
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.55,C22 C26,UK,,,"Montreal, PQ / Chesterville, ON",Allison,Helen
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.55,C22 C26,UK,,135,"Montreal, PQ / Chesterville, ON",Allison,Hudson
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.55,C22 C26,UK,,,"Montreal, PQ / Chesterville, ON",Allison,Hudson
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,FR,,328,,Zabour,Hileni
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,FR,,,,Zabour,Thamine
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.225,,FR,,304,,Zakarian,Mapriededer
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.225,,FR,,,,Zakarian,Ortin


In [126]:
titanic_data.rename(columns={'embarked': 'country'})

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,country,boat,body,home.dest,last_name,first_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,UK,2,,"St Louis, MO",Allen,Elisabeth
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,UK,11,,"Montreal, PQ / Chesterville, ON",Allison,Hudson
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.55,C22 C26,UK,,,"Montreal, PQ / Chesterville, ON",Allison,Helen
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.55,C22 C26,UK,,135,"Montreal, PQ / Chesterville, ON",Allison,Hudson
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.55,C22 C26,UK,,,"Montreal, PQ / Chesterville, ON",Allison,Hudson
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,FR,,328,,Zabour,Hileni
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,FR,,,,Zabour,Thamine
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.225,,FR,,304,,Zakarian,Mapriededer
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.225,,FR,,,,Zakarian,Ortin


In [None]:
# Combining 
# joins, concat, 

Okay, so now we have looked at our dataset, lets now use it for machine learning. 

In [118]:
# Seperate target from dataset
target = titanic_data['survived']
features_raw = titanic_data.drop('survived', axis = 1)

In [127]:
# preprocess data
features = pd.get_dummies(features_raw)
features = features.fillna(0.0)
features.head()
#split data

Unnamed: 0,pclass,age,sibsp,parch,"name_Abbing, Mr. Anthony","name_Abbott, Master. Eugene Joseph","name_Abbott, Mr. Rossmore Edward","name_Abbott, Mrs. Stanton (Rosa Hunt)","name_Abelseth, Miss. Karen Marie","name_Abelseth, Mr. Olaus Jorgensen",...,first_name_Wazli,first_name_Wendla,first_name_Wilhelm,first_name_William,first_name_Winifred,first_name_Woolf,first_name_Ylio,first_name_Yoto,first_name_Youssef,first_name_hoef
0,1,29.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0.9167,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,2.0,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,30.0,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,25.0,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [128]:
#import model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#train

# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# TODO: Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [129]:
#test model

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.9732824427480916
