# Introduction to Pandas II
Welcome to the Pandas II. In this lesson we will be covering: 
- **Creating, reading and writing dataframes**
- **Indexing in Pandas**
- **Mapping and Summarizing with pandas**

For this exercise we will again be using the Titanic Survival Dataset from Kaggle. We will perform various tranformations, edits and exploration. 

Lets go over the columns values once more as a reminder of the data we are using:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `?`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `?`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)


In [28]:
import pandas as pd
import numpy as np
import matplotlib as plt

In [29]:
# Read in the titanic survival dataset 
titanic_data = pd.read_csv('titanic_data.csv')
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,?,C,?,328,?
1305,3,0,"Zabour, Miss. Thamine",female,?,1,0,2665,14.4542,?,C,?,?,?
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,?,C,?,304,?
1307,3,0,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.225,?,C,?,?,?


## Grouping and Sorting 

### Groupwise analysis

In [30]:
# Grouping 
titanic_data.groupby('embarked').embarked.count()

embarked
?      2
C    270
Q    123
S    914
Name: embarked, dtype: int64

In [31]:
titanic_data.groupby(['embarked']).fare.agg([len,min,max])

Unnamed: 0_level_0,len,min,max
embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
?,2,80.0,80
C,270,106.425,91.0792
Q,123,10.7083,90
S,914,0.0,?


### Multi_indexes

### Sorting

In [32]:
# Sorting 
titanic_data.sort_values(by='embarked')

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
168,1,1,"Icard, Miss. Amelie",female,38,0,0,113572,80,B28,?,6,?,?
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62,0,0,113572,80,B28,?,6,?,"Cincinatti, OH"
953,3,1,"Leeni, Mr. Fahim ('Philip Zenni')",male,22,0,0,2620,7.225,?,C,6,?,?
531,2,0,"Pernot, Mr. Rene",male,?,0,0,SC/PARIS 2131,15.05,?,C,?,?,?
538,2,1,"Portaluppi, Mr. Emilio Ilario Giuseppe",male,30,0,0,C.A. 34644,12.7375,?,C,14,?,"Milford, NH"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
543,2,0,"Reeves, Mr. David",male,36,0,0,C.A. 17248,10.5,?,S,?,?,"Brighton, Sussex"
544,2,0,"Renouf, Mr. Peter Henry",male,34,1,0,31027,21,?,S,12,?,"Elizabeth, NJ"
545,2,1,"Renouf, Mrs. Peter Henry (Lillian Jefferys)",female,30,3,0,31027,21,?,S,?,?,"Elizabeth, NJ"
528,2,0,"Parkes, Mr. Francis 'Frank'",male,?,0,0,239853,0,?,S,?,?,Belfast


In [33]:
titanic_data.sort_values(by='embarked', ascending=False)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
784,3,0,"Dyker, Mr. Adolf Fredrik",male,23,1,0,347072,13.9,?,S,?,?,"West Haven, CT"
794,3,1,"Emanuel, Miss. Virginia Ethel",female,5,0,0,364516,12.475,?,S,13,?,"New York, NY"
793,3,0,"Elsbury, Mr. William James",male,47,0,0,A/5 3902,7.25,?,S,?,?,"Illinois, USA"
788,3,0,"Ekstrom, Mr. Johan",male,45,0,0,347061,6.975,?,S,?,?,"Effington Rut, SD"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
953,3,1,"Leeni, Mr. Fahim ('Philip Zenni')",male,22,0,0,2620,7.225,?,C,6,?,?
243,1,0,"Rosenshine, Mr. George ('Mr George Thorne')",male,46,0,0,PC 17585,79.2,?,C,?,16,"New York, NY"
654,3,0,"Baccos, Mr. Raffull",male,20,0,0,2679,7.225,?,C,?,?,?
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62,0,0,113572,80,B28,?,6,?,"Cincinatti, OH"


In [34]:
titanic_data.sort_values(by = ['sex','first_name'])

KeyError: 'first_name'

## Data Types and Missing values 


### Dtypes
Per our previous lessons, dataypes are the way we store specific data to a specific datatype. Such as numbers to int,float,double, and text to str,char. It is important to know what datatypes you are working with, because at times you will need to either alter, edit, adjust, or replace values in your data. When altering your data, you will need to ensure that the altered data matches the datatype of the data that you are changing. 

Lets take a look at how to find the data type of our columns from the titanic dataset

In [35]:
#dtypes, types of data
titanic_data.dtypes

pclass        int64
survived      int64
name         object
sex          object
age          object
sibsp         int64
parch         int64
ticket       object
fare         object
cabin        object
embarked     object
boat         object
body         object
home.dest    object
dtype: object

In [36]:
# To do: find the datatype of just one column from the titanic dataset


Now that we know how to check data types, lets try changing datatypes. In the example below we will be converting the, age, column from object(str,char) to numeric. 

#### Question
Now why would we want to convert the age value from str to numeric? Answer this question below.  To answer, make sure to double click the cell below. 

(Double click here) Answer: 


In [37]:
titanic_data['age'] = pd.to_numeric(titanic_data['age'], errors = 'coerce')
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,?,C,?,328,?
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,?,C,?,?,?
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.225,?,C,?,304,?
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.225,?,C,?,?,?


Now lets verify our change

In [38]:
titanic_data.age.dtype

dtype('float64')

As you can see, we were able to successfully convert our age data frm str to numeric

### Missing Values 
Missing values are displayed as NaN. Nan or Not a number are values that do not have any data. It is important to understand if your data has missing values, because your AI model will only be as good as the data.

In our data we do not have any missing values, but lets still explore on how to find missing values. 

In [39]:
# Use the .isnull() method to find missing values in our dataset 
titanic_data.isnull()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1305,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1306,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1307,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Lets all explore how to count how many missing values we have per column of data.

In [40]:
titanic_data.isnull().sum()

pclass         0
survived       0
name           0
sex            0
age          263
sibsp          0
parch          0
ticket         0
fare           0
cabin          0
embarked       0
boat           0
body           0
home.dest      0
dtype: int64

#### Question:
Why do you believe missing values are bad for AI? 

(Double click here) Answer: 


## Renaming and Combining 


### Renaming

In [41]:
# Renaming
titanic_data = titanic_data.replace({'?': None})
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.55,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.225,,C,,304,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.225,,C,,,


In [42]:
titanic_data.rename(columns={'embarked': 'country'})

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,country,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.55,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.225,,C,,304,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.225,,C,,,


### Combining

In [43]:
# Combining 
# joins, concat, 

## Lets now build our first machine learning model

Okay, so now we have looked at our dataset, lets now use it for machine learning. 

In [44]:
# Seperate target from dataset
target = titanic_data['survived']
features_raw = titanic_data.drop('survived', axis = 1)

In [45]:
# preprocess data
features = pd.get_dummies(features_raw)
features = features.fillna(0.0)
features.head()
#split data

Unnamed: 0,pclass,age,sibsp,parch,"name_Abbing, Mr. Anthony","name_Abbott, Master. Eugene Joseph","name_Abbott, Mr. Rossmore Edward","name_Abbott, Mrs. Stanton (Rosa Hunt)","name_Abelseth, Miss. Karen Marie","name_Abelseth, Mr. Olaus Jorgensen",...,"home.dest_Wimbledon Park, London / Hayling Island, Hants","home.dest_Windsor, England New York, NY","home.dest_Winnipeg, MB","home.dest_Winnipeg, MN","home.dest_Woodford County, KY","home.dest_Worcester, England","home.dest_Worcester, MA","home.dest_Yoevil, England / Cottage Grove, OR","home.dest_Youngstown, OH","home.dest_Zurich, Switzerland"
0,1,29.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0.9167,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,2.0,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,30.0,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,25.0,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
#import model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#train

# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier()

In [47]:
#test model

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.950381679389313
