# Introduction to Pandas II
Welcome to the Pandas II. In this lesson we will be covering: 
- **Grouping and Sorting Data**
- **Data Types and Missing Values in Data**
- **Renaming and Combining Data** 

We will also create our first **Machine Learning Model**.

The lab for Lesson 4 will consist of all the exercises that you will find throughtout the notebook. 

For this lesson we will again be using the Titanic Survival Dataset from Kaggle. We will perform various tranformations, edits and exploration. 

Let's review the column values once more as a reminder of the data we are using:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `?`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `?`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
- **Home.Dest**: Home / Destination


In [3]:
import pandas as pd
import numpy as np

Let's read in the data set. We performed this in our previous lab, now give it a try:

In [4]:
# EXERCISE 1
# Read in the titanic survival dataset (titanic_data.csv)
titanic_data = pd.read_csv('titanic_data.csv')
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,?,C,?,328,?
1305,3,0,"Zabour, Miss. Thamine",female,?,1,0,2665,14.4542,?,C,?,?,?
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,?,C,?,304,?
1307,3,0,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.225,?,C,?,?,?


## Grouping and Sorting 

### Groupwise Analysis
The `groupby()` method allows us group our data. Depending on the input given, `groupyby()` can also be used with the summary methods such as `count()`, `mean()`, and the others. 

In [5]:
# Grouping 
titanic_data.groupby('embarked').embarked.count()


embarked
?      2
C    270
Q    123
S    914
Name: embarked, dtype: int64

One thing to note from the above grouping is our results. We grouped by the column "embarked" and then asked for only the count of the "embarked column. The results we got were, ?, C, Q and S. Just as a thought, does it make sense to have ? as a value? Let continue and we shall find out.

In [6]:
# EXERCISE 2

# Run the cell below, and then answer the question.
titanic_data.groupby('embarked').count()
#Technically it does not make sense because from my understanding, having a question mark for the "embarked" column means that
# it is not verified whether those people did embark or not. As a result, it does not make sense to have that value presented.

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,boat,body,home.dest
embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
?,2,2,2,2,2,2,2,2,2,2,2,2,2
C,270,270,270,270,270,270,270,270,270,270,270,270,270
Q,123,123,123,123,123,123,123,123,123,123,123,123,123
S,914,914,914,914,914,914,914,914,914,914,914,914,914


In [7]:
# EXERCISE 3

# Describe the difference between the first `groupby()` vs the second one we executed in Exercise 2.
# Double click the cell below to type your answer

#### Answer: 
(type answer here)
The difference between the first groupby() and the second groupby() is that the first one only showed the information in basic script format. The second one, organizes the data in a neatly formatted chart along with the other columns relating to those who embarked. 

We can also use the `agg()` method to aggregate values all at once. Let's use it below to get the length, minimun value and maximum value. 

In [8]:
titanic_data.groupby(['sibsp']).survived.agg([len,min,max])

Unnamed: 0_level_0,len,min,max
sibsp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,891,0,1
1,319,0,1
2,42,0,1
3,20,0,1
4,22,0,1
5,6,0,0
8,9,0,0


Interesting fact about the data above:  
The information above tells us that larger families (sibsp refers to the number of siblings / spouses on board) did not survive. 

In [9]:
# EXERCISE 4
# Group the data by sibsp and find the sum, min and max values for the survived column using the agg() method
titanic_data.groupby(['sibsp']).survived.agg([sum,min,max])

Unnamed: 0_level_0,sum,min,max
sibsp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,309,0,1
1,163,0,1
2,19,0,1
3,6,0,1
4,3,0,1
5,0,0,0
8,0,0,0


### Sorting
Sorting is an extremely valuable tool. Sorting allows us to keep our data organized, and allows for the user of the dataset to have better control over their data. Lets sort our data below, by setting the "embarked" column as our refrence. 

In [10]:
# Sorting 
titanic_data.sort_values(by='embarked')

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
168,1,1,"Icard, Miss. Amelie",female,38,0,0,113572,80,B28,?,6,?,?
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62,0,0,113572,80,B28,?,6,?,"Cincinatti, OH"
953,3,1,"Leeni, Mr. Fahim ('Philip Zenni')",male,22,0,0,2620,7.225,?,C,6,?,?
531,2,0,"Pernot, Mr. Rene",male,?,0,0,SC/PARIS 2131,15.05,?,C,?,?,?
538,2,1,"Portaluppi, Mr. Emilio Ilario Giuseppe",male,30,0,0,C.A. 34644,12.7375,?,C,14,?,"Milford, NH"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
543,2,0,"Reeves, Mr. David",male,36,0,0,C.A. 17248,10.5,?,S,?,?,"Brighton, Sussex"
544,2,0,"Renouf, Mr. Peter Henry",male,34,1,0,31027,21,?,S,12,?,"Elizabeth, NJ"
545,2,1,"Renouf, Mrs. Peter Henry (Lillian Jefferys)",female,30,3,0,31027,21,?,S,?,?,"Elizabeth, NJ"
528,2,0,"Parkes, Mr. Francis 'Frank'",male,?,0,0,239853,0,?,S,?,?,Belfast


As we can see now, the embarked column has been sorted to show ?,C,Q, and S. ? is first as it comes first in the ascii table of characters (http://www.asciitable.com/). 

The values in code above were sorted in ascending order (default), but we can also sort in descending order as shown below:

In [11]:
titanic_data.sort_values(by='embarked', ascending=False)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
784,3,0,"Dyker, Mr. Adolf Fredrik",male,23,1,0,347072,13.9,?,S,?,?,"West Haven, CT"
794,3,1,"Emanuel, Miss. Virginia Ethel",female,5,0,0,364516,12.475,?,S,13,?,"New York, NY"
793,3,0,"Elsbury, Mr. William James",male,47,0,0,A/5 3902,7.25,?,S,?,?,"Illinois, USA"
788,3,0,"Ekstrom, Mr. Johan",male,45,0,0,347061,6.975,?,S,?,?,"Effington Rut, SD"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
953,3,1,"Leeni, Mr. Fahim ('Philip Zenni')",male,22,0,0,2620,7.225,?,C,6,?,?
243,1,0,"Rosenshine, Mr. George ('Mr George Thorne')",male,46,0,0,PC 17585,79.2,?,C,?,16,"New York, NY"
654,3,0,"Baccos, Mr. Raffull",male,20,0,0,2679,7.225,?,C,?,?,?
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62,0,0,113572,80,B28,?,6,?,"Cincinatti, OH"


In [12]:
# EXERCISE 5

# Look at the code in the cell below (titanic_data['last_name'] and titanic_data['first_name']) and 
#  try to describe what is occuring:

# Double click the cell below to type your answer

#### Answer: 
The code is splitting the name column into a first name column and last name column and creates a newly sorted column at the end of the chart. 

In [13]:
titanic_data['last_name'] = titanic_data['name'].str.split().str[0].replace(',','',regex=True)
titanic_data['first_name'] = titanic_data['name'].str.split().str[2].replace(',','',regex=True)

Note that we have used the code above in a previous lesson. It breaks the name column into first and last names and creates new columns at the end of the DataFrame.  
We will now use the newly created *last_name* and *first_name* columns to sort two columns at once:

In [14]:
titanic_data.sort_values(by = ['last_name','first_name'])

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
600,3,0,"Abbing, Mr. Anthony",male,42,0,0,C.A. 5547,7.55,?,S,?,?,?,Abbing,Anthony
601,3,0,"Abbott, Master. Eugene Joseph",male,13,0,2,C.A. 2673,20.25,?,S,?,?,"East Providence, RI",Abbott,Eugene
602,3,0,"Abbott, Mr. Rossmore Edward",male,16,1,1,C.A. 2673,20.25,?,S,?,190,"East Providence, RI",Abbott,Rossmore
603,3,1,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35,1,1,C.A. 2673,20.25,?,S,A,?,"East Providence, RI",Abbott,Stanton
604,3,1,"Abelseth, Miss. Karen Marie",female,16,0,0,348125,7.65,?,S,16,?,"Norway Los Angeles, CA",Abelseth,Karen
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
392,2,1,"del Carlo, Mrs. Sebastiano (Argenia Genovesi)",female,24,1,0,SC/PARIS 2167,27.7208,?,C,12,?,"Lucca, Italy / California",del,Mrs.
1262,3,0,"van Billiard, Master. James William",male,?,1,1,A/5. 851,14.5,?,S,?,?,?,van,Master.
1263,3,0,"van Billiard, Master. Walter John",male,11.5,1,1,A/5. 851,14.5,?,S,?,1,?,van,Master.
1264,3,0,"van Billiard, Mr. Austin Blyler",male,40.5,0,2,A/5. 851,14.5,?,S,?,255,?,van,Mr.


In [15]:
# EXERCISE 6
# Sort the fare and age columns and print
titanic_data.sort_values(by = ['fare','age'])

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
898,3,0,"Johnson, Mr. William Cahoone Jr",male,19,0,0,LINE,0,?,S,?,?,?,Johnson,William
1254,3,1,"Tornquist, Mr. William Henry",male,25,0,0,LINE,0,?,S,15,?,?,Tornquist,William
963,3,0,"Leonard, Mr. Lionel",male,36,0,0,LINE,0,?,S,?,?,?,Leonard,Lionel
234,1,0,"Reuchlin, Jonkheer. John George",male,38,0,0,19972,0,?,S,?,?,"Rotterdam, Netherlands",Reuchlin,John
7,1,0,"Andrews, Mr. Thomas Jr",male,39,0,0,112050,0,A36,S,?,?,"Belfast, NI",Andrews,Thomas
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
225,1,0,"Payne, Mr. Vivian Ponsonby",male,23,0,0,12749,93.5,B24,S,?,?,"Montreal, PQ",Payne,Vivian
230,1,1,"Perreault, Miss. Anne",female,30,0,0,12749,93.5,B73,S,3,?,?,Perreault,Anne
155,1,1,"Hays, Mrs. Charles Melville (Clara Jennings Gr...",female,52,1,1,12749,93.5,B69,S,3,?,"Montreal, PQ",Hays,Charles
154,1,0,"Hays, Mr. Charles Melville",male,55,1,1,12749,93.5,B69,S,?,307,"Montreal, PQ",Hays,Charles


## Data Types and Missing values 


### Dtypes
Per our previous lessons, dataypes are the way we store specific data to a specific datatype. Such as numbers to *int, float, double*, and text to *str, char*. It is important to know what data types you are working with because at times you will need to either alter, edit, adjust, or replace values in your data. When altering your data, you will need to ensure that the altered data matches the datatype of the data that you are changing. 

Let's take a look at how to find the data type of our columns from the titanic dataset:

In [16]:
#dtypes, types of data
titanic_data.dtypes

pclass         int64
survived       int64
name          object
sex           object
age           object
sibsp          int64
parch          int64
ticket        object
fare          object
cabin         object
embarked      object
boat          object
body          object
home.dest     object
last_name     object
first_name    object
dtype: object

In [17]:
# EXERCISE 7
# In the example above, we found the data type for all the columns. For this exercise,
#  find the data type of just one column from the titanic dataset
titanic_data.cabin.dtypes

dtype('O')

Now that we know how to check data types, let's try changing data types. In the example below we will be converting the *age* column from object (*str or char*) to numeric (*float*). 


In [18]:
# EXERCISE 8

# Why would we want to convert the age value from str to numeric?   
# Double click the cell below to type your answer

#### Answer:
It is better to convert age values that are in number format from string to numeric, because they allow for sorting. It is quite impossible to sort String data types in numerical order without converting them to numeric data types. 

From the data types listed above, we can see that the *age* column is an *object*. However, we want it to be numeric.  
Let's convert the *age* column from *object* to *numeric*:

In [19]:
titanic_data['age'] = pd.to_numeric(titanic_data['age'], errors = 'coerce')
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO",Allen,Elisabeth
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON",Allison,Hudson
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison,Helen
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON",Allison,Hudson
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison,Hudson
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,?,C,?,328,?,Zabour,Hileni
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,?,C,?,?,?,Zabour,Thamine
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.225,?,C,?,304,?,Zakarian,Mapriededer
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.225,?,C,?,?,?,Zakarian,Ortin


We can verify the data type of the *age* column by using `dtype`

In [20]:
titanic_data.age.dtype

dtype('float64')

As you can see, we were able to successfully convert our age data from *object* to *numeric*

In [21]:
# EXERCISE 9

# Convert the fare column into a numeric value and show print the new data type for the column
titanic_data['fare'] = pd.to_numeric(titanic_data['fare'], errors = 'coerce')
titanic_data.fare.dtype

dtype('float64')

### Missing Values 
Missing values are displayed as *NaN*. *NaN (Not a Number) values* are values that do not have any data. It is important to understand if your data has missing values because your AI model will only be as good as the data that you are working with.

In our data we do not have any missing values, but let's still explore on how to find missing values. 
To do this, we can use the `isnull()` method as shown below:

In [22]:
titanic_data.isnull()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1305,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
1306,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1307,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


The dataframe above shows us all the values that met the conditions from `isnull()`. Manually counting the `True` values would be extremely cumbersome and time consuming. However, we can use some of the tools we've learned so far and combine them to make this easier.

In [23]:
# This line of code uses isnull() to find the missing values and sums them up per column
titanic_data.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin           0
embarked        0
boat            0
body            0
home.dest       0
last_name       0
first_name      0
dtype: int64

In [24]:
# EXERCISE 10

# Do you believe missing values are bad for AI? and why?
# Double click the cell below to type your answer

#### Answer: 
Yes I do believe missing values are bad for AI because they do not allow the program or user to see proper data. For example, if the price of gasoline was 90 cents a year ago but rose to 3 dollars in two years, then what happened between those two periods is not presented to AI. The price may have dropped even more, risen to 4 dollars or grew at a linear rate. As a result, because there were missing values, the program may jump to a false conclusion.

## Renaming and Combining 


### Renaming
The majority of times when you will use a dataset, the columns, indexes, or values will have values that we cannot use or are not beneficial to use. In this case, out data contains some *'?'* throughout. We can use the `rename()` method to change some those values.

In [25]:
# note some of the ? in the cabin and boat columns
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO",Allen,Elisabeth
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON",Allison,Hudson
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison,Helen
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON",Allison,Hudson
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON",Allison,Hudson
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,?,C,?,328,?,Zabour,Hileni
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,?,C,?,?,?,Zabour,Thamine
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,?,C,?,304,?,Zakarian,Mapriededer
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,?,C,?,?,?,Zakarian,Ortin


In [26]:
# Renaming
titanic_data = titanic_data.replace({'?': None})
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,last_name,first_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO",Allen,Elisabeth
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON",Allison,Hudson
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",Allison,Helen
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON",Allison,Hudson
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",Allison,Hudson
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328,,Zabour,Hileni
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,,Zabour,Thamine
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,,304,,Zakarian,Mapriededer
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,,Zakarian,Ortin


As you can see above, the  values that were *'?'* are now None. We also could have used NaN, but either value will allows to mark the values as missing. Now let's rename the embarked column to country to make it easier to interpret.

In [27]:
titanic_data = titanic_data.rename(columns={'embarked': 'country'})
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,country,boat,body,home.dest,last_name,first_name
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO",Allen,Elisabeth
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON",Allison,Hudson
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",Allison,Helen
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON",Allison,Hudson
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",Allison,Hudson
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328,,Zabour,Hileni
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,,Zabour,Thamine
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,,304,,Zakarian,Mapriededer
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,,Zakarian,Ortin


In [41]:
# EXERCISE 11
# Find and replace the missing values in the new "country" column. 
#  To verify, run the cell below. The true values should be zero. 

# HINT: use the fillna() method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
titanic_data["country"].fillna("None", inplace = True)

In [42]:
titanic_data.country.isnull().value_counts()

False    1309
Name: country, dtype: int64

## Building our First Machine Learning Model

Now that we have looked at our dataset, let's use it for machine learning. For this model, just follow along and try to understang what is happening. We will discuss all these concepts in more detail as the lessons progress. 

The first step is for us to select a column we would like to try and predict a value for. In this case we will try to predict who will survive, so we will choose the *survived* column as our target and the rest of our columns as our features. 

We will learn more about what feature and target variables are in upcoming lessons. 

In [30]:
# Seperate target from dataset
target = titanic_data['survived']
features_raw = titanic_data.drop('survived', axis = 1)

We will now perform some basic steps to pre-process our data before passing it into a machine learning model. Computers only understand numbers, they do not understand the words (e.g., cat, dog, goldfish). Instead, we need to convert string characters into a numerical values. In a future lesson we will go over specific methods to do this. 

In [31]:
# preprocess data
features = pd.get_dummies(features_raw)
features = features.fillna(0.0)

In order to train our machine learning model, we will use the **sklearn** (scikit-learn) library. The **sklearn** library contains various statistical and machine learning models that we can use. It also contains useful tools to aid in the machine learning process.  
In example below, we will import **train_test_split** from **sklearn.model_selection** and **DecisionTreeClassifier** from **sklearn.tree**.  

**train_test_split** is a tool used to split data into *'test'* and *'training'* sets to train our machine learning model.  
**DecisionTreeClassifier** is a machine learning model.

In [32]:
#import model
from sklearn.model_selection import train_test_split

# Split the data into train and test. 
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#train

# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier()

In [33]:
# EXERCISE 12

# Why do you believe we need to split the data into a training and a test set, 
#  before we pass it into a machine learning model?

# I believe we need to split our data into a training and test data set because doing so, allows the model to first train itself
# on the data that it has been provided and after deriving a pattern for the data, the model can proceed to the test mode where it
# will use its tactics on new datasets.

In the last two lines of the code cell above, we used our training data to train our model. We will now take our trained model and try to predict values from our test set. We will compare our results and assign it an accuracy measure. 

In [34]:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is {}%'.format(train_accuracy*100))
print('The test accuracy is {}%'.format(test_accuracy * 100))

The training accuracy is 100.0%
The test accuracy is 95.0381679389313%


In [35]:
# EXERCISE 13

# Do you believe the machine learning model performed well? or is it performing too well? 
# Double click the cell below to type your answer

#### Answer: 
With the basic knowledge I have as of now regarding machine learning models, I like to think that it performed too wel because the accuracy was at max capacity (100%). If it was around the 98th percentile, I would have said it performed just right. 