# Introduction to Pandas II
Welcome to the Pandas II. In this lesson we will be covering: 
- **Grouping and Sorting Data**
- **Data Types and Missing Values in Data**
- **Renaming and Combining Data** 

We will also create our first **Machine Learning Model**.

The lab for Lesson 4 will consist of all the exercises that you will find throughtout the notebook. 

For this lesson we will again be using the Titanic Survival Dataset from Kaggle. We will perform various tranformations, edits and exploration. 

Let's review the column values once more as a reminder of the data we are using:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `?`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `?`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
- **Home.Dest**: Home / Destination


In [None]:
import pandas as pd
import numpy as np

Let's read in the data set. We performed this in our previous lab, now give it a try:

In [None]:
# EXERCISE 1
# Read in the titanic survival dataset (titanic_data.csv)


## Grouping and Sorting 

### Groupwise Analysis
The `groupby()` method allows us group our data. Depending on the input given, `groupyby()` can also be used with the summary methods such as `count()`, `mean()`, and the others. 

In [None]:
# Grouping 
titanic_data.groupby('embarked').embarked.count()


One thing to note from the above grouping is our results. We grouped by the column "embarked" and then asked for only the count of the "embarked column. The results we got were, ?, C, Q and S. Just as a thought, does it make sense to have ? as a value? Let continue and we shall find out.

In [None]:
# EXERCISE 2

# Run the cell below, and then answer the question.
titanic_data.groupby('embarked').count()

In [None]:
# EXERCISE 3

# Describe the difference between the first `groupby()` vs the second one we executed in Exercise 2.
# Double click the cell below to type your answer

#### Answer: 
(type answer here)


We can also use the `agg()` method to aggregate values all at once. Let's use it below to get the length, minimun value and maximum value. 

In [None]:
titanic_data.groupby(['sibsp']).survived.agg([len,min,max])

Interesting fact about the data above:  
The information above tells us that larger families (sibsp refers to the number of siblings / spouses on board) did not survive. 

In [None]:
# EXERCISE 4
# Group the data by sibsp and find the sum, min and max values using the agg() method


### Sorting
Sorting is an extremely valuable tool. Sorting allows us to keep our data organized, and allows for the user of the dataset to have better control over their data. Lets sort our data below, by setting the "embarked" column as our refrence. 

In [None]:
# Sorting 
titanic_data.sort_values(by='embarked')

As we can see now, the embarked column has been sorted to show ?,C,Q, and S. ? is first as it comes first in the ascii table of characters (http://www.asciitable.com/). 

The values in code above were sorted in ascending order (default), but we can also sort in descending order as shown below:

In [None]:
titanic_data.sort_values(by='embarked', ascending=False)

In [None]:
# EXERCISE 5

# Look at the code in the cell below and describe try to describe wha is occuring:
# Double click the cell below to type your answer

#### Answer: 
(type answer here)

In [None]:
titanic_data['last_name'] = titanic_data['name'].str.split().str[0].replace(',','',regex=True)
titanic_data['first_name'] = titanic_data['name'].str.split().str[2].replace(',','',regex=True)


Note that we have used the code above in a previous lesson. It breaks the name column into first and last names and creates new columns at the end of the DataFrame.  
We will now use the newly created *last_name* and *first_name* columns to sort two columns at once:

In [None]:
titanic_data.sort_values(by = ['last_name','first_name'])

In [None]:
# EXERCISE 6
# Sort the fare and age columns and print


## Data Types and Missing values 


### Dtypes
Per our previous lessons, dataypes are the way we store specific data to a specific datatype. Such as numbers to *int, float, double*, and text to *str, char*. It is important to know what data types you are working with because at times you will need to either alter, edit, adjust, or replace values in your data. When altering your data, you will need to ensure that the altered data matches the datatype of the data that you are changing. 

Let's take a look at how to find the data type of our columns from the titanic dataset:

In [None]:
#dtypes, types of data
titanic_data.dtypes

In [None]:
# EXERCISE 7
# In the example above, we found the data type for all the columns. For this exercise,
#  find the data type of just one column from the titanic dataset


Now that we know how to check data types, let's try changing data types. In the example below we will be converting the *age* column from object (*str or char*) to numeric (*float*). 


In [None]:
# EXERCISE 8

# Why would we want to convert the age value from str to numeric?   
# Double click the cell below to type your answer

#### Answer: 
(type answer here)

From the data types listed above, we can see that the *age* column is an *object*. However, we want it to be numeric.  
Let's convert the *age* column from *object* to *numeric*:

In [None]:
titanic_data['age'] = pd.to_numeric(titanic_data['age'], errors = 'coerce')
titanic_data

We can verify the data type of the *age* column by using `dtype`

In [None]:
titanic_data.age.dtype

As you can see, we were able to successfully convert our age data from *object* to *numeric*

In [None]:
# EXERCISE 9

# Convert the fare column into a numeric value and show print the new data type for the column


### Missing Values 
Missing values are displayed as *NaN*. *NaN (Not a Number) values* are values that do not have any data. It is important to understand if your data has missing values because your AI model will only be as good as the data that you are working with.

In our data we do not have any missing values, but let's still explore on how to find missing values. 
To do this, we can use the `isnull()` method as shown below:

In [None]:
titanic_data.isnull()

The dataframe above shows us all the values that met the conditions from `isnull()`. Manually counting the `True` values would be extremely cumbersome and time consuming. However, we can use some of the tools we've learned so far and combine them to make this easier.

In [None]:
# This line of code uses isnull() to find the missing values and sums them up per column
titanic_data.isnull().sum()

In [None]:
# EXERCISE 10

# Do you believe missing values are bad for AI? and why?
# Double click the cell below to type your answer

#### Answer: 
(type answer here)

## Renaming and Combining 


### Renaming
The majority of times when you will use a dataset, the columns, indexes, or values will have values that we cannot use or are not beneficial to use. In this case, out data contains some *'?'* throughout. We can use the `rename()` method to change some those values.

In [None]:
# note some of the ? in the cabin and boat columns
titanic_data

In [None]:
# Renaming
titanic_data = titanic_data.replace({'?': None})
titanic_data

As you can see above, the  values that were *'?'* are now None. We also could have used NaN, but either value will allows to mark the values as missing. Now let's rename the embarked column to country to make it easier to interpret.

In [None]:
titanic_data = titanic_data.rename(columns={'embarked': 'country'})
titanic_data

In [None]:
# EXERCISE 11
# Find and replace the missing values in the new "country" column. 
#  To verify, run the cell below. The true values should be zero. 

# HINT: use the fillna() method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html


In [None]:
titanic_data.country.isnull().value_counts()

## Building our First Machine Learning Model

Now that we have looked at our dataset, let's use it for machine learning. For this model, just follow along and try to understang what is happening. We will discuss all these concepts in more detail as the lessons progress. 

The first step is for us to select a column we would like to try and predict a value for. In this case we will try to predict who will survive, so we will choose the *survived* column as our target and the rest of our columns as our features. 

We will learn more about what feature and target variables are in upcoming lessons. 

In [None]:
# Seperate target from dataset
target = titanic_data['survived']
features_raw = titanic_data.drop('survived', axis = 1)

We will now perform some basic steps to pre-process our data before passing it into a machine learning model. Computers only understand numbers, they do not understand the words (e.g., cat, dog, goldfish). Instead, we need to convert string characters into a numerical values. In a future lesson we will go over specific methods to do this. 

In [None]:
# preprocess data
features = pd.get_dummies(features_raw)
features = features.fillna(0.0)

In order to train our machine learning model, we will use the **sklearn** (scikit-learn) library. The **sklearn** library contains various statistical and machine learning models that we can use. It also contains useful tools to aid in the machine learning process.  
In example below, we will import **train_test_split** from **sklearn.model_selection** and **DecisionTreeClassifier** from **sklearn.tree**.  

**train_test_split** is a tool used to split data into *'test'* and *'training'* sets to train our machine learning model.  
**DecisionTreeClassifier** is a machine learning model.

In [None]:
#import model
from sklearn.model_selection import train_test_split

# Split the data into train and test. 
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#train

# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

In [None]:
# EXERCISE 12

# Why do you believe we need to split the data into a training and a test set, 
#  before we pass it into a machine learning model?


In the last two lines of the code cell above, we used our training data to train our model. We will now take our trained model and try to predict values from our test set. We will compare our results and assign it an accuracy measure. 

In [None]:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is {}%'.format(train_accuracy*100))
print('The test accuracy is {}%'.format(test_accuracy * 100))

In [None]:
# EXERCISE 13

# Do you believe the machine learning model performed well? or is it performing too well? 
# Double click the cell below to type your answer

#### Answer: 
(type answer here)