# Introduction
Every data science investigation needs an introduction. In this example project we're going to explore how to use a Decision Tree Classifier to predict whether or not passengers on the titanic would have survived the disaster. 

In your project, you'll choose a different set of data (or the same, but make your analysis different to this one!) and make your own predictions.

Note that this project assumes that you've already completed the [Intro to Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning) on Kaggle and are familiar with Decision Tree Classifiers and Regressors. If you need to brush up on these, watch the following [YouTube playlist on different types of decision trees.](https://www.youtube.com/playlist?list=PLH8ela8ws3gvTRV0iZezkod_0egMCNwGq)

# Setting up
The below code contains necessary steps for setting up our machine learning environment. Key features are described in the comments.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualisation purposes
from sklearn.tree import DecisionTreeClassifier ,plot_tree # Our model and a handy tool for visualising trees

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Gather and explore the data
Now that we've set things up the next step is to add data to our proejct. You can do this by clicking the __Add data__ button in the top right and searching Kaggle for a data set that interests you.

It would be best to limit your search to data that:
- Is in .csv format
- Contains mostly numerical values
- May contain categorical values but with limited options such as male/female.
- Doesn't have many missing values

Once you've found the right data set, follow the prompts to add it to your notebook. It should appear under the _input_ folder icon as you see it in the top right of this notebook. For this example, we've chosen data about passengers on the titanic.

Now that we have a file containing data, let's get it into a Pandas DataFrame and take a peek.

In [None]:
train_file_path = '../input/titanic/train.csv'

# Create a new Pandas DataFrame with our training data
titanic_train_data = pd.read_csv(train_file_path)

#titanic_test_data.columns
titanic_train_data.describe(include='all')
#titanic_train_data.head()

# Prepare the data
In this example, we want to predict whether or not a passenger __survived__ the Titanic disaster. Therefore the 'Survived' column is our prediction target.

Before we can separate our prediction target 'y' from the rest of the data, we need to do some preparation so that there aren't any rows with missing values as our machine learning model will not be able to handle them.

## Select features and target then drop missing values
Choosing our features first will help reduce the total number of rows we need to drop (remove).

We want to choose a selection of features that are:
- Relevant to our predictions
- Don't have many missing values

Note that we'll also be including the target for now.


In [None]:
# Let's reduce our data to only the features we need and the target.
# The features we chose have similar 'count' values when we describe() them
# We need to keep the target as part of our DataFrame for now.
selected_columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']

# Create our new training set containing only the features we want
prepared_data = titanic_train_data[selected_columns]

# Drop rows (axis=0) that contain missing values
prepared_data = prepared_data.dropna(axis=0)

# Check that you still have a good 'count' value. The value should be the same for all columns.
# If your count is very low then you may need to remove features with the lowest count.
prepared_data.describe(include='all')

## Split data into training and testing data.
Splitting the training set into two subsets is important because you need to have data that your model hasn't seen yet with actual values to compare to your predictions to be able to tell how well it is performing. In this example project we're skipping this step, but when you do your project you'll need to consider how you want to split your data. The [Intro to Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning) goes through how to do this. 

## Separate Features From Target
Now that we have a set of data (as a Pandas DataFrame) without any missing values, let's separate the features we will use for training from the target.




In [None]:
# Separate out the prediction target
y = prepared_data.Survived

# Drop the target column (axis=1) from the original dataframe and use the rest as our feature data
X = prepared_data.drop('Survived', axis=1)

# Take a look at the data again
X.head()
#y.head()

## One Hot Encode Categorical Data 
One of the difficulties of working with machine learning models is that most of them can only work with numerical features. One problem with our current data is that 'Sex' is categorical, with values of either _male_ or _female_ which are not numbers!

To use 'Sex' as a feature, we must find a way to encode each category as a number. One Hot Encoding is the most widely used approach for doing this. One Hot Encoding creates new (binary) columns, indicating the presence of each possible category value in the original data. In other words, it separates each of the options for a category into a separate column, where a 1 means that the row fits the category in question and a zero indicates it doesn't.

__For example:__

<img src="https://i.imgur.com/mtimFxh.png" alt="One Hot Encoding" style="width:600px">

Watch this video (https://www.youtube.com/watch?v=v_4KWmkwmsU) for more information about how and why this works.

The Pandas __get_dummies__ function is the easiest way to One Hot Encode categorical data. Here's how it's done.

In [None]:
# One hot encode the features. This will only act on columns containing non-numerical values.
one_hot_X = pd.get_dummies(X)

one_hot_X.head()

Now we have two columns for Sex, one for female and one for male. Consider that if we had even more categories then we'd get even more columns, which could cause performance problems if there are a large number of categories in our data.

# Choose and Train a Model
Now that we have data our model can digest, let's use it to train a model and make some predictions. We're going to use a __Decision Tree Classifier__ which is different from the Decision Tree Regressor used in the [Intro to Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning) in that it makes categorical predictions instead of continuous numerical predictions. 

In this case, the category we want to predict is whether or not a passenger survived, with the output being a 1 if they survived and a 0 if they did not. Decision Tree Classifiers are also able to work with non-numerical prediction targets as well. For example, you might have a 'y' that contains the names (as strings) of different species of flowers. It's only features that need to be encoded.

For an example of a Decision Tree Classifier working with a non-numerical 'y' and a more in-depth look at how they work, take a look at this Kaggle notebook (https://www.kaggle.com/chrised209/decision-tree-modeling-of-the-iris-dataset)

Ok, let's train our model and see what it looks like.

In [None]:
# Create a decision tree classifier with a maximum depth of 3 for easy display later on
# Try changing the max_depth to see what happens
survivor_predictor = DecisionTreeClassifier(max_depth=3)

# Train the model on the one hot encoded data
survivor_predictor.fit(one_hot_X, y)

# Let's plot the tree to see what it looks like!
plt.figure(figsize = (20,10))
plot_tree(survivor_predictor,
          feature_names=one_hot_X.columns,
          class_names=['perished', 'survived'],
          filled=True)
plt.show()

# Note for class_names we've used strings to represent each of the values.
# However, the real values are 0 for perished and 1 for survived.
# Class names for plot_tree must be strings so to get the right replacement values
# we had to do the following:
# First get a list of classes the tree will classify things as with the following command
### print(survivor_predictor.classes_) ###
# This gives us [0,1] 
# Now we can create a new list with the replacement class strings in the same order.

## Pretty Cool!
Take a good look at the decision tree. 
- Does the hierarchy of nodes make sense? 
- Are the values used to make decisions what you expected?
- Is it telling a plausible story about what kind of passengers are more likely to survive?
- Is there anything that surprises you?

Note that there are other ways to view a decision tree and there may be other parameters you could include when plotting the tree to display the nodes differently, but this is fine for now.



# Evaluate model performance and tune hyperparameters
Now that we have a sweet looking model, let's see how good it is at predicting passenger survival on our training set. 

In [None]:
print("Making predictions for the first 5 passengers in the training set.")

# Get the first five predictions as a list
pred = survivor_predictor.predict(one_hot_X)

print("The predictions are:")

# Merge actual target values and predictions back in with original features to see how we went.
X['Survived'] = y
X['Predicted'] = pred

X.head()

## Wow! Perfect!... or is it?

Remember when doing the [Intro to Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning) we discovered the problem with evaluating our model on the training data. It looks like our predictions are good at the moment, but will they be as good for data our model hasn't seen yet?

In practice, it's always a good idea to split the data from the original training set into a training and validation/test set, otherwise it's difficult to know how good our predictions really are if we're testing our model on the same data it was trained with!

For now, we're just worried about how to make predictions and displaying them. When you do your own project, you'll want to __test your model on separate test data using different hyperparameter values__ (such as tree depth or features) and compare the __Mean Absolute Error__ of your predictions, just as you did in the [Intro to Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning), before making any conclusions. You might even want to create __multiple models__ and compare the predictions between them as well!

# Conclusion
Now that you have some predictions it's important to talk about them. Some questions you might want to answer are:
- How accurate is your final model?
- How does it compare to other models you've tested with this data set?
- Do you think the features you selected are appropriate? Why?
- What could be done to improve your model further?

... and that's it!