# Introduction

Today, we have our first ever Kaggle Hack Session! We're going to be competing in the Titantic competition. The goal in this competition is to be able to predict who survived and who passed away during this tragedy, given information about the people involved.

We know that getting started for one of these competitions can be difficult, so we've provided this starter notebook to help you get up and running. Let's think about what we need to do when approaching any machine learning competition/problem. 

1) Determine your problem space. Do you have a classification problem, or a regression problem?

2) Determine what model you want to use (Always good to start off with simple models).

3) Load in and preprocess your dataset. Examine your database to see if there are any NULL or non-numeric values.

4) Split up your dataset into training and testing components. 

5) Create your model. This entails defining your function, your placeholders, the loss function, and the optimizer. 

6) Train, evaluate, and iterate on your model!

7) Once you have a model that you're satisfied with, load in test.csv (the test set for the Titanic competition), compute your predictions, save them to a CSV file, and submit to Kaggle. 

In [1]:
import pandas as pd
import tensorflow as tf

# Load in Data

You can download the data from the Kaggle website. The direct link is [here](https://www.kaggle.com/c/titanic/data), but we've already downloaded it for you. It's located in the Data subfolder. 

In [2]:
# Use the Pandas read_csv() function to load in the train.csv
titanicTrain = pd.read_csv('Data/train.csv')

# Examine Data

In [3]:
# Use the head function to see how the first couple rows of the dataframe looks like
titanicTrain.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
# Figure out what the different column names are
titanicTrain.columns.tolist()

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

Use other functions such as describe, max, mean, value_counts, etc to learn more about the dataset you're dealing with. 

# Clean Data

This is one of the most important parts of any machine learning pipeline. We want to make sure that the inputs we feed into any machine learning model are are valid, non-null, and are numerical values. To get you started with datapreprocessing, we'll show you one example of a column you may want to drop in this dataset 

In [7]:
# Visualize the data we're working with
titanicTrain.describe()

So, as you can see above, the column _ has null values. There are ways we can deal with this (for example, replace the null values with the median of the other values, replace them with 0, etc), but a simple method is to just drop the column.

In [None]:
# Drop the column

Another column that needs processing is _

In [None]:
# Do the preprocessing


In [None]:
# TODO Find the other attributes that may give us trouble later on! Once you find these
# columns, figure out if you just want to drop the attribute altogether or replace with 
# median, or something else!

Now that you know a couple ways of dealing with null values and string values, feel free to be creative! The best way to get a more accurate machine learning model is to understand the best ways to visualize and clean your data! This is one of the most important steps in any ML pipeline. 

# Create Training/Testing Matrices

So, now that we've made our final changes to our dataframe, we want to convert it into a matrix of numbers. 

In [None]:
# Convert to numpy matrices. 

(OPTIONAL) Remember that whenever we have a dataset, it's good practice to seperate the dataset into 2 parts, one that we will use to train the model, and one that we will use to check how our model is doing as a test/validation set. 

In [None]:
# Divide into xTrain, yTrain, xTest, and yTest

# Create Model

Now that we have all of our data loaded in and preprocessed, we can start on creating our model. This component is pretty open ended. You have the freedom to choose whichever model you'd like to create. If you need inspiration, take a look at the code for linear regression and logistic regression in the week2 and week3 folders. 

In [1]:
# TODO Create your model here

# Train Model

Now that you've created your model by defining your computational graph, you're ready to start training the model. Remember that training model basically means that we want to run our optimizer object over different parts of our training dataset. A few other reminders:
- Remember to create a Tensorflow session and initialize all of your variables
- Run your optimizer object at every iteration
- Keep track of how your model is doing every now and again

In [None]:
numIterations = 1000 # Adjust this number as you see fit!
# TODO Initialize variables
for i in range(numIterations):
    # TODO Run optimizer object over your data
    # TODO Check accuracy every once in a while

# Test Model

By now, you have a trained model and you're almost ready to submit! We want to now see how our model does on data that it has never seen before. We want to compute our predictions for the test set. We will then submit these predictions to Kaggle in order to see how accurate we are. A few reminders:
- Remember that preprocessing you did for the training dataset? You'll need to do that same preprocessing for this test set as well. 
- No need to initialize variables or anything. Everything is already trained! We just want to compute our predictions for this new set of data. 

In [None]:
titanicTest = pd.read_csv('Data/test.csv')

# TODO Do the same data preprocessing you did for the train set
# TODO Compute the predictions for the testing set by evaluating (?)
# TODO Check that the predictions are a matrix of _ dimensionality 

# Create Kaggle Submission

It's very important to be familiar with the exact Kaggle submission format. We basically want to create a CSV file where the first line of the CSV has the column names '' and '' (this will be different from competition to competition). The following lines will be contain the id number for the test as well as the prediction for that example.

In [None]:
import numpy as np
import csv

firstRow = [['id', 'pred']]
with open("predictions.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(firstRow)
    # TODO write the predictions you got from the last step!