# This Assignment (Objective)

For this assignment, we are using a machine learning algorithm called KNN to determine what sorts of people were likely to survive.

## Data Analysis

The data provided are in files called `train.csv` and `test.csv`. 

`train.csv` contains a column named `survived` which states whether that individual survived or not. In `test.csv`, that column does not exist. We are using data we gathered from `train.csv` using the KNN algorithm to predict the `survived` status on each row (person) of the `test.csv`.

We will be using this data to see what traits of people made it more likely for that individual to survive. We will be using `train.csv` to train our machine learning algorithm first. To initialize, we do the following:

In [93]:
import csv
import pandas as pd
import numpy
#from scipy.stats import mode
from sklearn import neighbors
from sklearn.neighbors import DistanceMetric 
from pprint import pprint

TITANIC_TRAIN = 'train.csv'
TITANIC_TEST  = 'test.csv'

dataframe = pd.read_csv(TITANIC_TRAIN, header=0)

The header parameter represents which line of the CSV is the header. We can then also call builtin functions that would normally work on dictionaries or lists with our new Dataframe object.

In [33]:
print('length: {0} '.format(len(dataframe)))

length: 891 


We can also call a `.info()` method on our data to take a glance at the number of entries under each column. This gives us an idea of what data is present, and what isn't.

In [60]:
print(dataframe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None


We know the table contains data on 891 people. We can also see that every single column contains information on each person excluding `Cabin` and `Age`. SciPy cannot process rows with `None` values, so we must handle the data somehow. Since `Cabin` contains only 204 entries, we might as well consider it obsolete information, on the other hand, we are only missing 100 entries from the Age column, so we can use the column.

We can't just call `dataframe.dropna()` because we want to keep the data, so we need to create a parser to insert mean data into the missing fields. To calculate the mean, we can check using some of the methods provided by Pandas.

We've determined that the rows that may be relevent to passenger survival are:
- Pclass
- Sex
- Age
- SibSp
- Parch
- Embarked

We will drop the columns that aren't likely to affect chances of survival:
- PassgengerId
- Name
- Ticket

We will drop the columns that don't provide enough information as well:
- Cabin

To drop, we call the following command.
We set the axis to 1 because we want to delete the row and not the column
The `inplace` determines whether we are replacing our current `Dataframe` with the dropped columns or not.

In [94]:
# Syntatic Sugars
COLUMNS = 0
ROWS = 1

# Defenestrate the columns we don't need
dataframe.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=ROWS, inplace=True)
print(dataframe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB
None


As we can see, we got of the useless columns. Now we need to find the mean of the age and supply it as the age for the missing rows.

We can find the mean age using the following command:

In [95]:
mean = dataframe.Age.mean()
print(mean)

29.69911764705882


We can then set the rows with no Age entry to the mean using the following command:

In [96]:
dataframe.Age.fillna(mean, inplace=True)

And now the table looks like this:

In [63]:
print(dataframe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB
None


Another thing we should take a look at is the `Embarked` Column, each entry in there is not an integer, which will cause SciPy problems down the road. We'll be using Python's `map` builtin to convert each `Embark` to a number. However we don't want to overwrite the current letters, so we'll assign this new integer based destination to a new key called `port`

There are still `Embarked` values that are null, so we must assign them to `0`.

(We set replace `Embarked` with `Port` because you did it that way)

In [97]:
# To see each destination we can call:
print(dataframe['Embarked'].unique())
# After we confirm that there are only C, S and Q, we can map it
dataframe['Embarked'] = dataframe['Embarked'].map({'C':1, 'S':2, 'Q':3, None: 0}).astype(int)

['S' 'C' 'Q' nan]


Now our data set is complete, however, we need to replace all non-int64/float64 object types with int64/float64 object types

In [65]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    891 non-null int64
dtypes: float64(2), int64(5), object(1)
memory usage: 55.8+ KB


Currently, the `Sex` is currently listed as an `object` type. To fix that, we use Python's `map` builtin again.

In [98]:
# To see each destination we can call:
print(dataframe['Sex'].unique())
# After we confirm that there are only male and female, we can map it
dataframe['Sex'] = dataframe['Sex'].map({'male':0, 'female':1}).astype(int)

['male' 'female']


Now we can see that the type of `Sex` entries are `int64`

In [99]:
print(dataframe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int64
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    891 non-null int64
dtypes: float64(2), int64(6)
memory usage: 55.8 KB
None


We did it! We can now put our data into `SciKit`. We can now create a KNN Classifier Object that we can now train with our training cleaned `dataframe`.

In [173]:
model = neighbors.KNeighborsClassifier()

We have created a Classifier Object using the KNN machine learning algorithm. Next thing we want to do is splice our columns. Essentially what the following lines of codes is splitting our table into two.

Table train_columns:
- Pclass
- Sex
- Age
- SibSp
- Parch
- Fare
- Embarked

Table target_columns:
- Survived

The reason we do this is because we want to ask SciPy to look at every single row in our training data and find a pattern between all the data and train_columns and target_columns. In other words: Does your status in: `[Pclass, Sex, Age, SibSp, Parch, Fare, Embarked]` affect our chance of `Survived`?

In [174]:
# Convert our columns to lists
columns = dataframe.columns.tolist()
dataframe = dataframe[columns]

# Extrapolate from those columns
train_columns = columns[1:]
target_columns = [columns[0]]
print(train_columns, target_columns)

# Get the data from those columns
train_data = dataframe[train_columns]
target_data = dataframe[target_columns]

['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] ['Survived']


`_columns` represents a list containing the columns as you can see in the last print statement.
`_data` actually contains the data associated with those columns.

Next, we need to feed our KNClasifier object our training data. For the first parameter you pass in `train_data.values` which returns a Two-dimensional array containing everything in your table except for the columns `Survived`.

The line `[value[0] for value in target_data.values]` does solves multiple problems.
We cannot just pass in `target_data.values` because it's a two-dimensional array. We need to convert it into a one-dimensional array. Why? Because the world is unfair. So you take the first item of every row and add it to a list (that's the reason for the `for` loop and the `[0]`)

In [175]:
# Train our KNN Classifier Object
model.fit(train_data.values, [value[0] for value in target_data.values])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [154]:
# Load the test file
test_dataframe = pd.read_csv(TITANIC_TEST)

# Save PassengerIds for Kaggle Submission
ids = test_dataframe.PassengerId.values

# Drop the columns we don't need
test_dataframe.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=ROWS, inplace=True)

In [155]:
# Check if our table is clean
print(len(test_dataframe))
test_dataframe.info()

418
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        417 non-null float64
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 22.9+ KB


In [156]:
# Exact same thing we did with our training data
mean_age = test_dataframe.Age.mean()
test_dataframe.Age.fillna(mean_age, inplace=True)

fare_mean = test_dataframe.Fare.mean()
test_dataframe.Fare.fillna(fare_mean, inplace=True)

test_dataframe['Embarked'] = test_dataframe['Embarked'].map({'C':1, 'S':2, 'Q':3, None: 0}).astype(int)
test_dataframe['Sex'] = test_dataframe['Sex'].map({'male':0, 'female':1}).astype(int)


test_data = test_dataframe.values

In [159]:
# Checking output is correct
test_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null int64
Age         418 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        418 non-null float64
Embarked    418 non-null int64
dtypes: float64(2), int64(5)
memory usage: 22.9 KB


In [169]:
output = model.predict(test_data).astype(int)

result = numpy.c_[ids.astype(int), output]
print(result[:5])

[[892   0]
 [893   0]
 [894   0]
 [895   1]
 [896   0]]


In [172]:
predictions_file = open("veryoriginalcode.csv", "w")
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId","Survived"])
open_file_object.writerows(zip(ids, output))
predictions_file.close()

## Evaluation

## Representation

## KNN

## Optimization

## SciKit Learn