# Classification
Scikit-Learn is a Python package that provides some machine learning functions. Structures such as decisions trees are easy to use with minimal code. 

## Simple Classification 
We will be making use of the DecisionTreeClassifier.
As with other classifiers, DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of size [n_samples, n_features] holding the training samples, and an array Y of integer values, size [n_samples], holding the class labels for the training samples:

In [20]:
from sklearn import tree

#Example 1
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print('Example 1')
print(clf.predict([[2., 2.]]))


Example 1
[1]


In [21]:
from sklearn import tree

#Example 2
#Create training data for decision tree
#[height, hair-length, voice-pitch]
X = [[180, 15, 0],
     [167, 42, 1],
     [136, 35, 1],
     [174, 15, 0],
     [141, 28, 1]]
#Create target data for training set
Y = ['man', 'woman', 'woman', 'man', 'woman']

#Fitting the training data
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
prediction = clf.predict([[133, 37, 1]])
print('Example 2')
print(prediction)

Example 2
['woman']


In [25]:
from sklearn.datasets import load_iris
from sklearn import tree
import numpy as np

# Instead of reading from our CSV file, Scikit-Learn provides us with the Iris dataset
iris = load_iris()
#print(iris)
test_idx = [0, 50, 100]

# Training data
training_target = np.delete(iris.target, test_idx)
training_data = np.delete(iris.data, test_idx, axis = 0)

# Testing data
testing_target = iris.target[test_idx]
testing_data = iris.data[test_idx]

classifier = tree.DecisionTreeClassifier()
classifier.fit(training_data, training_target)

print(classifier.predict(testing_data[:1]))
print(testing_data[0], testing_target[0])
print(iris.feature_names, iris.target_names)

[0]
[ 5.1  3.5  1.4  0.2] 0
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] ['setosa' 'versicolor' 'virginica']


## Iris Flower Dataset
Lets use what we've learned from the simple example above on a real dataset.

## Titanic Survival with Decision Tree Classifier
Provided is a dataset with details unique to each passenger.
Use this dataset to determine whether a passenger will survive or not.

### The Dataset
The following data sets are provided:
* **Train:** A training dataset for training your machine learning algorithm.
* **Test:** A training dataset for testing your machine learning algorithm.

### Preparing the Data
* Remember to convert all String data into categorical(numeric) data
* Handle missing values appropriately

Use the following datasets: 
* datasets/titanic_train.csv
* datasets/titanic_test.csv

In [96]:
import csv
import numpy as np
from sklearn import tree

# Read in data and identify features

# Plot and understand your features (optional)

# Convert any required features to numeric

# Handle missing values

# Make a Scikit Learn classifier

# Fit the model using your classifier

# Determine accuracy of model
