## Python Modeling Exercises

Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All).  You can speak with others regarding the assignment but all work must be your own. 


### This is a 30 point assignment graded from answers to questions and automated tests that should be run at the bottom. Be sure to clearly label all of your answers and commit final tests at the end. If you attempt to fake passing the tests you will receive a 0 on the assignment and it will be considered an ethical violation. (Note, not all questions have tests).

### You must show the executed code and then the output . Do not just copy and past the code to a markdown cell. 

In [1]:
NAME = "Jason Kuruzovich"
COLLABORATORS = ["Alyssa Hacker"]  #You can speak with others regarding the assignment, but all typed work must be your own.

In [2]:
%load_ext ipython_unittest

### Get Cleaned Data
It is often useful to be able to move back and forth between R and Python.  In the last class we utilized a file `00-tree-models.R` to do some general analysis. 

Run that code and save the dataframes `train-new` and `test-new` to the `input` directory in this repository as `train-new.csv` and `test-new.csv`. 

**(1) In the `00-tree-models.R` example, explain why you combine the train and test sets before doing data cleaning.** 




#### Answer (1) here. 

In [3]:
import pandas as pd
#Load the train-new.csv and test-new.csv into dataframes train and test
train = pd.read_csv('input/train-new.csv')
test = pd.read_csv('input/test-new.csv')
train.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,family_size,Survived_log
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,H,S,Mr,2,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C,Mrs,2,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,H,S,Miss,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S,Mrs,2,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,H,S,Mr,1,0


## Dummy Variables XSIMPLE Example.

In the previous python examples we had easy data to work with that consisted of numeric data.  For scikit learn, we have to convert our categorical data to numeric data. Let's do a refresher and create a simple model. 

In [4]:
train_xsimple = pd.get_dummies(train[['Sex']])
train_xsimple.head()

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [5]:
# Combine the continuous variables Age and Pclass with the dummies. 
X = pd.concat([train[['Age','Pclass']], train_xsimple], axis=1)
X.head()

Unnamed: 0,Age,Pclass,Sex_female,Sex_male
0,22.0,3,0,1
1,38.0,1,1,0
2,26.0,3,1,0
3,35.0,1,1,0
4,35.0,3,0,1


### Dummy Variables: Generating X
Follow the example above to generate a new value for X utilizing all the continuous and dummy data.  

The resulting dataframe X should be all numeric and have these columns (in the correct order): ['Age','Pclass', 
        'SibSp','family_size','Fare','Sex_female','Sex_male', 
        'Embarked_C', 'Embarked_Q','Embarked_S','Title_Miss', 
        'Title_Mr', 'Title_Mrs','Cabin_A','Cabin_B','Cabin_C', 
        'Cabin_D','Cabin_E','Cabin_F','Cabin_G','Cabin_H']

### Test we got X right.
This is the same test that is included below. 

In [None]:
%%unittest_main
class TestPackages(unittest.TestCase):
    def test_packages1(self):
        self.assertTrue((X.columns.values.tolist() == ['Age','Pclass', \
        'SibSp','family_size','Fare','Sex_female','Sex_male', \
        'Embarked_C', 'Embarked_Q','Embarked_S','Title_Miss', \
        'Title_Mr', 'Title_Mrs','Cabin_A','Cabin_B','Cabin_C', \
        'Cabin_D','Cabin_E','Cabin_F','Cabin_G','Cabin_H']))

## Set the y Value

Set the y variable to the dependent variable.  

In [None]:
#Set the y value to survived. 
y= 


## Split Training Set For Cross Validation
We want to split up our training set so that we can do some cross validation.  

In doing below, use the sklearn methods to to a train test split.  

From X y dataframe, generate the following dataframes by drawing the data **randomly**  from the train dataframe 80% of the data in train and 20% of the data in test.  So that you get repeatable results, set the `random_state=100`. This will set a "seed" so that your random selection will be the same as mine and you will pass the internal tests. 

train_X, test_X, train_y, test_y


### Perform Nearest Neighbor Classification (KNeighborsClassifier)
Using the default options, perform nearest neighbor classification. 

Calculate the accuracy measure using `metrics.accuracy_score` for both the training data (assign to `knn_train1_y_acc`) and the testing data (assign to `knn_test1_y_acc`). 

**2. Is the accuracy higher for the training or the test set?  Is this normal?  Does the difference indicate anything?**


### Answer  2 here.

### Confusion Matrix
Though we haven't calculated one in example code, we can utilize a confusion matrix to be able to understand misclassifications a bit more.  

See the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html). 

You can utilize the syntax below to generate knn_mat1_train and knn_mat1_test.  
```
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true, y_pred)
```
**3. Explain what each of the four values for the confusion matrix for train means. Are the false-positives or false-negatives more frequent for the train set?**


### Answer  3 here.

In [6]:
from sklearn.metrics import confusion_matrix
knn_mat1_train=confusion_matrix(train_y, knn_train1_y)
knn_mat1_test=confusion_matrix(test_y, knn_test1_y)
knn_mat1_train
knn_mat1_test

NameError: name 'train_y' is not defined

### Other Models

Test 2 other algorithms/models (your choice).  Provide a summary of the best performance below. 

Use any of the available classification models. You should show and comment code

[Scikit Learn Documentation](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning).

** 4. Which model performed the best of the 3?  List the accuracy for each.**   

### Answer 4 here. 

**5. For the best performing model, look at the scikit learn documentation and identify 2 alternate model configurations.  For example, if the best performing model is a nearest neighbor, your could change the value for K. If the model is an SVM, you could try to change the value for the Kernal. **

### Answer 5 here. 

**6. For the best performing model, try simplifying the imput variables to age, gender, and class. What is the resulting impact on performance? **

### Answer 6 here. 

### Grading
These will be used for grading. 

15 - Automated test.
15 - Answers to questions.  


In [None]:
%%unittest_main
class TestHm6(unittest.TestCase):
    def test1_columns(self):
        self.assertTrue((X.columns.values.tolist() == ['Age','Pclass', \
        'SibSp','family_size','Fare','Sex_female','Sex_male', \
        'Embarked_C', 'Embarked_Q','Embarked_S','Title_Miss', \
        'Title_Mr', 'Title_Mrs','Cabin_A','Cabin_B','Cabin_C', \
        'Cabin_D','Cabin_E','Cabin_F','Cabin_G','Cabin_H']))   
    def test2_datasplit(self):
        self.assertAlmostEqual(train_X.iloc[4,0], 32.102631578947395) 
    def test3_knn_train1_y(self):
        self.assertAlmostEqual(knn_train1_y_acc, 0.820224719101)
    def test4_knn_test1_y(self):
        self.assertAlmostEqual(knn_test1_y_acc, 0.687150837989)
    def test5_confusion(self):
        self.assertTrue(knn_mat1_train[1,1]==191)       
 