# Basic Pandas and Scikit Learn Workflow

This is the first in a series of notebooks that demonstrate proper Machine Learning workflow using Pandas and Scikit Learn for supervised learning.

### Common Theme
Never look at the test data!

This seemingly obvious point has subtle implications when Pandas feature engineering and cross validation are employed.  Many beginners, and even a few experts, inadvertently write code that makes implicit use of test data resulting in an overly optimistic estimate of the model's accuracy.


### Notebooks
1. Basic Pandas and Scikit Learn Workflow
2. Scikit Learn Pipelines
3. Scikit Learn Pipelines with Pandas Feature Engineering
4. Scikit Learn Pipelines with Pandas Feature Engineering and Hyperparamter Optimization


### Prerequisites
This notebook assumes the reader has written at least some code in Python making use of Pandas and Scikit Learn.

The goal of this notebook is to help the reader put all these together to create Machine Learning models using best practices.

This notebook was created in a development enviroment using the Anaconda distribution.  This notebook was created using the following versions:

* Python  3.6
* Numpy  1.13
* Pandas 0.22
* Scikit Learn 19.1

### Data Set and Problem

A common dataset used in beginning Machine Learning is the titantic dataset from Kaggle.  The goal is to predict who survived and who did not based on the other features.

This notebook will use "train.csv" as available from Kaggle.

It should be noted that "test.csv" is also provided by Kaggle.  This is a set of records without labels.  As this notebook is concerned with supervised learning only (not semi-supervised learning), test.csv will not be used at all.

Question:
Why am I disgregarding test.csv when even the tutorial writers over at Kaggle for this introductory competition make use of it?

Answer:
Especially for a beginner, it is important not to learn bad practices up front.  In a supervised problem in real life, you will not have unlabeled data (making use of that is called semi-supervised learning).  Futhermore the very name of the file "test.csv" suggests you should not be looking at it.



In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
%matplotlib inline
sns.set() # enable seaborn style

In [23]:
train = pd.read_csv('./data/train.csv')

In [24]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
