# Python Guide to Data Science

This guide is based on Python 3 (any version above 3 is ok).
An easy way to get the Python and the necessary libraries is to install everything through [Anaconda](https://www.continuum.io/downloads). It is a distribution that will provide you everything you need to start working with Data Science. This thing you're looking at is an iPython notebook. Essentially you can write your process while executing code at the same time. On Kaggle this is a certain type of what they call a **kernel**.

---
This guide will look at the Titanic dataset, we will see if we can predict what types of people would have survived on the Titanic.

So first let's import some useful libraries that we will use.

In [19]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

**os** is a built-in library to do operating system related things. We mostly use the `os.path.join()` function to access the file we want. Different operating systems store their files in different ways and python easily does the work for us. Ex. Windows might have a path like `"C:\Users\scientist\Desktop"` while linux may have `"~/Desktop"`. 

**matplotlib** is used to plot any data we have. It's a very flexible library from plotting basic scatter plots to doing animations of geographical maps.

**pandas** is used to store our data into something called a dataframe (as you will see shortly). The library allows us to apply functions on the dataframe to allow us to easily extract certain parts of the data, apply functions (ex. mean) on the data, and much more. If you are already aware of this concept, pandas has a good cheatsheet [here](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf).

**numpy** is a scientific computing library that allows for more speedy computations and useful tools such as linear algebra capabilites.

We name each as `np` and `pd` by convention, much faster than writing the full name each time.

In [6]:
titanic_data = pd.read_csv(os.path.join('..', 'titanic_data', 'train.csv')) # .. means the parent folder

Since there are no errors, the import was successful. You can see we imported 891 observations of data and 12 different variables.

In [7]:
print(titanic_data.shape)

(891, 12)


You can view the first `n` or last `n` observations using `dataframe.head(n)` and `dataframe.tail(n)` respectively.

In [8]:
titanic_data.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [9]:
titanic_data.tail(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


We can also select individual columns.

In [28]:
titanic_data[['Sex', 'Age']].head(3)

Unnamed: 0,Sex,Age
0,male,22.0
1,female,38.0
2,female,26.0


Pandas is powerful as it allows us to group data together by a certain variable. We can apply what we learned to see the average `Fare`, `Age`, and proportion of `Survived` by each ticket class. We can see that the as you move to a higher class (ie. 3 -> 1):
- Fares increase
- Passengers are older
- More survived

In [18]:
titanic_data.groupby('Pclass').mean()[['Fare', 'Age', 'Survived']]

Unnamed: 0_level_0,Fare,Age,Survived
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,84.154687,38.233441,0.62963
2,20.662183,29.87763,0.472826
3,13.67555,25.14062,0.242363


While this seems pretty good, there's a problem that may not be obvious. Data rarely comes by perfectly, in this case there are missing values all over the data set. 

In [32]:
titanic_data[['Pclass', 'Age']].tail(4)

Unnamed: 0,Pclass,Age
887,1,19.0
888,3,
889,1,26.0
890,3,32.0
