# Introduction to Pandas

##  Some disclaimers
* The code is written in python 2.7 (if you want to run it in python 3, you'll need to do some adaptions)
* Almost every command in Pandas, you can do in a couple of ways. The code I wrote here is the one I thought the most suited

## What is Pandas?
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

https://pandas.pydata.org/

## How to install:
`pip install pandas` (in linux / mac)

### Dependencies:
(install using `pip install XXX`)
* numpy
* seaborn
* sklearn

## Documentation:
https://pandas.pydata.org/pandas-docs/stable/

# Titanic: Machine Learning from Disaster
This data is taken from Kaggle is great for playing with Data Science and Machine Leanring

https://www.kaggle.com/c/titanic

## Download the data set:
Down load it from here:

https://www.kaggle.com/c/titanic/data

## Exploring the data:

#### Load and check the data:

In [None]:
data_path = '../data/titanic_train.csv'

import pandas as pd
df = pd.read_csv(data_path, header=0)

# see all different df types:
df.ftypes

In [None]:
# See more info:
df.info()

In [None]:
df.head(10)

In [None]:
# how many rows / columns
df.shape

In [None]:
# some valuable statistical data 
df.describe()

# std is the standard deviation

## Visualization:

In [None]:
# see an histogram of ages
import pylab as P
df['Age'].hist()
P.title('Age histogram')
P.xlabel('Ages')
P.ylabel('Amount of passengers')
P.show()

In [None]:
# we cannot plot the 'Survived' column as-it, we must sum it first
print pd.value_counts(df['Survived'].values)
pd.value_counts(df['Survived'].values).plot(kind='pie', stacked=True)
# pd.value_counts(df['Survived'].values).plot(kind='barh', stacked=True)
P.title('Survival pie chart')
P.show()

# 0 = Did not survived, 1 = Survived

# other different build-in charts: 
# line, bar, barh, hist, box, kde, density, area, pie, scatter, hexbin
# more info here: 
# http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.plot.html

In [None]:
# survival based on the gender.
survived_sex = df[df['Survived']==1]['Sex'].value_counts()
dead_sex = df[df['Survived']==0]['Sex'].value_counts()
df2 = pd.DataFrame([survived_sex,dead_sex])
df2.index = ['Survived','Dead']
df2.plot(kind='bar',stacked=True, figsize=(15,8))
P.title('Survival based on the gender')
P.ylabel('Amount of passengers')
P.show()


In [None]:
#generate age groups
df['age_group'] = df['Age']//10*10
df.head(10)

In [None]:
#seaborn visualization
import seaborn as sns
# from matplotlib import pyplot as plt
sns.countplot(x="age_group", hue="Survived", data=df).set_title('Survived by age group')
P.xlabel('Age group')
P.ylabel('Survived')
P.show()

In [294]:
# age_group is not needed anymore
df = df.drop(['age_group'], axis=1)

In [None]:
# better to be at first class
ax = sns.countplot(x="Pclass", hue="Survived", data=df).set_title('Survived by class')
P.xlabel('Class')
P.ylabel('Survived')
P.show()

## Cleaning the data
Before the actual processing, we would like clean the data: remove unneeded columns, add missing values, normalize, change types,  etc...

In [None]:
# it is difficult to work with string values such as "male", or "female":

# we will create a new column 'Gender' and will be binary:
df['Gender'] = -1
df.head(10)

In [None]:
# set Gender column:
df['Gender'] = df['Sex'].map( {'female' : 0, 'male' : 1} ).astype(int)

# drop the sex column - we don't need it anymore:
df =df.drop(['Sex'], axis=1)


df.head(10)

In [None]:
df.count()
# we can see that the age missing some values. Let's fill them

In [None]:
# We could not fill Cabin properly because it has too many missing value  
# we will remove the column:

df = df.drop(['Cabin'], axis=1)

In [None]:
df['Age'].head(20)

In [None]:
# we will fill the missing value in Age with the mean of the existing values (or the median)
#df['Age'].median()
# df['Age'].mean()

# we don't want to replace the existing values so for that we'll add a new column 'AllAges' (maybe a better name?)

df['AllAge'] = df['Age']

df['AllAge'].fillna(int(df['Age'].mean()), inplace=True)

df.head(20)

In [None]:
# lets check the age histogram again:
df['AllAge'].hist()
P.title('Age histogram')
P.xlabel('Ages')
P.ylabel('Amount of passengers')
P.show()

In [None]:
# improvement: instead of filling the both males and females with the same mean value, we will fill males
# with males mean and females with females mean
df['AllAge'] = df['Age']

# boolean indexing
male_mean = int(df.loc[df['Gender'] == 1,'Age'].mean())
female_mean = int(df.loc[df['Gender'] == 0]['Age'].mean())

print 'male mean: ', male_mean
print 'female mean: ', female_mean

import numpy as np

# loc gets rows (or columns) with particular labels from the index

df.loc[(df['Gender'] == 1 & np.isnan(df['Age'])),'AllAge'] = male_mean
df.loc[(df['Gender'] == 0 & np.isnan(df['Age'])),'AllAge'] = female_mean

# for index, row in df.iterrows():
#     if pd.isnull(row['AllAge']):
#         if row.Gender == 1:
#             df.at[index, 'AllAge'] = male_mean
#         else:
#             df.at[index,'AllAge'] =female_mean

# df = df.drop(['Age'], axis=1)

df.head(20)
# A more persise methods:
# 1. Check if the person is a child or not (if he has sublings or parents)
#    and set the mean age accordinly.
# 2. Check the percent of each bar in the histogram and add the missing values as a mean of the bar 
# (so there won't be too much varience)

In [None]:
# lets add a new featuer: family_size:
# the motivation is to reduce similar features into one
df['FamilySize'] = df['SibSp'] + df['Parch']

# remove the united features
df = df.drop(['SibSp', 'Parch'], axis=1)

df.head(10)

## Filtering:

In [None]:
# lets find how many children below the age of 10 were of the ship:
below_10 = df[df['AllAge'] < 10]
len(below_10)


In [None]:
# lets check how many of then survived:
below_10['Survived'].value_counts()

In [None]:
# or in percentage:
below_10['Survived'].value_counts(normalize=True)

## Pivoting:

In [None]:
# Similar to Excel's PivotTable, pivoting in Pandas enables us to automatically sort,
# count, total, or average the data stored in one table

df.pivot_table(index='Gender', values='Survived', aggfunc=[np.sum, len, np.mean])
# 3 out of every 4 woman were saved! looks like "women and children first" works
# (men wern't so lucky)

In [None]:
# upper class women has much more chances to survive!
pd.pivot_table(df,index=['Gender', 'Pclass'], values=['Survived'], aggfunc=[np.sum, len, np.mean])

In [None]:
# lets split the passengers into children (under 18) and adults (18+) and see how this effect the 
# survival chances

age = pd.cut(df['Age'], [0, 18, 80])
pd.pivot_table(df,index=['Gender', age], values=['Survived'], aggfunc=[np.sum, len, np.mean])


## Time for some machine learning!
In this section we try to use the cleaned data and predict the survival of the passenger based on the data features


### Splitting the data (Cross validation):
We need to spilt the data into train and test in order to check how accurate our model is

In [None]:
from sklearn.model_selection import train_test_split

# X is all the set of the features we want to predict on
X = df.drop(['Survived', 'PassengerId', 'Name'], axis=1)

# y is the feature we want to be predicted
y = df['Survived']

# split the data 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

len(X_train), len(X_test), len(y_train), len(y_test)


###  Predicting using  Logistic Regression:
Logistic Regression is a classification algorithm. It's used to predict a binary outcome given a set (of independent) variables.

More info in:

https://en.wikipedia.org/wiki/Logistic_regression

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


#### Logistic Regression - first try

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logreg = LogisticRegression()

# lets try to predict only using 'AllAge' feature:
predictors = ['AllAge']

# Train the model according to the given training data
logreg.fit(X_train[predictors], y_train)

# Predict class labels for samples in X_test
y_pred = logreg.predict(X_test[predictors])

# check the difference between y_pred and y_test:
accuracy_score(y_test, y_pred)

#### Logistic Regression - second try

In [None]:
# this time lets try to predict with more features
predictors = ['AllAge', 'Pclass', 'FamilySize', 'Gender', 'Fare']

logreg.fit(X_train[predictors], y_train)

y_pred = logreg.predict(X_test[predictors])

accuracy_score(y_test, y_pred)

###  Predicting using  Naive Bayes:

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with naive independence assumptions between the features.

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

http://scikit-learn.org/stable/modules/naive_bayes.html

In [None]:
# the advanges of NaiveBayes is that it is relatively fast and the features are relatively independeant, 
# the result will be better

from sklearn.naive_bayes import GaussianNB

# GaussianNB implements the Gaussian Naive Bayes algorithm for classification. 
# The likelihood of the features is assumed to be Gaussian

gaussian_nb = GaussianNB()

gaussian_nb.fit(X_train[predictors], y_train)
y_pred = gaussian_nb.predict(X_test[predictors])

accuracy_score(y_test, y_pred)

###  Predicting using  Random Forest:

Random forests is a learning method for classification, regression and other tasks, that operate by constructing a multitude of random decision trees.

https://en.wikipedia.org/wiki/Random_forest

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://medium.com/@Synced/how-random-forest-algorithm-works-in-machine-learning-3c0fe15b6674

In [283]:
from sklearn.ensemble import RandomForestClassifier


predictors = ['AllAge', 'Pclass', 'FamilySize', 'Gender', 'Fare']

random_forest = RandomForestClassifier()

random_forest.fit(X_train[predictors], y_train)
y_pred = random_forest.predict(X_test[predictors])

accuracy_score(y_test, y_pred)

0.8666666666666667