# GA Data Science 19 (DAT19) - Class 5
## Developing Mastery of Pandas, Numpy & Bokeh
####  

Justin Breucop (with parts from Craig Sakuma)

## Lab goals

- NumPy: Entering the Matrix
- Pandas: DataFrames as Bamboo
- Bokeh: Picture-Perfect Visuals

## NumPy
As we've seen in lecture, linear algebra is the branch of mathematics describing navigation between different vector spaces. This core concept is very important as a big piece of data cleansing is converting data into various formats and certain algorithms require data to be in a specific shape.

NumPy is a package designed to be used in scientific computing, and specifically around building N-dimensional array objects.

### Creating an array

In [1]:
import numpy as np
a = np.arange(25).reshape(5,5)
# arange(n) is a function that creates a 1 row array of integers of length n 
# reshape(M,N) is a method converts a list to a matrix of size MxN
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

We can convert from lists to arrays. Note however unlike lists, elements of an array all have to be of the same datatype.

In [2]:
alist = [[ 0,  1,  2,  3,  4],[ 5,  6,  7,  8,  9],[10, 11, 12, 13, 14],[15, 16, 17, 18, 19],[20, 21, 22, 23, 24]]
type(alist)

list

In [3]:
np.array(alist)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [4]:
biga = a*10
biga

array([[  0,  10,  20,  30,  40],
       [ 50,  60,  70,  80,  90],
       [100, 110, 120, 130, 140],
       [150, 160, 170, 180, 190],
       [200, 210, 220, 230, 240]])

In [5]:
print biga.mean()
print biga.mean(0) #Average per column
biga.mean(1) #average per row
# type(biga.mean(1))

120.0
[ 100.  110.  120.  130.  140.]


array([  20.,   70.,  120.,  170.,  220.])

In [6]:
bigm = np.matrix(biga-20)
bigm

matrix([[-20, -10,   0,  10,  20],
        [ 30,  40,  50,  60,  70],
        [ 80,  90, 100, 110, 120],
        [130, 140, 150, 160, 170],
        [180, 190, 200, 210, 220]])

In [7]:
np.linalg.inv(biga-20)

array([[ -7.03687442e+13,   2.60439480e+13,   1.21204527e+14,
         -3.90659220e+13,  -3.78138091e+13],
       [ -9.96183206e-02,  -5.76973618e+14,   4.32730213e+14,
          8.65460427e+14,  -7.21217022e+14],
       [  2.74809160e-01,   1.12589991e+15,  -1.12589991e+15,
         -1.12589991e+15,   1.12589991e+15],
       [  2.81474977e+14,  -6.25054753e+14,   4.68791064e+14,
         -1.88317778e+14,   6.31064894e+13],
       [ -2.11106233e+14,   5.00845154e+13,   1.03174102e+14,
          4.87823180e+14,  -4.29975565e+14]])

#### Slices

In [8]:
bigm = np.array(bigm)
bigm[0]

array([-20, -10,   0,  10,  20])

In [9]:
#Same thing, but demonstrating the full slice with a colon
biga[0,:]
biga

array([[  0,  10,  20,  30,  40],
       [ 50,  60,  70,  80,  90],
       [100, 110, 120, 130, 140],
       [150, 160, 170, 180, 190],
       [200, 210, 220, 230, 240]])

In [10]:
biga[:,3]

array([ 30,  80, 130, 180, 230])

Slice rules work for even more complex dimensional data

In [11]:
compa = np.arange(30).reshape(5,3,2)
compa

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]],

       [[24, 25],
        [26, 27],
        [28, 29]]])

In [12]:
# lets describe it
print compa.shape
print compa.ndim
print compa.dtype

(5, 3, 2)
3
int64


In [13]:
compa[3,:,1]

array([19, 21, 23])

In [14]:
compa[0,0,0]

0

In [15]:
compa[0,0,0] = 5.9
compa[0,0,0]

5

Numpy tries to resolve conflicting datatypes, sometimes to our dismay

In [16]:
compa = compa.astype(float)
compa[0,0,0] = 5.75
compa[0,0,0]

5.75

#### Random Numbers
Random numbers are very helpful and are necessary at times for testing data pipelines and running statistical analyses. Functions for creating random values are under numpy.random.

In [17]:
#Create a randomized array
rm = np.random.rand(5,5)
rm

array([[ 0.57972977,  0.20743422,  0.3579575 ,  0.27484107,  0.4575964 ],
       [ 0.00634841,  0.01497859,  0.04118995,  0.60118203,  0.47578861],
       [ 0.09561091,  0.19474204,  0.2844459 ,  0.31167375,  0.99969289],
       [ 0.84037522,  0.7672864 ,  0.41542942,  0.07806972,  0.70847936],
       [ 0.29535343,  0.37504016,  0.32762846,  0.19263531,  0.42336328]])

In [18]:
rm.shape

(5, 5)

In [19]:
print rm.mean()
print rm.mean(0) #Average per column
print rm.mean(1) #average per row

0.373074911746
[ 0.36348355  0.31189628  0.28533024  0.29168038  0.61298411]
[ 0.37551179  0.22789752  0.3772331   0.56192802  0.32280413]


In [20]:
# for a different Normal Distribution, use np.random.normal
rm = np.random.normal(5,9,(30,30))
rm

array([[  8.53827853,  11.04672278,  -4.79064525,  -4.79647129,
          1.14868313,   7.27369609,   7.56292374,   4.27812527,
         19.45497235, -19.18907943,   1.04812048,  -3.36694198,
         14.70893565,   0.16331484,  17.73318639,  13.7093975 ,
         15.46482257,  18.56257846,   9.66226848,  10.39060412,
          4.26983946,   8.18102845,  -7.67996756,  11.90733128,
          7.7275947 ,  19.79623207,   6.74617954,  -2.41239889,
          6.05222098,   4.53970666],
       [ 16.68878774,  12.79018862,  12.00417801,  -0.27024898,
          6.46775796,  -1.14241123,  12.81252704,   4.30546659,
          0.73989995,  -2.03607684,   6.88960308,   8.35929607,
         19.02118143,  -4.36293488,  16.17951785,   7.88007127,
         11.82117014,   0.3107944 ,  12.99032379,   6.73637277,
          4.07849586,  -3.21289108,  -4.21769427,  15.28276487,
         -4.84358198,  -0.47177178,   4.14096844,   3.58298483,
         12.6445205 ,   9.86331946],
       [  5.67255819,   3.7250

In [21]:
print rm.mean(), "which is hopefully close to the input mean"
print rm.var(), "which variance = stdev squared"
print np.median(rm)

5.3164341127 which is hopefully close to the input mean
74.2189697114 which variance = stdev squared
5.61000604641


Find more distributions and random functions here: http://docs.scipy.org/doc/numpy/reference/routines.random.html

### Exercise 1
1) Create a 4x5 array of integers numbering 0 to 19.

In [22]:
np.arange(20).reshape(4,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

2) Create a 50x500 array with a mean of 20 and variance of 100. Save it to a variable called  `biggie`

In [23]:
biggie = np.random.normal(20,10,(50,500))
print biggie.shape
print biggie.mean()
print biggie.var()

(50, 500)
20.0532162464
101.402095063


3) Change the mean of the array to a value within 1 of 0 and the variance within 1 of 25. Think about what the mean and the variance represent and try using various mathematical operations.

In [24]:
morph = (biggie - 20)/2
print morph.mean()
print morph.var()

0.0266081232242
25.3505237658


## Pandas: DataFrames as Bamboo
You've already been exposed to dataframes in the previous labs so lets get into dataframes and how we can work with them.

In [25]:
import pandas as pd

data = pd.read_csv("../data/titanic.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [26]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [27]:
data[data.Age>65]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
672,673,0,2,"Mitchell, Mr. Henry Michael",male,70.0,0,0,C.A. 24580,10.5,,S
745,746,0,1,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,71.0,B22,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


In [28]:
data[(data.Age==11)&(data.SibSp==5)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
59,60,0,3,"Goodwin, Master. William Frederick",male,11.0,5,2,CA 2144,46.9,,S


In [29]:
data[(data.Age==11)|(data.SibSp==5)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
59,60,0,3,"Goodwin, Master. William Frederick",male,11.0,5,2,CA 2144,46.9,,S
71,72,0,3,"Goodwin, Miss. Lillian Amy",female,16.0,5,2,CA 2144,46.9,,S
386,387,0,3,"Goodwin, Master. Sidney Leonard",male,1.0,5,2,CA 2144,46.9,,S
480,481,0,3,"Goodwin, Master. Harold Victor",male,9.0,5,2,CA 2144,46.9,,S
542,543,0,3,"Andersson, Miss. Sigrid Elisabeth",female,11.0,4,2,347082,31.275,,S
683,684,0,3,"Goodwin, Mr. Charles Edward",male,14.0,5,2,CA 2144,46.9,,S
731,732,0,3,"Hassan, Mr. Houssein G N",male,11.0,0,0,2699,18.7875,,C
802,803,1,1,"Carter, Master. William Thornton II",male,11.0,1,2,113760,120.0,B96 B98,S


### Cleaning Data

In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


#### Working with nulls
Exclude data

In [31]:
# data[data.Age.isnull()]
data[data.Age.notnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S


In [32]:
# You can also just replace the nulls
data.Age[data.Age.isnull()].fillna(0)

5      0.0
17     0.0
19     0.0
26     0.0
28     0.0
29     0.0
31     0.0
32     0.0
36     0.0
42     0.0
45     0.0
46     0.0
47     0.0
48     0.0
55     0.0
64     0.0
65     0.0
76     0.0
77     0.0
82     0.0
87     0.0
95     0.0
101    0.0
107    0.0
109    0.0
121    0.0
126    0.0
128    0.0
140    0.0
154    0.0
      ... 
718    0.0
727    0.0
732    0.0
738    0.0
739    0.0
740    0.0
760    0.0
766    0.0
768    0.0
773    0.0
776    0.0
778    0.0
783    0.0
790    0.0
792    0.0
793    0.0
815    0.0
825    0.0
826    0.0
828    0.0
832    0.0
837    0.0
839    0.0
846    0.0
849    0.0
859    0.0
863    0.0
868    0.0
878    0.0
888    0.0
Name: Age, dtype: float64

In [33]:
#Replace with the mean to preserve statistical values
avg_age = data.Age[data.Age.notnull()].mean()
print avg_age
data.Age.fillna(avg_age)

29.6991176471


0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
5      29.699118
6      54.000000
7       2.000000
8      27.000000
9      14.000000
10      4.000000
11     58.000000
12     20.000000
13     39.000000
14     14.000000
15     55.000000
16      2.000000
17     29.699118
18     31.000000
19     29.699118
20     35.000000
21     34.000000
22     15.000000
23     28.000000
24      8.000000
25     38.000000
26     29.699118
27     19.000000
28     29.699118
29     29.699118
         ...    
861    21.000000
862    48.000000
863    29.699118
864    24.000000
865    42.000000
866    27.000000
867    31.000000
868    29.699118
869     4.000000
870    26.000000
871    47.000000
872    33.000000
873    47.000000
874    28.000000
875    15.000000
876    20.000000
877    19.000000
878    29.699118
879    56.000000
880    25.000000
881    33.000000
882    22.000000
883    28.000000
884    25.000000
885    39.000000
886    27.000000
887    19.000000
888    29.6991

#### Replace with random normal distribution

In [34]:
# Get values of mean and standard deviation
data.Age[data.Age.notnull()].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [35]:
# Replace null values with 
data.Age.fillna(np.random.normal(29.7,14.5),inplace=True)

In [36]:
data.Age.fillna(np.random.normal(29.7,14.5)).describe()

count    891.000000
mean      32.655400
std       14.294988
min        0.420000
25%       22.000000
50%       32.000000
75%       44.580741
max       80.000000
Name: Age, dtype: float64

### Convert categorical data to numerical

In [37]:
data.Sex=='female'

0      False
1       True
2       True
3       True
4      False
5      False
6      False
7      False
8       True
9       True
10      True
11      True
12     False
13     False
14      True
15      True
16     False
17     False
18      True
19      True
20     False
21     False
22      True
23     False
24      True
25      True
26     False
27     False
28      True
29     False
       ...  
861    False
862     True
863     True
864    False
865     True
866     True
867    False
868    False
869    False
870    False
871     True
872    False
873    False
874     True
875     True
876    False
877    False
878    False
879     True
880     True
881    False
882     True
883    False
884    False
885     True
886    False
887     True
888     True
889    False
890    False
Name: Sex, dtype: bool

In [38]:
data.rename(columns={'Sex':'Is Female'},inplace=True)
data['Is Female']=data['Is Female']=='female'
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Is Female,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",False,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",True,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",True,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",True,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",False,35.0,0,0,373450,8.05,,S


In [39]:
# get unique values of Embarked
data.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [40]:
# replace values with numbers
data.Embarked.replace(['S', 'C', 'Q'],[1,2,3],inplace=True)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Is Female,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",False,22.0,1,0,A/5 21171,7.25,,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",True,38.0,1,0,PC 17599,71.2833,C85,2.0
2,3,1,3,"Heikkinen, Miss. Laina",True,26.0,0,0,STON/O2. 3101282,7.925,,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",True,35.0,1,0,113803,53.1,C123,1.0
4,5,0,3,"Allen, Mr. William Henry",False,35.0,0,0,373450,8.05,,1.0


### Selecting with .loc, .iloc, & .ix

Selecting data in pandas can be tricky. The main takeaway is that .loc looks for index labels, .iloc looks for the integer index position, and .ix can be a mix. 

In [41]:
df = pd.DataFrame(np.random.randn(6,4),index=list('abcdef'),columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
a,-1.327672,-0.293902,0.180261,0.09625
b,-0.046679,-2.684191,1.241858,1.097784
c,0.402153,-2.176681,1.426811,-1.678443
d,0.732103,-0.450226,-0.86618,-1.931541
e,0.114139,-0.16057,-0.757395,-0.321668
f,1.028489,-0.121512,0.975229,1.319535


In [42]:
df.loc['f']

A    1.028489
B   -0.121512
C    0.975229
D    1.319535
Name: f, dtype: float64

In [43]:
df.iloc[len(df.index)-1]

A    1.028489
B   -0.121512
C    0.975229
D    1.319535
Name: f, dtype: float64

In [44]:
df.A.ix['f'] == df.A.ix[-1]

True

In [45]:
cc = list('cookies')
cc[-4]

'k'

### Group by

In [46]:
# Find average age of passengers that survived vs. died
data.groupby('Survived')['Age'].mean()

Survived
0    33.803447
1    30.812481
Name: Age, dtype: float64

In [47]:
# Count number of female passengers
data.groupby('Is Female')['PassengerId'].count()

Is Female
False    577
True     314
Name: PassengerId, dtype: int64

In [48]:
data.groupby(['Survived','Pclass'])['PassengerId'].count()

Survived  Pclass
0         1          80
          2          97
          3         372
1         1         136
          2          87
          3         119
Name: PassengerId, dtype: int64

### Apply

In [49]:
# Convert ticket prices to USD
data.Fare.apply(lambda x: x*1.6)

0       11.60000
1      114.05328
2       12.68000
3       84.96000
4       12.88000
5       13.53328
6       82.98000
7       33.72000
8       17.81328
9       48.11328
10      26.72000
11      42.48000
12      12.88000
13      50.04000
14      12.56672
15      25.60000
16      46.60000
17      20.80000
18      28.80000
19      11.56000
20      41.60000
21      20.80000
22      12.84672
23      56.80000
24      33.72000
25      50.22000
26      11.56000
27     420.80000
28      12.60672
29      12.63328
         ...    
861     18.40000
862     41.48672
863    111.28000
864     20.80000
865     20.80000
866     22.17328
867     80.79328
868     15.20000
869     17.81328
870     12.63328
871     84.08672
872      8.00000
873     14.40000
874     38.40000
875     11.56000
876     15.75328
877     12.63328
878     12.63328
879    133.05328
880     41.60000
881     12.63328
882     16.82672
883     16.80000
884     11.28000
885     46.60000
886     20.80000
887     48.00000
888     37.520

In [50]:
data.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Wil

In [51]:
data.Name.apply(lambda x: x.split(",")[0])

0               Braund
1              Cumings
2            Heikkinen
3             Futrelle
4                Allen
5                Moran
6             McCarthy
7              Palsson
8              Johnson
9               Nasser
10           Sandstrom
11             Bonnell
12         Saundercock
13           Andersson
14             Vestrom
15             Hewlett
16                Rice
17            Williams
18       Vander Planke
19          Masselmani
20              Fynney
21             Beesley
22             McGowan
23              Sloper
24             Palsson
25             Asplund
26                Emir
27             Fortune
28             O'Dwyer
29            Todoroff
            ...       
861              Giles
862              Swift
863               Sage
864               Gill
865            Bystrom
866       Duran y More
867           Roebling
868      van Melkebeke
869            Johnson
870             Balkic
871           Beckwith
872           Carlsson
873    Vand

### Concatenate

In [52]:
data_first_half = data.iloc[0:10,:]
data_first_half.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
PassengerId    10 non-null int64
Survived       10 non-null int64
Pclass         10 non-null int64
Name           10 non-null object
Is Female      10 non-null bool
Age            10 non-null float64
SibSp          10 non-null int64
Parch          10 non-null int64
Ticket         10 non-null object
Fare           10 non-null float64
Cabin          3 non-null object
Embarked       10 non-null float64
dtypes: bool(1), float64(3), int64(5), object(3)
memory usage: 962.0+ bytes


In [53]:
data_second_half = data.iloc[10:,:]

remake_data = pd.concat([data_first_half,data_second_half])
remake_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Is Female      891 non-null bool
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null float64
dtypes: bool(1), float64(3), int64(5), object(3)
memory usage: 84.4+ KB


### EXERCISE 2
1) Replace Pclass numbers with 'First Class', 'Second Class', 'Third Class'

In [54]:
data['Pclass'] = data.Pclass.map({1: 'First Class', 2:'Second Class', 3:'Third Class'})
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Is Female,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,Third Class,"Braund, Mr. Owen Harris",False,22.000000,1,0,A/5 21171,7.2500,,1.0
1,2,1,First Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",True,38.000000,1,0,PC 17599,71.2833,C85,2.0
2,3,1,Third Class,"Heikkinen, Miss. Laina",True,26.000000,0,0,STON/O2. 3101282,7.9250,,1.0
3,4,1,First Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",True,35.000000,1,0,113803,53.1000,C123,1.0
4,5,0,Third Class,"Allen, Mr. William Henry",False,35.000000,0,0,373450,8.0500,,1.0
5,6,0,Third Class,"Moran, Mr. James",False,44.580741,0,0,330877,8.4583,,3.0
6,7,0,First Class,"McCarthy, Mr. Timothy J",False,54.000000,0,0,17463,51.8625,E46,1.0
7,8,0,Third Class,"Palsson, Master. Gosta Leonard",False,2.000000,3,1,349909,21.0750,,1.0
8,9,1,Third Class,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",True,27.000000,0,2,347742,11.1333,,1.0
9,10,1,Second Class,"Nasser, Mrs. Nicholas (Adele Achem)",True,14.000000,1,0,237736,30.0708,,2.0


2) What was the average ticket price for survivors vs. dead passengers?

In [55]:
data.groupby('Survived')['Fare'].mean()

Survived
0    22.117887
1    48.395408
Name: Fare, dtype: float64

### Bonus!!!
Round all ages to the nearest year using `apply`

In [56]:
data.Age.apply(lambda x: int(round(x)))

0      22
1      38
2      26
3      35
4      35
5      45
6      54
7       2
8      27
9      14
10      4
11     58
12     20
13     39
14     14
15     55
16      2
17     45
18     31
19     45
20     35
21     34
22     15
23     28
24      8
25     38
26     45
27     19
28     45
29     45
       ..
861    21
862    48
863    45
864    24
865    42
866    27
867    31
868    45
869     4
870    26
871    47
872    33
873    47
874    28
875    15
876    20
877    19
878    45
879    56
880    25
881    33
882    22
883    28
884    25
885    39
886    27
887    19
888    45
889    26
890    32
Name: Age, dtype: int64

## Bokeh: Picture Perfect Visuals

To install Bokeh, go to a terminal and type:

`conda install bokeh` 

Bokeh is built by the same people that created Anaconda (Continuum Analytics) and is designed out of the box for web display, making it nice for creating presentation ready, interactive visuals quickly. Labs in this course will be shown in Bokeh. Checkout http://bokeh.pydata.org/en/latest/docs/quickstart.html#concepts to see some of the range of capabilities.

In [57]:
from bokeh.plotting import figure,output_notebook,show,vplot
output_notebook()

In [58]:
import pandas.io.data
import datetime
aapl = pd.io.data.get_data_yahoo('FB', 
                                 start=datetime.datetime(2015, 4, 1), 
                                 end=datetime.datetime(2015, 4, 28))


The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.


In [59]:
# prepare some data
x = aapl.Low
y = aapl.High

# create a new plot with a title and axis labels
p = figure(title="Stock High vs. Low", x_axis_label='Low', y_axis_label='High')

# These are glyphs
p.circle(x, y,size=30,alpha=.5,)
p.line(x,x*y.mean()/x.mean())

# show the results
show(p)

At its core, Bokeh is built up with Plots and Glyphs. Plots are created with the figure keyword and then glyphs are visuals that are added to the visualization. The visuals are scalable, interactive and savable. You can even create vectorized colors.

In [60]:
# prepare some data
N = 4000
x = np.random.random(size=N) * 100
y = np.random.random(size=N) * 100
radii = np.random.random(size=N) * 1.5
colors = ["#%02x%02x%02x" % (r, g, 150) for r, g in zip(np.floor(50+2*x), np.floor(30+2*y))]

TOOLS="resize,crosshair,pan,wheel_zoom,box_zoom,reset,box_select,lasso_select"

# create a new plot with the tools above, and explicit ranges
p = figure(tools=TOOLS, x_range=(0,100), y_range=(0,100))

# add a circle renderer with vecorized colors and sizes
p.circle(x,y, radius=radii, fill_color=colors, fill_alpha=0.6, line_color=None)

# show the results
show(p)

In [61]:
p1 = figure(title="Titanic Ages Dead",x_axis_label = 'Age',y_axis_label = 'Count')
#construct the histogram
hist, edges = np.histogram(data.Age[data.Survived==0].values, density=True, bins=50)
#Construct your x axis
x = np.linspace(data.Age.min(),data.Age.max(),100)
#add the bars, scaling the value to the full count of people
p1.quad(top=hist*len(data.Age), bottom=0, left=edges[:-1], right=edges[1:],line_color='black')

p2 = figure(title="Titanic Ages Survived",x_axis_label = 'Age',y_axis_label = 'Count')

hist, edges = np.histogram(data.Age[data.Survived==1].values, density=True, bins=50)
x = np.linspace(data.Age.min(),data.Age.max(),100)
p2.quad(top=hist*len(data.Age), bottom=0, left=edges[:-1], right=edges[1:],line_color='black')


show(vplot(p1,p2))