Majority of the time in data anlaysis and modeling is spent on data prepping. Pandas has a lot of high-performing core manipulations and algorithms to enable us to wrange data into the right form with ease.

### Combining and Merging Data Sets

Data contained in pandas objects can be merged in a number of ways:

- pandas.merge: connects rows in df based on 1+ keys. (similar to joins in SQL)

- pandas.concat: glues or stacks together objects along an axis

- combine_first: instance method enables splicing together overlapping data to fill in missing values in one objectw ith values from another

#### Database-style DataFrame Merges



In [160]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

In [161]:
df1= DataFrame({'key': ['b','b','a','c','a','a','b'],
                'data1': range(7)})

df2 = DataFrame({'key': ['a','b','d'],
                 'data2': range(3)})

df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [162]:
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d


In [163]:
# By default, merge is inner join. specify other joins with how = 

pd.merge(df1, df2, on='key')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


If column names are different in each object, we can specify them separately:

In [164]:
df3=DataFrame({'lkey': ['b','b','a','c','a','a','b'],
               'data1': range(7)})

df4=DataFrame({'rkey': ['a','b','d'],
               'data2': range(3)})

df3

Unnamed: 0,data1,lkey
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [165]:
df4

Unnamed: 0,data2,rkey
0,0,a
1,1,b
2,2,d


In [166]:
pd.merge(df3,df4, left_on='lkey', right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,4,a,0,a
5,5,a,0,a


In [167]:
# Outer Join

pd.merge(df1, df2, how='outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


In [168]:
# Many-to-Many merges

df1=DataFrame({'key':['b','b','a','c','a','b'],
               'data1': range(6)})

df2=DataFrame({'key': ['a','b','a','b','d'],
               'data2': range(5)})

df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [169]:
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,a
3,3,b
4,4,d


In [170]:
# Returns a cartesian product of the rows. Since there are 3 b's in the left df and 2 on the right df, there are 6 b's.
pd.merge(df1,df2, on='key', how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,0,b,3.0
2,1,b,1.0
3,1,b,3.0
4,2,a,0.0
5,2,a,2.0
6,3,c,
7,4,a,0.0
8,4,a,2.0
9,5,b,1.0


In [171]:
# To merge with multiple keys, pass a list of column names:

left = DataFrame({'key1': ['foo','foo','bar'],
                 'key2': ['one','two','one'],
                 'lval': [1,2,3]})

right= DataFrame({'key1': ['foo','foo','bar','bar'],
                  'key2': ['one','one','one','two'],
                  'rval': [4,5,6,7]})

pd.merge(left, right, on=['key1','key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


For overlapping column names, you can address the overlap manually or `merge` has a `suffixes` option for specifying strings to append to overlapping names in the left and right df objects

In [172]:
pd.merge(left,right, on='key1')

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [173]:
pd.merge(left,right,on='key1',suffixes=('_left','_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


### Merging on Index

In some cases, the merge keys in a DataFrame will be found in its index. If so, pass `left_index = True` or `right_index = True` or both to indicate that the index should be used as the merge key

In [174]:
left1=DataFrame({'key':['a','b','a','a','b','c'],
                 'value': range(6)})

right1=DataFrame({'group_val':[3.5,7]}, index=['a','b'])

left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [175]:
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [176]:
pd.merge(left1,right1,right_index=True, left_on='key')

#default merge method is to interset the join keys

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


In [177]:
pd.merge(left1,right1,left_on='key',right_index=True, how='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


In [178]:
left2= DataFrame([[1.,2.],[3.,4.],[5.,6.]], index=['a','c','e'],
                columns=['Ohio','Nevada'])

right2=DataFrame([[7.,8.], [9.,10.],[11.,12.],[13,14]],
                 index=['b','c','d','e'], columns = ['Missouri','Alabama'])

left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [179]:
right2

Unnamed: 0,Missouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [180]:
pd.merge(left2,right2, how='outer',left_index=True, right_index=True)

# Using the indexes of both sides of hte merge

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


For merging by index, use `join` instance instead. It can also be used to combine together many df's having the same indexes but non-overlapping columns

In [181]:
# Equivalent of previous example but using join

left2.join(right2,how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In [182]:
print left1
print right1
left1.join(right1,on='key')

  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5
   group_val
a        3.5
b        7.0


Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


### Concatenating Along an Axis

Concatenating/binding/stacking. NumPy has a concatenate function for doing this with raw numpy arrays:

In [183]:
arr=np.arange(12).reshape((3,4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [184]:
np.concatenate([arr,arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [185]:
np.concatenate([arr,arr], axis=0)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [186]:
# using concat to glue objects together. IBy default, it works along axis = 0

s1= Series([0,1], index=['a','b'])
s2=Series([2,3,4], index=['c','d','e'])
s3=Series([5,6], index=['f','g'])

pd.concat([s1,s2,s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [187]:
pd.concat([s1,s2,s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [188]:
pd.concat([s1,s2,s3], join='inner')

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [189]:
results = pd.concat([s1,s2,s3], keys=['one','two','three'])

results # created a hierarchical index on the concatenation axis distinguishing where the data came from

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

In [190]:
results.unstack()

Unnamed: 0,a,b,c,d,e,f,g
one,0.0,1.0,,,,,
two,,,2.0,3.0,4.0,,
three,,,,,,5.0,6.0


In the case of combinng Series along axis=1, the keys become the df column headers:

In [191]:
pd.concat([s1,s2,s3], axis=1, keys=['one','two','three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [192]:
df1 = DataFrame(np.arange(6).reshape(3,2), index=['a','b','c'],
                columns=['one','two'])

df2=DataFrame(5 + np.arange(4).reshape(2,2), index=['a','c'],
              columns=['three','four'])

pd.concat([df1, df2], axis=1, keys=['level1','level2']) # num of keys should match number of objects to be concat

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


If we pass a dict of objects instead of a list, the dict's keys will be used for the `keys` option:

In [193]:
pd.concat({'level1':df1,'level2':df2}, axis=1)

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


We can also name our axes:

In [194]:
pd.concat([df1,df2], axis=1, keys=['level1','level2'],
          names=['upper','lower'])

upper,level1,level1,level2,level2
lower,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


Consider df in which the row index is not meaningful in the context of the analysis. We can ignore the indexes using `ignore_index=True`

In [195]:
df1=DataFrame(np.random.randn(3,4), columns=['a','b','c','d'])
df2=DataFrame(np.random.randn(2,3), columns=['b','d','a'])

df1

Unnamed: 0,a,b,c,d
0,-0.212653,0.241883,0.888291,0.006326
1,0.133292,0.751779,0.255386,-0.867594
2,1.309691,-0.739512,-2.486909,0.401454


In [196]:
df2

Unnamed: 0,b,d,a
0,-0.777135,0.315455,1.114322
1,1.332383,1.881511,-1.809813


In [197]:
pd.concat([df1,df2], ignore_index=False)

Unnamed: 0,a,b,c,d
0,-0.212653,0.241883,0.888291,0.006326
1,0.133292,0.751779,0.255386,-0.867594
2,1.309691,-0.739512,-2.486909,0.401454
0,1.114322,-0.777135,,0.315455
1,-1.809813,1.332383,,1.881511


In [198]:
pd.concat([df1,df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,-0.212653,0.241883,0.888291,0.006326
1,0.133292,0.751779,0.255386,-0.867594
2,1.309691,-0.739512,-2.486909,0.401454
3,1.114322,-0.777135,,0.315455
4,-1.809813,1.332383,,1.881511


#### Combining Data with Overlap

We may have 2 datasets whose indexes overlap (see ex below). Series has a `combine_first` method, which performs the equivalent of if-else operation plus data alignment:

In [199]:
a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],
           index=['f','e','d','c','b','a'])

b= Series(np.arange(len(a), dtype=np.float64),
          index=['f','e','d','c','b','a'])

b[-1] = np.nan

a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [200]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [201]:
print b[:-2]

print a[2:]

f    0.0
e    1.0
d    2.0
c    3.0
dtype: float64
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64


In [202]:
b[:-2].combine_first(a[2:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

In [203]:
b.combine_first(a) # For any null values in a, replace them with values in b, else a

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [204]:
# Equivalent to above

np.where(pd.isnull(a),b,a)

array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])

In [205]:
# With Dataframes

df1= DataFrame({'a':[1.,np.nan,5.,np.nan],
                'b':[np.nan,2.,np.nan,6],
                'c':range(2,18,4)})

df2=DataFrame({'a':[5.,4.,np.nan,3.,7.],
               'b':[np.nan,3.,4.,6.,8.]})
print df1
print df2
df1.combine_first(df2) # Preserve df1 values unless there are null values. IN which case, replace with df2 values

     a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14
     a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0


Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


### Reshaping and Pivoting

For reshaping w/hierarch. indxing, there are 2 primary actions:

- stack: this "rotates" or pivots from columns --> rows
- unstack: pivots from rows --> columns

In [206]:
data = DataFrame(np.arange(6).reshape((2,3)),
                 index=pd.Index(['Ohio','Colorado'],name='states'),
                 columns=pd.Index(['one','two','three'], name='number'))

data

number,one,two,three
states,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [207]:
# Stack method on the data pivots the columns into rows producing a series

result= data.stack()
result

states    number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

In [208]:
result.unstack()

# By default the innermost level is unstacked (same with stack)

number,one,two,three
states,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [209]:
# Specify the column to unstack from rows to columns
result.unstack('states')

states,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


Note, unstacking  might introduce missing data if all the values in the level aren't found in each subgroup

In [210]:
s1= Series([0,1,2,3], index=['a','b','c','d'])
s2=Series([4,5,6], index=['c','d','e'])

data2=pd.concat([s1,s2], keys=['one','two'])
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [211]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


In [212]:
# Stacking filters out missing data by default, so the operation is easily invertible

data2.unstack().stack()

#override default behavior by dropna=False
data2.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

In [213]:
df= DataFrame({'left': result, 'right': result+5},
              columns=pd.Index(['left','right'],name='side'))

df

Unnamed: 0_level_0,side,left,right
states,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [214]:
df.unstack()

side,left,left,left,right,right,right
number,one,two,three,one,two,three
states,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Ohio,0,1,2,5,6,7
Colorado,3,4,5,8,9,10


In [215]:
df.unstack().stack('side')

Unnamed: 0_level_0,number,one,three,two
states,side,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ohio,left,0,2,1
Ohio,right,5,7,6
Colorado,left,3,5,4
Colorado,right,8,10,9


#### Pivoting "long" to "wide" format 

A common way to store ts data is long or stacked format where dates are in columns.  We might prefer to have a df containing 1 column per distinct item value indexed by timestamps in the date column. The df's pivot method perform this transformation: 

In [216]:
%cd C:\Users\sonya\Documents\Python for Data Analysis\data\ch07

C:\Users\sonya\Documents\Python for Data Analysis\data\ch07


In [217]:
df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
                       'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                       'baz': [1, 2, 3, 4, 5, 6]})

df

Unnamed: 0,bar,baz,foo
0,A,1,one
1,B,2,one
2,C,3,one
3,A,4,two
4,B,5,two
5,C,6,two


In [218]:
df.pivot(index='foo', columns='bar', values='baz')

bar,A,B,C
foo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,1,2,3
two,4,5,6


In [219]:
df.pivot(columns='bar',values='baz')

bar,A,B,C
0,1.0,,
1,,2.0,
2,,,3.0
3,4.0,,
4,,5.0,
5,,,6.0


### Data Transformation

Filtering, cleaning, and other transformations like removing duplicates

In [220]:
# Removing Duplicates

data = DataFrame ({'k1': ['one'] * 3 + ['two'] * 4,
                   'k2': [1,1,2,3,3,4,4,]})

print data

# Return a boolean series indicating whether each row is a duplicate or not
print data.duplicated()

    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool


In [221]:
# Use drop_duplicates to return a df of values where the duplicated array is false

data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
2,one,2
3,two,3
5,two,4


In [222]:
# To filter duplicated based on a specific column

data['v1'] = range(7)

data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,one,1,1
2,one,2,2
3,two,3,3
4,two,3,4
5,two,4,5
6,two,4,6


In [223]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
3,two,3,3


#### Transforming Data Using a Function or Mapping

Suppose we want to create an extra column that maps the type of animal to the food

In [224]:
data = DataFrame({'food': ['bacon','pulled pork','bacon','Pastrami',
                           'corned beef','Bacon','pastrami','honey ham','nova lox'],
                  'ounces': [4,3,12,6,7.5,8,3,5,6]})

data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [225]:
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

The `map` method on a Series accepts a function or dict-like object containing a mapping but the issue is some of our meats are upper case and others are not. We need to convert each value to lower case then map:

In [226]:
data['animal'] =data['food'].map(str.lower).map(meat_to_animal)

data[['animal','food']]

Unnamed: 0,animal,food
0,pig,bacon
1,pig,pulled pork
2,pig,bacon
3,cow,Pastrami
4,cow,corned beef
5,pig,Bacon
6,cow,pastrami
7,pig,honey ham
8,salmon,nova lox


In [227]:
# Equivalent using function

data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

#### Replacing Values

While map, can be used to modify a subset of values in an object, `replace` provides a simpler and more flexible way of value replacement.It produces a new series

In [228]:
data = Series([1., -999.,2.,-999.,-1000,3.])

data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [229]:
# replace -999 values with NA

print data.replace(-999, np.nan)

print ''
print data

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64


In [230]:
# To replace multiple values at once, pass a list then the substitute value

data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [231]:
# To use a different replacement for each value, pass a list of substitutes

data.replace([-999,-1000],["Red","More Red"])

0           1
1         Red
2           2
3         Red
4    More Red
5           3
dtype: object

In [232]:
# Equivalent using a dictionary instead

data.replace({-999: "Red Zone", -1000: "More Red Zone"})

0                1
1         Red Zone
2                2
3         Red Zone
4    More Red Zone
5                3
dtype: object

#### Renaming Axis Indexes

Like values in a Series, axis labels can also be transformed by a function or mapping of some form to produe new labeled objects. 

In [233]:
data=DataFrame(np.arange(12).reshape((3,4)),
               index=['Ohio','Colorado','New York'],
               columns = ['one','two','three','four'])


# Convert indexes to upper case
data.index.map(str.upper)

# Same thing
data.index = data.index.map(str.upper)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


To create a transformed version of data set without modifying the original use `rename`

In [234]:
data.rename(index=str.title, columns=str.upper)

# Upper = Upper case
# title = First letter capitolization

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


To replace with replacement use `inplace=true`

In [235]:
_ = data.rename(index={'OHIO': 'MIDWEST',
                       'MIDWEST': 'ROCKY',
                       'NEW YORK': 'EAST'}, inplace=True)

data

Unnamed: 0,one,two,three,four
MIDWEST,0,1,2,3
COLORADO,4,5,6,7
EAST,8,9,10,11


#### Discretization and Binning

Suppose you want to group ages into discrete age buckets for further analysis. We can use cut to place numbers into bins

In [236]:
ages = [20,22,25,27,21,23,27,23,37,61,43,32]

# Bins of 18-25, 26-35, 36-60
bins = [18,25,35,60,100]

cats=pd.cut(ages,bins)

cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (18, 25], (35, 60], (60, 100], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

This is a `categorical` object. It's treated like an array of strings indicating the bin name. It contains a levels array indicating the distinct categories along with a labelling for hte ages data in the labels attribute

In [237]:
cats.codes

array([0, 0, 0, 1, 0, 0, 1, 0, 2, 3, 2, 1], dtype=int8)

In [238]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

In [239]:
cats.categories[0]

Interval(18, 25, closed='right')

In [240]:
# Parenthesis means that the side is open while the square bracket means it is closed (inclusive) 

pd.value_counts(cats)

(18, 25]     6
(25, 35]     3
(35, 60]     2
(60, 100]    1
dtype: int64

In [241]:
# To switch which side is open/closed, use right=False

cats = pd.cut(ages, bins, right=False)

pd.value_counts(cats)

[18, 25)     5
[25, 35)     4
[35, 60)     2
[60, 100)    1
dtype: int64

We can also pass our own bin names by passing a list or array to the labels option of cut method

In [242]:
group_names = ['Young','Youngish','Old','Older']

pd.cut(ages,bins,labels=group_names)

[Young, Young, Young, Youngish, Young, ..., Young, Old, Older, Old, Youngish]
Length: 12
Categories (4, object): [Old < Older < Young < Youngish]

If we pass `cut` a integer # of bins, instead of explicit bins, it will compute equal-length bins based on the min/max values of the data

In [243]:
data=np.random.rand(20)

cats = pd.cut(data,4,precision=2)

cats

[(0.27, 0.51], (0.27, 0.51], (0.029, 0.27], (0.75, 1.0], (0.27, 0.51], ..., (0.27, 0.51], (0.51, 0.75], (0.75, 1.0], (0.27, 0.51], (0.029, 0.27]]
Length: 20
Categories (4, interval[float64]): [(0.029, 0.27] < (0.27, 0.51] < (0.51, 0.75] < (0.75, 1.0]]

In [244]:
pd.value_counts(cats)

(0.27, 0.51]     9
(0.51, 0.75]     5
(0.029, 0.27]    4
(0.75, 1.0]      2
dtype: int64

`qcut` bins the data based on sample quantiles. We obtain roughtly equal-size bins of observations.

In [245]:
data=np.random.randn(1000)

# 4 quantiles
cats = pd.qcut(data,4)

cats

[(-0.728, -0.0274], (-2.75, -0.728], (0.64, 3.478], (-0.0274, 0.64], (-0.728, -0.0274], ..., (0.64, 3.478], (0.64, 3.478], (-2.75, -0.728], (-0.0274, 0.64], (0.64, 3.478]]
Length: 1000
Categories (4, interval[float64]): [(-2.75, -0.728] < (-0.728, -0.0274] < (-0.0274, 0.64] < (0.64, 3.478]]

In [246]:
pd.value_counts(cats)

(0.64, 3.478]        250
(-0.0274, 0.64]      250
(-0.728, -0.0274]    250
(-2.75, -0.728]      250
dtype: int64

We can also pass our own quantiles (b/w 0 to 1)

In [247]:
cats = pd.qcut(data,[0,0.1,0.5,0.9,1.])
cats

[(-1.303, -0.0274], (-2.75, -1.303], (-0.0274, 1.208], (-0.0274, 1.208], (-1.303, -0.0274], ..., (1.208, 3.478], (1.208, 3.478], (-2.75, -1.303], (-0.0274, 1.208], (1.208, 3.478]]
Length: 1000
Categories (4, interval[float64]): [(-2.75, -1.303] < (-1.303, -0.0274] < (-0.0274, 1.208] < (1.208, 3.478]]

In [248]:
pd.value_counts(cats)

(-0.0274, 1.208]     400
(-1.303, -0.0274]    400
(1.208, 3.478]       100
(-2.75, -1.303]      100
dtype: int64

#### Detecting and FIltering Outliers

using numpy

In [249]:
np.random.seed(12345)

data = DataFrame(np.random.randn(1000,4)) # 1000 numbers per column,4 columns
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.067684,0.067924,0.025598,-0.002298
std,0.998035,0.992106,1.006835,0.996794
min,-3.428254,-3.548824,-3.184377,-3.745356
25%,-0.77489,-0.591841,-0.641675,-0.644144
50%,-0.116401,0.101143,0.002073,-0.013611
75%,0.616366,0.780282,0.680391,0.654328
max,3.366626,2.653656,3.260383,3.927528


In [250]:
col = data[3]

# Find values greater than 3 or -3
col[np.abs(col) >3]

# Select all rows having a value 3+ or >-3
data[(np.abs(data) >3).any(1)]

Unnamed: 0,0,1,2,3
5,-0.539741,0.476985,3.248944,-1.021228
97,-0.774363,0.552936,0.106061,3.927528
102,-0.655054,-0.56523,3.176873,0.959533
305,-2.315555,0.457246,-0.025907,-3.399312
324,0.050188,1.951312,3.260383,0.963301
400,0.146326,0.508391,-0.196713,-3.745356
499,-0.293333,-0.242459,-3.05699,1.918403
523,-3.428254,-0.296336,-0.439938,-0.867165
586,0.275144,1.179227,-3.184377,1.369891
808,-0.362528,-3.548824,1.553205,-2.186301


Values can easily be set based on a criteria. For ex, cap values outside the interval -3 to 3:


**np.sign()** The sign function returns -1 if x < 0, 0 if x==0, 1 if x > 0. nan is returned for nan inputs.

In [251]:
data[np.abs(data) > 3] = np.sign(data) * 3

data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.067623,0.068473,0.025153,-0.002081
std,0.995485,0.990253,1.003977,0.989736
min,-3.0,-3.0,-3.0,-3.0
25%,-0.77489,-0.591841,-0.641675,-0.644144
50%,-0.116401,0.101143,0.002073,-0.013611
75%,0.616366,0.780282,0.680391,0.654328
max,3.0,2.653656,3.0,3.0


####  Permutation and Random Sampling

**Permuting (randomly reordering)** a Series of rows in df is easy to do using `numpy.random.permutation` function. Calling `permutation` with the length of the axis you want to permute returns an array of integers with the new ordering.

If x is an integer, randomly permute np.arange(x). If x is an array, make a copy and shuffle the elements randomly.

In [252]:
df = DataFrame(np.arange(5 * 4).reshape(5,4))

df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [253]:
print np.random.permutation(3)
print np.random.permutation(10)

[2 1 0]
[1 3 0 8 5 7 6 4 9 2]


`take(a)` : returns an array formed from the elements of a at the given indices.

In [254]:
sampler = np.random.permutation(5)
print sampler

# This array can then be used in iloc based indexing or the take function to randomize rows or indexes
df.take(sampler)

[1 0 4 3 2]


Unnamed: 0,0,1,2,3
1,4,5,6,7
0,0,1,2,3
4,16,17,18,19
3,12,13,14,15
2,8,9,10,11


In [255]:
# To select a random subset without replacement, slice off the first k elemnets of the array 
# returned by permutation where k = desired subset size

df.take(np.random.permutation(len(df))[:3])

# returns the first 3 rows of an array with permuted results

Unnamed: 0,0,1,2,3
1,4,5,6,7
4,16,17,18,19
0,0,1,2,3


In [256]:
# To generate a sample with replacement, use np.random.randint to draw random integers

# bag of numbers
bag = np.array([5,7,-1,6,4])

# create random integers from 0 to len(bag) 10x
sampler = np.random.randint(0,len(bag),size=10)
print sampler

[1 1 2 3 0 1 2 2 3 2]


In [257]:
draws = bag.take(sampler)
draws

# returns an array formed from sampler at the given indices of bag

array([ 7,  7, -1,  6,  5,  7, -1, -1,  6, -1])

### Computing Indicator/Dummy Variables 

A common type of transformation is converting categorical variables into "dummy" or "indicator" matrix. If a column in a df has k distinct values, you would derive a matrix or df containing k columns containing binary values of 1s and 0s. Pandas has a `get_dummies` function for doing this.

In [258]:
df = DataFrame({'key': ['b','b','a','c','a','b'],
                'data1': range(6)})

df

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [259]:
dummies = pd.get_dummies(df['key'])
print dummies

   a  b  c
0  0  1  0
1  0  1  0
2  1  0  0
3  0  0  1
4  1  0  0
5  0  1  0


We need to use [[]] double brackets in below case since we can only join a dataframe with another dataframe.

double brackets selection returns a **pandas dataframe**

single brackets selection returns a **pandas series**

https://stackoverflow.com/questions/45201104/the-difference-between-double-brace-and-single-brace-indexing-i

In [260]:
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,a,b,c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataFrame belongs to multiple categories, things are more complicated:

In [261]:
mnames = ['movie_id','title','genres']
%cd C:\Users\sonya\Documents\Python for Data Analysis\data\ch07
    
movies = pd.read_table('movies.dat',sep='::', header=None, names=mnames)

movies[:10]

C:\Users\sonya\Documents\Python for Data Analysis\data\ch07


  after removing the cwd from sys.path.


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


We need to parse `genres` column by using the `|` delimiter. First we extract list of unique genres in the dataset using a nice `set.union` trick.

The sets module provides classes for constructing and manipulating unordered collections of unique elements. Common uses include membership testing, removing duplicates from a sequence, and computing standard math operations on sets such as intersection, union, difference, and symmetric difference.

The `union()` method returns a new set with distinct elements from all the sets.

Like other collections, sets support x in set, len(set), and for x in set. Being an unordered collection, sets do not record element position or order of insertion. Accordingly, sets do not support indexing, slicing, or other sequence-like behavior.

https://docs.python.org/2/library/sets.html

Split function: http://www.pythonforbeginners.com/dictionary/python-split


In [262]:
for x in movies.genres:
        print x.split("|")

['Animation', "Children's", 'Comedy']
['Adventure', "Children's", 'Fantasy']
['Comedy', 'Romance']
['Comedy', 'Drama']
['Comedy']
['Action', 'Crime', 'Thriller']
['Comedy', 'Romance']
['Adventure', "Children's"]
['Action']
['Action', 'Adventure', 'Thriller']
['Comedy', 'Drama', 'Romance']
['Comedy', 'Horror']
['Animation', "Children's"]
['Drama']
['Action', 'Adventure', 'Romance']
['Drama', 'Thriller']
['Drama', 'Romance']
['Thriller']
['Comedy']
['Action']
['Action', 'Comedy', 'Drama']
['Crime', 'Drama', 'Thriller']
['Thriller']
['Drama', 'Sci-Fi']
['Drama', 'Romance']
['Drama']
['Drama']
['Romance']
['Adventure', 'Sci-Fi']
['Drama']
['Drama']
['Drama', 'Sci-Fi']
['Adventure', 'Romance']
["Children's", 'Comedy', 'Drama']
['Drama', 'Romance']
['Drama']
['Documentary']
['Comedy']
['Comedy', 'Romance']
['Drama']
['Drama', 'War']
['Action', 'Crime', 'Drama']
['Drama']
['Action', 'Adventure']
['Comedy', 'Drama']
['Drama', 'Romance']
['Crime', 'Thriller']
['Animation', "Children's", 'Musica

['Documentary']
['Comedy', 'Romance']
["Children's", 'Comedy', 'Musical']
['Action', 'Adventure', 'Comedy']
['Western']
['Thriller']
['Action', 'Crime', 'Romance']
['Documentary']
['Drama']
['Action', 'Adventure', 'Animation', "Children's", 'Fantasy']
['Comedy']
['Drama']
['Thriller']
['Comedy', 'Drama']
['Drama']
['Comedy']
['Horror']
['Comedy', 'Romance']
['Drama']
['Comedy', 'Drama']
["Children's", 'Comedy']
['Comedy', 'Drama']
['Drama']
['Drama']
['Drama']
['Comedy', 'Drama']
["Children's", 'Comedy']
['Comedy']
['Adventure', "Children's"]
['Drama', 'Mystery']
['Thriller']
['Drama']
['Documentary']
['Comedy']
['Comedy', 'Drama']
['Drama']
['Comedy']
["Children's", 'Comedy']
['Comedy', 'Romance', 'Thriller']
['Animation', "Children's", 'Comedy', 'Musical']
['Action', 'Sci-Fi', 'Thriller']
['Adventure', 'Drama', 'Western']
['Action', 'Drama', 'Thriller']
['Action', 'Adventure', 'Crime', 'Drama']
['Drama', 'Thriller']
['Animation', "Children's", 'Musical']
['Animation', "Children's", '

['Animation', "Children's", 'Musical']
['Adventure', 'Animation', "Children's", 'Musical']
['Adventure', "Children's", 'Musical']
['Animation', "Children's", 'Musical']
['Animation', "Children's"]
['Crime']
['Musical']
['Action', 'Thriller']
['Action', 'Sci-Fi', 'Thriller']
['Drama']
['Documentary']
['Drama']
['Drama']
['Comedy']
['Drama', 'Romance']
['Drama']
['Comedy', 'Drama']
['Drama', 'Romance']
['Action', 'Thriller']
['Action', 'Adventure']
['Documentary', 'Drama']
['Drama']
['Drama']
['Crime', 'Drama']
['Drama']
['Thriller']
['Drama']
['Comedy', 'Musical', 'Romance']
['Drama']
['Drama', 'Romance']
['Comedy', 'Drama']
['Crime', 'Drama']
['Drama']
['Drama']
['Animation', "Children's", 'Comedy']
['Mystery']
['Comedy', 'Musical', 'Romance']
['Comedy', 'Musical', 'Romance']
['Crime', 'Film-Noir']
['Film-Noir', 'Thriller']
['Adventure']
['Romance', 'War']
['Adventure', "Children's", 'Comedy', 'Fantasy']
['Comedy']
['Thriller']
['Comedy', 'Sci-Fi']
['Comedy', 'War']
['Comedy']
['Comedy

['Drama']
['Romance']
['Drama', 'War']
['Crime']
['Drama']
['Drama']
['Drama', 'Romance', 'Thriller']
['Crime', 'Thriller']
['Action', 'Adventure', 'Sci-Fi', 'War']
['Comedy']
['Drama']
['Comedy']
['Drama', 'Romance']
['Action', 'Adventure']
['Drama']
['Drama', 'Romance', 'Thriller']
['Romance']
['Romance']
['Crime', 'Thriller']
['Action', 'Thriller']
['Animation', "Children's", 'Musical']
['Comedy', 'Mystery']
['Action', 'Horror', 'Sci-Fi']
['Horror', 'Sci-Fi']
['Drama']
['Drama']
['Drama']
['Drama', 'War']
['Crime']
['Comedy']
['Drama']
['Comedy', 'Drama']
["Children's", 'Comedy', 'Fantasy']
['Comedy']
['Drama']
['Drama']
['Drama']
["Children's", 'Comedy']
['Drama']
['Thriller']
['Drama']
['Comedy', 'Crime', 'Drama', 'Mystery']
["Children's", 'Comedy']
['Romance']
['Thriller']
['Drama']
['Horror', 'Thriller']
['Thriller']
['Drama']
['Action', 'Adventure', 'Sci-Fi']
['Drama', 'Romance']
['Action', 'Romance', 'Thriller']
['Comedy', 'Drama']
['Drama']
['Drama']
['Drama']
['Drama']
['Dra

['Drama']
['Drama']
['Comedy']
['Drama']
['Drama', 'Thriller']
['Drama']
['Drama']
['Drama', 'Western']
['Drama']
['Drama', 'Romance']
['Comedy', 'Drama']
['Drama']
['Thriller']
['Comedy', 'Drama']
['Horror', 'Sci-Fi']
['Adventure', "Children's"]
['Adventure', "Children's", 'Sci-Fi']
['Horror']
['Horror']
['Drama', 'Fantasy']
['Horror', 'Sci-Fi']
['Horror', 'Sci-Fi']
['Horror', 'Sci-Fi']
['Action', 'Comedy']
['Comedy', 'Crime']
['Horror']
['Horror']
['Horror']
['Horror']
['Comedy']
['Horror']
['Horror']
['Drama']
['Mystery']
['Action', 'Comedy', 'Romance', 'Thriller']
['Comedy', 'Romance']
['Adventure', 'Comedy']
['Adventure', 'Comedy']
['Comedy']
['Comedy']
['Drama']
['Action', 'Mystery', 'Thriller']
['Action', 'War']
['Adventure']
['Comedy', 'Western']
['Drama', 'Thriller']
['Drama']
['Drama', 'Romance']
['Comedy', 'Romance']
['Comedy', 'Horror', 'Thriller']
['Comedy']
['Comedy', 'Romance']
['Drama']
['Action', 'Comedy']
['Drama', 'Horror', 'Thriller']
['Drama']
['Action', 'Thriller'

['Drama', 'War']
['Animation', 'Musical']
['Action', 'Thriller']
['Comedy', 'Horror']
['Thriller']
['Action', 'Crime']
['Drama', 'Mystery']
['Drama', 'Thriller']
['Horror']
['Drama']
['Drama']
['Crime', 'Drama']
['Drama']
['Drama', 'Western']
['Drama']
['Comedy', 'Romance']
['Action', 'Comedy']
['Crime', 'Drama']
['Drama', 'War']
['Comedy', 'Romance']
['Action', 'Crime']
['Adventure', 'Animation', 'Sci-Fi']
['Drama', 'War']
['Drama']
['Comedy', 'Romance']
['Drama']
['Drama']
['Animation', "Children's", 'Comedy']
['Comedy']
['Action', 'Drama', 'War']
['Animation', "Children's", 'Comedy']
['Action', 'Adventure', 'Thriller']
['Drama']
['Horror']
['Drama', 'Sci-Fi', 'Thriller']
['Animation', "Children's", 'Musical']
['Comedy']
['Crime', 'Drama']
['Horror']
['Action', 'Crime', 'Thriller']
['Action', 'Crime', 'Thriller']
['Drama', 'Romance']
['Action', 'War']
['Action', 'War']
['Action', 'War']
['Action']
['Adventure', 'Crime', 'Sci-Fi', 'Thriller']
['Action', 'Adventure']
['Horror']
['Comed

`*` unpacks an list or tuple into position arguments

`**` unpacks an dictionary into keyword arguments

**Function arguments** can be specified by position or by keyword. Keywords make it clear what the purpose of each argument is when it would be confusing with only **positional arguments**. Keyword arguments with default values make it easy to add new behaviors to a function, especially when the function has existing callers.

In [263]:
genre_iter = (set(x.split("|")) for x in movies.genres)
genres = set.union(*genre_iter) # set.union = 	new set with elements from both s and t
genres = sorted(genres) # sort the genres
genres

['Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

In [264]:
# 1 way of constructing the indicator df is to start with a df of all zeroes

# Create a dataframe of zeroes that is row x columumns of genres
dummies = DataFrame(np.zeros((len(movies),len(genres))), columns = genres)
dummies.head()

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**`enumerate()`** function adds a counter to an iterable. So for each element in cursor , a tuple is produced with (counter, element) ; the for loop binds that to row_number and row , respectively. It's a builtin generator function, see http://docs.python.org/2/library/functions.html

In [265]:
for i, gen in enumerate(movies.genres):
    dummies.loc[i,gen.split('|')] = 1
    
# For each genre in genres list, return a 1 in the dummies df

dummies.head()

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [266]:
# Combine this with movies

movies_windic=movies.join(dummies)
movies_windic.head()

Unnamed: 0,movie_id,title,genres,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Animation|Children's|Comedy,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [267]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.head()

Unnamed: 0,movie_id,title,genres,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Crime,Genre_Documentary,...,Genre_Fantasy,Genre_Film-Noir,Genre_Horror,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Thriller,Genre_War,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**`pandas.cut(x,bins)`**: Return indices of half-open bins to which each value of x belongs. The bins include the right values by default. It can be override by `right = False` option.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html

In [273]:
values = np.random.rand(10)
print values

bins = np.arange(0,1.2,0.2)
print bins

print pd.cut(values,bins)
# returns the respective bins for each value

pd.get_dummies(pd.cut(values,bins))

[ 0.51630663  0.39898442  0.04617011  0.71845025  0.01501439  0.2463375
  0.65773015  0.11142634  0.68040654  0.65088076]
[ 0.   0.2  0.4  0.6  0.8  1. ]
[(0.4, 0.6], (0.2, 0.4], (0.0, 0.2], (0.6, 0.8], (0.0, 0.2], (0.2, 0.4], (0.6, 0.8], (0.0, 0.2], (0.6, 0.8], (0.6, 0.8]]
Categories (5, interval[float64]): [(0.0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1.0]]


Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,1,0,0
1,0,1,0,0,0
2,1,0,0,0,0
3,0,0,0,1,0
4,1,0,0,0,0
5,0,1,0,0,0
6,0,0,0,1,0
7,1,0,0,0,0
8,0,0,0,1,0
9,0,0,0,1,0


### String Manipulation

Apply string and regular expressions consicely on whole arrays of data.

#### String Object Methods 

To break a string with delimiters, use the `split` function

In [275]:
val = 'a,b, guido'

val.split(',')

['a', 'b', ' guido']

`split` is often combined with `strip` to trim whitespace (including newlines):

In [281]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

[]: Used to define mutable data types - lists, list comprehensions and for indexing/lookup/slicing.

(): Define tuples, order of operations, generator expressions, function calls and other syntax.

{}: The two hash table types - dictionaries and sets.

Substrings can be concatenated together with a `::` two-colon delimiter using addition:

In [284]:
first

'a'

In [286]:
first, second, third = pieces

first + '::' + second + '::' + third

'a::b::guido'

In [287]:
# a faster way is to pass list or tuple to the join method on the string '::'

'::'.join(pieces)

'a::b::guido'

Other methods are concerned with locating substrings. Using Python's `in` keyword is the best way to detect a substring.

`index` and `find` can also be used.

In [288]:
'guido' in val

True

In [298]:
print val.index(',')
print val.index('z')
# Index raises an exception if the string is not found (versus returning -1 for not found)

1


ValueError: substring not found

**`string.find(value)`** : Returns an index of first character of first occurence if found and -1 otherwise. 

In [296]:
print val.find(':')
print val.find('a')
print val.find(',')

-1
0
1


**`count()`** returns the # of occurances of a particular substring


In [299]:
val.count(',')

2

**`replace(value,replacement)`** substitutes occurrences of one pattern for another. We can also pass an empty string to delete values

In [300]:
val.replace(',','::')

'a::b:: guido'

In [301]:
val.replace(',','')

'ab guido'

The method **`str.join()`** returns a string in which the string elements of sequence have been joined by str separator.

In [309]:
s = "-";
seq = ("a", "b", "c"); # This is sequence of strings.
print s.join( seq )

a-b-c


**`lower(), upper()`** Convert string to lower case or upper case

In [311]:
val.upper()

'A,B, GUIDO'

### Regular Expressions

**Regular expressions** provide a flexible way to search or match string patterns in text.

**regex** is a single expression. It describes a pattern to locate in the text. The regex describing 1+ whitespace characters is `\s+`.

Python's built-in `re` module is responsible for applying regular expressions to strings. `re` module functions fall into 3 categories: pattern matching, substitution, and splitting. 


In [314]:
import re

text = "foo    bar\t baz    \tqux"

print re.split('\s+', text)


['foo', 'bar', 'baz', 'qux']


In [316]:
regex = re.compile('\s+')

print regex.split(text) # split on spaces

print regex.findall(text) # Find all the spaces

['foo', 'bar', 'baz', 'qux']
['    ', '\t ', '    \t']


Creating a regex object with `re.compile` is highly recommended if you intend to apply the same expression to many strings. Doing so will save CPU cycles.

While `findall` returns all matches in a string, `search()` returns only the first match. 

`match()` only matches at the beginning of the string. It will only match if the pattern occurs at the start of the string.

`re.compile(pattern, flags=value)` Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.

The expression’s behaviour can be modified by specifying a flags value.

In [318]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+=-]+@[A-Z0-9.-]+\.[A-Z]{2,4}' # find strings that have the format xxxxxx@xxxxxx.xxx

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE) # compile 

When an "r" or "R" prefix is present, we avoid unwanted escaping of code execuation with '\' in a regex.... a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase "n". String quotes can be escaped with a backslash, but the backslash remains in the string;

In [319]:
regex.findall(text) 

# returns values in text that have one of the pattern expressions

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [323]:
m=regex.search(text)
print m

text[m.start():m.end()]

<_sre.SRE_Match object at 0x000000000914A578>


'dave@google.com'

`sub('new string', source)` will return a new string with each occurrence of hte pattern replated by a new string

In [324]:
print regex.sub('*',text)

Dave *
Steve *
Rob *
Ryan *



Example: Find email addresses and simultaneously segment each address into its 3 components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment:

In [328]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9._]+)\.([A-Z]{2,4})'

regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()  # Return the string matched by the re.compile 

('wesm', 'bright', 'net')

In [329]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

`sub` also has access to groups in each match using special symbols like \1, \2, etc.

In [334]:
print regex.sub(r'\nUsername: \1\n  Domain: \2\n  Suffix: \3',text)

Dave 
Username: dave
  Domain: google
  Suffix: com
Steve 
Username: steve
  Domain: gmail
  Suffix: com
Rob 
Username: rob
  Domain: gmail
  Suffix: com
Ryan 
Username: ryan
  Domain: yahoo
  Suffix: com



`re.VERBOSE`
This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

`groupdict([default])`
Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name. The default argument is used for groups that did not participate in the match; it defaults to None. 

`(?P<Y>...)`	Capturing group named Y

In [337]:
regex = re.compile(r"""
    (?P<username>[A-Z0-9._%+-]+)
    @
    (?P<domain>[A-Z0-9._%-]+)
    \.
    (?P<suffix>[A-Z]{2,4})"""
                  , flags = re.IGNORECASE|re.VERBOSE)

m = regex.match('wesm@bright.net')

m.groupdict()

{'domain': 'bright', 'suffix': 'net', 'username': 'wesm'}

### Vectorized string functions in pandas

Cleaning up a messy data set for analysis often requires a lot of string munging and regularization. String and regex methods can be applied (passing a lambda or other function) to each value using data.map, but it will not work on the NA. Series has concise methods for string operations that skip NA values. These are accessed through Series's str attribute. 

For example, we can check whether each email address has 'gmail' in it with `str.contains()`

In [358]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob': 'rob@gmail.com', 'Wes': np.nan}

data = Series(data)

print data
print data.isnull()

Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object
Dave     False
Rob      False
Steve    False
Wes       True
dtype: bool


In [341]:
data.str.contains('gmail')

Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object

In [356]:
# regex can be used too along with any re options like IGNORECASE
print pattern
data.str.findall(pattern, flags=re.IGNORECASE)

([A-Z0-9._%+-]+)@([A-Z0-9._]+)\.([A-Z]{2,4})


Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

In [359]:
# Vectorized element retrieval - Either use str.get or index into the str attribute

matches = data.str.match(pattern, flags=re.IGNORECASE)
matches.str[:5]

Dave    NaN
Rob     NaN
Steve   NaN
Wes     NaN
dtype: float64

### Example: USDA Food Database


In [362]:
import os
os.getcwd()

import json
db=json.load(open('foods-2011-10-03.json'))

len(db)

6636

In [365]:
# Each entry in db is a dict containig all the data for a single food. The nutrients field is a list of dict,
# one for each nutrient

db[0].keys()

[u'portions',
 u'description',
 u'tags',
 u'nutrients',
 u'group',
 u'id',
 u'manufacturer']

In [371]:
print db[0]['nutrients'][0]

{u'units': u'g', u'group': u'Composition', u'description': u'Protein', u'value': 25.18}


In [373]:
nutrients = DataFrame(db[0]['nutrients'])
nutrients

Unnamed: 0,description,group,units,value
0,Protein,Composition,g,25.180
1,Total lipid (fat),Composition,g,29.200
2,"Carbohydrate, by difference",Composition,g,3.060
3,Ash,Other,g,3.280
4,Energy,Energy,kcal,376.000
5,Water,Composition,g,39.280
6,Energy,Energy,kJ,1573.000
7,"Fiber, total dietary",Composition,g,0.000
8,"Calcium, Ca",Elements,mg,673.000
9,"Iron, Fe",Elements,mg,0.640


In [377]:
info_keys = ['description','group','id','manufacturer']

info = DataFrame(db, columns=info_keys)
info.head()

Unnamed: 0,description,group,id,manufacturer
0,"Cheese, caraway",Dairy and Egg Products,1008,
1,"Cheese, cheddar",Dairy and Egg Products,1009,
2,"Cheese, edam",Dairy and Egg Products,1018,
3,"Cheese, feta",Dairy and Egg Products,1019,
4,"Cheese, mozzarella, part skim milk",Dairy and Egg Products,1028,


In [378]:
info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
description     6636 non-null object
group           6636 non-null object
id              6636 non-null int64
manufacturer    5195 non-null object
dtypes: int64(1), object(3)
memory usage: 207.4+ KB


In [379]:
pd.value_counts(info.group)[:10]

Vegetables and Vegetable Products    812
Beef Products                        618
Baked Products                       496
Breakfast Cereals                    403
Legumes and Legume Products          365
Fast Foods                           365
Lamb, Veal, and Game Products        345
Sweets                               341
Fruits and Fruit Juices              328
Pork Products                        328
Name: group, dtype: int64

To do some analysis on all of the nutrient data, it's easier to assemble he nutrients for each food in a single large DataFrame. To do so, first convert each list of food nutrients into a df, add a column for the food id, and append the DataFrame to a list. Then, these can be concatenated together with `concat()`

In [382]:
nutrients =[]

for rec in db:
    fnuts=DataFrame(rec['nutrients'])
    fnuts['id'] = rec['id']
    nutrients.append(fnuts)


In [386]:
nutrients[:3]

[                            description        group    units     value    id
 0                               Protein  Composition        g    25.180  1008
 1                     Total lipid (fat)  Composition        g    29.200  1008
 2           Carbohydrate, by difference  Composition        g     3.060  1008
 3                                   Ash        Other        g     3.280  1008
 4                                Energy       Energy     kcal   376.000  1008
 5                                 Water  Composition        g    39.280  1008
 6                                Energy       Energy       kJ  1573.000  1008
 7                  Fiber, total dietary  Composition        g     0.000  1008
 8                           Calcium, Ca     Elements       mg   673.000  1008
 9                              Iron, Fe     Elements       mg     0.640  1008
 10                        Magnesium, Mg     Elements       mg    22.000  1008
 11                        Phosphorus, P     Element

In [388]:
nutrients = pd.concat(nutrients, ignore_index=True)

In [393]:
# remove duplicates

nutrients.duplicated().sum()

14179

In [396]:
nutrients = nutrients.drop_duplicates()
nutrients.duplicated().sum()

0

Since 'group' and 'description' is in both DataFrame objects, we can rename them to make it clear using df.rename()

In [398]:
col_mapping = {'description' : 'food',
               'group' : 'food_group'}

info = info.rename(columns=col_mapping, copy=False)
info.head()

Unnamed: 0,food,food_group,id,manufacturer
0,"Cheese, caraway",Dairy and Egg Products,1008,
1,"Cheese, cheddar",Dairy and Egg Products,1009,
2,"Cheese, edam",Dairy and Egg Products,1018,
3,"Cheese, feta",Dairy and Egg Products,1019,
4,"Cheese, mozzarella, part skim milk",Dairy and Egg Products,1028,


In [400]:
col_mapping = {'description': 'nutrient',
               'group' : 'nutgroup'}

nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients.head()

Unnamed: 0,nutrient,nutgroup,units,value,id
0,Protein,Composition,g,25.18,1008
1,Total lipid (fat),Composition,g,29.2,1008
2,"Carbohydrate, by difference",Composition,g,3.06,1008
3,Ash,Other,g,3.28,1008
4,Energy,Energy,kcal,376.0,1008


In [402]:
# We're ready to merge info with nutrients

ndata = pd.merge(nutrients, info, on='id', how='outer')
ndata.head()

Unnamed: 0,nutrient,nutgroup,units,value,id,food,food_group,manufacturer
0,Protein,Composition,g,25.18,1008,"Cheese, caraway",Dairy and Egg Products,
1,Total lipid (fat),Composition,g,29.2,1008,"Cheese, caraway",Dairy and Egg Products,
2,"Carbohydrate, by difference",Composition,g,3.06,1008,"Cheese, caraway",Dairy and Egg Products,
3,Ash,Other,g,3.28,1008,"Cheese, caraway",Dairy and Egg Products,
4,Energy,Energy,kcal,376.0,1008,"Cheese, caraway",Dairy and Egg Products,


In [407]:
result = ndata.groupby(['nutrient','food_group'])['value'].quantile(0.5)
result.head()

nutrient          food_group                       
Adjusted Protein  Sweets                               12.900
                  Vegetables and Vegetable Products     2.180
Alanine           Baby Foods                            0.085
                  Baked Products                        0.248
                  Beef Products                         1.550
Name: value, dtype: float64

`pandas.DataFrame.quantile`
Return values at the given quantile over requested axis, a la numpy.percentile.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.quantile.html

In [424]:
result['Zinc, Zn'].plot(kind='barh')

by_nutrient = ndata.groupby(['nutgroup','nutrient'])

get_max = lambda x: x.xs(x.value.idxmax())
get_min= lambda x: x.xs(x.value.idxmin())

max_foods = by_nutrient.apply(get_max)[['value','food']]

max_foods.food=max_foods.food.str[:50]

max_foods.loc['Amino Acids']['food']

nutrient
Alanine                           Gelatins, dry powder, unsweetened
Arginine                               Seeds, sesame flour, low-fat
Aspartic acid                                   Soy protein isolate
Cystine                Seeds, cottonseed flour, low fat (glandless)
Glutamic acid                                   Soy protein isolate
Glycine                           Gelatins, dry powder, unsweetened
Histidine                Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline    KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
Isoleucine        Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Leucine           Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Lysine            Seal, bearded (Oogruk), meat, dried (Alaska Na...
Methionine                    Fish, cod, Atlantic, dried and salted
Phenylalanine     Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Proline                           Gelatins, dry powder, unsweetened
Serine            Soy protein isolate, 