# Lecture 2 
## Part 1 -- Introduction to Numpy, Pandas

We will look at useful Python libraries that are extremely useful for data analysis.

First, we will take a look at NumPy and Pandas tutorial by Marcel Caraciolo. You find the full tutorial here: [http://marcelcaraciolo.github.io/big-data-tutorial](http://marcelcaraciolo.github.io/big-data-tutorial)

## 2.1 NumPy: Numerical Python

### What is it?

*It is a Python library that provides a multidimensional array object, various
derived objects (such as masked arrays and matrices), and an assortment of
routines for fast operations on arrays, including mathematical, logical, shape
manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear
algebra, basic statistical operations, random simulation and much more.*

In [1]:
import numpy as np

# set some print options
np.set_printoptions(precision=4)
np.set_printoptions(threshold=5)
np.set_printoptions(suppress=True)

# init random gen
np.random.seed(2)

### NumPy's basic data structure: the ndarray

Think of ndarrays as the building blocks for pydata. A multidimensional array
object that acts as a container for data to be passed between algorithms. Also,
libraries written in a lower-level language, such as C or Fortran, can operate
on the data stored in a NumPy array without copying any data.


In [2]:
import numpy as np

# build an array using the array function
arr = np.array([0, 9, 5, 4, 3])
arr

array([0, 9, 5, 4, 3])

### Array creation examples

There are several functions that are used to create new arrays:

- `np.array`
- `np.asarray`
- `np.arange`
- `np.ones`
- `np.ones_like`
- `np.zeros`
- `np.zeros_like`
- `np.empty`
- `np.random.randn` and other funcs from the random module

In [3]:
np.zeros(4)

array([ 0.,  0.,  0.,  0.])

In [4]:
np.ones(4)

array([ 1.,  1.,  1.,  1.])

In [5]:
np.empty(4)

array([ -3.1050e+231,  -3.1050e+231,   6.9391e-310,   2.2708e-314])

In [6]:
np.arange(4)

array([0, 1, 2, 3])

### dtype and shape

NumPy's arrays are containers of homogeneous data, which means all elements are
of the same type. The 'dtype' propery is an object that specifies the data type
of each element. The 'shape' property is a tuple that indicates the size of each
dimension.

In [7]:
arr = np.random.randn(5)
arr

array([-0.4168, -0.0563, -2.1362,  1.6403, -1.7934])

In [8]:
arr.dtype

dtype('float64')

In [9]:
arr.shape

(5,)

In [10]:
# you can be explicit about the data type that you want
np.empty(4, dtype=np.int32)

array([         0, -268435456, -828433777, 1610614779], dtype=int32)

In [11]:
np.array(['numpy','pandas','pytables'], dtype=np.string_)

array(['numpy', 'pandas', 'pytables'], 
      dtype='|S8')

In [12]:
float_arr = np.array([4.4, 5.52425, -0.1234, 98.1], dtype=np.float64)
# truncate the decimal part
float_arr.astype(np.int32)

array([ 4,  5,  0, 98], dtype=int32)

### Indexing and slicing

#### Just what you would expect from Python

In [13]:
arr = np.array([0, 9, 1, 4, 64])
arr[3]

4

In [14]:
arr[1:3]

array([9, 1])

In [15]:
arr[:2]

array([0, 9])

In [16]:
# set the last two elements to 555
arr[-2:] = 55
arr

array([ 0,  9,  1, 55, 55])

#### Indexing behaviour for multidimensional arrays

A good way to think about indexing in multidimensional arrays is that you are
moving along the values of the shape property. So, a 4d array `arr_4d`, with a
shape of `(w,x,y,z)` will result in indexed views such that:

- `arr_4d[i].shape == (x,y,z)`
- `arr_4d[i,j].shape == (y,z)`
- `arr_4d[i,j,k].shape == (z,)`

For the case of slices, what you are doing is selecting a range of elements
along a particular axis:

In [17]:
arr_2d = np.array([[5,3,4],[0,1,2],[1,1,10],[0,0,0.1]])
arr_2d

array([[  5. ,   3. ,   4. ],
       [  0. ,   1. ,   2. ],
       [  1. ,   1. ,  10. ],
       [  0. ,   0. ,   0.1]])

In [18]:
# get the first row
arr_2d[0]

array([ 5.,  3.,  4.])

In [19]:
# get the first column
arr_2d[:,0]

array([ 5.,  0.,  1.,  0.])

In [20]:
# get the first two rows
arr_2d[:2]

array([[ 5.,  3.,  4.],
       [ 0.,  1.,  2.]])

#### Careful, it's a view!

A slice does not return a copy, which means that any modifications will be
reflected in the source array. This is a design feature of NumPy to avoid memory
problems.

In [21]:
arr = np.array([0, 3, 1, 4, 64])
arr

array([ 0,  3,  1,  4, 64])

In [22]:
subarr = arr[2:4]
subarr[1] = 99
arr

array([ 0,  3,  1, 99, 64])

#### (Fancy) Boolean indexing

Boolean indexing allows you to select data subsets of an array that satisfy a
given condition.

In [23]:
arr = np.array([10, 20])
idx = np.array([True, False])
arr[idx]


array([10])

In [24]:
arr_2d = np.random.randn(5)
arr_2d


array([-0.8417,  0.5029, -1.2453, -1.058 , -0.909 ])

In [25]:
arr_2d < 0

array([ True, False,  True,  True,  True], dtype=bool)

In [26]:
arr_2d[arr_2d < 0]

array([-0.8417, -1.2453, -1.058 , -0.909 ])

In [27]:
arr_2d[(arr_2d > -0.5) & (arr_2d < 0)]

array([], dtype=float64)

In [28]:
arr_2d[arr_2d < 0] = 0
arr_2d

array([ 0.    ,  0.5029,  0.    ,  0.    ,  0.    ])

#### (Fancy) list-of-locations indexing

Fancy indexing is indexing with integer arrays.

In [29]:
arr = np.arange(18).reshape(6,3)
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17]])

In [30]:
# fancy selection of rows in a particular order
arr[[0,4,4]]

array([[ 0,  1,  2],
       [12, 13, 14],
       [12, 13, 14]])

In [31]:
# index into individual elements and flatten
arr[[5,3,1]]

array([[15, 16, 17],
       [ 9, 10, 11],
       [ 3,  4,  5]])

In [32]:
arr[[5,3,1],[2,1,0]]

array([17, 10,  3])

In [33]:
# select a submatrix
arr[np.ix_([5,3,1],[2,1])]

array([[17, 16],
       [11, 10],
       [ 5,  4]])

### Vectorization
Vectorization is at the heart of NumPy and it enables us to express operations
without writing any for loops.  Operations between arrays with equal shapes are
performed element-wise.


In [34]:
arr = np.array([0, 9, 1.02, 4, 32])
arr - arr

array([ 0.,  0.,  0.,  0.,  0.])

In [35]:
arr * arr


array([    0.    ,    81.    ,     1.0404,    16.    ,  1024.    ])

In [36]:
np.array([[2],[3]]) * np.array([1,2])

array([[2, 4],
       [3, 6]])

### Broadcasting Rules

Vectorized operations between arrays of different sizes and between arrays and
scalars are subject to the rules of broadcasting. The idea is quite simple in
many cases:


In [37]:
arr = np.array([0, 9, 1.02, 4, 64])
5 * arr 

array([   0. ,   45. ,    5.1,   20. ,  320. ])

In [38]:
10 + arr

array([ 10.  ,  19.  ,  11.02,  14.  ,  74.  ])

In [39]:
arr ** .5

array([ 0.  ,  3.  ,  1.01,  2.  ,  8.  ])

The case of arrays of different shapes is slightly more complicated. The gist of it is that the shape of the operands need to conform to a certain specification. Don't worry if this does not make sense right away.

In [40]:
arr = np.random.randn(4,2)
arr

array([[ 0.5515,  2.2922],
       [ 0.0415, -1.1179],
       [ 0.5391, -0.5962],
       [-0.0191,  1.175 ]])

In [41]:
mean_row = np.mean(arr, axis=0)
mean_row

array([ 0.2782,  0.4383])

In [42]:
centered_rows = arr - mean_row
centered_rows

array([[ 0.2732,  1.8539],
       [-0.2367, -1.5562],
       [ 0.2608, -1.0344],
       [-0.2974,  0.7367]])

In [43]:
np.mean(centered_rows, axis=0)

array([-0.,  0.])

In [44]:
mean_col = np.mean(arr, axis=1)
mean_col

array([ 1.4218, -0.5382, -0.0286,  0.5779])

In [46]:
centered_cols = arr - mean_col

In [47]:
# make the 1-D array a column vector
mean_col.reshape((4,1))

array([[ 1.4218],
       [-0.5382],
       [-0.0286],
       [ 0.5779]])

In [48]:
centered_cols = arr - mean_col.reshape((4,1))
centered_rows

array([[ 0.2732,  1.8539],
       [-0.2367, -1.5562],
       [ 0.2608, -1.0344],
       [-0.2974,  0.7367]])

In [49]:
centered_cols.mean(axis=1)

array([-0.,  0.,  0., -0.])

#### A note about NANs: 

Per the floating point standard IEEE 754, NaN is a floating point value that, by definition, is not equal to any other floating point value.

In [50]:
np.nan != np.nan

True

In [51]:
np.array([10,5,4,np.nan,1,np.nan]) == np.nan

array([False, False, False, False, False, False], dtype=bool)

In [52]:
np.isnan(np.array([10,5,4,np.nan,1,np.nan]))

array([False, False, False,  True, False,  True], dtype=bool)

## 2.2 pandas: Python Data Analysis Library

### What is it?

*Python has long been great for data munging and preparation, but less so for
data analysis and modeling. pandas helps fill this gap, enabling you to carry
out your entire data analysis workflow in Python without having to switch to a
more domain specific language like R.*

The heart of pandas is the DataFrame object for data manipulation. It features:

- a powerful index object
- data alignment
- handling of missing data
- aggregation with groupby
- data manipuation via reshape, pivot, slice, merge, join

In [53]:
import pandas as pd

pd.set_option('display.notebook_repr_html', True)
pd.set_option('precision', 3)

### Series: labelled arrays

The pandas Series is the simplest datastructure to start with. It is a subclass
of ndarray that supports more meaninful indices.

#### Let's look at some creation examples for Series

In [54]:
import pandas as pd

values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
ser = pd.Series(values)
print ser

0     2.00
1     1.00
2     5.00
3     0.97
4     3.00
5    10.00
6     0.06
7     8.00
dtype: float64


In [55]:
values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
ser = pd.Series(data=values, index=labels)
print ser


A     2.00
B     1.00
C     5.00
D     0.97
E     3.00
F    10.00
G     0.06
H     8.00
dtype: float64


In [56]:
movie_rating = {
    'age': 1,
    'gender': 'F',
    'genres': 'Drama',
    'movie_id': 1193,
    'occupation': 10,
    'rating': 5,
    'timestamp': 978300760,
    'title': "One Flew Over the Cuckoo's Nest (1975)",
    'user_id': 1,
    'zip': '48067'
    }
ser = pd.Series(movie_rating)
print ser


age                                                1
gender                                             F
genres                                         Drama
movie_id                                        1193
occupation                                        10
rating                                             5
timestamp                                  978300760
title         One Flew Over the Cuckoo's Nest (1975)
user_id                                            1
zip                                            48067
dtype: object


In [57]:
ser.index

Index([u'age', u'gender', u'genres', u'movie_id', u'occupation', u'rating',
       u'timestamp', u'title', u'user_id', u'zip'],
      dtype='object')

In [58]:
ser.values

array([1, 'F', 'Drama', ..., "One Flew Over the Cuckoo's Nest (1975)", 1,
       '48067'], dtype=object)

#### Series indexing

In [59]:
ser[0]

1

In [60]:
ser['gender']

'F'

In [61]:
ser.get_value('gender')

'F'

#### Operations between Series with different index objects

In [62]:
ser_1 = pd.Series(data=[1,3,4], index=['A', 'B', 'C'])
ser_2 = pd.Series(data=[5,5,5], index=['A', 'G', 'C'])
print ser_1 + ser_2


A    6.0
B    NaN
C    9.0
G    NaN
dtype: float64


### DataFrame

The DataFrame is the 2-dimensional version of a Series.

#### Let's look at some creation examples for DataFrame

You can think of it as a spreadsheet whose columns are Series objects.

In [63]:
# build from a dict of equal-length lists or ndarrays
pd.DataFrame({'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]})

Unnamed: 0,col_1,col_2
0,0.12,0.9
1,7.0,9.0
2,45.0,34.0
3,10.0,11.0


You can explicitly set the column names and index values as well.

In [64]:
pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
             columns=['col_1', 'col_2', 'col_3'])

Unnamed: 0,col_1,col_2,col_3
0,0.12,0.9,
1,7.0,9.0,
2,45.0,34.0,
3,10.0,11.0,


In [65]:
pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
             columns=['col_1', 'col_2', 'col_3'],
             index=['obs1', 'obs2', 'obs3', 'obs4'])


Unnamed: 0,col_1,col_2,col_3
obs1,0.12,0.9,
obs2,7.0,9.0,
obs3,45.0,34.0,
obs4,10.0,11.0,


You can also think of it as a dictionary of Series objects.

In [66]:
movie_rating = {
    'gender': 'F',
    'genres': 'Drama',
    'movie_id': 1193,
    'rating': 5,
    'timestamp': 978300760,
    'user_id': 1,
    }
ser_1 = pd.Series(movie_rating)
ser_2 = pd.Series(movie_rating)
df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.columns.name = 'rating_events'
df.index.name = 'rating_data'
df

rating_events,r_1,r_2
rating_data,Unnamed: 1_level_1,Unnamed: 2_level_1
gender,F,F
genres,Drama,Drama
movie_id,1193,1193
rating,5,5
timestamp,978300760,978300760
user_id,1,1


In [67]:
df = df.T
df

rating_data,gender,genres,movie_id,rating,timestamp,user_id
rating_events,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
r_1,F,Drama,1193,5,978300760,1
r_2,F,Drama,1193,5,978300760,1


In [68]:
df.columns 

Index([u'gender', u'genres', u'movie_id', u'rating', u'timestamp', u'user_id'], dtype='object', name=u'rating_data')

In [69]:
df.index

Index([u'r_1', u'r_2'], dtype='object', name=u'rating_events')

In [70]:
df.values

array([['F', 'Drama', 1193, 5, 978300760, 1],
       ['F', 'Drama', 1193, 5, 978300760, 1]], dtype=object)

#### Adding/Deleting entries

In [71]:
df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.drop('genres', axis=0)

Unnamed: 0_level_0,r_1,r_2
rating_data,Unnamed: 1_level_1,Unnamed: 2_level_1
gender,F,F
movie_id,1193,1193
rating,5,5
timestamp,978300760,978300760
user_id,1,1


In [72]:
df.drop('r_1', axis=1)

Unnamed: 0_level_0,r_2
rating_data,Unnamed: 1_level_1
gender,F
genres,Drama
movie_id,1193
rating,5
timestamp,978300760
user_id,1


In [73]:
# careful with the order here
df['r_3'] = ['F', 'Drama', 1193, 5, 978300760, 1]
df

Unnamed: 0_level_0,r_1,r_2,r_3
rating_data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
gender,F,F,F
genres,Drama,Drama,Drama
movie_id,1193,1193,1193
rating,5,5,5
timestamp,978300760,978300760,978300760
user_id,1,1,1


#### DataFrame indexing

You can index into a column using it's label, or with dot notation


In [74]:
df = pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
                  columns=['col_1', 'col_2', 'col_3'],
                  index=['obs1', 'obs2', 'obs3', 'obs4'])
df['col_1']

obs1     0.12
obs2     7.00
obs3    45.00
obs4    10.00
Name: col_1, dtype: float64

In [75]:
df.col_1

obs1     0.12
obs2     7.00
obs3    45.00
obs4    10.00
Name: col_1, dtype: float64

You can also use multiple columns to select a subset of them:

In [76]:
df[['col_2', 'col_1']]

Unnamed: 0,col_2,col_1
obs1,0.9,0.12
obs2,9.0,7.0
obs3,34.0,45.0
obs4,11.0,10.0


The .ix method gives you the most flexibility to index into certain rows, or
even rows and columns:

In [77]:
df.ix['obs3']

col_1     45
col_2     34
col_3    NaN
Name: obs3, dtype: object

In [78]:
df.ix[0]

col_1    0.12
col_2     0.9
col_3     NaN
Name: obs1, dtype: object

In [79]:
df.ix[:2]

Unnamed: 0,col_1,col_2,col_3
obs1,0.12,0.9,
obs2,7.0,9.0,


In [80]:
df.ix[:2, 'col_2']

obs1    0.9
obs2    9.0
Name: col_2, dtype: float64

In [81]:
df.ix[:2, ['col_1', 'col_2']]

Unnamed: 0,col_1,col_2
obs1,0.12,0.9
obs2,7.0,9.0
