<a name="top"></a>
# Python Workshop 3
We will look at [NumPy](http://www.numpy.org/) and [Pandas](https://pandas.pydata.org/), which are two of the most popular packages for data analysts these days.

* [NumPy](#numpy) is the homogeneous multidimensional array, which is perfect for predictive analysis.
* [Pandas](#pandas) provides an easy-to-use data structures, which is great for handling data.

-----
<a name="numpy"></a>
## NumPy
NumPy’s main object is the homogeneous multidimensional array (a N-dimensional hypervolume whose elements, usually numbers, are all of the same type). NumPy’s array class is called `ndarray`. Here we will cover:
* Array Creation
* Basic Operations
* Indexing, Slicing and Iterating
* Shape Manipulation
* Copies
* Broadcasting Rules

More details [here](https://docs.scipy.org/doc/numpy-1.15.1/user/quickstart.html).

### Array Creation
First, we need to create the `ndarray`. There are several ways to create arrays.

In [1]:
import numpy as np
a = np.array([2,3,4])
print('a={}\ntype={}'.format(a, type(a)))
b = np.array([1.2, 3.5, 5.1])
print('b=',b)
c = np.zeros((2,4))
print('c=',c)
d = np.ones((2,3,4))
print('d=',d)
e = np.empty((2,3))
print('e=',e)

a=[2 3 4]
type=<class 'numpy.ndarray'>
b= [1.2 3.5 5.1]
c= [[0. 0. 0. 0.]
 [0. 0. 0. 0.]]
d= [[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]
e= [[1.11260348e-306 8.34422220e-308 1.60218763e-306]
 [1.69121367e-306 1.33511969e-306 9.34562257e-307]]


Notice how NumPy displays array (in NumPy dimensions are called axes):
* The last axis is printed from left to right,
* The second-to-last is printed from top to bottom,
* The rest are also printed from top to bottom, with each slice separated from the next by an empty line.

You can create sequences of numbers as well

In [2]:
f = np.arange(10, 30, 5)
print('arange=', f)
g = np.linspace(0, 2, 9)
print('linspace=', g)
h = np.random.random((3,4))
print('random=', h)

arange= [10 15 20 25]
linspace= [0.   0.25 0.5  0.75 1.   1.25 1.5  1.75 2.  ]
random= [[0.24676819 0.24108013 0.13106437 0.49904287]
 [0.87870332 0.6212798  0.13764296 0.09222461]
 [0.59885802 0.52389192 0.88047451 0.84402225]]


`ndarray` has a number of attributes.

In [3]:
print('a=',a)
print('a.ndim=',a.ndim)
print('a.shape=',a.shape)
print('a.size=',a.size)
print('a.dtype=',a.dtype)

a= [2 3 4]
a.ndim= 1
a.shape= (3,)
a.size= 3
a.dtype= int32


### Basic Operations
Arithmetic operators on arrays apply elementwise.

In [4]:
print('a=',a)
print('b=',b)
print('a-b=',a-b)

a= [2 3 4]
b= [1.2 3.5 5.1]
a-b= [ 0.8 -0.5 -1.1]


In [5]:
print(a**2)
print(np.sqrt(a))
print(b>3)

[ 4  9 16]
[1.41421356 1.73205081 2.        ]
[False  True  True]


Many unary operations, such as the sum of all the elements in the array, are implemented as methods of the `ndarray` class.

In [6]:
print(a.sum())
print(a.mean())
print(a.std())
print(a.min(), a.max())

9
3.0
0.816496580927726
2 4


### Indexing, Slicing and Iterating
**Indexing** and **slicing** are similar to list. Multidimensional arrays can have one index per axis. These indices are given in a tuple separated by commas:

In [7]:
print('a=',a)
print('a[1]=',a[1])
print('a[1:]=',a[1:])

a= [2 3 4]
a[1]= 3
a[1:]= [3 4]


In [8]:
print('e=',e)
print('e[0,1]=',e[0,1])
print('e[:,1]=',e[:,1])
print('e[0,0:2]=',e[0,0:2])

e= [[1.11260348e-306 8.34422220e-308 1.60218763e-306]
 [1.69121367e-306 1.33511969e-306 9.34562257e-307]]
e[0,1]= 8.344222198972635e-308
e[:,1]= [8.34422220e-308 1.33511969e-306]
e[0,0:2]= [1.11260348e-306 8.34422220e-308]


When fewer indices are provided than the number of axes, the missing indices are considered complete slices.

In [9]:
print('d=',d)
print('d[1]=',d[1])
print('d[...,3]=', d[...,3])

d= [[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]
d[1]= [[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
d[...,3]= [[1. 1. 1.]
 [1. 1. 1.]]


**Iterating** over multidimensional arrays is done with respect to the first axis:

In [10]:
for row in e:
    print('row=',row)

row= [1.11260348e-306 8.34422220e-308 1.60218763e-306]
row= [1.69121367e-306 1.33511969e-306 9.34562257e-307]


### Shape Manipulation
You can change the shape of an array.

In [11]:
print('h=',h)
print(h.shape)

h= [[0.24676819 0.24108013 0.13106437 0.49904287]
 [0.87870332 0.6212798  0.13764296 0.09222461]
 [0.59885802 0.52389192 0.88047451 0.84402225]]
(3, 4)


In [12]:
print('reshape=',h.reshape(6,2))

reshape= [[0.24676819 0.24108013]
 [0.13106437 0.49904287]
 [0.87870332 0.6212798 ]
 [0.13764296 0.09222461]
 [0.59885802 0.52389192]
 [0.88047451 0.84402225]]


In [13]:
print('transpose=',h.T)

transpose= [[0.24676819 0.87870332 0.59885802]
 [0.24108013 0.6212798  0.52389192]
 [0.13106437 0.13764296 0.88047451]
 [0.49904287 0.09222461 0.84402225]]


In [39]:
h.reshape(2,3,2)

array([[[0.24676819, 0.24108013],
        [0.13106437, 0.49904287],
        [0.87870332, 0.6212798 ]],

       [[0.13764296, 0.09222461],
        [0.59885802, 0.52389192],
        [0.88047451, 0.84402225]]])

In [41]:
print(np.swapaxes(h.reshape(2,3,2),1,2))

[[[0.24676819 0.13106437 0.87870332]
  [0.24108013 0.49904287 0.6212798 ]]

 [[0.13764296 0.59885802 0.88047451]
  [0.09222461 0.52389192 0.84402225]]]


### Copies
**No Copy at All** (passing-by-reference): Simple assignments make no copy of array objects or of their data.

In [14]:
a = np.arange(12)
b = a
print('a=',a)
print(b is a)
b[0] = 100
print('a=',a)

a= [ 0  1  2  3  4  5  6  7  8  9 10 11]
True
a= [100   1   2   3   4   5   6   7   8   9  10  11]


**Deep Copy** (passing-by-value): The `copy` method makes a complete copy of the array and its data.

In [15]:
d = a.copy()
print(d is a)
d[0] = 50
print('a=',a)
print('d=',d)

False
a= [100   1   2   3   4   5   6   7   8   9  10  11]
d= [50  1  2  3  4  5  6  7  8  9 10 11]


### Broadcasting Rules

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when

1. they are equal, or
2. one of them is 1

If these conditions are not met, a `ValueError: operands could not be broadcast together` exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.

In [16]:
a = np.array([[1,2],
              [3,4],
              [5,6]])
b = np.array([10,-10])
print(f'a \t size = {a.shape}\n', a)
print(f'b \t size = {b.shape}\n', b)
print(f'a+b \t size = {(a+b).shape}\n', a+b)

a 	 size = (3, 2)
 [[1 2]
 [3 4]
 [5 6]]
b 	 size = (2,)
 [ 10 -10]
a+b 	 size = (3, 2)
 [[11 -8]
 [13 -6]
 [15 -4]]


In [17]:
a = np.arange(24).reshape((2,4,3))
b = np.array([-1,1,0.1])
print(f'a \t size = {a.shape}\n', a)
print(f'b \t size = {b.shape}\n', b)
print(f'a*b \t size = {(a*b).shape}\n', a*b)

a 	 size = (2, 4, 3)
 [[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]
  [ 9 10 11]]

 [[12 13 14]
  [15 16 17]
  [18 19 20]
  [21 22 23]]]
b 	 size = (3,)
 [-1.   1.   0.1]
a*b 	 size = (2, 4, 3)
 [[[ -0.    1.    0.2]
  [ -3.    4.    0.5]
  [ -6.    7.    0.8]
  [ -9.   10.    1.1]]

 [[-12.   13.    1.4]
  [-15.   16.    1.7]
  [-18.   19.    2. ]
  [-21.   22.    2.3]]]


The second rule of broadcasting provides a convenient way of taking the outer operation of two arrays.

In [18]:
a = np.array([0.0, 10.0, 20.0, 30.0])
b = np.array([1.0, 2.0, 3.0])
print(f'a \t\t size = {a.shape}\n', a)
print(f'b \t\t size = {b.shape}\n', b)
print(f'a[:, np.newaxis] size = {a[:, np.newaxis].shape}')
print('a[:, np.newaxis] + b = \n', a[:, np.newaxis] + b)

a 		 size = (4,)
 [ 0. 10. 20. 30.]
b 		 size = (3,)
 [1. 2. 3.]
a[:, np.newaxis] size = (4, 1)
a[:, np.newaxis] + b = 
 [[ 1.  2.  3.]
 [11. 12. 13.]
 [21. 22. 23.]
 [31. 32. 33.]]


Here the `newaxis` index operator inserts a new axis into a, making it a two-dimensional 4x1 array. Combining the 4x1 array with b, which has shape (3,), yields a 4x3 array.

[[top](#top)]

----

## <a name="pandas"></a> Pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas handles data through
* Series for 1D data
* Data Frame for 2D data

### Series
Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

In [19]:
import pandas as pd
s = pd.Series([1,3,5,'str',6.1])
s

0      1
1      3
2      5
3    str
4    6.1
dtype: object

In [20]:
s = pd.Series(range(3))
s

0    0
1    1
2    2
dtype: int64

### DataFrame
A Data frame is a 2D data structure, where data is aligned in a tabular fashion in rows and columns.

In [21]:
df1 = pd.DataFrame(np.random.randn(3,4), columns=['A','B','C','D'])
print(df1)

          A         B         C         D
0  0.738447  1.080987 -0.917871  0.938041
1 -0.623283 -0.946206 -1.518223 -0.255585
2  1.273193 -0.016428 -0.303638 -0.226786


In [22]:
df1

Unnamed: 0,A,B,C,D
0,0.738447,1.080987,-0.917871,0.938041
1,-0.623283,-0.946206,-1.518223,-0.255585
2,1.273193,-0.016428,-0.303638,-0.226786


In [23]:
df2 = pd.DataFrame({'A':1, 'B':np.arange(3), 'C':'Hello'})
df2

Unnamed: 0,A,B,C
0,1,0,Hello
1,1,1,Hello
2,1,2,Hello


In [24]:
df1 = pd.DataFrame(np.random.randn(10,5), columns=['A','B','C','D', 'E'])
df1

Unnamed: 0,A,B,C,D,E
0,0.82818,-0.797814,-1.266018,0.707084,-1.011089
1,-2.424754,1.860558,0.334354,-1.137992,0.271048
2,0.879616,1.128962,0.77918,-0.12124,-1.204958
3,-1.114732,-1.569894,-0.01767,-1.259541,-0.592882
4,-0.435665,-0.459891,-0.908328,-0.540685,-1.054029
5,0.1873,0.623774,-0.618745,0.541438,0.990823
6,-0.580636,-1.4653,-1.905776,-1.486758,-1.245909
7,-0.014368,-0.683845,-0.274539,-0.377423,-0.035931
8,0.835499,0.720072,-0.201252,1.59279,0.483071
9,-1.699335,-0.547593,-1.361964,-0.281972,0.067018


In [25]:
df1.head(3)

Unnamed: 0,A,B,C,D,E
0,0.82818,-0.797814,-1.266018,0.707084,-1.011089
1,-2.424754,1.860558,0.334354,-1.137992,0.271048
2,0.879616,1.128962,0.77918,-0.12124,-1.204958


In [26]:
df1.tail(3)

Unnamed: 0,A,B,C,D,E
7,-0.014368,-0.683845,-0.274539,-0.377423,-0.035931
8,0.835499,0.720072,-0.201252,1.59279,0.483071
9,-1.699335,-0.547593,-1.361964,-0.281972,0.067018


In [27]:
df1.index

RangeIndex(start=0, stop=10, step=1)

In [28]:
df1.columns

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [29]:
df1.values

array([[ 0.82818036, -0.7978145 , -1.26601762,  0.70708406, -1.0110886 ],
       [-2.42475382,  1.86055768,  0.33435408, -1.1379916 ,  0.27104832],
       [ 0.87961629,  1.12896196,  0.77917965, -0.12123983, -1.20495778],
       [-1.11473208, -1.56989414, -0.01766993, -1.25954123, -0.59288199],
       [-0.43566452, -0.45989085, -0.90832764, -0.54068523, -1.05402872],
       [ 0.18730028,  0.62377441, -0.61874527,  0.54143787,  0.99082336],
       [-0.58063561, -1.46529995, -1.90577559, -1.48675838, -1.24590892],
       [-0.01436803, -0.68384506, -0.27453855, -0.37742253, -0.03593064],
       [ 0.83549861,  0.72007199, -0.20125176,  1.59278976,  0.48307135],
       [-1.69933527, -0.54759311, -1.36196422, -0.28197239,  0.06701759]])

In [30]:
type(df1.values)

numpy.ndarray

#### Convert Between NumPy and Pandas

You can convert between Pandas DataFrame/Series and NumPy ndarray very easily

In [32]:
a1 = np.array(df1)
a1

array([[ 0.82818036, -0.7978145 , -1.26601762,  0.70708406, -1.0110886 ],
       [-2.42475382,  1.86055768,  0.33435408, -1.1379916 ,  0.27104832],
       [ 0.87961629,  1.12896196,  0.77917965, -0.12123983, -1.20495778],
       [-1.11473208, -1.56989414, -0.01766993, -1.25954123, -0.59288199],
       [-0.43566452, -0.45989085, -0.90832764, -0.54068523, -1.05402872],
       [ 0.18730028,  0.62377441, -0.61874527,  0.54143787,  0.99082336],
       [-0.58063561, -1.46529995, -1.90577559, -1.48675838, -1.24590892],
       [-0.01436803, -0.68384506, -0.27453855, -0.37742253, -0.03593064],
       [ 0.83549861,  0.72007199, -0.20125176,  1.59278976,  0.48307135],
       [-1.69933527, -0.54759311, -1.36196422, -0.28197239,  0.06701759]])

In [33]:
a1.dtype

dtype('float64')

In [34]:
a2 = np.array(df2)
a2

array([[1, 0, 'Hello'],
       [1, 1, 'Hello'],
       [1, 2, 'Hello']], dtype=object)

In [35]:
df2.values

array([[1, 0, 'Hello'],
       [1, 1, 'Hello'],
       [1, 2, 'Hello']], dtype=object)

In [36]:
pd.DataFrame(a1)

Unnamed: 0,0,1,2,3,4
0,0.82818,-0.797814,-1.266018,0.707084,-1.011089
1,-2.424754,1.860558,0.334354,-1.137992,0.271048
2,0.879616,1.128962,0.77918,-0.12124,-1.204958
3,-1.114732,-1.569894,-0.01767,-1.259541,-0.592882
4,-0.435665,-0.459891,-0.908328,-0.540685,-1.054029
5,0.1873,0.623774,-0.618745,0.541438,0.990823
6,-0.580636,-1.4653,-1.905776,-1.486758,-1.245909
7,-0.014368,-0.683845,-0.274539,-0.377423,-0.035931
8,0.835499,0.720072,-0.201252,1.59279,0.483071
9,-1.699335,-0.547593,-1.361964,-0.281972,0.067018


In [37]:
pd.Series(a2[:,1])

0    0
1    1
2    2
dtype: object

[[top](#top)]

----