<a name="top"></a>
# Python Workshop 3
We will look at [NumPy](http://www.numpy.org/) and [Pandas](https://pandas.pydata.org/), which are two of the most popular packages for data analysts these days.

* [NumPy](#numpy) is the homogeneous multidimensional array, which is perfect for predictive analysis.
* [Pandas](#pandas) provides an easy-to-use data structures, which is great for handling data.

-----
<a name="numpy"></a>
## NumPy
NumPy’s main object is the homogeneous multidimensional array (a N-dimensional hypervolume whose elements, usually numbers, are all of the same type). NumPy’s array class is called `ndarray`. Here we will cover:
* Array Creation
* Basic Operations
* Indexing, Slicing and Iterating
* Shape Manipulation
* Copies
* Broadcasting Rules

More details [here](https://docs.scipy.org/doc/numpy-1.15.1/user/quickstart.html).

### Array Creation
First, we need to create the `ndarray`. There are several ways to create arrays.

In [56]:
import numpy as np
a = np.array([2,3,4])
print('a={}\ntype={}'.format(a, type(a)))
b = np.array([1.2, 3.5, 5.1])
print('b=',b)
c = np.zeros((2,4))
print('c=',c)
d = np.ones((2,3,4))
print('d=',d)
e = np.empty((2,3))
print('e=',e)

a=[2 3 4]
type=<class 'numpy.ndarray'>
b= [1.2 3.5 5.1]
c= [[0. 0. 0. 0.]
 [0. 0. 0. 0.]]
d= [[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]
e= [[4.9e-324 9.9e-324 1.5e-323]
 [2.0e-323 2.5e-323 3.0e-323]]


Notice how NumPy displays array (in NumPy dimensions are called axes):
* The last axis is printed from left to right,
* The second-to-last is printed from top to bottom,
* The rest are also printed from top to bottom, with each slice separated from the next by an empty line.

You can create sequences of numbers as well

In [57]:
f = np.arange(10, 30, 5)
print('arange=', f)
g = np.linspace(0, 2, 9)
print('linspace=', g)
h = np.random.random((3,4))
print('random=', h)

arange= [10 15 20 25]
linspace= [0.   0.25 0.5  0.75 1.   1.25 1.5  1.75 2.  ]
random= [[0.07635127 0.46437346 0.51108858 0.52961393]
 [0.24813921 0.83966458 0.77815347 0.35104367]
 [0.87344161 0.67031862 0.43899596 0.81716443]]


`ndarray` has a number of attributes.

In [58]:
print('a=',a)
print('a.ndim=',a.ndim)
print('a.shape=',a.shape)
print('a.size=',a.size)
print('a.dtype=',a.dtype)

a= [2 3 4]
a.ndim= 1
a.shape= (3,)
a.size= 3
a.dtype= int64


### Basic Operations
Arithmetic operators on arrays apply elementwise.

In [59]:
print('a=',a)
print('b=',b)
print('a-b=',a-b)

a= [2 3 4]
b= [1.2 3.5 5.1]
a-b= [ 0.8 -0.5 -1.1]


In [60]:
print(a**2)
print(np.sqrt(a))
print(b>3)

[ 4  9 16]
[1.41421356 1.73205081 2.        ]
[False  True  True]


Many unary operations, such as the sum of all the elements in the array, are implemented as methods of the `ndarray` class.

In [61]:
print(a.sum())
print(a.mean())
print(a.std())
print(a.min(), a.max())

9
3.0
0.816496580927726
2 4


### Indexing, Slicing and Iterating
**Indexing** and **slicing** are similar to list. Multidimensional arrays can have one index per axis. These indices are given in a tuple separated by commas:

In [62]:
print('a=',a)
print('a[1]=',a[1])
print('a[1:]=',a[1:])

a= [2 3 4]
a[1]= 3
a[1:]= [3 4]


In [63]:
print('e=',e)
print('e[0,1]=',e[0,1])
print('e[:,1]=',e[:,1])
print('e[0,0:2]=',e[0,0:2])

e= [[4.9e-324 9.9e-324 1.5e-323]
 [2.0e-323 2.5e-323 3.0e-323]]
e[0,1]= 1e-323
e[:,1]= [9.9e-324 2.5e-323]
e[0,0:2]= [5.e-324 1.e-323]


When fewer indices are provided than the number of axes, the missing indices are considered complete slices.

In [64]:
print('d=',d)
print('d[1]=',d[1])
print('d[...,3]=', d[...,3])

d= [[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]
d[1]= [[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
d[...,3]= [[1. 1. 1.]
 [1. 1. 1.]]


**Iterating** over multidimensional arrays is done with respect to the first axis:

In [65]:
for row in e:
    print('row=',row)

row= [4.9e-324 9.9e-324 1.5e-323]
row= [2.0e-323 2.5e-323 3.0e-323]


### Shape Manipulation
You can change the shape of an array.

In [66]:
print('h=',h)
print(h.shape)

h= [[0.07635127 0.46437346 0.51108858 0.52961393]
 [0.24813921 0.83966458 0.77815347 0.35104367]
 [0.87344161 0.67031862 0.43899596 0.81716443]]
(3, 4)


In [67]:
print('reshape=',h.reshape(6,2))

reshape= [[0.07635127 0.46437346]
 [0.51108858 0.52961393]
 [0.24813921 0.83966458]
 [0.77815347 0.35104367]
 [0.87344161 0.67031862]
 [0.43899596 0.81716443]]


In [68]:
print('transpose=',h.T)

transpose= [[0.07635127 0.24813921 0.87344161]
 [0.46437346 0.83966458 0.67031862]
 [0.51108858 0.77815347 0.43899596]
 [0.52961393 0.35104367 0.81716443]]


### Copies
**No Copy at All** (passing-by-reference): Simple assignments make no copy of array objects or of their data.

In [69]:
a = np.arange(12)
b = a
print('a=',a)
print(b is a)
b[0] = 100
print('a=',a)

a= [ 0  1  2  3  4  5  6  7  8  9 10 11]
True
a= [100   1   2   3   4   5   6   7   8   9  10  11]


**Deep Copy** (passing-by-value): The `copy` method makes a complete copy of the array and its data.

In [70]:
d = a.copy()
print(d is a)
d[0] = 50
print('a=',a)
print('d=',d)

False
a= [100   1   2   3   4   5   6   7   8   9  10  11]
d= [50  1  2  3  4  5  6  7  8  9 10 11]


### Broadcasting Rules

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when

1. they are equal, or
2. one of them is 1

If these conditions are not met, a `ValueError: operands could not be broadcast together` exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.

In [71]:
a = np.array([[1,2],
              [3,4],
              [5,6]])
b = np.array([10,-10])
print(f'a \t size = {a.shape}\n', a)
print(f'b \t size = {b.shape}\n', b)
print(f'a+b \t size = {(a+b).shape}\n', a+b)

a 	 size = (3, 2)
 [[1 2]
 [3 4]
 [5 6]]
b 	 size = (2,)
 [ 10 -10]
a+b 	 size = (3, 2)
 [[11 -8]
 [13 -6]
 [15 -4]]


In [72]:
a = np.arange(24).reshape((2,4,3))
b = np.array([-1,1,0.1])
print(f'a \t size = {a.shape}\n', a)
print(f'b \t size = {b.shape}\n', b)
print(f'a*b \t size = {(a*b).shape}\n', a*b)

a 	 size = (2, 4, 3)
 [[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]
  [ 9 10 11]]

 [[12 13 14]
  [15 16 17]
  [18 19 20]
  [21 22 23]]]
b 	 size = (3,)
 [-1.   1.   0.1]
a*b 	 size = (2, 4, 3)
 [[[ -0.    1.    0.2]
  [ -3.    4.    0.5]
  [ -6.    7.    0.8]
  [ -9.   10.    1.1]]

 [[-12.   13.    1.4]
  [-15.   16.    1.7]
  [-18.   19.    2. ]
  [-21.   22.    2.3]]]


The second rule of broadcasting provides a convenient way of taking the outer operation of two arrays.

In [73]:
a = np.array([0.0, 10.0, 20.0, 30.0])
b = np.array([1.0, 2.0, 3.0])
print(f'a \t\t size = {a.shape}\n', a)
print(f'b \t\t size = {b.shape}\n', b)
print(f'a[:, np.newaxis] size = {a[:, np.newaxis].shape}')
print('a[:, np.newaxis] + b = \n', a[:, np.newaxis] + b)

a 		 size = (4,)
 [ 0. 10. 20. 30.]
b 		 size = (3,)
 [1. 2. 3.]
a[:, np.newaxis] size = (4, 1)
a[:, np.newaxis] + b = 
 [[ 1.  2.  3.]
 [11. 12. 13.]
 [21. 22. 23.]
 [31. 32. 33.]]


Here the `newaxis` index operator inserts a new axis into a, making it a two-dimensional 4x1 array. Combining the 4x1 array with b, which has shape (3,), yields a 4x3 array.

[[top](#top)]

----

## <a name="pandas"></a> Pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas handles data through
* Series for 1D data
* Data Frame for 2D data

### Series
Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

In [74]:
import pandas as pd
s = pd.Series([1,3,5,'str',6.1])
s

0      1
1      3
2      5
3    str
4    6.1
dtype: object

In [75]:
s = pd.Series(range(3))
s

0    0
1    1
2    2
dtype: int64

### DataFrame
A Data frame is a 2D data structure, where data is aligned in a tabular fashion in rows and columns.

In [76]:
df1 = pd.DataFrame(np.random.randn(3,4), columns=['A','B','C','D'])
print(df1)

          A         B         C         D
0  1.175837  1.229544 -0.747967 -0.037420
1  0.455640 -1.363090 -0.070912  1.059917
2 -0.969031 -0.950011 -1.404612  2.181282


In [77]:
df1

Unnamed: 0,A,B,C,D
0,1.175837,1.229544,-0.747967,-0.03742
1,0.45564,-1.36309,-0.070912,1.059917
2,-0.969031,-0.950011,-1.404612,2.181282


In [78]:
df2 = pd.DataFrame({'A':1, 'B':np.arange(3), 'C':'Hello'})
df2

Unnamed: 0,A,B,C
0,1,0,Hello
1,1,1,Hello
2,1,2,Hello


In [79]:
df1 = pd.DataFrame(np.random.randn(10,5), columns=['A','B','C','D', 'E'])
df1

Unnamed: 0,A,B,C,D,E
0,0.558737,-0.689699,0.853262,0.270851,0.134071
1,-2.070171,-1.514859,-1.413037,0.253821,1.484072
2,1.03471,0.948821,1.50279,-0.487263,-0.681141
3,-0.748126,-0.283792,-0.249979,-0.326275,-0.573418
4,0.082693,0.151935,0.596887,-0.751293,-0.312903
5,-0.503388,-1.179796,0.735956,-0.930468,-1.709655
6,-0.665418,1.452067,1.415349,1.086858,-1.159439
7,-0.127376,-1.422871,-0.334567,-1.720186,1.27995
8,-0.43858,-0.34576,0.105766,0.025187,0.52276
9,0.325755,-0.180258,0.573176,0.157329,-0.773369


In [80]:
df1.head(3)

Unnamed: 0,A,B,C,D,E
0,0.558737,-0.689699,0.853262,0.270851,0.134071
1,-2.070171,-1.514859,-1.413037,0.253821,1.484072
2,1.03471,0.948821,1.50279,-0.487263,-0.681141


In [81]:
df1.tail(3)

Unnamed: 0,A,B,C,D,E
7,-0.127376,-1.422871,-0.334567,-1.720186,1.27995
8,-0.43858,-0.34576,0.105766,0.025187,0.52276
9,0.325755,-0.180258,0.573176,0.157329,-0.773369


In [82]:
df1.index

RangeIndex(start=0, stop=10, step=1)

In [83]:
df1.columns

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [84]:
df1.values

array([[ 0.558737  , -0.6896986 ,  0.85326235,  0.27085084,  0.13407062],
       [-2.07017051, -1.51485926, -1.41303711,  0.25382093,  1.48407215],
       [ 1.0347096 ,  0.94882068,  1.50279017, -0.487263  , -0.68114071],
       [-0.74812593, -0.28379151, -0.24997927, -0.32627487, -0.57341788],
       [ 0.0826934 ,  0.15193504,  0.59688734, -0.75129313, -0.3129026 ],
       [-0.50338828, -1.17979576,  0.73595641, -0.93046787, -1.70965468],
       [-0.66541821,  1.45206729,  1.41534877,  1.08685844, -1.1594392 ],
       [-0.12737598, -1.42287094, -0.33456718, -1.72018641,  1.27995018],
       [-0.43858048, -0.34576012,  0.10576608,  0.02518689,  0.52275955],
       [ 0.32575544, -0.18025795,  0.57317564,  0.15732858, -0.77336894]])

In [85]:
type(df1.values)

numpy.ndarray

### Getting Data In/Out

#### CSV
**CSV** (comma separated values) file is a text file in which the values in the
columns are separated by a comma.

CSV is preferred because of its simplicity, and it can be used with any spreadsheet program.

In [86]:
df3 = pd.read_csv('customer.csv')
df3

Unnamed: 0,customer_id,first_name,last_name,email,address,city,state,zip,balance
0,1,George,Washington,gwashington@usa.gov,3200 Mt Vernon Hwy,Mount Vernon,VA,22121,1000
1,2,John,Adams,jadams@usa.gov,1250 Hancock St,Quincy,MA,2169,500
2,3,Thomas,Jefferson,tjefferson@usa.gov,931 Thomas Jefferson Pkwy,Charlottesville,VA,22902,2500
3,4,James,Madison,jmadison@usa.gov,11350 Constitution Hwy,Orange,VA,22960,9000
4,5,James,Monroe,jmonroe@usa.gov,2050 James Monroe Pkwy,Charlottesville,VA,22902,1756


In [87]:
df3.to_csv('customer2.csv', header=False, index=False, float_format='%.6f')

#### JSON

**JSON** (JavaScript Object Notation) stores data as text in human-readable format. JSON object is a key-value data format that is typically rendered in curly brackets. Arrays use square brackets.

JSON is good for NoSQL data since its structure is more flexible and can handle tabular format.

In [88]:
!cat test.json
#!type test.json
df4 = pd.read_json('test.json')
df4

{
    "fruit": ["Apple", "Banana", "Orange", "Peach", "Pear"],
    "size": ["Large", "Medium", "Small", "Small", "Large"],
    "color": ["Red", "Yellow", "Orange", "Pink", "Green"]
}


Unnamed: 0,fruit,size,color
0,Apple,Large,Red
1,Banana,Medium,Yellow
2,Orange,Small,Orange
3,Peach,Small,Pink
4,Pear,Large,Green


In [89]:
json_str = df4.to_json(orient='records', lines=True)
print(json_str)
print(type(json_str))

{"fruit":"Apple","size":"Large","color":"Red"}
{"fruit":"Banana","size":"Medium","color":"Yellow"}
{"fruit":"Orange","size":"Small","color":"Orange"}
{"fruit":"Peach","size":"Small","color":"Pink"}
{"fruit":"Pear","size":"Large","color":"Green"}
<class 'str'>


#### Excel

Excel is the best spreadsheet software package available.

In [91]:
df5 = pd.read_excel('test.xlsx', index_col=0)
print(df5)

            first_name   last_name                email  \
customer_id                                               
1               George  Washington  gwashington@usa.gov   
2                 John       Adams       jadams@usa.gov   
3               Thomas   Jefferson   tjefferson@usa.gov   
4                James     Madison     jmadison@usa.gov   
5                James      Monroe      jmonroe@usa.gov   

                               address             city state    zip  
customer_id                                                           
1                   3200 Mt Vernon Hwy     Mount Vernon    VA  22121  
2                      1250 Hancock St           Quincy    MA   2169  
3            931 Thomas Jefferson Pkwy  Charlottesville    VA  22902  
4               11350 Constitution Hwy           Orange    VA  22960  
5               2050 James Monroe Pkwy  Charlottesville    VA  22902  


In [92]:
xl = pd.ExcelFile('test.xlsx')
for i, name in enumerate(xl.sheet_names):
    df = xl.parse(name, index_col=0)
    print('\nDataFrame', i)
    print(df)


DataFrame 0
            first_name   last_name                email  \
customer_id                                               
1               George  Washington  gwashington@usa.gov   
2                 John       Adams       jadams@usa.gov   
3               Thomas   Jefferson   tjefferson@usa.gov   
4                James     Madison     jmadison@usa.gov   
5                James      Monroe      jmonroe@usa.gov   

                               address             city state    zip  
customer_id                                                           
1                   3200 Mt Vernon Hwy     Mount Vernon    VA  22121  
2                      1250 Hancock St           Quincy    MA   2169  
3            931 Thomas Jefferson Pkwy  Charlottesville    VA  22902  
4               11350 Constitution Hwy           Orange    VA  22960  
5               2050 James Monroe Pkwy  Charlottesville    VA  22902  

DataFrame 1
                   order_date   amount  customer_id
order_id   

In [94]:
with pd.ExcelWriter('test2.xlsx') as writer:
    df2.to_excel(writer, sheet_name='Sheet1')
    df4.to_excel(writer, sheet_name='Sheet2')

For more details about Pandas I/O see at https://pandas.pydata.org/pandas-docs/stable/reference/io.html

#### Convert Between NumPy and Pandas

You can convert between Pandas DataFrame/Series and NumPy ndarray very easily

In [95]:
a1 = np.array(df1)
a1

array([[ 0.558737  , -0.6896986 ,  0.85326235,  0.27085084,  0.13407062],
       [-2.07017051, -1.51485926, -1.41303711,  0.25382093,  1.48407215],
       [ 1.0347096 ,  0.94882068,  1.50279017, -0.487263  , -0.68114071],
       [-0.74812593, -0.28379151, -0.24997927, -0.32627487, -0.57341788],
       [ 0.0826934 ,  0.15193504,  0.59688734, -0.75129313, -0.3129026 ],
       [-0.50338828, -1.17979576,  0.73595641, -0.93046787, -1.70965468],
       [-0.66541821,  1.45206729,  1.41534877,  1.08685844, -1.1594392 ],
       [-0.12737598, -1.42287094, -0.33456718, -1.72018641,  1.27995018],
       [-0.43858048, -0.34576012,  0.10576608,  0.02518689,  0.52275955],
       [ 0.32575544, -0.18025795,  0.57317564,  0.15732858, -0.77336894]])

In [96]:
a1.dtype

dtype('float64')

In [97]:
a2 = np.array(df2)
a2

array([[1, 0, 'Hello'],
       [1, 1, 'Hello'],
       [1, 2, 'Hello']], dtype=object)

In [98]:
df2.values

array([[1, 0, 'Hello'],
       [1, 1, 'Hello'],
       [1, 2, 'Hello']], dtype=object)

In [99]:
pd.DataFrame(a1)

Unnamed: 0,0,1,2,3,4
0,0.558737,-0.689699,0.853262,0.270851,0.134071
1,-2.070171,-1.514859,-1.413037,0.253821,1.484072
2,1.03471,0.948821,1.50279,-0.487263,-0.681141
3,-0.748126,-0.283792,-0.249979,-0.326275,-0.573418
4,0.082693,0.151935,0.596887,-0.751293,-0.312903
5,-0.503388,-1.179796,0.735956,-0.930468,-1.709655
6,-0.665418,1.452067,1.415349,1.086858,-1.159439
7,-0.127376,-1.422871,-0.334567,-1.720186,1.27995
8,-0.43858,-0.34576,0.105766,0.025187,0.52276
9,0.325755,-0.180258,0.573176,0.157329,-0.773369


In [100]:
pd.Series(a2[:,1])

0    0
1    1
2    2
dtype: object

[[top](#top)]

----