### Topics 
- array data types with numpy
- more advanced data manipulation with pandas
- Dict and Strings in Pyhton 

### Numpy Basics 
Numerical Python is a powerful library of functions, methods, and data types.
The main object that numpy provides is the **ndarray**. They differ in a few fundamental ways from regular Python lists:
- Arrays cannot be of mixed types. They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays (sortof). Which brings us to:

- Arrays can be multidimensional, but they must be rectangular. You can have a list of lists, where the first interior list is 3 elements long, the second 5, and the third 12, but for your multidemsional arrays, every row must have the same number of columns.

- We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in a slice of the array.

### Initializing Arrays
Start with a list and turn it to an arry with the array method 

In [5]:
import numpy as np

a = [0] * 40
print (type(a))

a = np.array(a)
print (type(a))
print (a)

<class 'list'>
<class 'numpy.ndarray'>
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0]


a similar vector of zeros can be created by the following

In [8]:
a = np.zeros(40)
print (a)
print(np.ones(5))

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.]


for some non zero arrays 

In [10]:
print (np.arange(14))

# more parameters could be given to arange
print (np.arange(10, 20))  ## starting and stopping points 

print (np.arange(10, 20, 2))   ## starting, stopping points and increment 
print (np.arange(10, 20, 1.5))

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13]
[10 11 12 13 14 15 16 17 18 19]
[10 12 14 16 18]
[ 10.   11.5  13.   14.5  16.   17.5  19. ]


In [14]:
# from 5 to 6 with 10 items, linear spacing 
print (np.linspace(5, 6, 10))

[ 5.          5.11111111  5.22222222  5.33333333  5.44444444  5.55555556
  5.66666667  5.77777778  5.88888889  6.        ]


In [15]:
# from 10^0 to 10^5, logarithmic spacing 
print (np.logspace(0, 5, 10))

[  1.00000000e+00   3.59381366e+00   1.29154967e+01   4.64158883e+01
   1.66810054e+02   5.99484250e+02   2.15443469e+03   7.74263683e+03
   2.78255940e+04   1.00000000e+05]


Arrays with more than one diemsion

In [16]:
# inner set of parantheses. (rows, columns)
np.zeros ((3,4))

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

In [17]:
# or create using python list of list 
a = [ [1,2,3], [3,4,5], [8,9,10] ]
np.array(a)

array([[ 1,  2,  3],
       [ 3,  4,  5],
       [ 8,  9, 10]])

### Recap: creating numpy arrays
- np.zeros: array of 0.0 of specified size
- np.ones : array of 1.0 of specified size
- np.array: create numpy array from python list
- np.linspace : create a numpy array within a given interval with linear spacing

### Accessing elements 

### 1d arrays

for 1d arrays indexing are just like that of lists


In [30]:
a = np.arange(10)
print (a)
print (a[5])

[0 1 2 3 4 5 6 7 8 9]
5


In [21]:
# slicing conventions also work 
print (a[:3])
print (a[-1])

[0 1 2]
9


In [25]:
# could also index with a list or another numpy array 
a = np.array([2, 4, 5, 8, 10])  # this is array creation from list 

# items wanted with their location in a list 
local = [0, 2, 3]
print (a[local])

# also works if locations are stored in another array
local = np.array(local)
print (a[local])
print (type(a[local]))

[2 5 8]
[2 5 8]
<class 'numpy.ndarray'>


** This (index into list using a list) DOES NOT work with normal python list**
A COMPREHENSION or a LOOP is needed instead 

In [28]:
a = [2, 4, 5, 8, 10]
local = [0, 2, 3]

print ([a[i] for i in local])
print (type([a[i] for i in local]))

[2, 5, 8]
<class 'list'>


You can also index into an array with boolean values
- must be the same length as the array (or else get a warning)
- **doesn't work with regular python lists**

In [29]:
a = np.array([2, 4, 6, 8, 10, 12])
b = np.array([True, True, False, False, False, True])
a[b]

array([ 2,  4, 12])

### 2d Array

for 2d arrays the indexing is [row, column]

In [31]:
a = np.array([[1, 2, 3, 4, 5], [2, 4, 6, 8, 10], [3, 6, 9, 12, 15]])
a

array([[ 1,  2,  3,  4,  5],
       [ 2,  4,  6,  8, 10],
       [ 3,  6,  9, 12, 15]])

In [32]:
a[2,4]   # sigle element

15

In [33]:
a[1, :]   # entire row

array([ 2,  4,  6,  8, 10])

In [34]:
a[:, 2]   # entire column

array([3, 6, 9])

**Slicing and bools work as they did before. They are only on each axis now.**


### Numpy array properties 

Numpy arrays have two important properties 

In [35]:
a = np.zeros((5, 10))
a

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [36]:
a.shape

(5, 10)

In [37]:
a.size 

50

In [38]:
# only gives the length of the outer most dimension, which is the len of rows 
len(a)

5

Every value in a numpy array must be the same type (unlike normal python lists) This is stored in the 'dtype' property


In [40]:
# numpy array type, type of an array is dtype 
a.dtype

dtype('float64')

In [41]:
# type of a sigle entry 
type(a[0, 0])

numpy.float64

In [44]:
# arange makes it 'int64' by default 
a = np.arange(10)
print (a.dtype)

# this dtype can be changed otherwise as follows
b = np.arange(10, dtype = 'float')
print (b.dtype)

int64
float64


In [45]:
# could also cast from one to the other 
a = np.arange(10).astype('float')
a.dtype

dtype('float64')

### Operating on numpy arrays

In [52]:
a = np.zeros((5, 5))

In [53]:
# update the current element values in the multidimensional array
a[1, 3] = 7
a

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  7.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [54]:
a[1, 3] *= 2
a 

array([[  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,  14.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.]])

**The real power of arrays is the ability to have one statement affect a large chunk of an array.**

In [56]:
# assign '1' to the enire row 1
a[1, :] = 1
a 

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [58]:
# assign '7' to the entire column 4
a[:, 3] = 7
a

array([[ 0.,  0.,  0.,  7.,  0.],
       [ 1.,  1.,  1.,  7.,  1.],
       [ 0.,  0.,  0.,  7.,  0.],
       [ 0.,  0.,  0.,  7.,  0.],
       [ 0.,  0.,  0.,  7.,  0.]])

In [59]:
# Assign -1 everywhere array is zero (this works because (a == 0) produces a 2d boolean numpy array)
a[a == 0] = -1
a

array([[-1., -1., -1.,  7., -1.],
       [ 1.,  1.,  1.,  7.,  1.],
       [-1., -1., -1.,  7., -1.],
       [-1., -1., -1.,  7., -1.],
       [-1., -1., -1.,  7., -1.]])

If try to do the above using lists

In [66]:
# create list of lists with all zeros 
ll = [ [0] * 5 for i in range(5)]
print (ll)

[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]


In [67]:
# set entries in row 1 to 1 
for i, elem in enumerate(ll[1]):
    
    """
   enumerate(iterable[, start]) -> iterator for index, value of iterable
   
   Return an enumerate object.  
   iterable must be another object that supports iteration.  
   The enumerate object yields pairs containing a count (from start, which defaults to zero) 
   and a value yielded by the iterable argument.
   enumerate is useful for obtaining an indexed list:
       (0, seq[0]), (1, seq[1]), (2, seq[2]), ...
 
   """
    ll[1][i] = 1
    
ll

[[0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0]]

In [79]:
# list indices must be integers or slices, not tuple
print (ll[:1])

print (ll[1])

[[0, 0, 0, 0, 0]]
[1, 1, 1, 1, 1]


In [82]:
for i, elem in enumerate(ll[1]):
    print(i, elem)

0 1
1 1
2 1
3 1
4 1


In [63]:
# set entries in column 4 to 7
for l in ll:
    l[4]= 7
ll

[[0, 0, 0, 0, 7],
 [1, 1, 1, 1, 7],
 [0, 0, 0, 0, 7],
 [0, 0, 0, 0, 7],
 [0, 0, 0, 0, 7]]

### Vector math with arrays 

do math on many values at once with arrays without using for loop

In [83]:
a = np.arange(0, 100, 2)
b = np.arange(50)

print (a)
print (b)

[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48
 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]


In [84]:
b/ 2

array([  0. ,   0.5,   1. ,   1.5,   2. ,   2.5,   3. ,   3.5,   4. ,
         4.5,   5. ,   5.5,   6. ,   6.5,   7. ,   7.5,   8. ,   8.5,
         9. ,   9.5,  10. ,  10.5,  11. ,  11.5,  12. ,  12.5,  13. ,
        13.5,  14. ,  14.5,  15. ,  15.5,  16. ,  16.5,  17. ,  17.5,
        18. ,  18.5,  19. ,  19.5,  20. ,  20.5,  21. ,  21.5,  22. ,
        22.5,  23. ,  23.5,  24. ,  24.5])

In [86]:
# pairwise multiplication 
print(a * b) 

# dot product 
np.dot(a, b)

[   0    2    8   18   32   50   72   98  128  162  200  242  288  338  392
  450  512  578  648  722  800  882  968 1058 1152 1250 1352 1458 1568 1682
 1800 1922 2048 2178 2312 2450 2592 2738 2888 3042 3200 3362 3528 3698 3872
 4050 4232 4418 4608 4802]


80850

In [87]:
print ((a * b).sum())

print (np.sum(a * b))

80850
80850


In [88]:
# other math function which applies to all the elements
a = np.linspace(-np.pi, np.pi, 10)
np.sin(a)

array([ -1.22464680e-16,  -6.42787610e-01,  -9.84807753e-01,
        -8.66025404e-01,  -3.42020143e-01,   3.42020143e-01,
         8.66025404e-01,   9.84807753e-01,   6.42787610e-01,
         1.22464680e-16])

Also:
- np.exp : exponential function
- np.log : logarithm
- np.abs : absolute value
- ... and many others

### Useful numpy functions
NumPy is huge, with around 1200 pages of reference documentation. The following are some basic statistical functions:

### Random distributions 
- uniform distribution
- normal distribution 

In [89]:
a = np.random.uniform(0, 100, 10)   # lower, upper bound and size of output
a

array([ 92.47768987,  55.66739508,  81.39816102,  41.21994641,
        94.32882325,  35.04759045,  91.5061206 ,  69.43786929,
        55.28103639,  14.65728916])

In [90]:
# could also give a shape to the third argument 
a = np.random.uniform(0, 100, (3, 3))
a

array([[ 30.53575893,  14.9305635 ,  62.77591512],
       [ 92.42281326,  17.23172521,  93.17119396],
       [ 67.1859455 ,   0.22656211,   8.54016835]])

In [91]:
# normal distribution with mean = 0, std = 1 and sample size= 10
a = np.random.normal(0, 1, 10)
a


array([-0.99089343,  0.49238631, -0.2529982 , -0.12834274, -1.0440051 ,
       -0.35750895,  0.51206632, -0.40534158,  0.10268209,  0.99130447])

### Summary statistics 

In [92]:
# draw 1000 sample from standard normal distribution with mean 5 and std 3 
a = np.random.normal(5, 3, 1000)

print (np.mean(a))
print (np.std(a))
print (np.min(a))
print (np.max(a))

5.2111643091
3.08882647628
-4.34183503261
14.2378638305


### Operating on 2d arrays 
one of the areas where numpy really shines is its ability to quickly operate along an axis of a 2d array

In [93]:
# 2d arary with 5 rows, 3 columns 
a = np.ones ((5, 3)) 
a

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [94]:
# sum over all elements 
sum(a)

array([ 5.,  5.,  5.])

In [96]:
# sum across all rows 
np.sum(a, axis = 0)

# which is the same as 
a.sum(axis = 0 )

array([ 5.,  5.,  5.])

In [97]:
# sum across all columns 
np.sum(a, axis = 1)

array([ 3.,  3.,  3.,  3.,  3.])

### Boolean numpy arrays for selection and filtering 

In [98]:
a = np.zeros(10, dtype = bool)
a

array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

In [99]:
# slicing and mass assignment still work for array with BOOLEAN type 
a[2: 5] = True 
a

array([False, False,  True,  True,  True, False, False, False, False, False], dtype=bool)

In [100]:
# the character ~ inverts the boolean array
b = ~a
b

array([ True,  True, False, False, False,  True,  True,  True,  True,  True], dtype=bool)

In [101]:
# "&" amd "|"
a = np.array([True, False, True])
b = np.array([False, False, True])

print (" A and B is {}". format (a & b))
print (" A or B is {}". format (a | b))

 A and B is [False False  True]
 A or B is [ True False  True]


In [102]:
# boolean array and data accessing example

# np.random.randn returns a sample (or samples) from the “standard normal” distribution.
data = np.random.randn(10)
data 

array([-0.32709702, -0.72586373, -0.06293976,  0.9014897 , -0.38309493,
       -0.22420609, -0.87452063,  0.73498385,  0.47888074,  0.29893285])

In [104]:
data = data[data < 0]
data 

array([-0.32709702, -0.72586373, -0.06293976, -0.38309493, -0.22420609,
       -0.87452063])

In [124]:
# np.random.rand creates an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
data = np.random.rand(4, 5) * 10
print (type (data))
data 


<class 'numpy.ndarray'>


array([[  7.72212061e+00,   3.32878843e+00,   4.26967922e+00,
          4.24869577e+00,   7.08036316e+00],
       [  6.80014456e+00,   2.58262997e+00,   9.88908151e+00,
          4.35020049e+00,   6.93051621e+00],
       [  7.00247033e+00,   9.05154777e+00,   5.79075262e+00,
          8.11815066e+00,   4.50701992e-02],
       [  4.54546024e+00,   9.14537478e+00,   8.86582688e+00,
          6.88032997e+00,   3.59784621e-04]])

In [113]:
# mean of each row 
np.mean(data, axis = 1)

array([ 3.45422862,  4.6599894 ,  6.24009515,  6.59614927,  4.3316211 ])

In [126]:
# Give me a subset of the data matrix, containing only rows with a mean > 5 and the second column < 4 

data[np.mean(data, axis = 1) > 5, :]


array([[  7.72212061e+00,   3.32878843e+00,   4.26967922e+00,
          4.24869577e+00,   7.08036316e+00],
       [  6.80014456e+00,   2.58262997e+00,   9.88908151e+00,
          4.35020049e+00,   6.93051621e+00],
       [  7.00247033e+00,   9.05154777e+00,   5.79075262e+00,
          8.11815066e+00,   4.50701992e-02],
       [  4.54546024e+00,   9.14537478e+00,   8.86582688e+00,
          6.88032997e+00,   3.59784621e-04]])

### Why Numpy?
- Avoid writing loops (don't re-invent the wheel)
- Most other python data libraries work on top of numpy
Efficient Computation
- Regarding the second point, numpy is useful because operations using it are many times faster than their pure Python implementations. This is because numpy processes arrays using code written in 'low-level' languages like C or Fortran. These languages are much more tedious to write programs in, but run much faster than a 'high-level' language like Python. However, by using Python to call functions written by other people in low-level languages, you can get the best of both worlds.


### Quick performance comparison 

In [145]:
# create a 1000 x 1000, list of lists (2d matrix)
n_row = 1000
n_col = 1000

python_matrix = [[1] * n_col for i in range(n_row)]

In [146]:
%%timeit
# ass 1 to every entry in the matix
for i in range(n_row):
    for j in range(n_col):
        python_matrix [i][j] += 1

138 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [147]:
%%timeit 
# list comprehensions 
results = [[x + 1 for x in row] for row in python_matrix]

49 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [149]:
%%timeit
numpy_matrix = np.zeros ((1000, 1000))
numpy_matrix += 1

3.72 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Introduction to pandas 
The main object in Pandas you will use is the DataFrame. It has several advantages over numpy ndarrays:
- Allows mixed-types
- Label-based row-column indices
- Easy database-like operations (merge, join, groupby, sort, etc...)
If pandas is so great, why'd we just learn about numpy?
- many of the same thigns work on numpy dataframes and pandas arrays
- pandas actually uses numpy under the hood
- many libraries expect numpy arrays (but easy to cast from pandas to numpy)


In [150]:
import pandas as pd

In [153]:
# Read in data from a tab-delimited text-file
planes = pd.read_table("nycflights13/planes.txt")

In [155]:
# columns 
planes.columns

Index(['tailnum', 'year', 'type', 'manufacturer', 'model', 'engines', 'seats',
       'speed', 'engine'],
      dtype='object')

In [169]:
# row index 
print (planes.index)

RangeIndex(start=0, stop=3322, step=1)


There are three important types that are used by DataFrames:
- DataFrame
- Series
- Index

### Series 
**one-dimensional** - represents a single columns or row of data. only has one index

### DataFrame 
**two-dimensional** has both row and column labels 

### Index
represents the row or column labels in Series and DataFrame 

In [162]:
planes.head()

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan


In [170]:
print (type(planes.columns))
print (type(planes.index))

# one single row or column is a series 
print (type (planes.iloc[0]))  # row 
print (type(planes['year']))   # column

<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


### DataFrame Indexing
- To grab a single column, use: dataframe[column_name]
- To grab a row, use: dataframe.loc[row_name]
- And to grab a specific element use: dataframe.loc[row_name, column_name]
- If you want to grab rows or column based on their position, use: dataframe.iloc[row_number or :, column_number or :]

In [175]:
# when a single column is sliced, a series is created with index the same as the DateFrame's index
print (type(planes['manufacturer']))

<class 'pandas.core.series.Series'>


In [181]:
# when a single row is sliced, its own index is the same as the columns' of the DataFrame

rowthree = planes.loc[3]
print (type(rowthree))     # this sliced row is also a series

<class 'pandas.core.series.Series'>


### For flights table 

In [182]:
flights = pd.read_table("nycflights13/flights.txt")

In [183]:
# random 5 rows
flights.sample(5)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
27631,2013,10,1,1559.0,1600,-1.0,1833.0,1905,-32.0,AA,1156,N3DWAA,LGA,DFW,178.0,1389,16,0,2013-10-01 16:00:00
198435,2013,5,6,1305.0,1250,15.0,1540.0,1535,5.0,WN,1791,N746SW,EWR,HOU,195.0,1411,12,50,2013-05-06 12:00:00
176379,2013,4,12,1954.0,1935,19.0,2340.0,2256,44.0,DL,1433,N371NB,LGA,RSW,165.0,1080,19,35,2013-04-12 19:00:00
240528,2013,6,20,1429.0,1430,-1.0,1724.0,1725,-1.0,B6,325,N584JB,JFK,TPA,142.0,1005,14,30,2013-06-20 14:00:00
185546,2013,4,22,1655.0,1700,-5.0,2021.0,1933,48.0,DL,1499,N675DL,LGA,ATL,117.0,762,17,0,2013-04-22 17:00:00


### Perform functions along an axis

In [189]:
# get the average air_time across all flights 
flights['air_time'].mean()

subset = flights[['air_time', 'dep_delay', 'arr_delay']]
# average along all rows
np.mean(subset, axis = 0)
# default average axis is also over the rows 
subset.mean()

air_time     150.686460
dep_delay     12.639070
arr_delay      6.895377
dtype: float64

In addition to mean, there's also:
- min
- max
- median
- sum
- var (for variance)
- std (for standard deviation)

there's also sort_values to sort by one or more columns

In [191]:
# Sorts by year, then by month, then by day....and so on

flights.sort_values(['year', 'month', 'day', 'hour', 'minute']).head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00
5,2013,1,1,554.0,558,-4.0,740.0,728,12.0,UA,1696,N39463,EWR,ORD,150.0,719,5,58,2013-01-01 05:00:00


** unique()** is used for checking out the values in a column

In [192]:
flights['origin'].unique()

array(['EWR', 'LGA', 'JFK'], dtype=object)

### Identifying and removig NA in a dataset

In [194]:
# Example of some bad rows
flights.iloc[835:845, :]

# are there any NAs in the above dataframe 
flights.isnull().any()

year              False
month             False
day               False
dep_time           True
sched_dep_time    False
dep_delay          True
arr_time           True
sched_arr_time    False
arr_delay          True
carrier           False
flight            False
tailnum            True
origin            False
dest              False
air_time           True
distance          False
hour              False
minute            False
time_hour         False
dtype: bool

In [143]:
# select flights where there is complete data
# rows with any null column values
null_rows = flights.isnull().any(axis = 1)

# invert the boolean SERIES
complete_rows = ~null_rows
flights_complete = flights[complete_rows]
# OR 
flights_complete = flights.loc[complete_rows]


print("Original Matrix Shape:", flights.shape)
print("Complete-rows shape:", flights_complete.shape)

Original Matrix Shape: (336776, 19)
Complete-rows shape: (327346, 19)


In [3]:
import pandas as pd
import numpy as np

In [1]:
### merge on columns 

In [4]:
planes = pd.read_table("nycflights13/planes.txt")
flights = pd.read_table("nycflights13/flights.txt")

In [123]:
planes.tail()

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
3317,N997AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3318,N997DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS AIRCRAFT CO,MD-88,2,142,,Turbo-fan
3319,N998AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3320,N998DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS CORPORATION,MD-88,2,142,,Turbo-jet
3321,N999DN,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS CORPORATION,MD-88,2,142,,Turbo-jet


In [121]:
flights.tail()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
336771,2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213,14,55,2013-09-30 14:00:00
336772,2013,9,30,,2200,,,2312,,9E,3525,,LGA,SYR,,198,22,0,2013-09-30 22:00:00
336773,2013,9,30,,1210,,,1330,,MQ,3461,N535MQ,LGA,BNA,,764,12,10,2013-09-30 12:00:00
336774,2013,9,30,,1159,,,1344,,MQ,3572,N511MQ,LGA,CLE,,419,11,59,2013-09-30 11:00:00
336775,2013,9,30,,840,,,1020,,MQ,3531,N839MQ,LGA,RDU,,431,8,40,2013-09-30 08:00:00


In [81]:
# number of seats on planes departed on Feb 1st

feb_flights = flights[(flights.month == 2) & (flights.day == 1)]
feb_flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
111296,2013,2,1,456.0,500,-4.0,652.0,648,4.0,US,1117,N197UW,EWR,CLT,98.0,529,5,0,2013-02-01 05:00:00
111297,2013,2,1,520.0,525,-5.0,816.0,820,-4.0,UA,1018,N24211,EWR,IAH,209.0,1400,5,25,2013-02-01 05:00:00
111298,2013,2,1,527.0,530,-3.0,837.0,829,8.0,UA,650,N470UA,LGA,IAH,233.0,1416,5,30,2013-02-01 05:00:00
111299,2013,2,1,532.0,540,-8.0,1007.0,1017,-10.0,B6,725,N554JB,JFK,BQN,195.0,1576,5,40,2013-02-01 05:00:00
111300,2013,2,1,540.0,540,0.0,859.0,850,9.0,AA,1141,N615AA,JFK,MIA,169.0,1089,5,40,2013-02-01 05:00:00


In [18]:
((flights.month == 2) & (flights.day == 1)).sum(axis = 0)

926

In [38]:
# 2750 planes did not depart on feb 1st 

len(set(planes.tailnum) - set(feb_flights.tailnum))

2750

In [136]:
# pd.merge on columns 
# want to keep the index from planes, where all the plane info are stored ????????

flight_plane = pd.merge(feb_flights, planes, how = 'left', on = 'tailnum', left_index = True)
print (np.sum(flight_plane['seats']))

# where do the row index 415, 515, 1322 ... of flight_plane come from 
flight_plane.head()

104242.0


Unnamed: 0,year_x,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,minute,time_hour,year_y,type,manufacturer,model,engines,seats,speed,engine
415,2013,2,1,456.0,500,-4.0,652.0,648,4.0,US,...,0,2013-02-01 05:00:00,2009.0,Fixed wing multi engine,AIRBUS,A321-211,2.0,199.0,,Turbo-fan
515,2013,2,1,520.0,525,-5.0,816.0,820,-4.0,UA,...,25,2013-02-01 05:00:00,1998.0,Fixed wing multi engine,BOEING,737-824,2.0,149.0,,Turbo-fan
1322,2013,2,1,527.0,530,-3.0,837.0,829,8.0,UA,...,30,2013-02-01 05:00:00,2001.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-232,2.0,200.0,,Turbo-fan
1632,2013,2,1,532.0,540,-8.0,1007.0,1017,-10.0,B6,...,40,2013-02-01 05:00:00,2002.0,Fixed wing multi engine,AIRBUS,A320-232,2.0,200.0,,Turbo-fan
1867,2013,2,1,540.0,540,0.0,859.0,850,9.0,AA,...,40,2013-02-01 05:00:00,1967.0,Fixed wing multi engine,BEECH,65-A90,2.0,9.0,202.0,Turbo-prop


In [137]:
print (flight_plane.index)  # left_index on feb_flights

print (feb_flights.index)   # series from flights with flights row index

print (planes.index.values)    # pandas index starts from zero 


Int64Index([ 415,  515, 1322, 1632, 1867, 2626, 3321,  808, 3284, 3321,
            ...
              96,   85,   91,  101,   96,  172,  787, 3321, 2442, 3321],
           dtype='int64', length=926)
Int64Index([111296, 111297, 111298, 111299, 111300, 111301, 111302, 111303,
            111304, 111305,
            ...
            112212, 112213, 112214, 112215, 112216, 112217, 112218, 112219,
            112220, 112221],
           dtype='int64', length=926)
[   0    1    2 ..., 3319 3320 3321]


In [117]:
# the same plane(same tailnum) may depart several times on a single day

print (flight_plane.tailnum.shape)
print((flight_plane.tailnum.unique()).shape)

(926,)
(678,)


In [141]:
print ((flights.loc[111297] == feb_flights.loc[111297]).sum())

# why does't flight_plane take the index from feb_flight as the left_index in pd.merge tells it 
flight_plane.loc[515]['tailnum'] ==  feb_flights.loc[111297]['tailnum']
flight_plane.loc[515]['dest'] ==  feb_flights.loc[111297]['dest']


19


True

In [None]:
### pd.concat stack multiple dataframes along the specified axis (axis = 1, concat horizontally; axsi = 0, concat  vertically)
### pd.merge on columns 
### pd.join on index 

In [133]:
### anotehr merge example - most common destination airports 

airports = pd.read_table('nycflights13/airports.txt')
airports.head()

Unnamed: 0,faa,name,lat,lon,alt,tz,dst,tzone
0,04G,Lansdowne Airport,41.130472,-80.619583,1044,-5,A,America/New_York
1,06A,Moton Field Municipal Airport,32.460572,-85.680028,264,-6,A,America/Chicago
2,06C,Schaumburg Regional,41.989341,-88.101243,801,-6,A,America/Chicago
3,06N,Randall Airport,41.431912,-74.391561,523,-5,A,America/New_York
4,09J,Jekyll Island Airport,31.074472,-81.427778,11,-5,A,America/New_York


In [146]:
# flights info documemtaion without NAs 
flights_complete.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00


In [150]:
flights_dest = pd.merge(flights_complete, airports, how = 'left', left_on = 'dest', right_on = 'faa')
# the above merge give too many data fields
flights_dest.columns

Index(['year', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay',
       'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight',
       'tailnum', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute',
       'time_hour', 'faa', 'name', 'lat', 'lon', 'alt', 'tz', 'dst', 'tzone'],
      dtype='object')

In [161]:
# most common destiantion airports 
flights_dest[['origin', 'name']].groupby('name').count().sort_values('origin')[::-1].head()

Unnamed: 0_level_0,origin
name,Unnamed: 1_level_1
Hartsfield Jackson Atlanta Intl,16837
Chicago Ohare Intl,16566
Los Angeles Intl,16026
General Edward Lawrence Logan Intl,15022
Orlando Intl,13967


In [169]:
# average air_time according to airport with the corresponding routes
## groupby(['origin','name'])['air_time']

air_time = flights_dest[['origin', 'name', 'air_time']].groupby(['origin','name'])['air_time'].mean()
print (air_time.shape[0])

air_time.head()

216


origin  name                      
EWR     Albany Intl                    31.787081
        Asheville Regional Airport     89.796813
        Austin Bergstrom Intl         211.247649
        Baltimore Washington Intl      37.164122
        Bradley Intl                   25.466019
Name: air_time, dtype: float64

In [172]:
# count for the frequency of differnet air routes 
flights_dest[['origin', 'name', 'air_time']].groupby(['origin','name']).count().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,air_time
origin,name,Unnamed: 2_level_1
EWR,Albany Intl,418
EWR,Asheville Regional Airport,251
EWR,Austin Bergstrom Intl,957
EWR,Baltimore Washington Intl,524
EWR,Bradley Intl,412


#### air_time has two levels of index
- the outer one is origin 
- inner one is name, which is the destination

In [178]:
print (type(air_time))

# longest air_time on average(one route several flights) from different airports
air_time.groupby(level = 'origin').max()

<class 'pandas.core.series.Series'>


origin
EWR    612.075209
JFK    623.087719
LGA    227.515997
Name: air_time, dtype: float64

In [181]:
air_time.max()

623.08771929824559

In [185]:
# to return the routes 
rows = air_time.groupby(level = 'origin').idxmax()
rows

origin
EWR    (EWR, Honolulu Intl)
JFK    (JFK, Honolulu Intl)
LGA      (LGA, Denver Intl)
Name: air_time, dtype: object

In [186]:
# and check the row details 
air_time[rows]

origin  name         
EWR     Honolulu Intl    612.075209
JFK     Honolulu Intl    623.087719
LGA     Denver Intl      227.515997
Name: air_time, dtype: float64

In [187]:
# resets the heirarchical index back to dataframe index. i.e. removing heirarchical index level 
air_time.reset_index() 

Unnamed: 0,origin,name,air_time
0,EWR,Albany Intl,31.787081
1,EWR,Asheville Regional Airport,89.796813
2,EWR,Austin Bergstrom Intl,211.247649
3,EWR,Baltimore Washington Intl,37.164122
4,EWR,Bradley Intl,25.466019
5,EWR,Buffalo Niagara Intl,51.153846
6,EWR,Burlington Intl,46.257336
7,EWR,Charleston Afb Intl,95.746091
8,EWR,Charlotte Douglas Intl,84.975680
9,EWR,Cherry Capital Airport,96.000000


### Pivot table 
put origin in rows and destination in columns and air_time as values

In [188]:
# unstack air_time, which has hierarchical index
pvt_airtime = air_time.unstack()
pvt_airtime

name,Akron Canton Regional Airport,Albany Intl,Albuquerque International Sunport,Asheville Regional Airport,Austin Bergstrom Intl,Baltimore Washington Intl,Bangor Intl,Birmingham Intl,Blue Grass,Bob Hope,...,Tampa Intl,Ted Stevens Anchorage Intl,Theodore Francis Green State,Tulsa Intl,Washington Dulles Intl,Will Rogers World,William P Hobby,Wilmington Intl,Yampa Valley,Yeager
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
EWR,,31.787081,,89.796813,211.247649,37.164122,,,,,...,143.620302,413.125,32.667598,178.309524,46.117304,193.009524,205.174051,,248.5,
JFK,,,249.169291,,213.702201,39.083624,,117.0,,334.102703,...,144.278399,,,,48.138382,,205.500703,,,
LGA,64.036817,,,92.2,,40.466667,54.117318,122.798507,90.0,,...,145.963311,,,,48.594937,,202.235849,75.35514,,74.753731


In [189]:
# pivot table from dataframe 
## create the dataframe first by resetign the index in the hierarchical indexed series
air_df = air_time.reset_index()
air_df.head()

Unnamed: 0,origin,name,air_time
0,EWR,Albany Intl,31.787081
1,EWR,Asheville Regional Airport,89.796813
2,EWR,Austin Bergstrom Intl,211.247649
3,EWR,Baltimore Washington Intl,37.164122
4,EWR,Bradley Intl,25.466019


In [190]:
air_pvt = air_df.pivot(index = 'origin', columns = 'name', values = 'air_time')
air_pvt.head()

name,Akron Canton Regional Airport,Albany Intl,Albuquerque International Sunport,Asheville Regional Airport,Austin Bergstrom Intl,Baltimore Washington Intl,Bangor Intl,Birmingham Intl,Blue Grass,Bob Hope,...,Tampa Intl,Ted Stevens Anchorage Intl,Theodore Francis Green State,Tulsa Intl,Washington Dulles Intl,Will Rogers World,William P Hobby,Wilmington Intl,Yampa Valley,Yeager
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
EWR,,31.787081,,89.796813,211.247649,37.164122,,,,,...,143.620302,413.125,32.667598,178.309524,46.117304,193.009524,205.174051,,248.5,
JFK,,,249.169291,,213.702201,39.083624,,117.0,,334.102703,...,144.278399,,,,48.138382,,205.500703,,,
LGA,64.036817,,,92.2,,40.466667,54.117318,122.798507,90.0,,...,145.963311,,,,48.594937,,202.235849,75.35514,,74.753731


### multi column merge

In [198]:
weather = pd.read_table('nycflights13/weather.txt')
weather.head()

Unnamed: 0,origin,year,month,day,hour,temp,dewp,humid,wind_dir,wind_speed,wind_gust,precip,pressure,visib,time_hour
0,EWR,2013,1,1,0,37.04,21.92,53.97,230.0,10.35702,11.918651,0.0,1013.9,10.0,2012-12-31 16:00:00
1,EWR,2013,1,1,1,37.04,21.92,53.97,230.0,13.80936,15.891535,0.0,1013.0,10.0,2012-12-31 17:00:00
2,EWR,2013,1,1,2,37.94,21.92,52.09,230.0,12.65858,14.567241,0.0,1012.6,10.0,2012-12-31 18:00:00
3,EWR,2013,1,1,3,37.94,23.0,54.51,230.0,13.80936,15.891535,0.0,1012.7,10.0,2012-12-31 19:00:00
4,EWR,2013,1,1,4,37.94,24.08,57.04,240.0,14.96014,17.21583,0.0,1012.8,10.0,2012-12-31 20:00:00


In [199]:
# shared columns bewtween two dataframes
print(flights_complete.columns & weather.columns)

Index(['year', 'month', 'day', 'origin', 'hour', 'time_hour'], dtype='object')


In [200]:
print (flights_complete.shape[0])
print (weather.shape[0])

327346
26130


In [216]:
weather_flights = pd.merge(flights_complete, weather, on = ['year', 'month', 'day','hour', 'origin'])

print (weather_flights.shape[0])
weather_flights.head()

326160


Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,temp,dewp,humid,wind_dir,wind_speed,wind_gust,precip,pressure,visib,time_hour_y
0,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,...,39.92,26.06,57.33,260.0,13.80936,15.891535,0.0,1011.9,10.0,2012-12-31 22:00:00
1,2013,1,1,557.0,600,-3.0,709.0,723,-14.0,EV,...,39.92,26.06,57.33,260.0,13.80936,15.891535,0.0,1011.9,10.0,2012-12-31 22:00:00
2,2013,1,1,558.0,600,-2.0,753.0,745,8.0,AA,...,39.92,26.06,57.33,260.0,13.80936,15.891535,0.0,1011.9,10.0,2012-12-31 22:00:00
3,2013,1,1,559.0,600,-1.0,941.0,910,31.0,AA,...,39.92,26.06,57.33,260.0,13.80936,15.891535,0.0,1011.9,10.0,2012-12-31 22:00:00
4,2013,1,1,600.0,600,0.0,851.0,858,-7.0,B6,...,39.92,26.06,57.33,260.0,13.80936,15.891535,0.0,1011.9,10.0,2012-12-31 22:00:00


In [233]:
# get flight and weather when the delay is greter than 200 mins

delay = weather_flights[weather_flights.dep_delay > 200][['flight', 'wind_speed', 'wind_gust', 'precip' ]]
# and the following works the same way 
# '>' condition gives a series of row index with boolean values where the conditions are satisfied
delay = weather_flights.loc[weather_flights.dep_delay > 200][['flight', 'wind_speed', 'wind_gust', 'precip']]

print (delay.shape[0])
delay.head()

2811


Unnamed: 0,flight,wind_speed,wind_gust,precip
158,3944,14.96014,17.21583,0.0
412,4417,10.35702,11.918651,0.0
455,4633,14.96014,17.21583,0.0
600,4644,14.96014,17.21583,0.0
640,3347,17.2617,19.864419,0.0


### Dictionary 

Python's efficient key/value **hash table** structure is called a "dict". The contents of a dict can be written as a series of key: value pairs within braces { }, e.g. dict = {key1: value1, key2: value2, ... }. The "empty dict" is just an empty pair of curly braces {}.

**Looking up or setting a value in a dict uses square brackets**
- e.g. dict['foo'] looks up the value under the key 'foo'. Strings, numbers, and tuples work as keys, and any type can be a value. Other types may or may not work correctly as keys (strings and tuples work since they are immutable. Only immutable data types can be used as keys, i.e. no lists or dictionaries can be used). 
- Looking up a value which is not in the dict throws a KeyError 
- use "in" to check if the key is in the dict, or use dict.get(key) which returns the value or None if the key is not present (or get(key, not-found) allows you to specify what value to return in the not-found case).




In [300]:
dic = {}
# adding keys to the dict
dic['a'] = 'alpha'
dic['g'] = 'gamma'
dic['o'] = 'omega'
dic

{'a': 'alpha', 'g': 'gamma', 'o': 'omega'}

In [236]:
# removing keys and the values using del and pop
del dic['a']

In [245]:
help(dic.pop)

Help on built-in function pop:

pop(...) method of builtins.dict instance
    D.pop(k[,d]) -> v, remove specified key and return the corresponding value.
    If key is not found, d is returned if given, otherwise KeyError is raised



### By default iterating over a dictionary iterates over its keys. Note that the keys are in a random order

In [280]:
for key in dic:
    print (key)

a
g
o


In [277]:
# the above is exactly the same as the following
for key in dic.keys():
    print (key)

a
g
o


In [281]:
# to loop over sorted keys 

for key in sorted(dic.keys()):
    print (key, dic[key])

a alpha
g gamma
o omega


In [298]:
# the following syntax loops over the .items() **tuple***  list, one pair of (key, value) on each iteration  

for k, v in dic.items():
    print ('{} is the short for {}'.format(k, v))
    
# this key, value pair iteration is usful list comprehension as well. for example
print ([v for k, v in dic.items() if (k == 'a')| (k == 'o')])

a is the short for alpha
g is the short for gamma
o is the short for omega
['alpha', 'omega']


### create pandas dataframe from dictionary


In [292]:
# creation of a relativley more complex dictionary 
# keys for the following values are 'country', 'drives_right' and 'cars_per_cap'

names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

cars = {'country': names, 'drives_right': dr, 'cars_per_cap': cpc}
print (cars)

# and convert this dictionary to pandas dataframe using pd.DataFrame
cars = pd.DataFrame(cars)
print(cars)

{'country': ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt'], 'drives_right': [True, False, False, False, True, True, True], 'cars_per_cap': [809, 731, 588, 18, 200, 70, 45]}
   cars_per_cap        country  drives_right
0           809  United States          True
1           731      Australia         False
2           588          Japan         False
3            18          India         False
4           200         Russia          True
5            70        Morocco          True
6            45          Egypt          True


### dataframe rename column names and rearrange the order of columns

In [319]:
# reorder columns
print (cars.columns)
cars_reorder = cars[['country', 'drives_right', 'cars_per_cap']]

cars_reorder

Index(['cars_per_cap', 'country', 'drives_right'], dtype='object')


Unnamed: 0,country,drives_right,cars_per_cap
0,United States,True,809
1,Australia,False,731
2,Japan,False,588
3,India,False,18
4,Russia,True,200
5,Morocco,True,70
6,Egypt,True,45


In [349]:
# or could work on the slicing of column names when there are too many of them 

col = cars.columns.tolist()
print (type(col))

col_val = cars.columns.values
print (type(col_val))

print (type(cars.columns))

# reorder columns - col list; the + sign concats lists or strings, but not across types like concatnating list and string  
col = [col[1]] + [col[2]] + [col[0]]
cars1 = cars[col]
print (cars1)

# reorder columns - col numpy, 1d numpy array is the same as python list in terms of slicing 
col_val = [col_val[1]] + [col_val[2]] + [col_val[0]]
print (cars[col_val])

<class 'list'>
<class 'numpy.ndarray'>
<class 'pandas.core.indexes.base.Index'>
         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45
         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45


In [356]:
# rename columns
cars.rename(columns = {'cars_per_cap': 'cpc', 'drives_right': 'dr'})

# rename using lambda, the anonymous functions
cars.rename(columns = lambda x: x[-1])

Unnamed: 0,p,y,t
0,809,United States,True
1,731,Australia,False
2,588,Japan,False
3,18,India,False
4,200,Russia,True
5,70,Morocco,True
6,45,Egypt,True


http://openbookproject.net/thinkcs/python/english3e/classes_and_objects_I.html

In procedural programming the focus is on writing functions or procedures which operate on data. In object-oriented programming the focus is on the creation of objects which contain both data and functionality together. 

Object instances have both attributes and methods.

In [363]:
print (cars.index)
print (type(cars.index))

# to show the values of the indes, use values, which is an attribute for many other python objects 
print (cars.index.values)
print (cars.index.tolist)

RangeIndex(start=0, stop=7, step=1)
<class 'pandas.core.indexes.range.RangeIndex'>
[0 1 2 3 4 5 6]
<bound method RangeIndex.tolist of RangeIndex(start=0, stop=7, step=1)>


In [360]:
print (col)
# revert list order
print (col[::-1])

print (col[-1])   # slice for the single last item in the list
print (type(col[0]))  # which returns a string type object, instead of a list

print (col[:-1])   # slice everything without the last item in the list


['country', 'drives_right', 'cars_per_cap']
['cars_per_cap', 'drives_right', 'country']
cars_per_cap
<class 'str'>
['country', 'drives_right']


In [355]:
l = ['a', 'c', 'd', 'e', 'm']
print (l[::-1])

# revert order while only taking numbers in the original list with stepsize 2 
print (l[::-2])

# 2 is the stepsize/ stride
print (l[::2])

['m', 'e', 'd', 'c', 'a']
['m', 'd', 'a']
['a', 'd', 'm']


In [334]:
cars.mean(axis = 1)

0    405.0
1    365.5
2    294.0
3      9.0
4    100.5
5     35.5
6     23.0
dtype: float64

### Dictionary iteration 

In [270]:
# get the key names 

dic.keys()

dict_keys(['a', 'g', 'o'])

In [261]:
# get the values in the dictionary

dic.values()

dict_values(['alpha', 'gamma', 'omega'])

In [243]:
# get hte key names and values in pairs

dic.items()

dict_items([('a', 'alpha'), ('g', 'gamma'), ('o', 'omega')])

In [365]:
# get the values from the dictioanry given the key values

print('a:', dic.get('a'))

a: alpha


### connection between list and dictionaries
- lists from dictionaries: dict.values(), dict.keys(), dcit.items()
- dictionaries from lists, using zip(), dict()

In [366]:
# create dictionary from listsusing zip and dict
dishes = ["pizza", "sauerkraut", "paella", "Hamburger"]
countries = ["Italy", "Germany", "Spain", "USA"]
con_dish = zip(countries, dishes)

con_dish_dict = dict(con_dish)
con_dish_dict

{'Germany': 'sauerkraut',
 'Italy': 'pizza',
 'Spain': 'paella',
 'USA': 'Hamburger'}

In [368]:
# when the numbe of elements in both lists does not match, thwe superfluous element will not be used.
countries = ["Italy", "Germany", "Spain", "USA", "Switzerland"]
dishes = ["pizza", "sauerkraut", "paella", "Hamburger"]
con_dish = zip(countries, dishes)

con_dish_dict = dict(con_dish)
con_dish_dict

{'Germany': 'sauerkraut',
 'Italy': 'pizza',
 'Spain': 'paella',
 'USA': 'Hamburger'}

### OrderedDict and regular Dict
- An OrderedDict is a dictionary subclass that remembers the order in which its contents are added.
- A regular dict does not track the insertion order, and iterating over it produces the values in an arbitrary order. In an OrderedDict, by contrast, the order the items are inserted is remembered and used when creating an **iterator.**


In [378]:
# ordered dicitonary
import collections
od = collections.OrderedDict()
od['a'] = 'A'
od['b'] = 'B'
od['c'] = 'C'
od['d'] = 'D'
od['e'] = 'E'
for k, v in od.items():
    print (k, v)

a A
b B
c C
d D
e E


In [388]:
#  regular dictionary
d = {}
d['c'] = 'C'
d['d'] = 'D'
d['e'] = 'E'
d['a'] = 'A'
d['b'] = 'B'

for k, v in d.items():
    print (k, v)


c C
d D
e E
a A
b B


### Sort Dictionary
- by keys
- by values

It is not possible to sort a dict, only to get a representation of a dict that is sorted. Dicts are inherently orderless, but other types, such as **lists and tuples,** are not. So you need a sorted representation, which will be a list—probably a list of tuples.


In [392]:
# sort on values
import operator
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=operator.itemgetter(1))
# sorted_x will be a list of tuples sorted by the second element in each tuple. dict(sorted_x) == x.

# sort on keys
import operator
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=operator.itemgetter(0))

In [393]:
help(operator.itemgetter)

Help on class itemgetter in module operator:

class itemgetter(builtins.object)
 |  itemgetter(item, ...) --> itemgetter object
 |  
 |  Return a callable object that fetches the given item(s) from its operand.
 |  After f = itemgetter(2), the call f(r) returns r[2].
 |  After g = itemgetter(2, 5, 3), the call g(r) returns (r[2], r[5], r[3])
 |  
 |  Methods defined here:
 |  
 |  __call__(self, /, *args, **kwargs)
 |      Call self as a function.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  __reduce__(...)
 |      Return state information for pickling
 |  
 |  __repr__(self, /)
 |      Return repr(self).



In [396]:
# sort by keys
print (sorted(d.items()))

# sort by values - .items() returns a list of tuples here 
# and lambda tells the sorted fucntion to sort accoding to the second values in the tuple

print (collections.OrderedDict(sorted(d.items(), key = lambda t: t[1])))

[('a', 'A'), ('b', 'B'), ('c', 'C'), ('d', 'D'), ('e', 'E')]
OrderedDict([('a', 'A'), ('b', 'B'), ('c', 'C'), ('d', 'D'), ('e', 'E')])


In [387]:
# convert a regular dict to OrderedDict after sorted by keys
new_od_dict = collections.OrderedDict(sorted(d.items()))
new_od_dict 

OrderedDict([('a', 'A'), ('b', 'B'), ('c', 'C'), ('d', 'D'), ('e', 'E')])

### Dict & Enumerateand set_default
- **enumerate ** (emunerate(sequence, start = 0)) returns an enumerate object. sequence must be a sequence, an iterator, or some other object which supports iteration. The net() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over sequence.
- **setdefault** append a list if it exists or othervwise create a blank one. Useful in creating dicitonary form tuples with some repeated values that are supposed to be dictionary keys.

In [400]:
seasons = ['Spring', 'Summer', 'Fall', 'Winter']

# enumerate simply gives both the idex and item

print (list(enumerate(seasons)))
print (list(enumerate(seasons, start = 1)))

[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]


In [404]:
for i, (k, v) in enumerate(d.items()):
    print (i, k, v)
    
# could also save the iterator created from the built in function enumerate in a list
# and used it later as often as it's wanted
ite = list(enumerate(d.items()))
type(ite)
ite

0 c C
1 d D
2 e E
3 a A
4 b B


[(0, ('c', 'C')),
 (1, ('d', 'D')),
 (2, ('e', 'E')),
 (3, ('a', 'A')),
 (4, ('b', 'B'))]

In [413]:
# setdefault 

key_value_pairs = [
    ('key1', 'value'),
    ('key1', 'value2'),
    ('key1', 'value3'),
    ]

# initiate empty dictionary
dic = {}
 
for key, value in key_value_pairs:
    dic.setdefault(key, []).append(value)
 
print (dic)

{'key1': ['value', 'value2', 'value3']}


In [412]:
## comparison 
# really verbose
new = {}
for (key, value) in key_value_pairs:
    if key in new:
        new[key].append( value )
    else:
        new[key] = [value]


# easy with setdefault
new = {}
for (key, value) in key_value_pairs:
    group = new.setdefault(key, []) # key might exist already
    group.append( value )
    # which is the same as the following
    dic.setdefault(key, []).append(value)
 


# even simpler with defaultdict 
new = collections.defaultdict(list)
for (key, value) in key_value_pairs:
    new[key].append( value ) # all keys have a default already
    
print (new)

defaultdict(<class 'list'>, {'key1': ['value', 'value2', 'value3']})


In [414]:
# List comprehension 
 
vec = [[1,2,3], [4,5,6], [7,8,9]] 
[num for elem in vec for num in elem] 


[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [416]:
# defaultdict allows you to append lists
# ordereddict maintains order in which it is fed
# the following combines ordered dict with ability to append lists

from collections import OrderedDict

# class DefaultListOrderedDict(OrderedDict):
#     def __missing__(self,k):
#         self[k] = []
#         return self[k]


In [418]:
# WORKING EXAMPLE BELOW:
# enumerate gives the index, which is its greatest feature 

a = ['dog','cat','boar','cow','sheep']

print ('using enumerate')
for i, item in enumerate(a):
    print ('index: ', i, 'value: ', item)
print ('\n')



# when want to konw the location of the value in the ist
for i in range(len(a)):
    print( 'index: ', i, 'value:', a[i])

# while enumerate helps to save time here 
for index, value in enumerate(a):
    print ('index:', index, 'value: ', value)


using enumerate
index:  0 value:  dog
index:  1 value:  cat
index:  2 value:  boar
index:  3 value:  cow
index:  4 value:  sheep


index:  0 value: dog
index:  1 value: cat
index:  2 value: boar
index:  3 value: cow
index:  4 value: sheep
index: 0 value:  dog
index: 1 value:  cat
index: 2 value:  boar
index: 3 value:  cow
index: 4 value:  sheep


In [419]:
# find in nested dictionary

def gen_dict_extract(key, var):
    if hasattr(var,'iteritems'):
        for k, v in var.iteritems():
            if k == key:
                yield v
            if isinstance(v, dict):
                for result in gen_dict_extract(key, v):
                    yield result
            elif isinstance(v, list):
                for d in v:
                    for result in gen_dict_extract(key, d):
                        yield result                        


# http://stackoverflow.com/questions/9807634/find-all-occurences-of-a-key-in-nested-python-dictionaries-and-lists

In [427]:
# regular unsorted dictionary
d = {'banana': 3, 'apple': 4, 'pear': 1, 'orange': 2}

# dictionary sorted by length of the key string
odl = OrderedDict(sorted(d.items(), key=lambda t: len(t[0])))
print (odl)


# http://www.saltycrane.com/blog/2007/09/how-to-sort-python-dictionary-by-keys/

OrderedDict([('pear', 1), ('apple', 4), ('banana', 3), ('orange', 2)])


In [423]:
# Mapping in dictionary. different ways 

print (dict(one=1, two=2))
print (dict({'one': 1, 'two': 2}))
print (dict(zip(('one', 'two'), (1, 2))))
print (dict([['two', 2], ['one', 1]]))

{'one': 1, 'two': 2}
{'one': 1, 'two': 2}
{'one': 1, 'two': 2}
{'two': 2, 'one': 1}


In [438]:
for i in d.items():
    print (type(i))
    print (i[0])

<class 'tuple'>
banana
<class 'tuple'>
apple
<class 'tuple'>
pear
<class 'tuple'>
orange
