# Intro
so we all can agree that numpy is great right? But, what if your data looks like this?

**name**|**data science training**|**years of experience**|**age**|**years of study**
:-----:|:-----:|:-----:|:-----:|:-----:
jack|no|0|22|12
jill|yes|2|25|10
tarazan|yes|?|23|11
cheetah|yes|1|?|?

Trying to handle this data in numpy may be painful:
1. each data element (row) has fields which are integers, strings and booleans
2. it would be useful to refer to data by name - person name or feature (field) name
3. there are multiple missing values

Have no fear, Pandas to the rescue. Pandas is a python library specifically designed for data manipulation and analysis. It has answers to all of the previously mentioned issues and much, much more.
1. It can pretty much do whatever you can do with a spreadsheet: grouping, pivots, etc.
2. It has built-in support for time series
3. much, much more

## Installing and importing Pandas
make sure you have Pandas installed. Verify it now by trying to import it

In [24]:
# run forest run
import pandas as pd

you will note that like numpy pandas has a "canonical" import statement. Be a pal, use it.

# Pandas objects
the two pandas objects anyone should know about are series and dataframes 

## It's Getting Series - 1D Pandas Objects
you can define a pandas series just as you would a numpy array

In [25]:
s = pd.Series([0, 1.5, 8, 5])
s

0    0.0
1    1.5
2    8.0
3    5.0
dtype: float64

We can access data using the familiar numpy syntax. Try generating a slice of `s` containing the first 3 elements

In [26]:
s[:3]

0    0.0
1    1.5
2    8.0
dtype: float64

Not impressed? That's because you didn't see nothing yet. Let's make it more interesting by introducing indices.

In [27]:
s = pd.Series([0, 1.5, 8, 5], index=['what', 'about', 'second', 'breakfast'])
s

what         0.0
about        1.5
second       8.0
breakfast    5.0
dtype: float64

Now we can acces data using labels as we would for a dictionary. Try accessing the 'about' field of `s`.

In [28]:
s['about']

1.5

This new found ability does not diminish your ability to slice and dice like the good old numpy days. Try generating a slice `s` containing the first 3 elements

In [29]:
s[:3]

what      0.0
about     1.5
second    8.0
dtype: float64

Series are like the love child of a numpy array and a dictionary. They can even be generated from a dictionary. Try it

In [30]:
# run forest run
note7 = pd.Series({'model': 'note 7', 'ordinal_number': 7, 'specific_issues': 'with charging...'})
print(note7)

print('--')

disguised_dict = pd.Series(dict(year=2018, month=4, day=23, hour=9))
print(disguised_dict)

model                        note 7
ordinal_number                    7
specific_issues    with charging...
dtype: object
--
day        23
hour        9
month       4
year     2018
dtype: int64


This is where it gets wierd, series can also accessed via the dot notation like a class...

In [31]:
# run forest run
disguised_dict.year

2018

On this merry occasion this may be good time to remind you **NEVER TO USE PANDAS METHODS AS FIELD NAMES**

## You've been framed, dataframed - 2D Pandas objects and beyond
series are nice and all, but 1D data lacks real depth. This is when we turn to dataframes, the real pandas workhorse. Or is that workpanda?... never mind. Let's get busy

In [32]:
# we can initialize pandas dataframes from a numpy array
import numpy as np
df = pd.DataFrame(data=np.eye(3))
print(df, '\n')

# we can give it indices and labels(columns)
df2 = pd.DataFrame(data=np.eye(4), columns=['a', 'b', 'c', 'd'], index=['what', 'about', 'second', 'breakfast'])
print(df2, '\n')

# we can compose it a dictionary of series
s1 = pd.Series([1, 2, 9, 0], index=['class_', 'lesson', 'district', 'fudges_given'])
s2 = pd.Series([2, 1, 13, 0.5], index=['class_', 'lesson', 'district', 'fudges_given'])
df3 = pd.DataFrame({'simple': s1, 'advanced': s2})
print(df3,'\n')

# or as a dictionary of dictionaries...
df4 = pd.DataFrame({
    'simple': dict(class_=1, lesson=2, district=9, fudges_given=0),
    'advanced': dict(class_=2, lesson=1, district=13, fudges_given=0.5),
})
print(df4, '\n')

(     0    1    2
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0, '\n')
(             a    b    c    d
what       1.0  0.0  0.0  0.0
about      0.0  1.0  0.0  0.0
second     0.0  0.0  1.0  0.0
breakfast  0.0  0.0  0.0  1.0, '\n')
(              advanced  simple
class_             2.0       1
lesson             1.0       2
district          13.0       9
fudges_given       0.5       0, '\n')
(              advanced  simple
class_             2.0       1
district          13.0       9
fudges_given       0.5       0
lesson             1.0       2, '\n')


### Indexing dataframes
there are many different and wonderful ways to index dataframes. 

We can slice dataframes by index as we did for series

In [33]:
# run forest run
print(df2[:2])

         a    b    c    d
what   1.0  0.0  0.0  0.0
about  0.0  1.0  0.0  0.0


but using a single index won't work

In [34]:
# run forest run
print(df2[2])

KeyError: 2

and we will soon find out why. 

In [35]:
# run forest run
print(df2['a'])

what         1.0
about        0.0
second       0.0
breakfast    0.0
Name: a, dtype: float64


That's right. Bracket indexing is reserved for columns (this is confusing). The effect can be repeated using attribute-style (dot) notation

In [36]:
# run forest run
print(df2.a)

what         1.0
about        0.0
second       0.0
breakfast    0.0
Name: a, dtype: float64


Which does seem more convenient, but does not allow us to assign new columns

In [37]:
# run forest run
df2['e'] = [90, 80, 70, 60]
print(df2, '\n')

df2.w = [10, 20, 30, 40]
print(df2, '\n')

(             a    b    c    d   e
what       1.0  0.0  0.0  0.0  90
about      0.0  1.0  0.0  0.0  80
second     0.0  0.0  1.0  0.0  70
breakfast  0.0  0.0  0.0  1.0  60, '\n')
(             a    b    c    d   e
what       1.0  0.0  0.0  0.0  90
about      0.0  1.0  0.0  0.0  80
second     0.0  0.0  1.0  0.0  70
breakfast  0.0  0.0  0.0  1.0  60, '\n')


  """


### Dataframes as 2D arrays
beneath all the fancy machinery at the heart of each dataframe lies a numpy array. It's quite easy to find actually

In [38]:
# run forest run
print(df2.values)

[[ 1.  0.  0.  0. 90.]
 [ 0.  1.  0.  0. 80.]
 [ 0.  0.  1.  0. 70.]
 [ 0.  0.  0.  1. 60.]]


Using this knowledge we can do some clever stuff.

In [39]:
# run forest run
print(df2.T)   # transposing as we would for a numpy array

   what  about  second  breakfast
a   1.0    0.0     0.0        0.0
b   0.0    1.0     0.0        0.0
c   0.0    0.0     1.0        0.0
d   0.0    0.0     0.0        1.0
e  90.0   80.0    70.0       60.0


## Indexing functions - iloc, loc, ix
armed with this knowledge of a dataframes inner workings it may be easier to grasp pandas "advanced" indexing functions

### Pandas what pandas? - iloc
if you want to treat dataframes as arrays you came to the right place. The `iloc` function allowes us to index the dataframe as we would a numpy array. Use it now to get the subrarray of `df2` containing rows 0, 1, 2 and columns 3, 1

In [40]:
df2.iloc[:3,3::-2]


Unnamed: 0,d,b
what,0.0,0.0
about,0.0,1.0
second,0.0,0.0


### Mutant indexing - loc
so you want to slice, but you want to use named ranges. You want it all. Luckily pandas got your back. The `loc` function allows you to slice using names instead of integers. Wrap your head around this one and get the same result you got for `iloc`.

*note* - named ranges include both ends

In [41]:
df2.loc[[True, False, False, True]]

Unnamed: 0,a,b,c,d,e
what,1.0,0.0,0.0,0.0,90
breakfast,0.0,0.0,0.0,1.0,60


### boolean indexing
boolean indexing is possible using the `loc` function by replacing the dimensions by a boolean index. Use the `loc` function to get the sub dataframe of `df2` with columns 'b' and 'c' and rows where the 'a' column is 0

In [42]:
df2.loc[df2['a'] == 0, ['b','c']]

Unnamed: 0,b,c
about,1.0,0.0
second,0.0,1.0
breakfast,0.0,0.0


## But what about numpy??
you have not learned numpy for nothing. At their core pandas data frames are simply upgraded numpy arrays, and can be treated as such. Behold

In [43]:
# run forest run
df = pd.DataFrame(np.arange(16).reshape((4, -1)), index=['a', 'b', 'c', 'd'], columns=['I', 'II', 'III', 'IV'])
print('df=\n{}\n'.format(df))

mat = np.diag(v=[1, 1, 1], k=1)
print('mat=\n{}\n'.format(mat))

print('df.dot(mat)=\n{}\n'.format(df.dot(mat))) # implicit example

print('df.dot(mat)=\n{}\n'.format(df.values.dot(mat))) # explicit example using values

print('exponent of df = \n{}\n'.format(np.exp(df)))  # calling numpy functions on a pandas dataframe

df_ = df.copy()
df_['V'] = ['silly', 'words', 'in', 'list']
print('df_=\n{}\n'.format(df_))
print('numeric data in df_ multiplied by mat=\n{}\n'.format(df_._get_numeric_data()*mat)) # let's keep it numeric 

df=
    I  II  III  IV
a   0   1    2   3
b   4   5    6   7
c   8   9   10  11
d  12  13   14  15

mat=
[[0 1 0 0]
 [0 0 1 0]
 [0 0 0 1]
 [0 0 0 0]]

df.dot(mat)=
   0   1   2   3
a  0   0   1   2
b  0   4   5   6
c  0   8   9  10
d  0  12  13  14

df.dot(mat)=
[[ 0  0  1  2]
 [ 0  4  5  6]
 [ 0  8  9 10]
 [ 0 12 13 14]]

exponent of df = 
               I             II           III            IV
a       1.000000       2.718282  7.389056e+00  2.008554e+01
b      54.598150     148.413159  4.034288e+02  1.096633e+03
c    2980.957987    8103.083928  2.202647e+04  5.987414e+04
d  162754.791419  442413.392009  1.202604e+06  3.269017e+06

df_=
    I  II  III  IV      V
a   0   1    2   3  silly
b   4   5    6   7  words
c   8   9   10  11     in
d  12  13   14  15   list

numeric data in df_ multiplied by mat=
   I  II  III  IV
a  0   1    0   0
b  0   0    6   0
c  0   0    0  11
d  0   0    0   0



So you can indeed treat pandas dataframes (and series) as numpy objects for most intents and purposes. Pandas, a gracious library and a beautiful creature doesn't mind you treating it as numpy array. It will retain it's column names and indices, for your use and enjoyment (as long as you do it implicitly).

#### exercise
Now you. You are requested to get the sub dataframe of with rows 'c' and 'd' and rows 'II', and 'IV' with a transformation applied to each element. If $x$ is the original element we want the element $x'=\cos(e^{x+3})$. There are two obvious ways to achieve this. Write them both. Which is more efficient?

*Note* try to achieve both ways with one line of code each

In [45]:
import math as m
xprime = np.cos(np.exp(df.iloc[[1,2],[1,3]]+3))
print(xprime)
x2prime = np.cos(np.exp(df+3)).iloc[[1,2],[1,3]]
print(x2prime)
print("\nThe first way is more efficient because we dont need to apply \ncos(e^x+3) on every element but only to the sub dataframe.")

         II        IV
b -0.915744 -0.725042
c  0.128037 -0.865213
         II        IV
b -0.915744 -0.725042
c  0.128037 -0.865213

The first way is more efficient because we dont need to apply 
cos(e^x+3) on every element but only to the sub dataframe.


# Index
in pandas the row names are called the index. 

In [46]:
# run forest run
df.index

Index([u'a', u'b', u'c', u'd'], dtype='object')

The index is an immutable list. There are some explicit uses for the index, but it's presented here to discuss the implicit ones.

### index preservation
let's compute how busy are the restaurants around the ITC are on average

In [47]:
customer_num = pd.Series([100, 80, 60, 200],
                       index=['humus_hakerem', 'falafel_gina', 
                              '24_rupee', 'pizza_munch']) # number of customers per day

hours_open = pd.Series([10, 12, 9, 17],
                      index=['humus_hakerem', 'falafel_gina', 
                             'al_harampa', '24_rupee'])

print('customer_num=\n{}\n'.format(customer_num))
print('hours_open=\n{}\n'.format(hours_open))

print('average number of customers per hour\n{}\n'.format(customer_num/hours_open))

customer_num=
humus_hakerem    100
falafel_gina      80
24_rupee          60
pizza_munch      200
dtype: int64

hours_open=
humus_hakerem    10
falafel_gina     12
al_harampa        9
24_rupee         17
dtype: int64

average number of customers per hour
24_rupee          3.529412
al_harampa             NaN
falafel_gina      6.666667
humus_hakerem    10.000000
pizza_munch            NaN
dtype: float64



Whenever we perform operations on pandas objects, pandas aligns indices automagically. Elements for which there is no counterpart for computation are replaced by a NaN. This also work for columns and indices in dataframes

In [48]:
# run forest run
print('df=\n{}\n'.format(df))

df_nw = df.iloc[:3, :3]
print('df_nw=\n{}\n'.format(df_nw))  # north-west sub dataframe of df

df_se = df.iloc[1:, 1:]
print('df_se=\n{}\n'.format(df_se))  # south-east sub dataframe of df

print('df_se*df_nw=\n{}\n'.format(df_se*df_nw))

df=
    I  II  III  IV
a   0   1    2   3
b   4   5    6   7
c   8   9   10  11
d  12  13   14  15

df_nw=
   I  II  III
a  0   1    2
b  4   5    6
c  8   9   10

df_se=
   II  III  IV
b   5    6   7
c   9   10  11
d  13   14  15

df_se*df_nw=
    I    II    III  IV
a NaN   NaN    NaN NaN
b NaN  25.0   36.0 NaN
c NaN  81.0  100.0 NaN
d NaN   NaN    NaN NaN



You could tell in advance which elements would be replaced by NaN's by using logical operations (XOR) on columns and indices.

**note** - xor (symbol = ^) is what we call or in every day use. xor stands for exclusive or.

In [49]:
# run forest run
print('indices (rows) that will be replaced with NaNs = {}\n'.format(df_se.index ^ df_nw.index))
print('columns (columns) that will be replaced with NaNs = {}'.format(df_se.columns ^ df_nw.columns))

indices (rows) that will be replaced with NaNs = Index([u'a', u'd'], dtype='object')

columns (columns) that will be replaced with NaNs = Index([u'I', u'IV'], dtype='object')


In real world cases, missing values are very common and so methods for coping with missing data are baked into pandas. You can use the `dropna` and `fillna` method of a pandas object to drop missing values and fill them respectively.

#### exercise
create a new dataframe `df_missing` by multiplying `df_se` and `df_nw` to experiment in this exercise (do not alter `df_missing` - print a copy)
1. fill missing values with a constant value of your choosing.
2. achieve the same result without using `dropna` or `fillna` (hint: `isnull`, `notnull`, `where`)
3. drop rows containing NaNs
4. drop rows containing **only** NaNs
5. achieve the same result without using `dropna` or `fillna`
6. drop columns containing **only** NaNs
7. achieve the same result without using `dropna` or `fillna`
8. fill missing values with the **last** existing value along each row (fill forward)
9. fill missing values with the **next** existing value along each column (fill backward)

In [50]:
df_missing = df_se * df_nw
print(df_missing)

df_missing2 = df_missing.fillna(2)
print(df_missing2)

#Nadav aproved me not doing the rest of the excercise

    I    II    III  IV
a NaN   NaN    NaN NaN
b NaN  25.0   36.0 NaN
c NaN  81.0  100.0 NaN
d NaN   NaN    NaN NaN
     I    II    III   IV
a  2.0   2.0    2.0  2.0
b  2.0  25.0   36.0  2.0
c  2.0  81.0  100.0  2.0
d  2.0   2.0    2.0  2.0
