### <font color="brown">Pandas</font>

https://pandas.pydata.org/docs/user_guide/index.html<br>
(You can also get at this from Jupiter notebook through Help -> pandas Reference -> User Guide)

#### Pandas has two key data strucutures: Series and DataFrame

In [1]:
from pandas import Series

---

#### <font color="brown">Series is a 1D array-like object containing an array of data (of any NumPy datatype),<br> and an associated array of data labels called *index*</font>

In [2]:
aser = Series([1, 5, -2, 16])
aser

0     1
1     5
2    -2
3    16
dtype: int64

In [3]:
aser.values, aser.index

(array([ 1,  5, -2, 16], dtype=int64), RangeIndex(start=0, stop=4, step=1))

**Both values and index have data types**

In [4]:
aser.values.dtype, aser.index.dtype

(dtype('int64'), dtype('int64'))

**Can explicity specify index**

In [5]:
aser = Series([1, 5, -2, 16], index=range(0,4))
aser

0     1
1     5
2    -2
3    16
dtype: int64

In [6]:
aser.values, aser.index

(array([ 1,  5, -2, 16], dtype=int64), RangeIndex(start=0, stop=4, step=1))

**Index can be string labels**

In [7]:
ser = Series([1, 5, -2, 16], index=['a','b','x','d'])
ser

a     1
b     5
x    -2
d    16
dtype: int64

In [8]:
print(ser.index.dtype) 

object


---

#### <font color="brown">Acsessing Series values</font>

**Can use index label subscripts to access and assign values, like a dictionary**

In [9]:
aser

0     1
1     5
2    -2
3    16
dtype: int64

In [10]:
aser[2]

-2

In [11]:
ser

a     1
b     5
x    -2
d    16
dtype: int64

In [12]:
ser['x']

-2

In [13]:
ser['a'] = 10
ser

a    10
b     5
x    -2
d    16
dtype: int64

**Can access a list of values using NumPy-like row list**

In [14]:
ser[['x','a','b']]  

x    -2
a    10
b     5
dtype: int64

---

#### <font color="brown">NumPy like array operations work as before, index tags along</font>

In [15]:
import numpy as np

print(ser, '\n')

res = ser[ser > 0]
print(res, '\n')

res = ser * 2
print(res, '\n')

res = np.power(ser,2)
print(res, '\n')

ser = ser ** 2
print(ser, '\n')

a    10
b     5
x    -2
d    16
dtype: int64 

a    10
b     5
d    16
dtype: int64 

a    20
b    10
x    -4
d    32
dtype: int64 

a    100
b     25
x      4
d    256
dtype: int64 

a    100
b     25
x      4
d    256
dtype: int64 



---

#### <font color="brown">Series is like an ordered dictionary</font>

**Can do membership on index (like key membership in dictionary)**

In [16]:
ser

a    100
b     25
x      4
d    256
dtype: int64

In [17]:
'x' in ser

True

**Can create a Series out of a Python dictionary**

In [18]:
udict = {'Rutgers': 55000, 'Princeton': 15000, 'MIT': 20000, 'USC': 40000}
useries = Series(udict)
print(useries)
print(useries.index)

Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
dtype: int64
Index(['Rutgers', 'Princeton', 'MIT', 'USC'], dtype='object')


**Make a new series out of a subset of useries, with an explicit index that replaces 'Princeton' with 'Purdue'**

In [19]:
univs = ['Purdue','Rutgers','MIT','USC']
useries2 = Series(udict, index=univs)
useries2

Purdue         NaN
Rutgers    55000.0
MIT        20000.0
USC        40000.0
dtype: float64

In [20]:
# What if dictionary has list values
adict = {"one": [1,2,3,4], "two": [4,5,6]}
aser = Series(adict)
aser

one    [1, 2, 3, 4]
two       [4, 5, 6]
dtype: object

In [21]:
np.power(aser,2)

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

---

##### <font color="brown">Checking for null/not null values</font>
**NaN is equivalent to null**

In [None]:
useries2

In [None]:
useries2.isnull()  

In [None]:
useries2.notnull()

---

##### <font color="brown">Naming the Series, and the index</font>

In [None]:
useries

In [22]:
useries.name = "student population"
useries.index.name = "university"
useries

university
Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
Name: student population, dtype: int64

---

##### <font color="brown">Adding two Series, Auto alignment of differently indexed datax</font>

**If an index appears in one and not the other, result is NaN**

In [23]:
useries + useries2

MIT           40000.0
Princeton         NaN
Purdue            NaN
Rutgers      110000.0
USC           80000.0
dtype: float64

---

##### <font color="brown">Changing the index</font>

In [24]:
useries

university
Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
Name: student population, dtype: int64

In [25]:
print('Original index: ',useries.index)
useries.index = ['RU','Princeton U','MIT','USC']
print('\nUpdated index: ',useries.index)

Original index:  Index(['Rutgers', 'Princeton', 'MIT', 'USC'], dtype='object', name='university')

Updated index:  Index(['RU', 'Princeton U', 'MIT', 'USC'], dtype='object')


In [26]:
useries

RU             55000
Princeton U    15000
MIT            20000
USC            40000
Name: student population, dtype: int64

In [27]:
useries.name
useries.index.name = 'university'
useries

university
RU             55000
Princeton U    15000
MIT            20000
USC            40000
Name: student population, dtype: int64

##### <font color="brown">Dropping NaNs</font>

In [28]:
dat = Series([1, np.nan, 2.6, np.nan, 6])
print(dat)

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64


**Alternatively, you can use an alias for np.nan (popular), as follows**

In [29]:
from numpy import nan as NA

dat = Series([1, NA, 2.6, NA, 6])
print(dat)

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64


**dropna**

In [30]:
dat.dropna()

0    1.0
2    2.6
4    6.0
dtype: float64

In [31]:
dat  # is the original modified?

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

---

##### <font color="brown">Filling NaNs</font>

In [32]:
dat = Series([1, np.nan, 2.6, np.nan, 6])
print(dat)

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64


**1. Filling (replacing) NaNs with a specific value**

In [33]:
dat.fillna(dat.mean())   # replace each NaN with mean

0    1.0
1    3.2
2    2.6
3    3.2
4    6.0
dtype: float64

In [34]:
dat   # ? is original modified?

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

**<font color="red">fillna returns a new Series, the original is unchanged</font>**

**2. Filling (replacing) NaNs with existing value using 'forward fill'**

In [35]:
dat.fillna(method='ffill')   # forward fill value into following NaN sequence

0    1.0
1    1.0
2    2.6
3    2.6
4    6.0
dtype: float64

In [36]:
dat2 = Series([1, np.nan, 2.6, np.nan, np.nan, 6])
print(dat2)

0    1.0
1    NaN
2    2.6
3    NaN
4    NaN
5    6.0
dtype: float64


In [37]:
dat2.fillna(method='ffill')   # forward fill value into following NaN sequence

0    1.0
1    1.0
2    2.6
3    2.6
4    2.6
5    6.0
dtype: float64

**3. Filling (replacing) NaNs with existing value using 'back fill'**

In [38]:
dat

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

In [39]:
dat.fillna(method='bfill')   # back fill value into following NaN sequence

0    1.0
1    2.6
2    2.6
3    6.0
4    6.0
dtype: float64

In [40]:
dat2

0    1.0
1    NaN
2    2.6
3    NaN
4    NaN
5    6.0
dtype: float64

In [41]:
dat2.fillna(method='bfill')   # back fill value into following NaN sequence

0    1.0
1    2.6
2    2.6
3    6.0
4    6.0
5    6.0
dtype: float64

---

##### <font color="brown">Filtering with notnull()</font>

In [42]:
dat[dat.notnull()]

0    1.0
2    2.6
4    6.0
dtype: float64

In [43]:
dat1 = dat.copy()
dat1

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

**Updating in place (modifying original) with inplace parameter**

In [44]:
dat1.dropna(inplace=True)  
dat1

0    1.0
2    2.6
4    6.0
dtype: float64

**Or modify original by reassigning to it**

In [45]:
dat1 = dat.copy()
dat1

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

In [46]:
dat1 = dat1[dat1.notnull()]
dat1

0    1.0
2    2.6
4    6.0
dtype: float64

---

##### <font color="brown">Counting occurrences of values</font>

In [47]:
valser = Series(np.random.randint(1,10,20))
valser

0     2
1     4
2     9
3     3
4     2
5     4
6     1
7     2
8     6
9     9
10    5
11    7
12    2
13    8
14    1
15    9
16    4
17    1
18    1
19    3
dtype: int32

In [48]:
valser.value_counts()

2    4
1    4
4    3
9    3
3    2
6    1
5    1
7    1
8    1
dtype: int64

In [49]:
valser.value_counts().index

Int64Index([2, 1, 4, 9, 3, 6, 5, 7, 8], dtype='int64')

In [50]:
valser.value_counts()[7]

1

---

##### <font color="brown">Mapping values with map function</font>

In [51]:
def mapper(val):
    return val**2 + 5

In [52]:
aser = Series(np.arange(1,5))
aser

0    1
1    2
2    3
3    4
dtype: int32

In [53]:
aser.map(mapper)   # each value is transformed via the mapper function

0     6
1     9
2    14
3    21
dtype: int64

In [54]:
aser

0    1
1    2
2    3
3    4
dtype: int32

In [55]:
aser.map(lambda v: v**2 + 5)   # can do the same with a lambda function

0     6
1     9
2    14
3    21
dtype: int64

---

##### <font color="brown">Resetting the index</font>

In [56]:
useries

university
RU             55000
Princeton U    15000
MIT            20000
USC            40000
Name: student population, dtype: int64

**Reset index to numbers**

In [57]:
useries = useries.reset_index() 
useries

Unnamed: 0,university,student population
0,RU,55000
1,Princeton U,15000
2,MIT,20000
3,USC,40000


In [58]:
type(useries)   

pandas.core.frame.DataFrame

**Can change column names**

In [59]:
useries.columns = ['Univ','Student Population']
useries

Unnamed: 0,Univ,Student Population
0,RU,55000
1,Princeton U,15000
2,MIT,20000
3,USC,40000


---

---

### <font color="brown">Pandas - DataFrame</font>

#### DataFrame is a tabular spreadsheet-like data structure consisting of an ordered collection of columns, each of which can be a different value type

In [60]:
import numpy as np
from pandas import DataFrame

#### <font color="brown">DataFrame Creation</font>

**1. Creating a DataFrame from a dictionary**

In [61]:
popdat = {'state': ['Arizona','Arizona','Arizona','Virginia','Virginia'],
          'year': [2005, 2010, 2015, 2010, 2015],
          'pop': [5.9, 6.6, 6.8, 7.9, 8.3]}
popdf = DataFrame(popdat)
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


In [62]:
popdf.shape

(5, 3)

**Can sequence columns of a DataFrame as needed during creation with columns parameter**

In [63]:
popdf = DataFrame(popdat, columns=['year','state', 'pop'])
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


**If you give a column name that's not a key in the dictionary, you get NaN's (like Series index)**

In [64]:
popdf1 = DataFrame(popdat, columns=['year','state', 'population'])
popdf1

Unnamed: 0,year,state,population
0,2005,Arizona,
1,2010,Arizona,
2,2015,Arizona,
3,2010,Virginia,
4,2015,Virginia,


**Index and columns names**

In [65]:
print('Index:',popdf.index)
print('Columns:',popdf.columns)

Index: RangeIndex(start=0, stop=5, step=1)
Columns: Index(['year', 'state', 'pop'], dtype='object')


**values property gives an ndarray**

In [66]:
popdf.values

array([[2005, 'Arizona', 5.9],
       [2010, 'Arizona', 6.6],
       [2015, 'Arizona', 6.8],
       [2010, 'Virginia', 7.9],
       [2015, 'Virginia', 8.3]], dtype=object)

In [67]:
type(popdf.values)

numpy.ndarray

In [68]:
popdf.name  

AttributeError: 'DataFrame' object has no attribute 'name'

---

**2. Creating a DataFrame from a nested dictionary**

In [69]:
popdat2 = {'Arizona': {2005: 5.9, 2010: 6.6, 2015: 6.8},
           'Virginia': {2010: 7.9, 2015: 8.3}}
popdf2 = DataFrame(popdat2)
popdf2

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


In [70]:
popdf2.T

Unnamed: 0,2005,2010,2015
Arizona,5.9,6.6,6.8
Virginia,,7.9,8.3


In [71]:
popdf2    # is the original dataframe modified?

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


In [72]:
popdf3 = DataFrame(popdat2,columns=['AZ','VA'])
popdf3

Unnamed: 0,AZ,VA


In [73]:
popdf3 = DataFrame(popdat2,columns=['VA','Arizona'])
popdf3

Unnamed: 0,VA,Arizona
2005,,5.9
2010,,6.6
2015,,6.8


---

**3. Creating a DataFrame from a 2D NumPy array**

In [74]:
rand2d = np.random.random((3,2))
randdf = DataFrame(rand2d)
randdf

Unnamed: 0,0,1
0,0.57984,0.264331
1,0.03722,0.072799
2,0.396589,0.216195


**Change index and column names**

In [75]:
randdf.index = ['one', 'two', 'three']
randdf.columns = ['first', 'second']
randdf

Unnamed: 0,first,second
one,0.57984,0.264331
two,0.03722,0.072799
three,0.396589,0.216195


**Or set them up at creation time**

In [76]:
randdf = DataFrame(rand2d, index=['one', 'two', 'three'],
                   columns = ['first', 'second'])
randdf

Unnamed: 0,first,second
one,0.57984,0.264331
two,0.03722,0.072799
three,0.396589,0.216195


---

#### <font color="brown">Columns</font>

**Membership**

In [77]:
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


In [78]:
'debt' in popdf.columns  

False

**Each column is a Series**

**Column can be referenced by using column name as index into dataframe**

In [79]:
print(popdf['state'])
print(popdf['state'].name)
print(popdf['state'].values)
print(popdf['state'].index)

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object
state
['Arizona' 'Arizona' 'Arizona' 'Virginia' 'Virginia']
RangeIndex(start=0, stop=5, step=1)


**Alternatively, a column can be referenced as an attribute of the dataframe**

In [80]:
popdf.state

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object

**Can get at a subset of columns with list, similar to rows of ndarray or index of Series**

In [81]:
popdf[['state','pop']]

Unnamed: 0,state,pop
0,Arizona,5.9
1,Arizona,6.6
2,Arizona,6.8
3,Virginia,7.9
4,Virginia,8.3


**Changing column names**

In [82]:
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


In [83]:
popdf.columns = ['year','state','pop']
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


In [84]:
# restore to original
popdf.columns = ['state','year','pop']
popdf

Unnamed: 0,state,year,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3
