# lecture 14

### why pandas
* One of the most popular library that data scientists use
* Labeled axes to avoid misalignment of data
    * Data[:, 2] represents weight or weight2?
    * When merge two tables, some rows may be different
* Missing values or special values may need to be removed or replaced
* Powerful and productive Python data analysis and Management Library
* Panel Data System

### overview
* Python Library to provide data analysis features similar to: R, MATLAB, SAS
* Rich data structures and functions to make working with data structure fast, easy and expressive
* It is built on top of NumPy
* Key components provided by Pandas
    * Series 
    * DataFrame

In [1]:
from pandas import Series, DataFrame
import pandas as pd

### series
* One dimensional array-like object
* It contains array of data (of any NumPy data type) with associated indexes. (Indexes can be strings or integers or other data types.) 
* By default , the series will get indexing from 0 to N where N = size -1

In [2]:
obj=Series([4,7,-5,3])
print(obj,"\n")
print(obj.values,"\n")
print(obj.index)

0    4
1    7
2   -5
3    3
dtype: int64 

[ 4  7 -5  3] 

RangeIndex(start=0, stop=4, step=1)


In [10]:
obj=Series([4,5,6,7.0],index=['d','b','a','c'])

print(obj['b':'c'])

b    5.0
a    6.0
c    7.0
dtype: float64


#### series-referencing elements

In [3]:
obj2=Series([4,7,-5,3],index=['d','b','a','c'])
print(obj2,"\n")
print(obj2.index,"\n")
print(obj2.values,"\n")

#both statements return the same thing
print(obj2['a'],"\n")
print(obj2.a,"\n")

obj2['d']=10
print(obj2['d'],"\n")
print(obj2[['d','c','a']],"\n")
print(obj2[:2],"\n")

d    4
b    7
a   -5
c    3
dtype: int64 

Index(['d', 'b', 'a', 'c'], dtype='object') 

[ 4  7 -5  3] 

-5 

-5 

10 

d    10
c     3
a    -5
dtype: int64 

d    10
b     7
dtype: int64 



In [4]:
print(obj2[obj2>0],"\n") #returns filtered indexes based on values greates than zero
print(obj2**2,"\n") #returns indexes with their values raised to the power of two
print('b' in obj2,"\n")

d    10
b     7
c     3
dtype: int64 

d    100
b     49
a     25
c      9
dtype: int64 

True 



#### series-array/dict operations
* numpy array operations can also be applied, which will preserve the index-value link
* can be thought of as a dict
* can be constructed from a dict directly

In [5]:
obj3=Series({'a':10,'b':5,'c':10})
print(obj3)


a    10
b     5
c    10
dtype: int64


#### series-null values

In [6]:
sdata={'Texas':10,'Ohio':20,'Oregon':15,'Utah':18} #dictionary
states=['Texas','Ohio','Oregon','Iowa'] #list
#NaN values are generated when arithmetic operations result in undefined or unrepresentable values
obj4=Series(sdata,index=states) #series of dict of data with an associated index of strings
print(obj4,"\n")
print(pd.isnull(obj4),"\n") #check if index values are null, returns booleans
print(pd.notnull(obj4),"\n") #check if index values are not null, returns booleans
print(obj4[obj4.notnull()]) #returns list of indexes of obj4 who's values are not null

Texas     10.0
Ohio      20.0
Oregon    15.0
Iowa       NaN
dtype: float64 

Texas     False
Ohio      False
Oregon    False
Iowa       True
dtype: bool 

Texas      True
Ohio       True
Oregon     True
Iowa      False
dtype: bool 

Texas     10.0
Ohio      20.0
Oregon    15.0
dtype: float64


#### series-auto alignment

In [7]:
sdata={'Ohio': 20, 'Oregon': 15, 'Texas': 10,  'Utah': 18}
obj5=Series(sdata)
print(obj4,"\n")
print(obj5,"\n")
print(obj5+obj4,"\n") #adds values of indexes that align

Texas     10.0
Ohio      20.0
Oregon    15.0
Iowa       NaN
dtype: float64 

Ohio      20
Oregon    15
Texas     10
Utah      18
dtype: int64 

Iowa       NaN
Ohio      40.0
Oregon    30.0
Texas     20.0
Utah       NaN
dtype: float64 



#### series name and index name
* Index of a series can be changed to a different index
* Index object itself is immutable


In [8]:
obj4.name='population' #assign name of data series
print(obj4,"\n")

obj4.index.name='state' #assign row label to 'state'
print(obj4,"\n")

obj4.index= ['Florida', 'New York', 'Kentucky', 'Georgia']
print(obj4,"\n")

#obj4.index[2]='California' -> TypeError: Index does not support mutable operations


Texas     10.0
Ohio      20.0
Oregon    15.0
Iowa       NaN
Name: population, dtype: float64 

state
Texas     10.0
Ohio      20.0
Oregon    15.0
Iowa       NaN
Name: population, dtype: float64 

Florida     10.0
New York    20.0
Kentucky    15.0
Georgia      NaN
Name: population, dtype: float64 



### dataframe
* A DataFrame is a tabular data structure comprised of rows and columns, akin to a spreadsheet or database table
* It can be treated as an order collection of  columns
    * Each column can be a different data type
    * Have both row and column indices

In [9]:
data= {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame,"\n")
print(frame.sort_index(axis=1),"\n") #ordered by columns, alpha-order

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9 

   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002 



#### dataframe– specifying columns and indices
* Order of columns/rows can be specified
* Columns not in data will have NaN

In [10]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data,
                   columns=['year', 'state', 'pop', 'debt'], #columns can be renamed, in same order written
                   index=['A', 'B', 'C', 'D', 'E']) #assigning labels of each row
print(frame2,"\n") #debt has no values, it is just an empty column, thus it is initialized with NaN


   year   state  pop debt
A  2000    Ohio  1.5  NaN
B  2001    Ohio  1.7  NaN
C  2002    Ohio  3.6  NaN
D  2001  Nevada  2.4  NaN
E  2002  Nevada  2.9  NaN 



#### dataframe– from nested dict of dicts
* Outer dict keys as columns and inner dict keys as row indices
* Union of inner keys (in sorted order)

In [11]:
pop = {'Nevada': {2001: 2.9, 2002: 2.9},
       'Ohio': {2002: 3.6, 2001: 1.7, 2000: 1.5}}
#state names set as columns, years set as rows, years are auto in sorted order despite not in the same order within the dictionary
frame3 = DataFrame(pop)
print(frame3,"\n")
#transpose
print(frame3.T,"\n") #switches row/column places

      Nevada  Ohio
2001     2.9   1.7
2002     2.9   3.6
2000     NaN   1.5 

        2001  2002  2000
Nevada   2.9   2.9   NaN
Ohio     1.7   3.6   1.5 



#### dataframe– index, columns, values

In [12]:
print(frame3.index,"\n")
print(frame3.columns,"\n")
print(frame3.values,"\n")

#assigning column/row labels
frame3.index.name = 'year'
frame3.columns.name='state'
print(frame3,"\n") #bad design

Int64Index([2001, 2002, 2000], dtype='int64') 

Index(['Nevada', 'Ohio'], dtype='object') 

[[2.9 1.7]
 [2.9 3.6]
 [nan 1.5]] 

state  Nevada  Ohio
year               
2001      2.9   1.7
2002      2.9   3.6
2000      NaN   1.5 

