### <font color="brown">Pandas - DataFrame</font>

#### DataFrame is a tabular spreadsheet-like data structure consisting of an ordered collection of columns, each of which can be a different value type

In [86]:
import numpy as np
from pandas import DataFrame

---

#### <font color="brown">DataFrame Creation</font>

**1. Creating a DataFrame from a dictionary**

In [87]:
popdat = {'state': ['Arizona','Arizona','Arizona','Virginia','Virginia'],
          'year': [2005, 2010, 2015, 2010, 2015],
          'pop': [5.9, 6.6, 6.8, 7.9, 8.3]}
# all lists must be of the same length, otherwise error
popdf = DataFrame(popdat)
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


In [88]:
# how many rows and columns
popdf.shape

(5, 3)

**Can specify a sequence of columns with columns parameter**

In [90]:
popdf = DataFrame(popdat, columns=['year','state', 'pop'])
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


In [91]:
# if you give a column name that's not a key in the dictionary, you get NaN's (like Series index)
popdf1 = DataFrame(popdat, columns=['year','state', 'population'])
popdf1

Unnamed: 0,year,state,population
0,2005,Arizona,
1,2010,Arizona,
2,2015,Arizona,
3,2010,Virginia,
4,2015,Virginia,


In [92]:
# names of index and columns
print('Index:',popdf.index)
print('Columns:',popdf.columns)

Index: RangeIndex(start=0, stop=5, step=1)
Columns: Index(['year', 'state', 'pop'], dtype='object')


In [93]:
popdf.values  # values gives an ndarray

array([[2005, 'Arizona', 5.9],
       [2010, 'Arizona', 6.6],
       [2015, 'Arizona', 6.8],
       [2010, 'Virginia', 7.9],
       [2015, 'Virginia', 8.3]], dtype=object)

In [94]:
type(popdf.values)

numpy.ndarray

In [118]:
popdf.name  # unlike Series DataFrame does not have a name

AttributeError: 'DataFrame' object has no attribute 'name'

---

**2. Creating a DataFrame from a nested dictionary**

In [153]:
popdat2 = {'Arizona': {2005: 5.9, 2010: 6.6, 2015: 6.8},
           'Virginia': {2010: 7.9, 2015: 8.3}}
popdf2 = DataFrame(popdat2)
popdf2

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


**The outer keys will be column names, and the inner will be indexes<br>
indexes will be lined up, and NaN's will be used to fill missing values**

In [154]:
# if you want it the other way, transpose the data frame
popdf2.T

Unnamed: 0,2005,2010,2015
Arizona,5.9,6.6,6.8
Virginia,,7.9,8.3


In [155]:
popdf2   # does not change the original dataframe, 
         # so you will need to reassign if you want original to be updated

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


In [111]:
popdf3 = DataFrame(popdat2,columns=['AZ','VA'])
popdf3

Unnamed: 0,AZ,VA


<font color="red">In the above, the column names don't match any of the keys of popdat2, so the resulting dataframe has columns named AZ and VA, but no values</font>

In [112]:
popdf3 = DataFrame(popdat2,columns=['VA','Arizona'])
popdf3

Unnamed: 0,VA,Arizona
2005,,5.9
2010,,6.6
2015,,6.8


*So this is similar to creating a Series out of a dictionary where an index is given, with only some labels matching the dictionary keys*

---

**3. Creating a DataFrame from a 2D NumPy array**

In [113]:
rand2d = np.random.random((3,2))
randdf = DataFrame(rand2d)
randdf

Unnamed: 0,0,1
0,0.685627,0.080473
1,0.390041,0.037957
2,0.210715,0.090835


In [114]:
# change index and column names
randdf.index = ['one', 'two', 'three']
randdf.columns = ['first', 'second']
randdf

Unnamed: 0,first,second
one,0.685627,0.080473
two,0.390041,0.037957
three,0.210715,0.090835


In [144]:
# or do all of it at the time of creation
randdf = DataFrame(rand2d, index=['one', 'two', 'three'],
                   columns = ['first', 'second'])
randdf

Unnamed: 0,first,second
one,0.685627,0.080473
two,0.390041,0.037957
three,0.210715,0.090835


---

#### <font color="brown">Columns</font>

**Membership**

In [116]:
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


In [117]:
'debt' in popdf.columns  # like dictionary keys

False

**Each column is a Series**

In [32]:
print(popdf['state'])
print(popdf['state'].values)
print(popdf['state'].index)

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object
['Arizona' 'Arizona' 'Arizona' 'Virginia' 'Virginia']
RangeIndex(start=0, stop=5, step=1)


In [119]:
popdf['state'].name   # name of a column Series is simply the column name

'state'

**Alternatively, a column can be referenced as an attribute of the dataframe**

In [120]:
popdf.state

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object

**Can get at a subset of columns with list**

In [122]:
popdf[['state','pop']]

Unnamed: 0,state,pop
0,Arizona,5.9
1,Arizona,6.6
2,Arizona,6.8
3,Virginia,7.9
4,Virginia,8.3


**Changing column names**

In [123]:
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


In [124]:
popdf.columns = ['state','year','pop']
popdf

Unnamed: 0,state,year,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


<font color="red">**Warning: Changing column names afterward is like assigning new names, not rearranging!**</font>

In [125]:
# restore to original
popdf.columns = ['year','state','pop']
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


---

#### <font color="brown">Indexing and Manipulating rows and columns</font>

**Row indexing by position, using loc**

In [127]:
popdf.loc[1]

year        2010
state    Arizona
pop          6.6
Name: 1, dtype: object

**Row of a DataFrame is a Series**

In [128]:
print(popdf.loc[1].name)
print(popdf.loc[1].values)

1
[2010 'Arizona' 6.6]


**Range of rows**

In [129]:
popdf.loc[1:3]

Unnamed: 0,year,state,pop
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9


**Note above, end value of range of rows is INCLUSIVE!**

**Subset of rows, subset of columns**

In [130]:
popdf.loc[[0,2],['state','pop']]  # like with ndarray

Unnamed: 0,state,pop
0,Arizona,5.9
2,Arizona,6.8


**Adding a column**

In [132]:
# same value for all rows
popdf['debt'] = 1.5
popdf

Unnamed: 0,year,state,pop,debt
0,2005,Arizona,5.9,1.5
1,2010,Arizona,6.6,1.5
2,2015,Arizona,6.8,1.5
3,2010,Virginia,7.9,1.5
4,2015,Virginia,8.3,1.5


In [133]:
# Different value for each row
popdf['debt'] = np.arange(1,6)
popdf

Unnamed: 0,year,state,pop,debt
0,2005,Arizona,5.9,1
1,2010,Arizona,6.6,2
2,2015,Arizona,6.8,3
3,2010,Virginia,7.9,4
4,2015,Virginia,8.3,5


In [157]:
popdf2

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


In [158]:
# Different value for each row
popdf2['NJ'] = [8.2, 8.4, 8.6]
popdf2

Unnamed: 0,Arizona,Virginia,NJ
2005,5.9,,8.2
2010,6.6,7.9,8.4
2015,6.8,8.3,8.6


In [134]:
debts = Series([1.2, 1.5, 1.7])
popdf['debt'] = debts
popdf

Unnamed: 0,year,state,pop,debt
0,2005,Arizona,5.9,1.2
1,2010,Arizona,6.6,1.5
2,2015,Arizona,6.8,1.7
3,2010,Virginia,7.9,
4,2015,Virginia,8.3,


**In the above, NaNs are used to pad insufficient number of values for column**

**Creating a new column with values as a function of the other columns**

In [145]:
randdf

Unnamed: 0,first,second
one,0.685627,0.080473
two,0.390041,0.037957
three,0.210715,0.090835


In [146]:
randdf['third'] = randdf['first'] > randdf['second']
randdf

Unnamed: 0,first,second,third
one,0.685627,0.080473,True
two,0.390041,0.037957,True
three,0.210715,0.090835,True


**Row (index) membership**

In [147]:
'three' in randdf.index

True

**Row indexing by labels, using loc**

In [148]:
randdf.loc['two':'three']

Unnamed: 0,first,second,third
two,0.390041,0.037957,True
three,0.210715,0.090835,True


In [139]:
randdf.loc[1:2]  # but you can't use numeric indexes because the dataframe is now indexed by labels

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>

---

**Adding a row using loc**

In [159]:
popdf2

Unnamed: 0,Arizona,Virginia,NJ
2005,5.9,,8.2
2010,6.6,7.9,8.4
2015,6.8,8.3,8.6


In [160]:
popdf2.loc[2020] = [7.2, 8.6, 8.9]
popdf2

Unnamed: 0,Arizona,Virginia,NJ
2005,5.9,,8.2
2010,6.6,7.9,8.4
2015,6.8,8.3,8.6
2020,7.2,8.6,8.9


In [161]:
popdf2.loc[[2010,2020]]

Unnamed: 0,Arizona,Virginia,NJ
2010,6.6,7.9,8.4
2020,7.2,8.6,8.9


**Deleting a column with del operation**

In [181]:
popdf

Unnamed: 0,year,state,pop,debt
0,2005,Arizona,5.9,1.0
1,2010,Arizona,6.6,2.0
2,2015,Arizona,6.8,3.0
3,2010,Virginia,7.9,
4,2015,Virginia,8.3,


In [182]:
del popdf['debt']
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


In [164]:
randdf

Unnamed: 0,first,second,third
one,0.685627,0.080473,True
two,0.390041,0.037957,True
three,0.210715,0.090835,True


In [165]:
del randdf['second']
randdf

Unnamed: 0,first,third
one,0.685627,True
two,0.390041,True
three,0.210715,True


---

**Indexing a DataFrame with iloc (using integer indices)**

In [183]:
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


In [184]:
popdf.loc[1]

year        2010
state    Arizona
pop          6.6
Name: 1, dtype: object

In [185]:
popdf.loc[1,'year']

2010

In [186]:
# using iloc
popdf.iloc[1,0]   # use index for rows and columns

2010

In [188]:
popdf.iloc[1:4]   # slice rows

Unnamed: 0,year,state,pop
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9


In [189]:
popdf.iloc[[1,2,3]]  # list of row indexes

Unnamed: 0,year,state,pop
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9


In [190]:
popdf.iloc[:,'state']

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

**With iloc you can only use integer indexes for either rows or columns**

In [194]:
popdf2

Unnamed: 0,Arizona,Virginia,NJ
2005,5.9,,8.2
2010,6.6,7.9,8.4
2015,6.8,8.3,8.6
2020,7.2,8.6,8.9


In [196]:
popdf2.iloc[2:,[0,2]]

Unnamed: 0,Arizona,NJ
2015,6.8,8.6
2020,7.2,8.9
