## Pandas and SQL - Continued

In [3]:
#Import pandas
import pandas as pd

In [5]:
#Import pandas_access https://github.com/jbn/pandas_access
try:
    import pandas_access as mdb
except:
    import pip
    pip.main(['install','pandas_access'])
    import pandas_access as mdb

### Reading in the NRI point  data
Below, we read the csv files into a Pandas dataFrame as we have in the past - with a few exceptions.

Pandas, like MS Access, will infer the data type from the values it's importing. However, we have some numeric fields that need to be imported as strings: the `recordid`, `fips`, `hydro`, `mhydro`, and `mlra` fields. To do this, we create a dictionary of field names and the field types we want to override. Any fields left of this list will get the default data types.

We will also set the recordid as the index for the dataFrame.

In [None]:
#Create the dataType dictionary
dtypeDict = {'recordid':'str',
             'fips':'str',
             'hydro':'str',
             'mhydro':'str',
             'mlra':'str'
            }

#Read in the data
dfPoint = pd.read_csv('../Data/nc_point.csv',
                      index_col='recordid',
                      dtype=dtypeDict)

In [21]:
#Show the data types
dfPoint.dtypes

nriptr        int64
stratum       int64
cluster       int64
xfact         int64
fips         object
hydro        object
mhydro       object
mlra         object
bailey       object
split       float64
impute        int64
impute87      int64
wifact        int64
wcfact        int64
urfact        int64
ukfact      float64
usleflag      int64
s5id         object
s5name       object
surftxt      object
txtmod       object
loslope       int64
hislope       int64
flood        object
ophase       object
tfact         int64
dtype: object

In [10]:
#Have a quick look 
dfPoint.head()

Unnamed: 0,recordid,nriptr,stratum,cluster,xfact,fips,hydro,mhydro,mlra,bailey,...,usleflag,s5id,s5name,surftxt,txtmod,loslope,hislope,flood,ophase,tfact
0,37001010001,249587,37001,1,1,37001,3030002,303,136,231A,...,0,NC0032,APPLING,SL,NON,2,6,NONE,ERODED,4
1,37001010002,288554,37001,1,17,37001,3030002,303,136,231A,...,0,VA0046,ORANGE,SIL,NON,6,10,NONE,0,3
2,37001010003,274351,37001,1,17,37001,3030002,303,136,231A,...,0,SC0017,HERNDON,SIL,NON,15,25,NONE,0,5
3,37001010004,274167,37001,1,17,37001,3030002,303,136,231A,...,0,SC0014,GEORGEVILLE,SIL,NON,10,15,NONE,0,4
4,37001010005,288554,37001,1,1,37001,3030002,303,136,231A,...,0,VA0046,ORANGE,SIL,NON,6,10,NONE,0,3


Ok. Now it's your turn. Import the nc_trend.csv file. Set the following columns to be strings: `recordid`,`landuse`,`broad`. (Others columns with nominal data should be strings, but this will suffice...). Also, as above, set the `recordid` column to be the index.

In [27]:
dtypeDict = {'recordid':'str',
             'landuse':'str',
             'broad':'str'
            }

dfTrend = pd.read_csv("../Data/nc_trend.csv", dtype=dtypeDict, index_col='recordid')
dfTrend.dtypes

yr           int64
landuse     object
broad       object
prime        int64
crp          int64
crpcov       int64
crpnum       int64
dblcrop      int64
irtyp        int64
irsor        int64
irsys        int64
ucfact     float64
upfact     float64
lenslop      int64
slope      float64
landcl      object
knoll        int64
wrotat       int64
usle       float64
ei         float64
eiwater    float64
eiwind     float64
aaweq      float64
dtype: object

OK, now we are read to analyse the data (and learn how Pandas does it...)

* First another example of an aggregate function: Lets count the number of samples and total area of each location within each county using the `dfPoint` dataFrame.

In [36]:
#Create the grouping object
grpCounty = dfPoint.groupby('fips')
type(grpCounty)

pandas.core.groupby.DataFrameGroupBy

In [51]:
#With this DataFrameGroupBy object we can apply different aggregate functions.
dfX = grpCounty['fips'].agg('count')
dfX.head()

fips
37001    369
37003    191
37005    314
37007    336
37009    445
Name: fips, dtype: int64

In [55]:
#Sum up the xfact values and muliply by 10
dfX = grpCounty['xfact'].agg('sum')
dfX.head()

fips
37001    2783
37003    1685
37005    1507
37007    3438
37009    2732
Name: xfact, dtype: int64

In [56]:
#Or we can combine the aggregating functions into a single 
# command using a dictionary to define how we want to aggregate

#Create a dictionary of field names: aggregating functions
grpFunctions = {'fips':['count'],'xfact':['sum']}

#Apply them all at once
dfX = grpCounty['xfact'].agg(grpFunctions)
dfX.head()

Unnamed: 0_level_0,fips,xfact
Unnamed: 0_level_1,count,sum
fips,Unnamed: 1_level_2,Unnamed: 2_level_2
37001,369,2783
37003,191,1685
37005,314,1507
37007,336,3438
37009,445,2732


## Transforming data
Pandas can pivot data too. Let's pivot our `dfTrend` table so that it moves the year values into columns and presents the value in the `broad` column (for each year). This is done with the Pandas `pivot`

In [59]:
dfX = dfTrend.pivot(columns='yr')
dfX.head()

Unnamed: 0_level_0,landuse,landuse,landuse,landuse,broad,broad,broad,broad,prime,prime,...,eiwater,eiwater,eiwind,eiwind,eiwind,eiwind,aaweq,aaweq,aaweq,aaweq
yr,1982,1987,1992,1997,1982,1987,1992,1997,1982,1987,...,1992,1997,1982,1987,1992,1997,1982,1987,1992,1997
recordid,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
37001010001,342,342,800,800,5,5,8,8,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37001010002,342,342,342,342,5,5,5,5,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37001010003,342,342,342,342,5,5,5,5,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37001010004,342,342,342,342,5,5,5,5,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37001010005,342,342,342,800,5,5,5,8,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
