# Requesting Statistics from the USGS Statistics Service
The USGS calculates various types of statistics for its data and provides these values through a web service. You can access this service through the `stats` function.
Learn more about the [USGS Statistics Service](https://waterservices.usgs.gov/rest/Statistics-Service.html).

There are three types of report that you can request using the `StatReportType` parameter.

- **'annual'**: This summarizes all of the official daily data for each year using max, min, mean, and the 5, 10, 20, 25, 50, 75, 80, 90, and 95th percentiles.
- **'monthly'**: This calculates the mean of the 28 to 31 daily values that occur for each of the months in each of the years of record.
- **'daily'**: This summarizes all of the data for this month and day, using max, min, mean, and the 5, 10, 20, 25, 50, 75, 80, 90, and 95th percentiles.

## Request multiple sites
You can request multiple sites by separating them with commas, like this: `'01541200,01542500'`

## Providing additional arguments
The USGS Statistics Service allows you to specify a wide array of additional parameters in your request. You can provide these parameters as keyword arguments, like in this example:

`hf.stats('01452500', parameterCD='00060')`

This will only request statistics for discharge, which is specified with the '00060' parameter code.

## Limiting requests to only certain parameters
The [default behavior for the USGS Statistics Service](https://waterservices.usgs.gov/rest/Statistics-Service.html#parameterCd) is to provide statistics for every parameter that is collected at a site. This can make for a long table that you will have to filter by the parameter that you want, like this:

`my_stat_dataframe.loc(my_stat_dataframe['parameter_cd']='00060')`

Alternatively, you can just request the parameter that you are interested in, rather than all of the parameters. To limit your request, provide the `parameterCD` keyword argument, like this:

`hf.stats('01452500', parameterCD='00060')`

You can request more than one parameter by listing every parameter code that you are interested in, separated by a comma:
`parameterCD='00060,00065'`

## Calculating annual statistics using water years
The [default behavior for the USGS Statistics Service](https://waterservices.usgs.gov/rest/Statistics-Service.html#statYearType) is to calculate annual statistics using calendar years. Unfortunately, for many places in the US, this will split the wet season in half. Since discharge data tends to be autocorrelated, you are more likely to get a large flood in January 2020 if you had a large flood in December 2019. To fix this, hydrologists often use 'Water Years', which split the year during the more or less dry season, on October 1st. To calculate annual statistics using the water year, provide the `statYearType='water'` argument, like this:

`hf.stats('01452500','annual', statYearType='water')`

## Missing data
The [default behavior for the USGS Statistics Service](https://waterservices.usgs.gov/rest/Statistics-Service.html#missingData) is to not calculate statistics for months or years if there are -ANY- missing values. In other words, in an annual report, every year reported will be based on 365 or 366 (leap year) values. You can override this behavior by providing the `missingData='on'` parameter. This will calculate the statistics as long as there are at least one measurement. You can decide whether or not to use the statistic by looking at the `count_nu` column to see how many values were used to generate the statistic.

## Viewing the metadata header or the data
The USGS accompanies every dataset with a header that explains the data. Hydrofunctions will automatically display this header along with the data. To access just one item, use either the .header or .df attribute. 

```
test = stats('01542500')

test        # Print the header & dataframe
test.header # print just the header
test.df     # print just the dataframe.
```

## Examples
The first step as always is to import hydrofunctions.

In [1]:
import hydrofunctions as hf
print(hf.__version__)

0.2.0


To get started, let's request some data from Karthus, PA to see what typically gets collected there.

In [2]:
may_2019 = hf.NWIS('01542500', 'dv', '2019-05-01', '2019-06-01')
may_2019

Requested data from https://waterservices.usgs.gov/nwis/dv/?format=json%2C1.1&sites=01542500&startDT=2019-05-01&endDT=2019-06-01


USGS:01542500: WB Susquehanna River at Karthaus, PA
    00010: <Day>  Temperature, water, degrees Celsius
    00060: <Day>  Discharge, cubic feet per second
    00095: <Day>  Specific conductance, water, unfiltered, microsiemens per centimeter at 25 degrees Celsius
    00300: <Day>  Dissolved oxygen, water, unfiltered, milligrams per liter
    00400: <Day>  pH, water, unfiltered, field, standard units
Start: 2019-05-01 00:00:00+00:00
End:   2019-06-01 00:00:00+00:00

### Requesting annual statistics
This site has collected discharge data since 1960, but other parameters, such as water temperature ('00010'), have only been collected since 2010. Unfortunately, in 2010, only 41 days of water temperature measurements were collected. By setting the `missingData` argument to `on`, we can ask the USGS to report averages for incomplete years. Now it is up to you to decide if 41 values is an adequate number!

In [3]:
annual_stats = hf.stats('01542500', 'annual', missingData='on')
# Use annual_stats.header to access just the header, or .df for just the dataframe.
# If you don't specify, both will be provided.
annual_stats

Retrieving annual statistics for site #01542500 from https://waterservices.usgs.gov/nwis/stat/?statReportType=annual&statType=all&sites=01542500&format=rdb&missingData=on


Unnamed: 0,agency_cd,site_no,parameter_cd,ts_id,loc_web_ds,year_nu,mean_va,count_nu
0,USGS,01542500,00010,118870,,2010,4.70,41
1,USGS,01542500,00010,118870,,2011,12.92,354
2,USGS,01542500,00010,118870,,2012,13.98,360
3,USGS,01542500,00010,118870,,2013,12.76,365
4,USGS,01542500,00010,118870,,2014,12.43,362
5,USGS,01542500,00010,118870,,2015,12.46,365
6,USGS,01542500,00010,118870,,2016,12.64,358
7,USGS,01542500,00010,118870,,2017,12.37,358
8,USGS,01542500,00010,118870,,2018,11.64,353
9,USGS,01542500,00010,118870,,2019,11.80,362


### Requesting monthly statistics
The monthly report provides the mean value for each parameter for every month since 1960, when data collection began at this site.

Since this site collects lots of parameters, we can limit our display of the dataframe by filtering everything out except the discharge parameter ('00060').

In [4]:
monthly_stats = hf.stats('01542500', 'monthly')
monthly_stats.df.loc[monthly_stats.df['parameter_cd']=='00060']

Retrieving monthly statistics for site #01542500 from https://waterservices.usgs.gov/nwis/stat/?statReportType=monthly&statType=all&sites=01542500&format=rdb


Unnamed: 0,agency_cd,site_no,parameter_cd,ts_id,loc_web_ds,year_nu,month_nu,mean_va,count_nu
93,USGS,01542500,00060,118867,,1960,10,258.8,31
94,USGS,01542500,00060,118867,,1960,11,441.1,30
95,USGS,01542500,00060,118867,,1960,12,280.5,31
96,USGS,01542500,00060,118867,,1961,1,474.2,31
97,USGS,01542500,00060,118867,,1961,2,5155.0,28
98,USGS,01542500,00060,118867,,1961,3,6108.0,31
99,USGS,01542500,00060,118867,,1961,4,6145.0,30
100,USGS,01542500,00060,118867,,1961,5,3320.0,31
101,USGS,01542500,00060,118867,,1961,6,2109.0,30
102,USGS,01542500,00060,118867,,1961,7,1004.0,31


### Requesting daily reports
The daily statistics report is different from the monthly and annual reports in that it aggregates multiple years together from across the entire period of record. So in the following example, in line 0, the report provides statistics for January 1st by calculating the mean of every January 1st from 1961 ('begin_yr') to 2019 ('end_yr').

Note that there are 366 rows, or 365 days each year plus Febrary 29th on leap years.

In [5]:
daily_stats = hf.stats('01542500', 'daily', parameterCd='00060')
daily_stats.df

Retrieving daily statistics for site #01542500 from https://waterservices.usgs.gov/nwis/stat/?statReportType=daily&statType=all&sites=01542500&format=rdb&parameterCd=00060


Unnamed: 0,agency_cd,site_no,parameter_cd,ts_id,loc_web_ds,month_nu,day_nu,begin_yr,end_yr,count_nu,...,mean_va,p05_va,p10_va,p20_va,p25_va,p50_va,p75_va,p80_va,p90_va,p95_va
0,USGS,01542500,00060,118867,,1,1,1961,2019,46,...,2850,520.0,807,970,1080,2180,3790,4340,6800,7600.0
1,USGS,01542500,00060,118867,,1,2,1961,2019,46,...,2950,487.0,800,1040,1100,2080,3730,3960,6400,9460.0
2,USGS,01542500,00060,118867,,1,3,1961,2019,46,...,2880,523.0,793,1110,1290,2110,3400,4590,6450,8390.0
3,USGS,01542500,00060,118867,,1,4,1961,2019,46,...,2720,541.0,753,1120,1280,1890,3280,4190,6070,7510.0
4,USGS,01542500,00060,118867,,1,5,1961,2019,46,...,2710,551.0,716,1080,1180,1850,3880,4480,5760,8090.0
5,USGS,01542500,00060,118867,,1,6,1961,2019,46,...,2850,556.0,802,1040,1120,2140,3820,4420,6160,7460.0
6,USGS,01542500,00060,118867,,1,7,1961,2019,46,...,2780,576.0,819,1020,1100,1900,3650,4060,5500,6930.0
7,USGS,01542500,00060,118867,,1,8,1961,2019,46,...,2630,603.0,792,976,1100,1770,3700,4050,4870,6090.0
8,USGS,01542500,00060,118867,,1,9,1961,2019,46,...,2710,621.0,788,929,1130,1920,3620,3720,5280,8450.0
9,USGS,01542500,00060,118867,,1,10,1961,2019,46,...,2560,696.0,847,961,1070,1720,3320,3700,5370,7960.0
