# Data Workshop 2

**Instructor:** Jared Brzenski jabrzenski@ucsd.edu

**TAs:** Tommy Stone           tstone@ucsd.edu


This can be run as MATLAB or Python, depending on the environment chosen.

For MATLAB, run ```pip install jupyter-matlab-proxy``` in your environmnet and activate MATLAB in the upper right corner.


# Raw Data
Lets say we are given the task of analyzing the hsitorical data fro mthe water gauge station at the [Prado Dam in Los Angeles](https://waterdata.usgs.gov/monitoring-location/USGS-11074000/#dataTypeId=continuous-00065-0&period=P7D).

We want to download the [data](data/PradoDam.txt), clean it, and do some spectral analysis on it to see if there is anything interesting.



## Reading in the data

If we do a quick check of the file, we note there is a giant header, then some columns of data.
```
# ---------------------------------- WARNING ----------------------------------------
# Some of the data that you have obtained from this U.S. Geological Survey database
# may not have received Director's approval. Any such data values are qualified
# as provisional and are subject to revision. Provisional data are released on the
# condition that neither the USGS nor the United States Government may be held liable
# for any damages resulting from its use.
#
# Additional info: https://help.waterdata.usgs.gov/policies/provisional-data-statement
#
# File-format description:  https://help.waterdata.usgs.gov/faq/about-tab-delimited-output
# Automated-retrieval info: https://help.waterdata.usgs.gov/faq/automated-retrievals
#
# Contact:   gs-w_support_nwisweb@usgs.gov
# retrieved: 2020-04-29 18:30:02 EDT       (caww01)
#
# Data for the following 1 site(s) are contained in this file
#    USGS 11074000 SANTA ANA R BL PRADO DAM CA
# -----------------------------------------------------------------------------------
#
# Data provided for site 11074000
#            TS   parameter     statistic     Description
#          8183       00060     00003     Discharge, cubic feet per second (Mean)
#
# Data-value qualification codes included in this output:
#     A  Approved for publication -- Processing and review completed.
#     P  Provisional data subject to revision.
#     e  Value has been estimated.
# 
agency_cd	site_no	datetime	8183_00060_00003	8183_00060_00003_cd
5s	15s	20d	14n	10s
USGS	11074000	1940-09-30	51.0	A
USGS	11074000	1940-10-01	47.0	A
USGS	11074000	1940-10-02	47.0	A
USGS	11074000	1940-10-03	47.0	A
```


This tells us the pertinent information about the file, where it came from, and what format the data displayed is in.

In MATLAB, we can load this data in by giving the filename of the data location, and using readtable.

In [1]:
% MATLAB
% read in a text file

filename = 'PradoDam.txt';

% Offer a helpful hint if we cant find the file
[path, name, ext] = fileparts(filename);
if ext ~= '.txt'
    fprint("Wrong file extension given.\n");
    return;
end

% We could read in the data raw with
% ff = importdata(filename);

% Or, read it in as a table with
f =  readtable(filename);

<class 'SyntaxError'>: invalid syntax (1569836461.py, line 8)

WE can examine the data by viewing the ```f``` variable, and note it is in columns already. we are interested in column 3 and 4, the date and measurements.

In [2]:
% MATLAB
date = f{:,3};
flow = f{:,4};
% Convert the date from a string into datenum object so MATLAB can do date specific work.
date = datenum( f{:,3} );

<class 'SyntaxError'>: invalid syntax (712056761.py, line 1)

%MATLAB
% Lets say we need a monthly average, we can do that with indexing ?!?
months = month(date)

for ii=  1:12
    indexes = find(months == ii)
    Total(ii) = sum( flow(indexes) );
    Average(ii) = Total(ii) / length(indexes);
end

date = datetime( date , 'ConvertFrom', 'datenum' );

save ('save_data.mat', 'date', 'flow' );

And conversely, if we are missing a newer MATLAB, or want to do this the way MATLAB is doing it under the hood, this is the equivalent to the above code. Be carefuly running it, it will clobber the variables in the above cells!

In [None]:
% MATLAB - Not necessary to run
raw_data = importdata(filename);
[nr nc] = size(raw_data);
date = zeros(nr,1);
Flow = zeros(nr,1);
% Scan the rows for a string, string, string, float, and a string
for ii = 1:nr
    row = textscan( raw_data{ii}, '%s%s%s%f%s');
    agency = row{1};
    % Check if the first string says USGS, we know we in the data!
    if strcmp(agency, 'USGS')
        date(ii) = datenum(row{3});
        flow(ii) = row{4};
    end
end

[nr nc] = size(f);
date = zeros( nr, 1);
Q = zeros( nr, 1);

% Import the date and time and plot
date =  f{:,3} ;
flow =  f{:,4};
plot(date, flow)
xtickformat('dd-MMM-yyyy');


If you scrolled through the data, you noticed there were some missing values, dates, etc. WE need to remove those from our data set. How can we find them efficiently?

In [None]:
% MATLAB 
% clean NaNs
inan = find(isnan(flow));
flow(inan) = [];           % This effectively removes the entry
date(inan) = [];           % Do not forget the dates as well

% Other equivalent ways of finding nans in dates
% isnat == is not a time
% if date is a date string

% If we did everything correct, then this should be equal to zero
inad = find(isnat(date))