# Data Ingestion

Machine learning tasks are typically data heavy, requiring labelled data for supervised learning or unlabelled data for unsupervised learning. In Python, data is typically stored in memory as NumPy arrays at some level, but in most cases you can use higher-level containers built on top of NumPy that are more convenient for tabular data ([Pandas](http://pandas.pydata.org)), multidimensional gridded data ([xarray](http://xarray.pydata.org)), or out-of-core and distributed data ([Dask](http://dask.pydata.org)).  

Each of these libraries allows reading local data in a variety of formats.  In many cases the required datasets are large and stored on remote servers, so we will show how to use the [Intake](https://intake.readthedocs.io) library to fetch remote datasets efficiently, including built-in caching to avoid unncessary downloads when the files are available locally.

To ensure that you understand the properties of your data and how it gets transformed at each step in the workflow, we will use exploratory visualization tools as soon as the data is available and at every subsequent step.

Once you have loaded your data, you will typically need to reshape it appropriately before it can be fed into a machine learning pipeline. Those steps will be detailed in the next tutorial: [Alignment and Preprocessing](03_Alignment_and_Preprocessing.ipynb).  

## Inline loading

We'll start with the simple case of loading small local datasets, such as a .csv file for Pandas:

In [1]:
import pandas as pd

training_df = pd.read_csv('../data/landsat5_training.csv')

We can inspect the first several lines of the file using ``.head``, or a random set of rows using ``.sample(n)``

In [2]:
training_df.head()

Unnamed: 0,image,type,easting,northing,red,green,blue,nir,ndvi,bn,bnn
0,LT05_L1TP_042033_19881022_20161001_01_T1,water,348586.0,4286269.0,182,351,319,130,-0.166667,2.453846,-0.420935
1,LT05_L1TP_042033_19881022_20161001_01_T1,water,338690.0,4323890.0,620,656,527,433,-0.177588,1.21709,-0.097917
2,LT05_L1TP_042033_19881022_20161001_01_T1,veg,345930.0,4360830.0,358,506,272,5411,0.875888,0.050268,0.904276
3,LT05_L1TP_042033_19881022_20161001_01_T1,veg,344490.0,4363590.0,343,639,374,5826,0.888799,0.064195,0.879355
4,LT05_L1TP_042033_19881022_20161001_01_T1,veg,346410.0,4360620.0,360,611,325,5405,0.875108,0.06013,0.886562


To get a better sense of how this dataframe is set up, we can look at ``.info()``

In [3]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 11 columns):
image       22 non-null object
type        22 non-null object
easting     22 non-null float64
northing    22 non-null float64
red         22 non-null int64
green       22 non-null int64
blue        22 non-null int64
nir         22 non-null int64
ndvi        22 non-null float64
bn          22 non-null float64
bnn         22 non-null float64
dtypes: float64(5), int64(4), object(2)
memory usage: 2.0+ KB


To use methods like `pd.read_csv`, the data all needs to be on the local filesystem (or on one of the limited remote specification formats supported by Pandas, such as S3). We could of course put in various commands here to fetch a file explicitly from a remote server, but the notebook would then very quickly get complex and unreadable.

Instead, for larger datasets, we can automate those steps using intake so that remote and local data can be treated similarly. 

In [4]:
import intake

training = intake.open_csv('../data/landsat5_training.csv')

To get better insight into the data without loading it all in just yet, we can inspect the data using ``.to_dask()``

In [5]:
training_dd = training.to_dask()
training_dd.head()

Unnamed: 0,image,type,easting,northing,red,green,blue,nir,ndvi,bn,bnn
0,LT05_L1TP_042033_19881022_20161001_01_T1,water,348586.0,4286269.0,182,351,319,130,-0.166667,2.453846,-0.420935
1,LT05_L1TP_042033_19881022_20161001_01_T1,water,338690.0,4323890.0,620,656,527,433,-0.177588,1.21709,-0.097917
2,LT05_L1TP_042033_19881022_20161001_01_T1,veg,345930.0,4360830.0,358,506,272,5411,0.875888,0.050268,0.904276
3,LT05_L1TP_042033_19881022_20161001_01_T1,veg,344490.0,4363590.0,343,639,374,5826,0.888799,0.064195,0.879355
4,LT05_L1TP_042033_19881022_20161001_01_T1,veg,346410.0,4360620.0,360,611,325,5405,0.875108,0.06013,0.886562


In [6]:
training_dd.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 11 entries, image to bnn
dtypes: object(2), float64(5), int64(4)

To get a full pandas.DataFrame object, use ``.read()`` to load in all the data.

In [7]:
training_df = training.read()
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 11 columns):
image       22 non-null object
type        22 non-null object
easting     22 non-null float64
northing    22 non-null float64
red         22 non-null int64
green       22 non-null int64
blue        22 non-null int64
nir         22 non-null int64
ndvi        22 non-null float64
bn          22 non-null float64
bnn         22 non-null float64
dtypes: float64(5), int64(4), object(2)
memory usage: 2.0+ KB


**NOTE:** There are different items in these two info views which reflect what is knowable before and after we read all the data. For instance, it is not possible to know the ``shape`` of the whole dataset before it is loaded.

## Loading multiple files

In addition to allowing partitioned reading of files, intake lets the user load and concatenate data across multiple files in one command

In [8]:
training = intake.open_csv(['../data/landsat5_training.csv', '../data/landsat8_training.csv'])

In [9]:
training_df = training.read()
training_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 27
Data columns (total 11 columns):
image       50 non-null object
type        50 non-null object
easting     50 non-null float64
northing    50 non-null float64
red         50 non-null int64
green       50 non-null int64
blue        50 non-null int64
nir         50 non-null int64
ndvi        50 non-null float64
bn          50 non-null float64
bnn         50 non-null float64
dtypes: float64(5), int64(4), object(2)
memory usage: 4.7+ KB


**NOTE:** The length of the dataframe has increased now that we are loading multiple sets of training data.

This can be more simply expressed as:

In [10]:
training = intake.open_csv('../data/landsat*_training.csv')

Sometimes, there is data encoded in a file name or path that causes concatenated data to lose some important context. In this example, we lose the information about which version of landsat the training was done on. To keep track of that information, we can use a python format string to specify our path and declare a new field on our data. That field will get populated based on its value in the path. 

In [11]:
training = intake.open_csv('../data/landsat{version:d}_training.csv')
training_df = training.read()
training_df.head()

Unnamed: 0,image,type,easting,northing,red,green,blue,nir,ndvi,bn,bnn,version
0,LT05_L1TP_042033_19881022_20161001_01_T1,water,348586.0,4286269.0,182,351,319,130,-0.166667,2.453846,-0.420935,5
1,LT05_L1TP_042033_19881022_20161001_01_T1,water,338690.0,4323890.0,620,656,527,433,-0.177588,1.21709,-0.097917,5
2,LT05_L1TP_042033_19881022_20161001_01_T1,veg,345930.0,4360830.0,358,506,272,5411,0.875888,0.050268,0.904276,5
3,LT05_L1TP_042033_19881022_20161001_01_T1,veg,344490.0,4363590.0,343,639,374,5826,0.888799,0.064195,0.879355,5
4,LT05_L1TP_042033_19881022_20161001_01_T1,veg,346410.0,4360620.0,360,611,325,5405,0.875108,0.06013,0.886562,5


In [12]:
# Exercise: Try looking at the tail of the data using training_df.tail(), or a random sample using training_df.sample(5)

## Using Catalogs

For more complicated setups, we use the file catalog.yml to declare how the data should be loaded. The catalog lays out how the data should be loaded, defines some metadata, and specifies any patterns in the file path that should be included in the data. Here is an example of a catalog entry:

```
sources:
  l5:
    description: Images contain Landsat 5 Surface Reflectance Level-2 Science Product.
    driver: rasterio
    cache:
      - argkey: urlpath
        regex: 'earth-data/landsat'
        type: file
    args:
      urlpath: 's3://earth-data/landsat/LT05_L1TP_042033_19881022_20161001_01_T1_sr_band{band:d}.tif'
      chunks:
        band: 1
        x: 256
        y: 256
      concat_dim: band
      storage_options: {'anon': True}
```

The ``urlpath`` can be a path to a file, list of files, or a path with glob notation. Alternatively the path can be written as a python style [format_string](https://docs.python.org/3.6/library/string.html#format-string-syntax). In the case where the ``urlpath`` is a format string, the fields specified in that string will be parsed from the filenames and returned in the data. 

In [13]:
cat = intake.open_catalog('../catalog.yml')
list(cat)

['l5',
 'l8',
 'google_landsat_band',
 'amazon_landsat_band',
 'fluxnet_daily',
 'fluxnet_metadata']

In [14]:
# Exercise: Read the description of the l5 data source using cat.l5.description

**NOTE:** If you don't have the data cached yet, then the next cell will take several minutes.

In [15]:
l5 = cat.l5
l5.to_dask()

<xarray.DataArray (band: 6, y: 7241, x: 7961)>
dask.array<shape=(6, 7241, 7961), dtype=int16, chunksize=(1, 256, 256)>
Coordinates:
  * y        (y) float64 4.414e+06 4.414e+06 4.414e+06 ... 4.197e+06 4.197e+06
  * x        (x) float64 2.424e+05 2.424e+05 2.425e+05 ... 4.812e+05 4.812e+05
  * band     (band) int64 1 2 3 4 5 7
Attributes:
    transform:   (30.0, 0.0, 242385.0, 0.0, -30.0, 4414215.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)

The data has not yet been loaded so we don't have access to the actual data values yet, but we do have access to coordinates and metadata. Next we will read in the data:

In [16]:
l5_da = l5.read_chunked()

## Visualizing the data

To get a quick sense of the data, we can plot it using [hvPlot](https://hvplot.pyviz.org/), which provides interactive plotting commands for Intake, Pandas, XArray, Dask, and GeoPandas. We'll look more closely at hvPlot and its options in later tutorials.

In [17]:
import hvplot.intake
intake.output_notebook()

We can quickly generate a plot of each of the landsat bands:

In [18]:
l5.hvplot(kind='image', x='x', y='y', groupby='band', datashade=True, width=400)

This same plot can be declared in the catalog for ease of use and to point users to helpful ways to visualize data. Here is the relevant part of `catalog.yml`:

```
metadata:
  plots:
    band_image:
      kind: 'image'
      x: 'x'
      y: 'y'
      groupby: 'band'
      datashade: True
      width: 400
      dynamic: False
```

In [19]:
l5.hvplot.band_image()

We can achieve the same output using the underlying XArray DataArray itself. When using the DataArray, we can do some pre-processing such as filtering out missing values. 

In [20]:
l5_da_filtered = l5_da.where(l5_da > l5_da.nodatavals[0])

We can plot this filtered array to get rid of the background artifact seen above. 

In [21]:
import hvplot.xarray

In [22]:
l5_da_filtered.hvplot(kind='image', x='x', y='y', groupby='band', datashade=True, width=400, dynamic=False)

Since `xarray` objects store raster information in a predictable way, we can even project the image onto lat lon coordinates using the `geo` key word argument.

In [23]:
# Exercise: Add geo=True to the cell above and rerun it. Notice how the axes are now in lat, lon.

## Accessing the data

Machine Learning pipelines such as scikit-learn accept Numpy arrays as input. These arrays are accessible in DataArray objects on the `values` attribute.

In [24]:
type(l5_da_filtered.values)

numpy.ndarray

We can use tab completion to explore what other information is stored on our xarray.DataArray object. We can use tab completion to explore attributed and methods available on our object.

In [25]:
# Exercise: Try typing l5_da_filtered. and press [tab] - don't forget the trailing dot!

In [26]:
# Exercise: Look at the data for just one band usingl5_da_filtered.sel(band=n). 
# Challenge: Try to plot that band (hint: don't use groupby)

### Next:

Now that you have loaded your data, you will typically need to reshape it appropriately before it can be fed into a machine-learning pipeline. These steps are detailed in the next tutorial: [Alignment and Preprocessing](03_Alignment_and_Preprocessing.ipynb).