# Data collection

In this notebook we will collect the raw data from sentinel2 satellite and the target data from the CORINE data.

## Sentinel 2

The sentinel2 data is available [here](https://scihub.copernicus.eu/). This program provides free multi spectral satellite imagery of the whole planet with a refresh rate of 3 to 5 days.

The resolution depends on the band you are interested in, the highest being 10m and lowets being 60m.

In this notebook, we will use the [`sentinelsat`](https://sentinelsat.readthedocs.io/en/stable/) library to select and download our data. You will need a free [account](https://scihub.copernicus.eu/userguide/SelfRegistration) for that.

## CORINE

The CORINE program is also a European program is a manual classification of the land cover in Europe. We will be using here the 2018 update as it is the latest and as we have sentinel data for this year.

The data used is available [here](https://land.copernicus.eu/pan-european/corine-land-cover/clc2018) by going to the download tab and selecting the geoTiff. (You'll need a free account for that).

The classification contains 44 classes that you can find [here](https://land.copernicus.eu/eagle/files/eagle-related-projects/pt_clc-conversion-to-fao-lccs3_dec2010). Classes are grouped into 5 main categories:
* Artificial surfaces
* Agriculture area
* Forest and semi natural area
* Wetlands
* Water bodies

In the following notebooks, we will be interested in all kinds of forests, classes `311`, `312` ad `313`.

In the following, we assume the data is stored in the `data/` folder. We also need to have a geometry available. If you do not like the default one, you can create one on [geojson.io](http://geojson.io/). This geometry will be used to search for tiles to download.

We use here an approximate version of Normandy's Geometry as I like this region :p
You do not need to have a super accurate geometry at this step as this geometry is only used to query input data from the sentinel satellites.

In [1]:
from sentinelsat import SentinelAPI, read_geojson, geojson_to_wkt
import zipfile
import os

# First we connect to the Copernicus public API
api = SentinelAPI(os.environ['DHUS_USER'], os.environ['DHUS_PASSWORD'])

# Then we load the geometry to query the data
footprint = geojson_to_wkt(read_geojson('../data/region.geojson'))

# We query Coppernicus for 2020 as 2018 is archived and harder to retrieve.
# We focus on June as this month has less clouds
# Here we chose data between June because it is less likely to have clouds
products = api.query(footprint,
                     date=('20200601', '20200630'),
                     producttype='S2MSI2A', # This is the athmosperic corrected version of S2
                     cloudcoverpercentage=(0, 5))
                     
print(f'{len(products)} results found')

31 results found


If you kept the default parameters you should see 31 tiles covering Normandy with less than 5% cloud coverage in June 2020. Let us now download these to the data folder. (Might take a while, so obviously try to run it only once).

We also unzip them and delete the tarball for more convenience.

In [None]:
raw_data_path = '../data/s2a/'
api.download_all(products, raw_data_path)

for tarball in os.listdir(raw_data_path):
    path = os.path.join(raw_data_path, tarball)
    with zipfile.ZipFile(path, 'r') as zip_ref:
        zip_ref.extractall(raw_data_path)
        os.remove(path)

The first part of this tutorial is now over, you can go check notebook 2 on pre-processing for the following.