# Usage
We start by exploring the data-processing pipeline part of `DAMAST`.
We consider a manufactured dataset of Automatic Identification System (AIS) messages.
The data is generated for 150 boats, where the minimal length of a trajectory is 30 messages, and the maximal length is 1000

In [10]:
import damast.domains.maritime.ais.data_generator as generator

data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)

The data is stored in a [vaex.DataFrame](https://vaex.io/), and we can inspect the first and last 5 messages in the dataset.

In [11]:
print(data.dataframe)

#        mmsi       lon                  lat                  date_time_utc        sog                  cog                  true_heading         nav_status    rot    message_nr    source
0        621532812  169.01944156308522   -50.731451872711695  1985-06-03 11:13:00  4.1499236136697135   -2.354470944458527   -2.2876499646786725  1             0.0    2             g
1        199844590  -83.18571761617619   -62.45988284001118   2013-05-20 09:12:49  -8.204280701719151   -4.2991047127428805  -4.254451021920967   1             0.0    2             g
2        548223900  108.05711108014981   2.064646638815308    1994-09-08 15:24:10  7.561444636735771    -2.5578428137523153  -2.5519120877859653  1             0.0    2             g
3        352947096  21.591849741843838   -52.45075452078838   1988-10-04 07:25:39  -3.1909910434431     -4.430585703265792   -4.420046661627201   1             0.0    2             s
4        746644671  -78.35262475651618   -5.019945790377366   1976-10-09 22:53:5

The dataset consists of 11 columns, which we will go through in detail.

## Data-specification
The Maritime Mobile Service Identity (MMSI) used to identify a ship. It *should* be a 9 digit number whose first integer should be between 2 and 7
The data we have generated should consist some invalid numbers. Let us inspect these.

In [12]:
from damast.domains.maritime.data_specification import MMSI
df = data.dataframe
invalid_mmsis = df[(MMSI.min_value>df["mmsi"]) | (df["mmsi"]>MMSI.max_value)]
invalid_mmsis

#,mmsi,lon,lat,date_time_utc,sog,cog,true_heading,nav_status,rot,message_nr,source
0,199844590,-83.18571761617619,-62.45988284001118,2013-05-20 09:12:49,-8.204280701719151,-4.2991047127428805,-4.254451021920967,1,0.0,2,g
1,837189107,-17.057999085781166,73.17942504645124,1999-02-14 04:19:40,4.161191531695675,-2.1446024143296953,-2.141166182643515,1,0.0,2,s
2,808438589,6.541356033529531,-50.962478312716335,2003-11-10 14:21:49,-0.21343675077172186,-1.3832250016226015,-1.2880294451172154,1,0.0,2,s
3,801576021,-64.04333586597684,74.6913387398394,2008-01-10 14:20:27,-5.7904595085769675,3.15275458554951,3.23937413376734,1,0.0,2,g
4,820183028,47.79091861635569,70.66473159505955,2005-04-28 02:14:46,-9.831573137181007,5.7983732973460285,5.8321671902263255,0,0.0,2,s
...,...,...,...,...,...,...,...,...,...,...,...
12893,832643648,11.99445634184816,19.426419931381997,2004-10-04 01:05:41,11.403018588983691,4.286657792814315,4.373453929778491,1,0.0,2,g
12894,832643648,12.057301018030678,19.37045107258576,2004-10-04 00:36:25,9.150906448560725,4.137031361195816,4.2285641150594415,1,0.0,2,g
12895,192628228,79.91846583193717,32.82099865898587,1980-01-10 10:30:57,-16.879659916693292,1.6403416676838498,1.7146719250402822,7,0.0,2,s
12896,194539051,140.18493668457407,-55.06869714944428,1975-09-29 23:05:12,-0.4189652880784167,0.9340582884391302,1.0261002223031315,0,0.0,2,s


Before sending this data to a machine learning algorithm, one would have to filter out invalid data.
We can do this by creating a `damast.core.DataSpecification` describing what valid output we would like in our data-frame.

In [13]:
from damast.core import DataSpecification, MinMax
mmsi_spec = DataSpecification(name="mmsi", description="Maritime Mobile Service Identity", representation_type=int,
                              value_range=MinMax(MMSI.min_value, MMSI.max_value))

We have here described what data this column is supposed to describe, how the data is represented in Python, and its minimum and maximum range.
Next, we create a `damast.core.MetaData` object that we can apply to the dataframe.

In [14]:
from damast.core import MetaData,ValidationMode
metadata = MetaData([mmsi_spec])
metadata.apply(df, ValidationMode.UPDATE_DATA)



Of course, we do not want to do this process manually per row. Therefore, we can create a `DataSpecification` per row, and let the `damast.core.AnnotatedDataFrame` handle the validation of the data. We can choose between three ways of handling the input data with metadata, we can either use:
- `ValidationMode.READONLY`: Reads in the data, checks it against the meta-data and throws and error if the data does not adhere to the data-specification.
- `ValidationMode.UPDATE_METADATA`: Update the metadata based on the input in the annotated data-frame. This might change the representation type, column name and valid rages of the data.
- `ValidationMode.UPDATE_DATA`: Update data so that it adheres to the meta-data.

In [21]:
from damast.core.metadata import DataCategory
from damast.core.dataframe import AnnotatedDataFrame
dataspec = {
    "annotations": {"comment": "This is a autogenerated test data set"},
    "columns": [
        {"name": "mmsi", "is_optional": False, "category": DataCategory.STATIC,
         "value_range":{"MinMax": {"min": MMSI.min_value, "max": MMSI.max_value}}},
        {"name": "lon", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "lat", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "date_time_utc", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "sog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "cog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "true_heading", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "nav_status", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "rot", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "message_nr", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "source", "is_optional": False, "category": DataCategory.DYNAMIC},
    ]
}
metadata = MetaData.from_dict(dataspec)
data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
adf = AnnotatedDataFrame(data.dataframe, metadata, validation_mode=ValidationMode.UPDATE_DATA)
adf



#,mmsi,lon,lat,date_time_utc,sog,cog,true_heading,nav_status,rot,message_nr,source
0,314098953,95.48376234338699,-13.183112816482993,1985-06-30 05:54:09,2.2456943765993764,-2.4784838425669555,-2.4144838666540136,0,0.0,2,g
1,652693917,163.36414039335648,-22.71019798308159,2009-11-30 19:55:04,-17.54548950112173,-1.673696483450787,-1.6133598294089926,7,0.0,2,g
2,302955746,-165.42009688588345,49.66171311820449,1993-12-12 13:19:08,16.218648300391166,-0.7903644947335857,-0.7383507416449882,0,0.0,2,g
3,450038448,69.44124964216434,-40.735481407406844,1990-12-25 08:13:47,-2.681262299291747,1.8424407120661597,1.8746845180943952,7,0.0,2,s
4,309601479,-20.000057028329937,86.07870394453047,1970-04-16 14:24:40,2.7672072779456798,1.3100704134646464,1.339808315797292,1,0.0,2,s
...,...,...,...,...,...,...,...,...,...,...,...
168133,725252480,-112.85003131101635,-10.964689355797598,1970-03-05 00:47:35,-9.089431977158117,-0.8831700683681305,-0.7931477772601211,0,0.0,2,s
168134,422842121,4.902779474933569,60.5131208024045,1976-10-11 02:20:05,2.1560387048143808,-1.1568993975767228,-1.0783496253500509,0,0.0,2,g
168135,498463225,-103.44501767274288,-83.69724238201772,1987-05-09 08:31:07,1.561557035793344,-1.969834312744223,-1.9259732350145522,1,0.0,2,s
168136,755237644,-78.37346861779069,-56.77304595750046,1988-05-12 15:02:08,1.6864113149311981,-2.4635335986334157,-2.375469776943925,0,0.0,2,g


## Data-processing
Say we want to repeat this process on any data-set we read in. Then, we should create a `damast.core.dataprocessing.DataProcessingPipeline`.
A pipeline consists of pipeline-elements, that is a set of transformations on the original dataset.
We start by creating a Pipeline-element that drops all rows missing an `"mmsi"` entry.

In [27]:
from damast.data_handling.transformers.filters import DropMissing
from damast.core.dataprocessing import DataProcessingPipeline
pipeline = DataProcessingPipeline("Remove missing MMSI columns", "./output_dir")
pipeline.add("Remove MMSI column", DropMissing(), name_mappings={"x": "mmsi"})

transformed_adf = pipeline.transform(adf)
transformed_adf

ImportError: cannot import name 'DropMissing' from 'damast.data_handling.transformers' (/home/dokken/Documents/src/TSAR/damast/src/damast/data_handling/transformers/__init__.py)