# Ingesta de datos con Logstash

In [16]:
!curl -X PUT http://elasticsearch:9200/_index_template/trips -H 'Content-Type: application/json' -d ' \
{ \
  "index_patterns": ["trips"], \
  "template": { \
    "mappings": { \
      "dynamic_templates": [ \
        { \
          "strings_as_keywords": { \
            "match_mapping_type": "string", \
            "mapping": { "type": "keyword" } \
          } \
        } \
      ], \
      "properties": { \
        "EndAirportGeo": { "type": "geo_point" }, \
        "StartAirportGeo": { "type": "geo_point" }, \
        "DistanceKM": { "type": "integer" }, \
        "ActivityCostAUD": { "type": "integer" }, \
        "StartTime": { \
          "type":   "date", \
          "format": "HH:mm:ss||H:mm:ss" \
        }, \
        "EndTime": { \
          "type":   "date", \
          "format": "HH:mm:ss||H:mm:ss" \
        }, \
        "StartDate": { \
          "type":   "date", \
          "format": "dd/MM/yy" \
        }, \
        "EndDate": { \
          "type":   "date", \
          "format": "dd/MM/yy" \
        } \
      } \
    } \
  } \
}'      

{"acknowledged":true}

Vamos a ver el contenido del fichero de trips.csv:

In [15]:
import pandas as pd

trips = pd.read_csv('../data/elasticsearch/tirps/trips.csv')

print(trips.dtypes)

StartAirport        object
EndAirport          object
TripID               int64
Type                object
ActivityID           int64
ActivityCostAUD    float64
AirlineCode         object
Aircraft            object
ServiceClass        object
FlightNumber         int64
StartCountry        object
StartCityName       object
StartLat           float64
StartLong          float64
StartDate           object
StartTime           object
EndCountry          object
EndCityName         object
EndLat             float64
EndLong            float64
EndDate             object
EndTime             object
Stops               object
DistanceKM           int64
dtype: object


In [13]:
print(trips.head(5)) 

  StartAirport EndAirport     TripID Type  ActivityID  ActivityCostAUD  \
0          CBR        MEL  306007947  Air  1141935494          1241.36   
1          MEL        CBR  306007947  Air  1141935494          1241.36   
2          SYD        MEL  305316367  Air  1140039658           502.00   
3          CBR        SYD  305312206  Air  1140385947          1313.16   
4          MEL        SYD  305312206  Air  1140269701           350.00   

  AirlineCode                                         Aircraft ServiceClass  \
0          QF         Boeing 737-800 (winglets) Passenger/BBJ2      Economy   
1          QF                                   Boeing 717-200      Economy   
2          VA         Boeing 737-800 (winglets) Passenger/BBJ2      Economy   
3          QF  De Havilland (Bombardier) DHC-8-300 Dash 8 / 8Q      Economy   
4          VA         Boeing 737-800 (winglets) Passenger/BBJ2      Economy   

   FlightNumber     ...     StartDate StartTime  EndCountry  EndCityName  \
0   

Configuración de logstash:

`
input {
    file {
        path => "/tmp/data/*"
    }
}

filter {
    csv {
        source => "message"
        columns => ["StartAirport","EndAirport","TripID","Type","ActivityID","ActivityCostAUD","AirlineCode","Aircraft","ServiceClass","FlightNumber","StartCountry","StartCityName","StartLat","StartLong","StartDate","StartTime","EndCountry","EndCityName","EndLat","EndLong","EndDate","EndTime","Stops","DistanceKM"]
        skip_header => true
    }

    mutate {
        add_field => {
            "StartAirportGeo" => "%{StartLat},%{StartLong}"
            "EndAirportGeo" => "%{EndLat},%{EndLong}"
            }
        remove_field => ["host", "@version", "@timestamp", "message", "StartLat", "EndLat", "StartLong", "EndLong"]
    }
}

output {
    elasticsearch {
    hosts => "elasticsearch:9200"
    index => "trips"
  } 
}
`

In [None]:
docker run --rm -it --network=datahack-nosql_default \
    -v /Users/rgarrote/desarrollo/datahack-nosql/work/data/elasticsearch/trips/pipeline/:/usr/share/logstash/pipeline/ \
    -v /Users/rgarrote/desarrollo/datahack-nosql/work/data/elasticsearch/trips/data/:/tmp/data/ \
docker.elastic.co/logstash/logstash:8.3.3

## 1. Creamos el data vaiew de los datos

A fundamental aspect of starting to work with a dataset on Kibana is configuring the data view for the data. A Kibana data view determines what underlying Elasticsearch indices will be addressed in a given query, dashboard, alert, or machine learning job configuration. Data views also cache some metadata for underlying Elasticsearch indices, including the field names and data types (the schema) in a given group of indices. This cached data is used in the Kibana interface when creating and working with visualizations.

In the case of time series data, data view can configure the name of the field containing the timestamp in a given index. This allows Kibana to narrow down your queries, dashboards, and so on to the appropriate time range on the underlying indices, allowing for fast and efficient results. The universal date and time picker at the top right of the screen allows granular control of time ranges. The time picker will not be available if a time field is not configured for a data view.
Data view can also specify how fields should be formatted and rendered on visualizations. For example, a source.bytes integer field can be represented by bytes to automatically format values in human-readable units such as MB or GB.

Para ello entramos en Kibana: http://127.0.0.1:5601

A continuación entramos en la sección de Stack Management:
Menu > Management > Stack Management

Stack Management is home to UIs for managing all things Elastic Stack— indices, clusters, licenses, UI settings, data views, spaces, and more.

Para crear nuestro data view accedemos a las sección de gestión de data views: 
Kibana > Data Views

En esta página encontramos el listado de los Data Views ya creados. Para crear nuestro primer data view clickamos en el botón Create Data View.

Para crear el Data View, es necesario indicar el índice o índices que van a ser la fuente de datos de esta data view. Para ello podemos indicar o bien el nombre del índice sobre el que crear el Data View o en el caso de querer crear el Data View sobre varios índices el patrón o expresión regular que tiene que cumplirse sobre el nombre de los índices.

En nuestro caso vamos a utilizar el índice que hemos creado antes en la ingesta de datos, por lo que en el campo name introducimos el nombre del índice: trips.

Puesto que el data set que hemos ingestado representa una serie de eventos en el tiempo, es por tanto una serie temporal, podemos indicar cual es el campo de tipo fecha que almacena esta información temporal introduciendo en el campo Timestamp field el campo StartDate. 

Una vez informado correctamente el formulario clickcamos en Create Data View.

La pagina que nos muestra a continuación contiene la información del Data View que hemos creado, donde podemos ver todos los campos de nuestro data view y su información.


