administration: migrate pending content

edsiper · edsiper · commit b78cfe98123e · 2020-01-21T22:30:36.000+10:00
Signed-off-by: Eduardo Silva &lt;eduardo@treasure-data.com&gt;
diff --git a/administration/backpressure.md b/administration/backpressure.md
@@ -1,2 +1,38 @@
 # Backpressure
 
+In certain environments is common to see that logs or data being ingested is faster than the ability to flush it to some destinations. The common case is reading from big log files and dispatching the logs to a backend over the network which takes some time to respond, this generate backpressure leading to a high memory consumption in the service.
+
+In order to avoid backpressure, Fluent Bit implements a mechanism in the engine that restrict the amount of data than an input plugin can ingest, this is done through the configuration parameter **Mem\_Buf\_Limit**.
+
+## Mem\_Buf\_Limit
+
+This option is disabled by default and can be applied to all input plugins. Let's explain it behavior using the following scenario:
+
+* Mem\_Buf\_Limit is set to 1MB \(one megabyte\)
+* input plugin tries to append 700KB
+* engine route the data to an output plugin
+* output plugin backend \(HTTP Server\) is down
+* engine scheduler will retry the flush after 10 seconds
+* input plugin tries to append 500KB
+
+At this exact point, the engine will **allow** to append those 500KB of data into the engine: in total we have 1.2MB. The options works in a permissive mode before to reach the limit, but the limit is **exceeded** the following actions are taken:
+
+* block local buffers for the input plugin \(cannot append more data\)
+* notify the input plugin invoking a **pause** callback
+
+The engine will protect it self and will not append more data coming from the input plugin in question; Note that is the plugin responsibility to keep their state and take some decisions about what to do on that _paused_ state.
+
+After some seconds if the scheduler was able to flush the initial 700KB of data or it gave up after retrying, that amount memory is released and internally the following actions happens:
+
+* Upon data buffer release \(700KB\), the internal counters get updated
+* Counters now are set at 500KB
+* Since 500KB is &lt; 1MB it checks the input plugin state
+* If the plugin is paused, it invokes a **resume** callback
+* input plugin can continue appending more data
+
+## About pause and resume Callbacks
+
+Each plugin is independent and not all of them implements the **pause** and **resume** callbacks. As said, these callbacks are just a notification mechanism for the plugin.
+
+The plugin who implements and keep a good state is the [Tail Input](../input/tail.md) plugin. When the **pause** callback is triggered, it stop their collectors and stop appending data. Upon **resume**, it re-enable the collectors.
+
diff --git a/administration/buffering-and-storage.md b/administration/buffering-and-storage.md
@@ -1,2 +1,68 @@
-# Buffering & Storage
+# Fluent Bit and Buffering
+
+The end-goal of [Fluent Bit](https://fluentbit.io) is to collect, parse, filter and ship logs to a central place. In this workflow there are many phases and one of the critical pieces is the ability to do _buffering_ : a mechanism to place processed data into a temporal location until is ready to be shipped.
+
+By default when Fluent Bit process data, it uses Memory as a primary and temporal place to store the record logs, but there are certain scenarios where would be ideal to have a persistent buffering mechanism based in the filesystem  to provide aggregation and data safety capabilities.
+
+Starting with Fluent Bit v1.0, we introduced a new _storage layer_ that can either work in memory or in the file system. Input plugins can be configured to use one or the other upon demand at start time.
+
+## Configuration
+
+The storage layer configuration takes place in two areas:
+
+- Service Section
+- Input Section
+
+The known Service section configure a global environment for the storage layer, and then in the Input sections defines which mechanism to use.
+
+### Service Section Configuration
+
+| Key                       | Description                                                  | Default |
+| ------------------------- | ------------------------------------------------------------ | ------- |
+| storage.path              | Set an optional location in the file system to store streams and chunks of data. If this parameter is not set, Input plugins can only use in-memory buffering. |         |
+| storage.sync              | Configure the synchronization mode used to store the data into the file system. It can take the values _normal_ or _full_. | normal  |
+| storage.checksum          | Enable the data integrity check when writing and reading data from the filesystem. The storage layer uses the CRC32 algorithm. | Off     |
+| storage.backlog.mem_limit | If _storage.path_ is set, Fluent Bit will look for data chunks that were not delivered and are still in the storage layer, these are called _backlog_ data. This option configure a hint of maximum value of memory to use when processing these records. | 5M      |
+
+a Service section will look like this:
+
+```
+[SERVICE]
+    flush                     1
+    log_Level                 info
+    storage.path              /var/log/flb-storage/
+    storage.sync              normal
+    storage.checksum          off
+    storage.backlog.mem_limit 5M
+```
+
+that configuration configure an optional buffering mechanism where it root for data is _/var/log/flb-storage/_, it will use _normal_ synchronization mode, without checksum and up to a maximum of 5MB of memory when processing backlog data. 
+
+### Input Section Configuration
+
+Optionally, any Input plugin can configure their storage preference, the following table describe the options available:
+
+| Key          | Description                                                  | Default |
+| ------------ | ------------------------------------------------------------ | ------- |
+| storage.type | Specify the buffering mechanism to use. It can be _memory_ or _filesystem_. | memory  |
+
+The following example configure a service that offers filesystem buffering capabilities and two Input plugins  being the first based in memory and the second with the filesystem:
+
+```
+[SERVICE]
+    flush                     1
+    log_Level                 info
+    storage.path              /var/log/flb-storage/
+    storage.sync              normal
+    storage.checksum          off
+    storage.backlog.mem_limit 5M
+    
+[INPUT]
+    name          cpu
+    storage.type  filesystem
+    
+[INPUT]
+    name          mem
+    storage.type  memory
+```
 
diff --git a/administration/memory-management.md b/administration/memory-management.md
@@ -1,2 +1,37 @@
-# Memory Management
+# Memory Usage
+
+In certain scenarios would be ideal to estimate how much memory Fluent Bit could be using, this is very useful for containerized environments where memory limits are a must.
+
+In order to estimate we will assume that the input plugins have set the **Mem\_Buf\_Limit** option \(you can learn more about it in the [Backpressure](backpressure.md) section\).
+
+## Estimating
+
+Input plugins append data independently, so in order to do an estimation a limit should be imposed through the **Mem\_Buf\_Limit** option. If the limit was set to _10MB_ we need to estimate that in the worse case, the output plugin likely could use _20MB_.
+
+Fluent Bit has an internal binary representation for the data being processed, but when this data reach an output plugin, this one will likely create their own representation in a new memory buffer for processing. The best example are the [InfluxDB](../output/influxdb.md) and [Elasticsearch](../output/elasticsearch.md) output plugins, both needs to convert the binary representation to their respective-custom JSON formats before to talk to their backend servers.
+
+So, if we impose a limit of _10MB_ for the input plugins and considering the worse case scenario of the output plugin consuming _20MB_ extra, as a minimum we need \(_30MB_ x 1.2\) = **36MB**.
+
+## Glibc and Memory Fragmentation
+
+Is well known that in intensive environments where memory allocations happens in the order of magnitude, the default memory allocator provided by Glibc could lead to a high fragmentation, reporting a high memory usage by the service.
+
+It's strongly suggested that in any production environment, Fluent Bit should be built with [jemalloc](http://jemalloc.net/) enabled \(e.g. `-DFLB_JEMALLOC=On`\). Jemalloc is an alternative memory allocator that can reduce fragmentation \(among others things\) resulting in better performance.
+
+You can check if Fluent Bit has been built with Jemalloc using the following command:
+
+```text
+$ bin/fluent-bit -h|grep JEMALLOC
+```
+
+The output should looks like:
+
+```text
+Build Flags =  JSMN_PARENT_LINKS JSMN_STRICT FLB_HAVE_TLS FLB_HAVE_SQLDB
+FLB_HAVE_TRACE FLB_HAVE_FLUSH_LIBCO FLB_HAVE_VALGRIND FLB_HAVE_FORK
+FLB_HAVE_PROXY_GO FLB_HAVE_JEMALLOC JEMALLOC_MANGLE FLB_HAVE_REGEX
+FLB_HAVE_C_TLS FLB_HAVE_SETJMP FLB_HAVE_ACCEPT4 FLB_HAVE_INOTIFY
+```
+
+If the FLB\_HAVE\_JEMALLOC option is listed in _Build Flags_, everything will be fine.
 
diff --git a/administration/monitoring.md b/administration/monitoring.md
@@ -1,2 +1,191 @@
 # Monitoring
 
+Fluent Bit comes with a built-in HTTP Server that can be used to query internal information and monitor metrics of each running plugin.
+
+## Getting Started {#getting_started}
+
+To get started, the first step is to enable the HTTP Server from the configuration file:
+
+```text
+[SERVICE]
+    HTTP_Server  On
+    HTTP_Listen  0.0.0.0
+    HTTP_PORT    2020
+
+[INPUT]
+    Name cpu
+
+[OUTPUT]
+    Name  stdout
+    Match *
+```
+
+the above configuration snippet will instruct Fluent Bit to start it HTTP Server on TCP Port 2020 and listening on all network interfaces:
+
+```text
+$ bin/fluent-bit -c fluent-bit.conf
+Fluent-Bit v0.14.x
+Copyright (C) Treasure Data
+
+[2017/10/27 19:08:24] [ info] [engine] started
+[2017/10/27 19:08:24] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
+```
+
+now with a simple **curl** command is enough to gather some information:
+
+```text
+$ curl -s http://127.0.0.1:2020 | jq
+{
+  "fluent-bit": {
+    "version": "0.13.0",
+    "edition": "Community",
+    "flags": [
+      "FLB_HAVE_TLS",
+      "FLB_HAVE_METRICS",
+      "FLB_HAVE_SQLDB",
+      "FLB_HAVE_TRACE",
+      "FLB_HAVE_HTTP_SERVER",
+      "FLB_HAVE_FLUSH_LIBCO",
+      "FLB_HAVE_SYSTEMD",
+      "FLB_HAVE_VALGRIND",
+      "FLB_HAVE_FORK",
+      "FLB_HAVE_PROXY_GO",
+      "FLB_HAVE_REGEX",
+      "FLB_HAVE_C_TLS",
+      "FLB_HAVE_SETJMP",
+      "FLB_HAVE_ACCEPT4",
+      "FLB_HAVE_INOTIFY"
+    ]
+  }
+}
+```
+
+Note that we are sending the _curl_ command output to the _jq_ program which helps to make the JSON data easy to read from the terminal. Fluent Bit don't aim to do JSON pretty-printing.
+
+## REST API Interface {#rest_api}
+
+Fluent Bit aims to expose useful interfaces for monitoring, as of Fluent Bit v0.14 the following end points are available:
+
+| URI | Description | Data Format |
+| :--- | :--- | :--- |
+| / | Fluent Bit build information | JSON |
+| /api/v1/uptime | Get uptime information in seconds and human readable format | JSON |
+| /api/v1/metrics | Internal metrics per loaded plugin | JSON |
+| /api/v1/metrics/prometheus | Internal metrics per loaded plugin ready to be consumed by a Prometheus Server | Prometheus Text 0.0.4 |
+
+## Uptime Example
+
+Query the service uptime with the following command:
+
+```
+$ curl -s http://127.0.0.1:2020/api/v1/uptime | jq
+```
+
+it should print a similar output like this:
+
+```json
+{
+  "uptime_sec": 8950000,
+  "uptime_hr": "Fluent Bit has been running:  103 days, 14 hours, 6 minutes and 40 seconds"
+}
+
+```
+
+## Metrics Examples
+
+Query internal metrics in JSON format with the following command:
+
+```bash
+$ curl -s http://127.0.0.1:2020/api/v1/metrics | jq
+```
+
+it should print a similar output like this:
+
+```json
+{
+  "input": {
+    "cpu.0": {
+      "records": 8,
+      "bytes": 2536
+    }
+  },
+  "output": {
+    "stdout.0": {
+      "proc_records": 5,
+      "proc_bytes": 1585,
+      "errors": 0,
+      "retries": 0,
+      "retries_failed": 0
+    }
+  }
+}
+```
+
+#### Metrics in Prometheus format
+
+Query internal metrics in Prometheus Text 0.0.4 format:
+
+```bash
+$ curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
+```
+
+this time the same metrics will be in Prometheus format instead of JSON:
+
+```
+fluentbit_input_records_total{name="cpu.0"} 57 1509150350542
+fluentbit_input_bytes_total{name="cpu.0"} 18069 1509150350542
+fluentbit_output_proc_records_total{name="stdout.0"} 54 1509150350542
+fluentbit_output_proc_bytes_total{name="stdout.0"} 17118 1509150350542
+fluentbit_output_errors_total{name="stdout.0"} 0 1509150350542
+fluentbit_output_retries_total{name="stdout.0"} 0 1509150350542
+fluentbit_output_retries_failed_total{name="stdout.0"} 0 1509150350542
+```
+
+
+
+### Configuring Aliases
+
+By default configured plugins on runtime get an internal name in the format _plugin_name.ID_. For monitoring purposes this can be confusing if many plugins of the same type were configured. To make a distinction each configured input or output section can get an _alias_ that will be used as the parent name for the metric.
+
+The following example set an alias to the INPUT section which is using the [CPU](../input/cpu.md) input plugin:
+
+```
+[SERVICE]
+    HTTP_Server  On
+    HTTP_Listen  0.0.0.0
+    HTTP_PORT    2020
+
+[INPUT]
+    Name  cpu
+    Alias server1_cpu
+    
+[OUTPUT]
+    Name  stdout
+    Alias raw_output
+    Match *
+```
+
+Now when querying the metrics we get the aliases in place instead of the plugin name:
+
+```json
+{
+  "input": {
+    "server1_cpu": {
+      "records": 8,
+      "bytes": 2536
+    }
+  },
+  "output": {
+    "raw_output": {
+      "proc_records": 5,
+      "proc_bytes": 1585,
+      "errors": 0,
+      "retries": 0,
+      "retries_failed": 0
+    }
+  }
+}
+```
+
+
+
diff --git a/administration/scheduling-and-retries.md b/administration/scheduling-and-retries.md
@@ -1,2 +1,42 @@
-# Scheduling and Retries
+# Scheduler
+
+[Fluent Bit](https://fluentbit.io) has an Engine that helps to coordinate the data ingestion from input plugins and call the _Scheduler_ to decide when is time to flush the data through one or multiple output plugins. The Scheduler flush new data every a fixed time of seconds and Schedule retries  when asked.
+
+Once an output plugin gets call to flush some data, after processing that data it can notify the Engine three possible return statuses:
+
+- OK
+- Retry
+- Error
+
+If the return status was __OK__, it means it was successfully able to process and flush the data, if it returned an __Error__ status, means that an unrecoverable error happened and the engine should not try to flush that data again.  If a __Retry__ was requested, the _Engine_ will ask the _Scheduler_ to retry to flush that data, the Scheduler will decide how many seconds to wait before that happen. 
+
+## Configuring Retries
+
+The Scheduler provides a simple configuration option called __Retry_Limit__ which can be set independently on each output section. This option allows to disable retries or impose a limit to try N times and then discard the data after reaching that limit:
+
+|             | Value | Description                                                  |
+| ----------- | ----- | ------------------------------------------------------------ |
+| Retry_Limit | N     | Integer value to set the maximum number of retries allowed. N must be >= 1 (default: 2) |
+| Retry_Limit | False | When Retry_Limit is set to False, means that there is not limit for the number of retries that the Scheduler can do. |
+
+
+
+### Example
+
+The following example configure two outputs where the HTTP plugin have an unlimited number of retries and the Elasticsearch plugin have a limit of 5 times:
+
+```
+[OUTPUT]
+    Name        http
+    Host        192.168.5.6
+    Port        8080
+    Retry_Limit False
+
+[OUTPUT]
+    Name            es
+    Host            192.168.5.20
+    Port            9200
+    Logstash_Format On
+    Retry_Limit     5
+```
 
diff --git a/administration/security.md b/administration/security.md