# Apache Flume

![](https://flume.apache.org/_static/flume-logo.png)


## What is Apache Flume ?
:::: {.columns}

::: {.fragment .column width="50%"}
Apache Flume is a 

- *distributed*: multiple agents manage multiple sources and sinks
- reliable: events are staged in the channel and delivered using a transactional approach
- available: open source, mature
system
::: 

::: {.fragment .column width="50%"}
for efficiently:

- collecting: make them avalaible
- aggregating: merge together from different sources
- moving: transfer from one place to another

:::
::::

:::{.fragment}
from many different sources: refered both type of data and different place to a centralized data store
:::



## Flume Usage
The use of Apache Flume is not only restricted to log data aggregation. 

Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to:
- network traffic data
- social-media-generated data
- email messages 

and pretty much any data source possible

## Event Data

:::: {.columns}

::: {.fragment .column width="50%"}

**High Level Architecture**

![](images/flume-dataflow.png){.fragment}
::: 

::: {.fragment .column width="50%"}

**Event**
A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes

![](images/flume-event.png){.fragment}
:::
::::




## Components

### Source

:::: {.columns}

::: {.fragment .column width="50%"}
- A Flume source consumes events delivered to it by an external source like a web server. 
- The external source sends events to Flume in a format that is recognized by the target Flume source. 
- Sources can be pollable or event driven
::: 

::: {.fragment .column width="50%"}
![](images/flume-source.png)
:::
::::


### Channel

:::: {.columns}

::: {.fragment .column width="50%"}
- When a Flume source receives an event, it stores it into one or more channels.

- The channel is a passive store that keeps the event until it’s consumed by a Flume sink

- A channel is a conduit for events between a source and a sink. Channels also dictate the durability of event delivery between a source and a sink. 
::: 

::: {.fragment .column width="50%"}
Coundit image here
:::
::::


### Sink

:::: {.columns}

::: {.fragment .column width="50%"}
A sink is the counterpart to the source in that it is a destination for data in Flume. 

Some of the builtin sinks that are included with Flume are 

- the Hadoop Distributed File System sink which writes events to HDFS in various ways

- the logger sink which simply logs all events received

- the null sink which is Flume's version of /dev/null

::: 

::: {.fragment .column width="50%"}
![](images/nemo.gif)
:::
::::




## Agent

::::{.columns}

::: {.fragment .column width="50%"}

- A Flume agent puts together the components to be connected:
1. source
2. channel 
3. sink

- Components are named and configured for each channel
- Multiple agent can be run at the same time in the same process

:::

::: {.fragment .column width="50%"}
![](images/flume-agent.png)
:::

::::






### Agent Configuration 

```properties
# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
```

# Running Flume using Docker 

## Dockerfile

```DockerFile
FROM openjdk:8-alpine
LABEL mantainer="Salvo Nicotra"
ENV FLUME_VERSION="1.11.0"
ENV PATH=/opt/flume/bin:$PATH
ENV HADOOP_VERSION=2.10.1
ENV HADOOP_HOME=/opt/flume/lib/hadoop-$HADOOP_VERSION

RUN apk update && apk add bash

ADD pkg/apache-flume-${FLUME_VERSION}-bin.tar.gz /opt/

# Create sym link
RUN ln -s /opt/apache-flume-${FLUME_VERSION}-bin /opt/flume

ADD pkg/hadoop-$HADOOP_VERSION.tar.gz /opt/flume/lib/

RUN mkdir /var/log/netcat
ADD start-flume.sh /opt/flume/bin/start-flume
# Copy All conf here
ADD conf/* /opt/flume/conf/

EXPOSE 44444

ENTRYPOINT [ "start-flume" ]
```

## Wrapper Entry Point

```bash
#!/bin/bash
set -v
FLUME_CONF_DIR=/opt/flume/conf
FLUME_AGENT_NAME=a1 

[[ -d "${FLUME_CONF_DIR}"  ]]  || { echo "Flume config dir not mounted in /opt/flume-config";  exit 1; }
[[ -z "${FLUME_AGENT_NAME}" ]] && { echo "FLUME_AGENT_NAME required"; exit 1; }

echo "Starting flume agent : ${FLUME_AGENT_NAME}"

COMMAND="flume-ng agent \
  -c ${FLUME_CONF_DIR} \
  -f ${FLUME_CONF_DIR}/${FLUME_CONF_FILE}\
  -n ${FLUME_AGENT_NAME} \
  -Dflume.root.logger=INFO,console
  -Dorg.apache.flume.log.printconfig=true 
  -Dorg.apache.flume.log.rawdata=true
  "

${COMMAND}
```

# Flume Hello World

```bash
# Only once
docker network create --subnet=10.0.100.0/24 tap
# Build
docker build flume/ --tag tap:flume
# Run
docker run --rm --name flumehw --network tap --ip 10.0.100.10 -p 44444:44444  -e FLUME_CONF_FILE=netcatExample.conf tap:flume
# Get flume logs
docker exec -it flumehw tail -f flume.log
```

# Hadoop Demo

## HDFS
:::: {.columns}

::: {.fragment .column width="50%"}

- HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. 

- The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.
::: 

::: {.fragment .column width="50%"}
![](images/hdfs.gif)
:::
::::

## Hadoop in Docker
<https://github.com/big-data-europe/docker-hadoop>

Last update 4 years ago...

![](https://i.imgflip.com/8l9zv8.jpg)

### Let's build a TAP one
```bash
docker build hadoop --tag tap:hadoop 

docker run --name namenode --hostname namenode -it --rm --network tap -p 9870:9870  tap:hadoop /bin/bash -c "hdfs namenode -format && hdfs namenode -fs hdfs://namenode:9000"

docker run --name datanode --hostname datanode -it --rm --network tap  tap:hadoop /bin/bash -c "hdfs datanode -fs hdfs://namenode:9000"

```

| Service         | Address                                            |
| ---------       |-----------------                                   |
| Namenode        |<http://nifi:9870/dfshealth.html#tab-overview>        |       

### Configure Flume

```properties
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://namenode:9000/flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
```

### Demo Netcat - Hdfs

```bash

# Create /flume in hdfs
docker run --rm --network tap -p 44444:44444 -it -e FLUME_CONF_FILE=netcatHdfs.conf tap:flume


```