# Fundamentals of Data Pipeline

## What is a Data Pipeline?

A Data Pipeline is a unified system for capturing events, sometimes from different sources, to offer them to value generation initiatives (like analytics, building products, monitoring, etc.). 

![Example of a Data Pipeline](img/data-pipeline.png)

## What can we call EVENTS?

Each action executed can be called an event. And an action can be executed by:
- Users;
- Systems;
- Bots;

### Coarse-grained events

This type of event is a kind of *by-product* derived by the main product/operation being tracked. Usually, coarse-grained events are stored as text logs that can be used for debugging and analysis .

```json
{
  "id": 3969,
  "ip": "127.0.0.1",
  "timestamp": "2018-01-03 08:48:26.987-02Z",
  "action": "GET / HTTP/2.0",
  "status": 200
}
```

In some cases, the content of an action has some meaning that have to be known at the debugging or analysis time. For example, suppose you have the following event from an email app:

```json
{
  "id": 36728,
  "ip": "127.0.0.1",
  "timestamp": "2018-01-03 08:55:26.987-02Z",
  "action": "GET /inbox",
  "status": 200
}
```

This event have two possible meanings:
- User just loaded the entire app (if this is the first time loaded this session);
- User just refreshed his timeline/inbox;

So, this implementation details have to be documented and spread over the target people that is going to use it to run some analysis or debugging.

### Fine-grained events

This type of event have a record/ticket format like:
- app opened;
- auto refresh;
- user pull down refresh;

Using fine-grained events we annotate actions with some contextual information. So we don't have the risk of multiple meaning as we have with course-grained one.

### Course VS fine grained

Usually, course-grained events are more suitable for logging and debugging, once we don't the exact context that is going to embrace an issue or bug. So we need more **generic** data.

However, analytics initiatives have their scope/context well defined. So it is a good practice to create specific events for everything within the defined context.

The final message of this topic is: **decouple** loggig and analysis!

## Schema

Schema is the structure used to describe your data, providing a contract between **fields** collected and their **data types**.

### Schema and schema-less events

Nowadays, some schema-less formats have been widely used for events collection, like JSON and CSV format. And we also have the old but extensively used table format (e.g. SQL/relational).

The main advantages of having an event schema are:
- More efficient processing, once you don't have to spent resources for checking new fields of changes in data type;
- You don't need to write parsers;
- Easier to change schemas, once you previously know it;
- Facilities **automated anaytics**;

But there are drawbacks too:
- You put data processing effort in the application side;
- less flexibility for shipping new data;

# Show me the code

Let's use Vizgr (www.vizgr.org) webservice to request some historical events from Wikpedia.

In [2]:
VIZGR_URL = "http://www.vizgr.org/historical-events/search.php?format=json&begin_date=-3000000&end_date=20171231&lang=pt"

In [3]:
import urllib2

In [7]:
response = urllib2.urlopen(VIZGR_URL)
result = response.read()

In [8]:
type(response)

instance

# References

1. https://www.slideshare.net/g33ktalk/data-pipeline-acial-lyceum20140624
2. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying