In recent times, a new way of handling data called `stream processing` has become important for companies big and small. 

It's like a supercharged version of how we used to work with data.

This new method is really good at solving the problems that old-fashioned data systems had trouble with, especially things like:
  - analyzing data, 
  - moving it around (ETL), 
  - handling transactions, 
  - building software, 
  - and finding new business ideas. 

Stream processing offers several advantages over traditional batch processing methodologies:

  - **Real-time data processing**: Unlike batch processing, which deals with data in discrete chunks, stream processing allows for real-time or near-real-time data processing. This is crucial for applications that require immediate insights from their data.
  
  - **Scalability**: Stream processing systems are designed to handle large volumes of data. They can scale horizontally to accommodate data growth, making them suitable for big data applications.
  
  - **Fault tolerance**: Many stream processing systems are designed to be fault-tolerant. This ensures that data processing can continue uninterrupted, even if a part of the system fails.
  
  - **Integration with other systems**: Stream processing systems can often integrate with various data sources and sinks, making them versatile for different data pipelines. This makes it easier to build complex data processing workflows that involve multiple data sources and sinks.

This lecture aims to explain the growing excitement around stream processing and its transformative impact on data-driven application development.


Before we dive into stream processing, let's first understand traditional data application architectures and their limitations. 

In the past, data applications have mostly used **batch processing**. This means data is processed in separate chunks at specific times. Think of it like baking cookies - you can only bake as many as fit on the tray at one time.

Batch processing works well in some cases, but it has problems when it comes to real-time data processing.


Imagine you're trying to track the score of a live football game. If you're using batch processing, you might only get updates at half-time and at the end of the game. This is because batch processing can cause delays, known as latency issues.

Batch processing also has trouble scaling, which means it can't always handle large amounts of data efficiently. It's like trying to bake cookies for a whole city with just one oven!

Moreover, batch processing is not very flexible. If the data changes or something unexpected happens, batch processing can't adapt quickly. This can slow down businesses and prevent them from innovating.

In the next part of the lecture, we'll discuss how stream processing can help overcome these limitations.


##  Data. From data file to data lake

The advancement of information technology has revolutionized the way we access and utilize data. With the availability of vast amounts of both structured and unstructured data, new opportunities have emerged for scientific and business challenges.

Data plays a crucial role in the creation, collection, storage, and processing of information on an unprecedented scale. This has led to the development of numerous tools and technologies that enable us to harness the power of data.

Thanks to open-source software and the computing power of home computers, we can now tackle complex problems and explore new frontiers in various domains. The possibilities are endless as we continue to push the boundaries of data-driven innovation.

The new era of business and scientific challenges brings forth a multitude of opportunities:

- Intelligent advertising for thousands of products, targeting millions of customers.
- Processing of data related to genes, RNA, or proteins, such as [genus](http://genus.fuw.edu.pl).
- Intelligent detection of fraudulent activities among hundreds of billions of credit card transactions.
- Stock market simulations based on thousands of financial instruments.
- ...

As we enter the data age, we face not only the challenge of handling large quantities of data but also the need for faster data processing.

Machine learning algorithms rely on structured data in tabular form. This data is organized into columns representing characteristics that describe each observation or row. For example, these characteristics could include sex, height, or the number of cars owned. These features are used to predict whether a customer will repay a loan or not. This prediction is also included as a feature. By utilizing tables of features created in this manner, we can employ algorithms like XGBoost or logistic regression to determine the optimal combination of variables that influence the probability of a good or bad customer.

**Unstructured data** refers to data that does not have a predefined structure or format, such as sound, images, and text. Unlike structured data, which is organized in a tabular form with columns and rows, unstructured data lacks a consistent organization.

When processing unstructured data, it is often converted into a vector form to enable analysis and extraction of meaningful insights. However, individual elements like letters, frequencies, or pixels do not convey specific information on their own. They need to be transformed and analyzed collectively to derive valuable features and patterns.

Understanding the distinction between structured and unstructured data is crucial for effective data processing and analysis.

> Give an example of structured and unstructured data. Load sample data in jupyter notebook.

### Example of structured data

In [None]:
import pandas as pd

### Create a DataFrame with structured data

In [None]:
data = {'Name': ['John', 'Jane', 'Mike'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

### Display the structured data

In [None]:
print("Structured Data:")
print(df)

### Example of unstructured data

In [None]:
import nltk

### Load sample unstructured text data

In [None]:
nltk.download('gutenberg')
from nltk.corpus import gutenberg

### Display the unstructured data

In [None]:
print("\nUnstructured Data:")
print(gutenberg.raw('shakespeare-hamlet.txt'))

> Knows the types of structured and unstructured data (K2A_W02, K2A_W04, O2_W04, O2_W07)


## Data sources

The three largest data `generators` are:

- social data in the form of texts (tweets, entries in social networks, comments), photos or videos.
    These data are very important due to their wide possibilities of consumer behaviour and sentiment analysis in marketing analyses.
- data from all kinds of sensors or logs of the operation of devices and users (e.g. on a website).
    These data are related to IoT (Internet of Things) technology, which is currently one of the most developing areas in data processing, but also in the business direction.
- Transaction data, which is generally what is always generated as transactions appearing both online and offline.
    Currently, this type of data is processed for the purpose of performing transactions and rich analytics supporting virtually every area of ​​everyday life.

## Actual data generation process

The data that is in reality appears as a result of the continuous operation of the systems.
You have generated a lot of data on your phone today (and even on these devices!)
Will it not generate them early or tomorrow?
Batch processing splits the data into a time-length chunk and runs granular processes at a user-specified time
. However, the timestamp is not always appropriate.

With many systems that handle the data streams that you already have.
They are e.g.:
- data warehouses
- devices monitoring systems (IoT)
- transaction systems
- website analytics systems
- Internet advertising
- social media
- operating systems
- ....

> a company is an organization that works and responds to a constant stream of data.

The input to the orchard source (but also the result of the evaluation) of the data is the **file**.
It is written once and can be referred to (multiple functions - tasks can run on it).
The name of the file to identify the record set.

In the case of the stream of change, it is only once through the so-called _manufacturer_ (also referred to as the sender or supplier).
They can be formed by many so-called _consumers_ (recipients).
Streaming events are grouped into so-called **topic** (eng. **topic**).

## not to Big Data

> _,,Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so every one claims they are doing it.''_ — Dan Ariely, Professor of Psychology and Behavioral Economics, Duke University

### one, two, ... four V

1. **Volume** - the size of the data produced worldwide is growing exponentially. Huge amounts of data are being generated every second - the email you send, Twitter, Facebook, or other social media, videos, pictures, SMS messages, call records and data from varied devices and sensors.
2. **Velocity** - the speed of data production, the speed of their transfer and processing.
3. **Variety** - we associate traditional data with an alphanumeric form composed of letters and numbers. Currently, we have images, sounds, videos and IoT data streams at our disposal
4. **Veracity** - Is the data complete and correct? Do they objectively reflect reality? Are they the basis for making decisions?
5. **Value** - The value that the data holds. In the end, it's all about cost and benefits.

> _The purpose of calculations is not numbers, but understanding them_ R.W. Hamming 1962.  


As You can see data and data processing have been omnipresent in businesses for many decades.
Over the years the collection and usage of data have grown consistently, and companies have designed and built infrastructures to manage that data.

## Data processing models

The traditional architecture that most businesses implement distinguishes two types of data processing.

Most of the data is stored in databases or data warehouses.
By default, access to data comes down to the implementation of queries via applications.
The method of using and implementing the database access process is called the **processing model**.
Two implementations are most commonly used:

### Traditional Model

**Traditional model** - on-line transaction processing, OLTP (on-line transaction processing).
It works great in the case of ongoing service, e.g. customer service, order register, sales service, etc.
Companies use all kinds of applications for their day-to-day business activities, such as Enterprise Resource Planning (ERP) Systems, Customer Relationship Management (CRM) software, and web-based applications.
These systems are typically designed with separate tiers for data processing and data storage (transactional database system).

<!-- <img alt="OLTP system" src="img/baza1.png" align="center" /> -->
![](img/baza1.png){.center}

Applications are usually connected to external services or face human users and continuously process incoming events such as orders, emails, or clicks on a website.

When **an event** is processed, an application reads its state or updates it by running transactions against the remote database system. Often, a database system serves multiple applications that sometimes access the same databases or tables.

This model provides effective solutions for:

- effective and safe data storage,
- transactional data recovery after a failure,
- data access optimization,
- concurrency management,
- event processing -> read -> write

And what if we are dealing with:

- aggregation of data from many systems (e.g. for many stores),
- supporting data analysis,
- data reporting and summaries,
- optimization of complex queries,
- supporting business decisions.

Research on such issues has led to the formulation of a new data processing model and a new type of database _(Data warehouse)_.

This application design can cause problems when applications need to evolve or scale. 
Since multiple applications might work on the same data representation or share the same infrastructure,  changing the schema of a table or scaling a database system requires careful planning and a lot of effort. 
Currently, many running applications (even in one area) are implemented as **microservices**, i.e. small and independent applications (LINUX programming philosophy - do little but right). 
Because microservices are strictly decoupled from each other and only communicate over well-defined interfaces, each microservice can be implemented with a different technology stack including a programming language, libraries and data stores.

This model provides effective solutions for:

- effective and safe data storage,
- transactional data recovery after a failure,
- data access optimization,
- concurrency management,
- event processing -> read -> write

And what if we are dealing with:

- aggregation of data from many systems (e.g. for many stores),
- supporting data analysis,
- data reporting and summaries,
- optimization of complex queries,
- supporting business decisions.

Research on such issues has led to the formulation of a new data processing model and a new type of database _(Data warehouse)_.

This application design can cause problems when applications need to evolve or scale. 
Since multiple applications might work on the same data representation or share the same infrastructure, changing the schema of a table or scaling a database system requires careful planning and a lot of effort. 
Currently, many running applications (even in one area) are implemented as microservices, i.e. small and independent applications (LINUX programming philosophy - do little but right). 
Because microservices are strictly decoupled from each other and only communicate over well-defined interfaces, each microservice can be implemented with a different technology stack including a programming language, libraries and data stores.

Both are performed in batch mode. 
Today they are strictly made using Hadoop technology.

<!-- <img alt="OLAP system" src="img/baza2.png"/> -->

![](img/baza2.png){.center}