# Big Data

Date: September 26, 2023

In [1]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('default')
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('font', size=18)
plt.rc('axes', titlesize=18)
plt.rc('axes', labelsize=18)
plt.rc('xtick', labelsize=18)
plt.rc('ytick', labelsize=18)
plt.rc('legend', fontsize=18)
plt.rc('lines', markersize=10)

3V of big data:
- Volume: It is the amount of data that is generated from different sources like social media, machines, internet, etc. It is the size of data that determines the value and potential of the data under consideration and whether it can actually be considered as Big Data or not.
- Velocity: For Big Data, velocity is the rate at which data is generated and the speed at which the data moves from one point to another. The flow of data is massive and continuous.
- Variety: It refers to the many types of data that are available. It can be in the form of structured, unstructured, and semi-structured form. Structured data is organized and is typically found in databases, while unstructured data is chaotic and raw that cannot be organized or easily interpreted. Semi-structured data is a combination of both structured and unstructured data.

Example of structure, unstructured and semi-structured data:
- Structured data (Schema first): Relational database
- Unstructured data (Non-schema): Text, images, video, audio
- Semi-structured data (Data first): XML, JSON, HTML

BATCH vs STREAMING

Batch processing is a method of running high-volume data that is collected over a period of time. Batch processing is used for tasks that require processing large volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced (Hadoop, Spark, Hive, Pig, etc.).

Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.

Robustness: The system should be able to recover from failure. Moreover, it should be able to process data that is out of order or arrive late.

In [2]:
L = [2, 3, 1, 2, 1, 3, 4, 5, 6, 7]

In [3]:
from functools import reduce
M = reduce(lambda x, y: x + y, list(map(lambda x: 1 if x % 2 == 0 else 0, L)))
M1 = reduce(lambda x, y: x + y, list(map(lambda x: 1 if x % 2 != 0 else 0, L)))

In [4]:
M

4

In [5]:
M1

6

$$
P_{n, s} = \mathbf{A}_{n, m} \cdot \mathbf{B}_{m, s}
= \begin{bmatrix}
    a_{11} & a_{12} & \cdots & a_{1m} \\
    a_{21} & a_{22} & \cdots & a_{2m} \\
    \vdots & \vdots & \ddots & \vdots \\
    a_{n1} & a_{n2} & \cdots & a_{nm}
  \end{bmatrix}
  \begin{bmatrix}
    b_{11} & b_{12} & \cdots & b_{1s} \\
    b_{21} & b_{22} & \cdots & b_{2s} \\
    \vdots & \vdots & \ddots & \vdots \\
    b_{m1} & b_{m2} & \cdots & b_{ms}
  \end{bmatrix}
  
= P_{i, j} = \sum_{k=1}^m a_{ik} b_{kj}
$$

$$ 
\mathbf{A} (i, k, a) \rightarrow \mathbf{A} (row, column, value) \\
\mathbf{B} (k, j, b) \rightarrow \mathbf{B} (row, column, value) \\
\mathbf{P} (i, j, p) \rightarrow \mathbf{P} (row, column, value) \\
$$

In SQL, we can use the following query to calculate the matrix multiplication:
```sql
SELECT A.row, B.column, SUM(A.value * B.value) AS value
FROM A, B
WHERE A.column = B.row
GROUP BY A.row, B.column
```

In [7]:
A = [[1, 2, 3], [4, 5, 6]]
B = [[1, 2], [3, 4], [5, 6]]

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).