# Big Data

Date: September 26, 2023

In [1]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('default')
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('font', size=18)
plt.rc('axes', titlesize=18)
plt.rc('axes', labelsize=18)
plt.rc('xtick', labelsize=18)
plt.rc('ytick', labelsize=18)
plt.rc('legend', fontsize=18)
plt.rc('lines', markersize=10)

3V of big data:
- Volume: It is the amount of data that is generated from different sources like social media, machines, internet, etc. It is the size of data that determines the value and potential of the data under consideration and whether it can actually be considered as Big Data or not.
- Velocity: For Big Data, velocity is the rate at which data is generated and the speed at which the data moves from one point to another. The flow of data is massive and continuous.
- Variety: It refers to the many types of data that are available. It can be in the form of structured, unstructured, and semi-structured form. Structured data is organized and is typically found in databases, while unstructured data is chaotic and raw that cannot be organized or easily interpreted. Semi-structured data is a combination of both structured and unstructured data.

Example of structure, unstructured and semi-structured data:
- Structured data (Schema first): Relational database
- Unstructured data (Non-schema): Text, images, video, audio
- Semi-structured data (Data first): XML, JSON, HTML

BATCH vs STREAMING

Batch processing is a method of running high-volume data that is collected over a period of time. Batch processing is used for tasks that require processing large volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced (Hadoop, Spark, Hive, Pig, etc.).

Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.

Robustness: The system should be able to recover from failure. Moreover, it should be able to process data that is out of order or arrive late.

In [2]:
L = [2, 3, 1, 2, 1, 3, 4, 5, 6, 7]

In [3]:
from functools import reduce
M = reduce(lambda x, y: x + y, list(map(lambda x: 1 if x % 2 == 0 else 0, L)))
M1 = reduce(lambda x, y: x + y, list(map(lambda x: 1 if x % 2 != 0 else 0, L)))

In [4]:
M

4

In [5]:
M1

6

$$
P_{n, s} = \mathbf{A}_{n, m} \cdot \mathbf{B}_{m, s}
= \begin{bmatrix}
    a_{11} & a_{12} & \cdots & a_{1m} \\
    a_{21} & a_{22} & \cdots & a_{2m} \\
    \vdots & \vdots & \ddots & \vdots \\
    a_{n1} & a_{n2} & \cdots & a_{nm}
  \end{bmatrix}
  \begin{bmatrix}
    b_{11} & b_{12} & \cdots & b_{1s} \\
    b_{21} & b_{22} & \cdots & b_{2s} \\
    \vdots & \vdots & \ddots & \vdots \\
    b_{m1} & b_{m2} & \cdots & b_{ms}
  \end{bmatrix}
  
= P_{i, j} = \sum_{k=1}^m a_{ik} b_{kj}
$$

$$ 
\mathbf{A} (i, k, a) \rightarrow \mathbf{A} (row, column, value) \\
\mathbf{B} (k, j, b) \rightarrow \mathbf{B} (row, column, value) \\
\mathbf{P} (i, j, p) \rightarrow \mathbf{P} (row, column, value) \\
$$

In SQL, we can use the following query to calculate the matrix multiplication:
```sql
SELECT A.row, B.column, SUM(A.value * B.value) AS value
FROM A, B
WHERE A.column = B.row
GROUP BY A.row, B.column
```

In [6]:
A = [[1, 2, 3], [4, 5, 6]]
B = [[1, 2], [3, 4], [5, 6]]

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

## MapReduce

- Data model: key-value pairs $(k, v)$
- Execution model: $map(k, udf) \rightarrow m(k, v) \rightarrow [(), (), ...]$, $reduce(k, [v_1, v_2, ...]) \rightarrow [(), (), ...]$

### SPARK

In [7]:
# ! pip install pyspark

In [8]:
from pyspark import SparkContext
sc = SparkContext()

23/11/17 14:47:38 WARN Utils: Your hostname, Beta.local resolves to a loopback address: 127.0.0.1; using 10.113.217.28 instead (on interface en0)
23/11/17 14:47:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/17 14:47:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [9]:
sc

In [10]:
list_of_number = [2, 4, 3, 2, 5, 5, 3, 5, 2, 6, 8, 3, 9, 9, 2, 3, 6, 4, 7, 9, 8]

In [11]:
print(list_of_number)

[2, 4, 3, 2, 5, 5, 3, 5, 2, 6, 8, 3, 9, 9, 2, 3, 6, 4, 7, 9, 8]


In [12]:
rdd = sc.parallelize(list_of_number)

In [13]:
rdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289

In [14]:
# collect() is an action
print("RDD: ", rdd.collect())

RDD:  [2, 4, 3, 2, 5, 5, 3, 5, 2, 6, 8, 3, 9, 9, 2, 3, 6, 4, 7, 9, 8]


                                                                                

In [15]:
print("Number of partitions: ", rdd.getNumPartitions())

Number of partitions:  8


In [16]:
# glom() is an transformation
print("Partitions structure: ", rdd.glom().collect())

Partitions structure:  [[2, 4], [3, 2], [5, 5, 3, 5], [2, 6], [8, 3], [9, 9, 2, 3], [6, 4], [7, 9, 8]]


In [17]:
# map() is an transformation
plus_one = rdd.map(lambda x: x + 1).collect()
print("Plus one: ", plus_one)
print("Structure: ", rdd.map(lambda x: x + 1).glom().collect())

Plus one:  [3, 5, 4, 3, 6, 6, 4, 6, 3, 7, 9, 4, 10, 10, 3, 4, 7, 5, 8, 10, 9]
Structure:  [[3, 5], [4, 3], [6, 6, 4, 6], [3, 7], [9, 4], [10, 10, 3, 4], [7, 5], [8, 10, 9]]


In [18]:
# filter() is an transformation
even = rdd.filter(lambda x: x % 2 == 0).collect()
print("Even: ", even)
print("Structure: ", rdd.filter(lambda x: x % 2 == 0).glom().collect())

Even:  [2, 4, 2, 2, 6, 8, 2, 6, 4, 8]
Structure:  [[2, 4], [2], [], [2, 6], [8], [2], [6, 4], [8]]


In [19]:
# map() and reduce() is an action which is not required shuffle operation
print("Map: ", rdd.map(lambda x: x + 1).glom().collect())
print("Map and reduce: ", rdd.map(lambda x: x + 1).reduce(lambda x, y: x + y))

Map:  [[3, 5], [4, 3], [6, 6, 4, 6], [3, 7], [9, 4], [10, 10, 3, 4], [7, 5], [8, 10, 9]]
Map and reduce:  126


In [20]:
assert rdd.map(lambda x: x + 1).reduce(lambda x, y: x + y) == sum(plus_one)

In [21]:
# Calculate the sum of even numbers and odd numbers
print("Sum of even numbers: ", rdd.filter(lambda x: x % 2 == 0).reduce(lambda x, y: x + y))
print("Sum of odd numbers: ", rdd.filter(lambda x: x % 2 != 0).reduce(lambda x, y: x + y))

Sum of even numbers:  44
Sum of odd numbers:  61


In [22]:
print("Sum of even numbers: ", rdd.map(lambda x: x if x % 2 == 0 else 0).reduce(lambda x, y: x + y))
print("Sum of odd numbers: ", rdd.map(lambda x: x if x % 2 != 0 else 0).reduce(lambda x, y: x + y))

Sum of even numbers:  44
Sum of odd numbers:  61


In [23]:
# reduceByKey() is an transformation
print(rdd.map(lambda x: (x % 2, x)).reduceByKey(lambda x, y: x + y).glom().collect())
print("Sum of even numbers: ", rdd.map(lambda x: (x % 2, x)).reduceByKey(lambda x, y: x + y).collect()[0][1])
print("Sum of odd numbers: ", rdd.map(lambda x: (x % 2, x)).reduceByKey(lambda x, y: x + y).collect()[1][1])

[[(0, 44)], [(1, 61)], [], [], [], [], [], []]
Sum of even numbers:  44
Sum of odd numbers:  61


In [24]:
# flatMap(), union() are transformations
print("Flat map: ", rdd.flatMap(lambda x: list(range(x))).collect())
print("Union: ", rdd.union(rdd).collect())

Flat map:  [0, 1, 0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7]
Union:  [2, 4, 3, 2, 5, 5, 3, 5, 2, 6, 8, 3, 9, 9, 2, 3, 6, 4, 7, 9, 8, 2, 4, 3, 2, 5, 5, 3, 5, 2, 6, 8, 3, 9, 9, 2, 3, 6, 4, 7, 9, 8]


In [25]:
# take is an action
print("Take: ", rdd.take(5))

Take:  [2, 4, 3, 2, 5]


In [26]:
# Word count exercise
text = ["aa bb nn aa cc", "aa bb cc dd ee", "aa bb cc dd ff", "aa bb cc dd gg", "zz cc xx vv bb", "aa bb cc dd hh"]

In [27]:
r = sc.parallelize(text)
print(r.collect())
print(r.glom().collect())

['aa bb nn aa cc', 'aa bb cc dd ee', 'aa bb cc dd ff', 'aa bb cc dd gg', 'zz cc xx vv bb', 'aa bb cc dd hh']
[[], ['aa bb nn aa cc'], ['aa bb cc dd ee'], ['aa bb cc dd ff'], [], ['aa bb cc dd gg'], ['zz cc xx vv bb'], ['aa bb cc dd hh']]


In [28]:
def word_count(r):
    count = r.flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).collect()
    return count

In [29]:
word_count(r)

[('ff', 1),
 ('nn', 1),
 ('zz', 1),
 ('hh', 1),
 ('ee', 1),
 ('aa', 6),
 ('gg', 1),
 ('dd', 4),
 ('vv', 1),
 ('cc', 6),
 ('xx', 1),
 ('bb', 6)]

In [30]:
def max_count(r):
    count = r.flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
    max_count = count.map(lambda x: x[1]).reduce(lambda x, y: x if x > y else y)
    return count.flatMap(lambda x: [x] if x[1] == max_count else []).collect()

In [31]:
max_count(r)

[('aa', 6), ('cc', 6), ('bb', 6)]

In [32]:
# Calculate intersection of two RDDs
r1 = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9])
r2 = sc.parallelize([2, 4, 6, 8, 10, 12, 14, 16, 18])

In [33]:
def intersection(r1, r2):
    result1 = r1.map(lambda x: (x, 1)) # [(1, 1), (2, 1), (3, 1), ...]
    result2 = r2.map(lambda x: (x, 1)) # [(1, 1), (2, 1), (3, 1), ...]
    result = result1.union(result2).reduceByKey(lambda x, y: x + y).flatMap(lambda x: [x[0]] if x[1] == 2 else []).collect() # [(1, 2), (2, 2), (3, 2), ...]
    return result

In [34]:
intersection(r1, r2)

[2, 4, 6, 8]