# Lab 05 - Scalable Computations with the MapReduce Framework - Introduction

During this lab we will explore the MapReduce computational model. It can be used to process large
data sets in a distributed computing environment. The framework simplifies the process of developing
parallel applications by abstracting the complexities of managing parallel tasks, data distribution,
and fault tolerance.

## 1. MapReduce Overview

There are several implementations/environments/platforms of the MapReduce framework available. You
may want to explore and compare their features and performance. They will differ in terms of ease of
setup and requirements. Some popular options include:

- Apache Hadoop
- Amazon EMR - a managed platform for running big data frameworks, such as Apache Hadoop or Apache
  Spark
- Google Dataproc - similar to Amazon EMR, but provided by Google Cloud Platform
- ...

For this lab, use the MapReduce implementation of your choice. For simplicity, you can use
a Python library [mrjob](https://mrjob.readthedocs.io/en/latest/index.html) with local mode to run
MapReduce locally without setting up a distributed cluster or HDFS environment.

The internals of the computation are abstracted away, and you only need to focus on two main
functions - mapping and reducing. The computation (in its simplest form) consists of three steps:
- data files are split into chunks and processed by some number of Map tasks. They generate zero or
  more key-value pairs as intermediate results.
- the intermediate results are gathered and shuffled/sorted by key and passed to Reduce tasks so that
  each reduce task receives all values associated with a single key.
- there are several Reduce tasks that process these key-value pairs and combine/aggregate all the
  values associated with a single key in some way.

## 2. Introductory Examples

For each of the problem stated below think of the best input format that let you solve the problem
efficiently. Prepare several input files (programmatically or manually) with expected outputs so
that you can easily verify the correctness of your solutions - you are responsible for creating 
interesting and diverse test cases that match your decision on input and output format.


### 2.1 Word Count Example

Let's start with the classic "Word Count" example. It is considered the "Hello World" for MapReduce.
The goal is to count the occurrences of each word in given text files. E.g., given the input text
"hello world hello" the output should be semantically equivalent to `{"hello": 2, "world": 1}`.

Implement your solution to the stated problem. If you have decided to use `mrjob` library in local
mode, you may need to prepare your solution in a separate file outside the notebook or use some kind
of testing features of `mrjob`. Check your implementation by passing different text files.

### 2.2 Multiplying matrices and vectors

Let us consider the problem of multiplying a matrix `M` by a vector `v`. You can assume that the
size of the matrix is compatible with the size of the vector for multiplication, i.e., `M` has `I`
rows and `J` columns, while `v` has `J` elements.

Design a solution that can handle large sparse matrices (i.e., matrices in which most elements are
zeros).

In the simpler scenario, the vector fits into memory. How would you approach the case when it does
not?

Remember that your design concerns not only the MapReduce algorithm but also how to represent
the data so that it can be efficiently processed by MapReduce.

### 2.3 Matrices multiplication

Consider matrix-matrix multiplication where you can assume that the matrices are compatible for
multiplication, i.e., `M` has size `I x J`, and `N` has size `J x K`. You can also assume that both
matrices are sparse.

### 2.4 Common Friends

Let us imagine that we have a social network where each user can have friends. We want to have a
feature that is computed, for example, on a daily basis, which tells us who the common friends are 
for any two users who are already friends in our system, while omitting pairs of users who are not 
friends.

---

Example:

Having the following friendships:

- A -> B, C, D
- B -> A, C, D, E
- C -> A, B, D, E
- D -> A, B, C, E
- E -> B, C, D

We should be able to conclude that:

- The common friends for A and B are C and D.
- The common friends for B and C are A, D, and E.
- etc.

---

Remember that, besides designing the solution itself, you are also required to propose 
the input and output format that allows you to perform the computation efficiently.

### 2.5 Friends Recommendations 