# Map, Filter, and Reduce with Python

Map, filter, and reduce are python primitive functions that give you the ability to quickly process large, sequential data sets (lists, for instance). These functions are also the foundation of processing big data in a distributed environment. Familiarity with map and reduce, especially, is important for understanding efficient data processing.

In this tutorial, we will explore the base functions to understand how to use them alone and in conjunction with each other.


## Get the data

We will be exploring movie data (since I *love* movies).

Download the Kaggle [IMDB dataset](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset).

The data has the following fields (in order):

1. `color`
1. `director_name`
1. `num_critic_for_reviews`
1. `duration`
1. `director_facebook_likes`
1. `actor_3_facebook_likes`
1. `actor_2_name`
1. `actor_1_facebook_likes`
1. `gross`
1. `genres`
1. `actor_1_name`
1. `movie_title`
1. `num_voted_users`
1. `cast_total_facebook_likes`
1. `actor_3_name`
1. `facenumber_in_poster`
1. `plot_keywords`
1. `movie_imdb_link`
1. `num_user_for_reviews`
1. `language`
1. `country`
1. `content_rating`
1. `budget`
1. `title_year`
1. `actor_2_facebook_likes`
1. `imdb_score`
1. `aspect_ratio`
1. `movie_facebook_likes`

Download the file and unzip it into the same directory this notebook is in.

## Reading in the data

To load the data into your python environment:

In [1]:
with open('movie_metadata.csv') as file:
    data = [ line.strip().split(',') for line in file.readlines()[1:] ]

len(data)

In [2]:
data[0]

## Mapping

The `map()` function takes two arguments, a function to apply to a set of data and the data to apply the function to. Mapping is similar to a `select` statement in SQL, including choosing which fields to keep for each row and specifying modifications to the raw data.

Let's say we want to retrieve the title and year of each movie:

In [3]:
movie_title_and_year = list(map(lambda x: (x[11].encode('ascii', 'ignore'), x[23]), data))

len(movie_title_and_year)

5043

In [4]:
movie_title_and_year[:10]

[(b'Avatar', '2009'),
 (b"Pirates of the Caribbean: At World's End", '2007'),
 (b'Spectre', '2015'),
 (b'The Dark Knight Rises', '2012'),
 (b'Star Wars: Episode VII - The Force Awakens            ', ''),
 (b'John Carter', '2012'),
 (b'Spider-Man 3', '2007'),
 (b'Tangled', '2010'),
 (b'Avengers: Age of Ultron', '2015'),
 (b'Harry Potter and the Half-Blood Prince', '2009')]

## Filtering

The `filter()` function takes two arguments, a function to apply to a set of data and the data to apply the function to. Filtering is similar to the `where` clause in a SQL `select` statement.

Let's say that we want to find all movies that are directed by James Cameron:

In [5]:
james_cameron_movies = list(filter(lambda x: x[1] == 'James Cameron', data))

len(james_cameron_movies)

7

In [6]:
james_cameron_movies[0]

['Color',
 'James Cameron',
 '723',
 '178',
 '0',
 '855',
 'Joel David Moore',
 '1000',
 '760505847',
 'Action|Adventure|Fantasy|Sci-Fi',
 'CCH Pounder',
 'Avatar\xa0',
 '886204',
 '4834',
 'Wes Studi',
 '0',
 'avatar|future|marine|native|paraplegic',
 'http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1',
 '3054',
 'English',
 'USA',
 'PG-13',
 '237000000',
 '2009',
 '936',
 '7.9',
 '1.78',
 '33000']

## Reducing

The `reduce()` function takes three arguments, a function to apply an aggregation to a set of data, the data to apply the function to, and an (optionial) initialization value. This is usually the hardest function for people to wrap their heads around, but it is simple if you let it be :)

Let's say we want to count the number of movies in the list (I know, you could just use `len(data)`, but this will be much more useful in later examples):

In [7]:
from functools import reduce
count = reduce(lambda x, y: x+1, data, 0)

count

5043

Reduction is traversing the data one element at a time and applying the function to the previous result (the *accumulated* value) and the current value. In the previous case, we start with the value `0`, and for each element in data we add `1` to the accumulator. `x` in the lambda expression is the accumulated value (which we initialized to `0`) and `y` is the value of each row as it is processed.

### A little bit about generators

To understand the true power of filter/map/reduce, you need to know about *generators*. This is a topic in and of itself, but basically a generator is a function that processes only as much as it needs to in order to emit an interim value and then stops until more data is requested by the consumer.

**Note**: That's **way** over-simplified, for a good overview see [Generators](https://wiki.python.org/moin/Generators) in the [Python Wiki](https://wiki.python.org/). If you take it on faith that generators allow you to keep the minimum amount of data in memory at any point in time, you will be OK for the rest of this discussion.

The `map()`, `filter()`, and `reduce()` functions are implemented to use generators to pipeline data through your select/aggregate very efficiently. Intermediate values are created as necessary but released for garbage collection immediately after consumption, so these values are not a bottleneck for processing.

## Putting it all together

### Filter and map

The first combination of functions that makes sense is applying a function to a sub-set of your data. The pattern is:

    map(mapfn, filter(filterfn, data))

To get the title of all James Cameron movies, you would do the following:

In [8]:
list(map(lambda x: x[11].encode('ascii' , 'ignore'), filter(lambda x: x[1] == 'James Cameron', data)))

[b'Avatar',
 b'Titanic',
 b'Terminator 2: Judgment Day',
 b'True Lies',
 b'The Abyss',
 b'Aliens',
 b'The Terminator']

### Map and reduce

The next combination of functions that makes sense is to aggregate one or more values for each element in a dataset. The pattern is:

    reduce(reducefn, map(mapfn, data))

To get the total number of genres (counting duplicates) for all movies in the dataset:

In [9]:
reduce(lambda x, y: x+y, map(lambda x: len(x[9].split('|')), data))

14504

### Filter and map and reduce

The complete/canonical pattern for all uses is:

    reduce(fn, map(fn, filter(fn, data)))

That's a reduction of a mapping of a filtering of a set of data.

Let's say you want to compute the average budget of all films from James Cameron (note that we are assuming a budget of `0` if no budget was provided):

In [10]:
sum_num = reduce(
    lambda x, y: (x[0]+y, x[1]+1),
    map(
        lambda x: int(x[22]) if x[22].isdigit() else 0,
        filter(
            lambda x: x[1] == 'James Cameron',
            data)
    ),
    (0, 0)
)

sum_num

(748500000, 7)

Note that this gives us a total and a count, from which we can compute the average:

In [11]:
round(float(sum_num[0]) / sum_num[1], 2)

106928571.43

## Final thoughts

Everything here can also be done with `while` or `for` loops around if statements, but MapReduce algorithms are able to take advantage of very powerful optimizations that are not generally available using more verbose constructs.

For instance, if we are aggregating data from a CSV file that is multiple terabytes in size, we can use the `csv` library to read and process one line at a time, and `map()`, `filter()`, and `reduce()` this data with almost no overhead. This means we can run meaningful analyses of datasets that are far larger than the resources we have in the machine doing the computations.

This would look like the following:

In [12]:
import csv

with open('movie_metadata.csv') as file:
    sum_num = reduce(
        lambda x, y: (x[0]+y, x[1]+1),
        map(
            lambda x: int(x[22]) if x[22].isdigit() else 0,
            filter(
                lambda x: x[1] == 'James Cameron',
                csv.reader(iter(file.readline, ''))
            )
        ),
        (0, 0)
    )

sum_num

(748500000, 7)

### Distributed MapReduce

As an extension of that example, if the data itself is spread across multiple machines, the `map()` and `filter()` steps can be performed where the data is (send the code to the data rather than sending the data to the code) and the only thing that happens on the calling machine is the final reduction. This is how Google indexes the entire internet, for instance. But that's a talk for another time.