Week 5: Aggregation Framework

Topics:

Goals, The Use Of The Pipeline, Comparison With SQL Facilities.

Simple Aggregation Example

select manufacturer, count(*), from products where product group by manufacturer.

use agg
db.products.aggregate([
    {$group:
     {
	 _id:"$manufacturer",    //set the field '_id' to the field you want to group by
	 num_products:{$sum:1}   //define field 'num_products'
     }
    }
])

Quiz:

db.products.aggregate([{
  "$group":{
         "_id":"$category",
         "num_products":{"$sum":1}
   }
}])

The Aggregation Pipeline

Lecture Video

Each document in the array parameter to the aggregate function is a stage in the pipeline
pipeline stages from collection to results: collection -> $project -> $match -> $group -> $sort -> result
Stages (main operators):
- $project select relevant fields and reshape doc in: 1 / out: 1
- $match filters docs; in: n / out: n-x
- $group aggregates; Reduces the number of docs in: n / out: n-x
- $sort sorts the docs in: 1 / out: 1
- $skip skips docs in: n / out: n-x
- $limit limits returned docs in: n / out: n-x
- $unwind explodes arrays - Produces a doc for each value in an array-key-field with everything else repeated in: n / out: n+x
Stages (plus operators):
- $out allow put docs to another collection in: 1 / out: 1
- $redact
- $geonear limit doc by the pipeline stage based on location
Each stage can exist more than once in a pipeline

Simple Example Explained

Lecture Video

Quiz:

db.stuff.find()
{ "_id" : ObjectId("50b26f9d80a78af03b5163c8"), "a" : 1, "b" : 1, "c" : 1 }
{ "_id" : ObjectId("50b26fb480a78af03b5163c9"), "a" : 2, "b" : 2, "c" : 1 }
{ "_id" : ObjectId("50b26fbf80a78af03b5163ca"), "a" : 3, "b" : 3, "c" : 1 }
{ "_id" : ObjectId("50b26fcd80a78af03b5163cb"), "a" : 3, "b" : 3, "c" : 2 }
{ "_id" : ObjectId("50b26fd380a78af03b5163cc"), "a" : 3, "b" : 5, "c" : 3 }

db.stuff.aggregate([{$group:{_id:'$c'}}])

How many documents will be in the result set from aggregate?

3

Compound Grouping

Lecture Video

select manufacturer, category, count(*) from products group by manufacturer, category.

Use a compound id with manufacturer and category

db.products.aggregate([{
  "$group":{
        "_id":{                     //create new key or _id for each doc 
           "maker":"$manufacturer"
           "category":"$category"
        },
        "num_products":{"$sum":1}
  }
}])

Using a document for _id

Lecture Video

Aggregation Expressions

Lecture Video

These are the expression that you can use aggregation grouping stage of the pipeline $group :

$sun add one to a key mySum: {$sum:1} or sum up keys sum_prices:{$sum:”$price”}
$avg, $min, $max average, minimum or maximum value of a key
$push build the arrays
$addToSet build the arrays, no duplicate a element - uniquely
$first only useful after a sort
$last only useful after a sort

Using $sum

Lecture Video

db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "maker":"$manufacturer"
         },
         "sum_prices":{ "$sum":"$prices" }
      }
   }
])

Quiz:

db.zips.aggregate([  
   {  
      "$group":{  
         "_id":"$state",
         "population":{ $sum:"$pop" }
      }
   }
])

Using $avg

Lecture Video

db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "category":"$category"
         },
         "avg_prices":{ "$avg":"$prices" }
      }
   }
])

Quiz:

db.zips.aggregate([
        {$group:
         {
    	 _id: {
           "state":"$state", "zip":"$zip"
    	 },
    	 avg_pop:{$avg:"$pop"}
         }
        }
    ])

Using $addToSet

Lecture Video

Create arrays for each doc
$addToSet adds it only if it's not already there.

db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "maker":"$manufacturer"
         },
         "categories":{ "$addToSet":"$category" }
      }
   }
])

Quiz:

db.zips.aggregate([  
   {  
      "$group":{  
         "_id":"$city",
         "postal_codes":{ "$addToSet":"$_id" }
      }
   }
])

Using $push

Lecture Video

$push is very similar to $addToSet, except that $push does not guarantee that it adds each item only once. It doesn't look through to make sure it's not already there.

db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "maker":"$manufacturer"
         },
         "categories":{ "$push":"$category" }
      }
   }
])

Using $max and $min

Lecture Video

db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "maker":"$manufacturer"
         },
         "maxprice":{ "$max":"$price" }
      }
   }
])

Quiz:

db.zips.aggregate([  
   {  
      "$group":{  
         "_id":"$state",
         "pop":{ "$max":"$pop" }
      }
   }
])

Double $group stages

Lecture Video

db.grades.aggregate([  
   {  
      '$group':{  
         _id:{  
            class_id:"$class_id",
            student_id:"$student_id"
         },
         'average':{  
            "$avg":"$score"
         }
      }
   },//grouping one
   {  
      '$group':{  
         _id:"$_id.class_id",
         'average':{  
            "$avg":"$average"
         }
      }
   } //grouping two from previous $group stage 
])

Using $project

Lecture Video

It's a 1:1 stage of the pipeline, so for every document that comes into the project phase, one document will leave the project phase.

You can do things like:

remove a key: if you don't mention a key, it is not included, except for _id, which must be explicitly suppressed {$project: {_id: 0, ...
add a new key: also possible to create new subdocuments
reshape the keys: you could take a key and decide to put it into a subdocument with another key.
keep keys: {$project: {myKey: 1, ...
rename keys / use functions: $toUpper, $toLower, $add, $multiply

Simple function of keys:

$toUpper
$toLower
$add add something to the value
$multiply multiply by number

db.products.aggregate([
    {$project:
     {
	 _id:0,
	 'maker': {$toLower:"$manufacturer"},
	 'details': {'category': "$category",
		     'price' : {"$multiply":["$price",10]}
		    },
	 'item':'$name'
     }
    }
])

Quiz:

db.zips.aggregate([
    {$project:
     {
     _id:0,                      //remove _id
     'city': {$toLower:"$city"},
     'pop':1,                    //show pop
     'state':1,                  //show state
     'zip':'$_id'                //ref _id
     }
    }
])

Using $match

Lecture Video

$match phase performs a filtering which is an n:1 operation. Match will go through each document and see if the document matches your criteria.

There are two reasons why you might want to match.

pre agg filter
filter the results

db.zips.aggregate([
    {$match:
     {
	 state:"CA"
     }
    },
    {$group:
     {
	 _id: "$city",
	 population: {$sum:"$pop"},
	 zip_codes: {$addToSet: "$_id"}
     }
    },
    {$project: //reshape the doc above grouped and matched doc
     {
	 _id: 0,
	 city: "$_id",
	 population: 1,
	 zip_codes:1
     }
    }
])

One thing to note about $match (and $sort) is that they can use indexes, but only if done at the beginning of the aggregation pipeline.

You can read the documentation here.

Quiz:

db.zips.aggregate([
    {$match:
     {
	 pop: {$gt:100000}
     }
    }
])

Using $sort

Lecture Video

$sort supports both disk and memory bases sorting.

by default the agg framework will try to sort in memory.
there is a limit of 100MB for any given pipeline stage

before or after the grouping stage (there are some good reasons!)

db.zips.aggregate([
    {$match:
     {
	 state:"NY"
     }
    },
    {$group:
     {
	 _id: "$city",
	 population: {$sum:"$pop"},
     }
    },
    {$project:
     {
	 _id: 0,
	 city: "$_id",
	 population: 1,
     }
    },
    {$sort:
     {
	 population:-1
     }
    }
])

Quiz:

db.zips.aggregate([
    {$sort:
     {
	 state:1, city:1
     }
    }
])

Using $limit and $skip

Lecture Video

So it doesn't make any sense to skip and limit, unless you first sort. First $skip – then $limit (order of the stages in the pipeline matter)

db.zips.aggregate([
    {$match:
     {
	 state:"NY"
     }
    },
    {$group:
     {
	 _id: "$city",
	 population: {$sum:"$pop"},
     }
    },
    {$project:
     {
	 _id: 0,
	 city: "$_id",
	 population: 1,
     }
    },
    {$sort:
     {
	 population:-1
     }
    },
    {$skip: 10},
    {$limit: 5}
])

Revisiting $first and $last

Lecture Video

$first and $last are group operators.

db.zips.aggregate([
    /* get the population of every city in every state */
    {$group:
     {
	 _id: {state:"$state", city:"$city"},
	 population: {$sum:"$pop"},
     }
    },
     /* sort by state, population */
    {$sort: 
     {"_id.state":1, "population":-1}
    },
    /* group by state, get the first item in each group */
    {$group: 
     {
	 _id:"$_id.state",
	 city: {$first: "$_id.city"},
	 population: {$first:"$population"}
     }
    },
    /* now sort by state again */
    {$sort:
     {"_id":1}
    }
])

Using $unwind

Lecture Video

use agg;
db.items.drop();
db.items.insert({_id:'nail', 'attributes':['hard', 'shiny', 'pointy', 'thin']});
db.items.insert({_id:'hammer', 'attributes':['heavy', 'black', 'blunt']});
db.items.insert({_id:'screwdriver', 'attributes':['long', 'black', 'flat']});
db.items.insert({_id:'rock', 'attributes':['heavy', 'rough', 'roundish']});
db.items.aggregate([{$unwind:"$attributes"}]);

{ "_id" : "nail", "attributes" : "hard" }
{ "_id" : "nail", "attributes" : "shiny" }
{ "_id" : "nail", "attributes" : "pointy" }
{ "_id" : "nail", "attributes" : "thin" }
{ "_id" : "hammer", "attributes" : "heavy" }
{ "_id" : "hammer", "attributes" : "black" }
{ "_id" : "hammer", "attributes" : "blunt" }
{ "_id" : "screwdriver", "attributes" : "long" }
{ "_id" : "screwdriver", "attributes" : "black" }
{ "_id" : "screwdriver", "attributes" : "flat" }
{ "_id" : "rock", "attributes" : "heavy" }
{ "_id" : "rock", "attributes" : "rough" }
{ "_id" : "rock", "attributes" : "roundish" }

$unwind example

Lecture Video

use blog;
db.posts.aggregate([
    /* unwind by tags */
    {"$unwind":"$tags"},
    /* now group by tags, counting each tag */
    {"$group": 
     {"_id":"$tags",
      "count":{$sum:1}
     }
    },
    /* sort by popularity */
    {"$sort":{"count":-1}},
    /* show me the top 10 */
    {"$limit": 10},
    /* change the name of _id to be tag */
    {"$project":
     {_id:0,
      'tag':'$_id',
      'count' : 1
     }
    }
])

Quiz: Reverse the effects of an $unwind? $push

Double $unwind

Lecture Video

And therefore create a Cartesian product of the two arrays as well as the rest of the document.

use agg;
db.inventory.drop();
db.inventory.insert({'name':"Polo Shirt", 'sizes':["Small", "Medium", "Large"], 'colors':['navy', 'white', 'orange', 'red']})
db.inventory.insert({'name':"T-Shirt", 'sizes':["Small", "Medium", "Large", "X-Large"], 'colors':['navy', "black",  'orange', 'red']})
db.inventory.insert({'name':"Chino Pants", 'sizes':["32x32", "31x30", "36x32"], 'colors':['navy', 'white', 'orange', 'violet']})

> db.inventory.find()
{ "_id" : ObjectId("552ef5ecdf3c683c8c1fd66f"), "name" : "Polo Shirt", "sizes" : [ "Small", "Medium", "Large" ], "colors" : [ "navy", "white", "orange", "red" ] }
{ "_id" : ObjectId("552ef5ecdf3c683c8c1fd670"), "name" : "T-Shirt", "sizes" : [ "Small", "Medium", "Large", "X-Large" ], "colors" : [ "navy", "black", "orange", "red" ] }
{ "_id" : ObjectId("552ef5eddf3c683c8c1fd671"), "name" : "Chino Pants", "sizes" : [ "32x32", "31x30", "36x32" ], "colors" : [ "navy", "white", "orange", "violet" ] }

db.inventory.aggregate([
    {$unwind: "$sizes"},
    {$unwind: "$colors"},
    {$group: 
     {
	'_id': {'size':'$sizes', 'color':'$colors'},
	'count' : {'$sum':1}
     }
    }
])
{ "_id" : { "size" : "31x30", "color" : "orange" }, "count" : 1 }
{ "_id" : { "size" : "Medium", "color" : "navy" }, "count" : 2 }
{ "_id" : { "size" : "31x30", "color" : "navy" }, "count" : 1 }
{ "_id" : { "size" : "32x32", "color" : "violet" }, "count" : 1 }
{ "_id" : { "size" : "36x32", "color" : "navy" }, "count" : 1 }
{ "_id" : { "size" : "32x32", "color" : "white" }, "count" : 1 }
{ "_id" : { "size" : "32x32", "color" : "navy" }, "count" : 1 }
{ "_id" : { "size" : "X-Large", "color" : "orange" }, "count" : 1 }
{ "_id" : { "size" : "Medium", "color" : "red" }, "count" : 2 }
{ "_id" : { "size" : "Large", "color" : "orange" }, "count" : 2 }
{ "_id" : { "size" : "X-Large", "color" : "black" }, "count" : 1 }
{ "_id" : { "size" : "X-Large", "color" : "navy" }, "count" : 1 }
{ "_id" : { "size" : "Small", "color" : "black" }, "count" : 1 }
{ "_id" : { "size" : "Large", "color" : "navy" }, "count" : 2 }
{ "_id" : { "size" : "Medium", "color" : "black" }, "count" : 1 }
{ "_id" : { "size" : "Small", "color" : "navy" }, "count" : 2 }
{ "_id" : { "size" : "31x30", "color" : "white" }, "count" : 1 }
{ "_id" : { "size" : "Large", "color" : "white" }, "count" : 1 }
{ "_id" : { "size" : "Large", "color" : "red" }, "count" : 2 }
{ "_id" : { "size" : "36x32", "color" : "orange" }, "count" : 1 }

Quiz: Can you reverse the effects of a double $unwind (2 unwinds in a row) in our inventory collection (shown in the lesson ) with the $push operator?

db.inventory.aggregate([
    {$unwind: "$sizes"},
    {$unwind: "$colors"},
    /* create the color array */
    {$group: 
     {
	'_id': {name:"$name",size:"$sizes"},
	 'colors': {$push: "$colors"},
     }
    },
    /* create the size array */
    {$group: 
     {
	'_id': {'name':"$_id.name",
		'colors' : "$colors"},
	 'sizes': {$push: "$_id.size"}
     }
    },
    /* reshape for beauty */
    {$project: 
     {
	 _id:0,
	 "name":"$_id.name",
	 "sizes":1,
	 "colors": "$_id.colors"
     }
    }
])
{ "sizes" : [ "Medium", "Large", "Small" ], "name" : "Polo Shirt", "colors" : [ "navy", "white", "orange", "red" ] }
{ "sizes" : [ "Large", "X-Large", "Small", "Medium" ], "name" : "T-Shirt", "colors" : [ "navy", "black", "orange", "red" ] }
{ "sizes" : [ "36x32", "31x30", "32x32" ], "name" : "Chino Pants", "colors" : [ "navy", "white", "orange", "violet" ] }

Simple grouping reverting with $addToSet

db.inventory.aggregate([
    {$unwind: "$sizes"},
    {$unwind: "$colors"},
    {$group: 
     {
	'_id': "$name",
	 'sizes': {$addToSet: "$sizes"},
	 'colors': {$addToSet: "$colors"},
     }
    }
])
{ "_id" : "Chino Pants", "sizes" : [ "36x32", "31x30", "32x32" ], "colors" : [ "violet", "white", "orange", "navy" ] }
{ "_id" : "T-Shirt", "sizes" : [ "X-Large", "Large", "Medium", "Small" ], "colors" : [ "red", "black", "orange", "navy" ] }
{ "_id" : "Polo Shirt", "sizes" : [ "Large", "Medium", "Small" ], "colors" : [ "red", "white", "orange", "navy" ] }

Mapping between SQL and Aggregation

Lecture Video

SQL to Aggregation Mapping Chart

SQL	MongoDB
WHERE	$match
GROUP BY	$group
HAVING	$match
SELECT	$project
ORDER BY	$sort
LIMIT	$limit
SUM()	$sum
COUNT	$sum
JOIN	No direct corresponding operator; however, the `$unwind` operator allows for somewhat similar functionality, but with fields embedded within the document.

Some Common SQL examples

Lecture Video

SQL to Aggregation Mapping - Examples

Limitations of the Aggregation Framework

Lecture Video

by default, 100MB limit for pipeline stages
- by allowDiskUse option, get around that 100MB limit
to return the results in one document can only be a 16MB limit (by default in python)
- easy way around this one, set cursor equal to empty document cursor = {} and you can have aggregation result that have no limit.
in sharded environment:
- group by or a sort or anything that requires looking at all the data, then will be brought back to the first shard from multiple shards.

Alternatives to aggregation framework:

map-redude built from hadoop: hadoop connector
map-reduce built from mongodb: don't recommended

Tips: Projection or a match, could go in parallel with all the shards.

Aggregation Framework with the Java Driver

Lecture Video

Examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week5.md

week5.md

Week 5: Aggregation Framework

Topics:

Simple Aggregation Example

The Aggregation Pipeline

Simple Example Explained

Compound Grouping

Using a document for _id

Aggregation Expressions

Using $sum

Using $avg

Using $addToSet

Using $push

Using $max and $min

Double $group stages

Using $project

Using $match

Using $sort

Using $limit and $skip

Revisiting $first and $last

Using $unwind

$unwind example

Double $unwind

Mapping between SQL and Aggregation

Some Common SQL examples

Limitations of the Aggregation Framework

Aggregation Framework with the Java Driver

Files

week5.md

Latest commit

History

week5.md

File metadata and controls

Week 5: Aggregation Framework

Topics:

Simple Aggregation Example

The Aggregation Pipeline

Simple Example Explained

Compound Grouping

Using a document for _id

Aggregation Expressions

Using $sum

Using $avg

Using $addToSet

Using $push

Using $max and $min

Double $group stages

Using $project

Using $match

Using $sort

Using $limit and $skip

Revisiting $first and $last

Using $unwind

$unwind example

Double $unwind

Mapping between SQL and Aggregation

Some Common SQL examples

Limitations of the Aggregation Framework

Aggregation Framework with the Java Driver