Skip to content

Latest commit

 

History

History
695 lines (609 loc) · 19.1 KB

week5.md

File metadata and controls

695 lines (609 loc) · 19.1 KB

Week 5: Aggregation Framework

Topics:

Goals, The Use Of The Pipeline, Comparison With SQL Facilities.

Simple Aggregation Example

Lecture Video

select manufacturer, count(*), from products where product group by manufacturer.
use agg
db.products.aggregate([
    {$group:
     {
	 _id:"$manufacturer",    //set the field '_id' to the field you want to group by
	 num_products:{$sum:1}   //define field 'num_products'
     }
    }
])

Quiz:

db.products.aggregate([{
  "$group":{
         "_id":"$category",
         "num_products":{"$sum":1}
   }
}])

The Aggregation Pipeline

Lecture Video

  • Each document in the array parameter to the aggregate function is a stage in the pipeline

  • pipeline stages from collection to results: collection -> $project -> $match -> $group -> $sort -> result

  • Stages (main operators):

    • $project select relevant fields and reshape doc in: 1 / out: 1
    • $match filters docs; in: n / out: n-x
    • $group aggregates; Reduces the number of docs in: n / out: n-x
    • $sort sorts the docs in: 1 / out: 1
    • $skip skips docs in: n / out: n-x
    • $limit limits returned docs in: n / out: n-x
    • $unwind explodes arrays - Produces a doc for each value in an array-key-field with everything else repeated in: n / out: n+x
  • Stages (plus operators):

    • $out allow put docs to another collection in: 1 / out: 1
    • $redact
    • $geonear limit doc by the pipeline stage based on location
  • Each stage can exist more than once in a pipeline

Simple Example Explained

Lecture Video

Quiz:

db.stuff.find()
{ "_id" : ObjectId("50b26f9d80a78af03b5163c8"), "a" : 1, "b" : 1, "c" : 1 }
{ "_id" : ObjectId("50b26fb480a78af03b5163c9"), "a" : 2, "b" : 2, "c" : 1 }
{ "_id" : ObjectId("50b26fbf80a78af03b5163ca"), "a" : 3, "b" : 3, "c" : 1 }
{ "_id" : ObjectId("50b26fcd80a78af03b5163cb"), "a" : 3, "b" : 3, "c" : 2 }
{ "_id" : ObjectId("50b26fd380a78af03b5163cc"), "a" : 3, "b" : 5, "c" : 3 }

db.stuff.aggregate([{$group:{_id:'$c'}}])

How many documents will be in the result set from aggregate?

  • 3

Compound Grouping

Lecture Video

select manufacturer, category, count(*) from products group by manufacturer, category.
  • Use a compound id with manufacturer and category
db.products.aggregate([{
  "$group":{
        "_id":{                     //create new key or _id for each doc 
           "maker":"$manufacturer"
           "category":"$category"
        },
        "num_products":{"$sum":1}
  }
}])

Using a document for _id

Lecture Video

Aggregation Expressions

Lecture Video

These are the expression that you can use aggregation grouping stage of the pipeline $group :

  • $sun add one to a key mySum: {$sum:1} or sum up keys sum_prices:{$sum:”$price”}
  • $avg, $min, $max average, minimum or maximum value of a key
  • $push build the arrays
  • $addToSet build the arrays, no duplicate a element - uniquely
  • $first only useful after a sort
  • $last only useful after a sort

Using $sum

Lecture Video

db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "maker":"$manufacturer"
         },
         "sum_prices":{ "$sum":"$prices" }
      }
   }
])

Quiz:

db.zips.aggregate([  
   {  
      "$group":{  
         "_id":"$state",
         "population":{ $sum:"$pop" }
      }
   }
])

Using $avg

Lecture Video

db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "category":"$category"
         },
         "avg_prices":{ "$avg":"$prices" }
      }
   }
])

Quiz:

db.zips.aggregate([
        {$group:
         {
    	 _id: {
           "state":"$state", "zip":"$zip"
    	 },
    	 avg_pop:{$avg:"$pop"}
         }
        }
    ])

Using $addToSet

Lecture Video

  • Create arrays for each doc
  • $addToSet adds it only if it's not already there.
db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "maker":"$manufacturer"
         },
         "categories":{ "$addToSet":"$category" }
      }
   }
])

Quiz:

db.zips.aggregate([  
   {  
      "$group":{  
         "_id":"$city",
         "postal_codes":{ "$addToSet":"$_id" }
      }
   }
])

Using $push

Lecture Video

$push is very similar to $addToSet, except that $push does not guarantee that it adds each item only once. It doesn't look through to make sure it's not already there.

db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "maker":"$manufacturer"
         },
         "categories":{ "$push":"$category" }
      }
   }
])

Using $max and $min

Lecture Video

db.products.aggregate([  
   {  
      "$group":{  
         "_id":{  
            "maker":"$manufacturer"
         },
         "maxprice":{ "$max":"$price" }
      }
   }
])

Quiz:

db.zips.aggregate([  
   {  
      "$group":{  
         "_id":"$state",
         "pop":{ "$max":"$pop" }
      }
   }
])

Double $group stages

Lecture Video

db.grades.aggregate([  
   {  
      '$group':{  
         _id:{  
            class_id:"$class_id",
            student_id:"$student_id"
         },
         'average':{  
            "$avg":"$score"
         }
      }
   },//grouping one
   {  
      '$group':{  
         _id:"$_id.class_id",
         'average':{  
            "$avg":"$average"
         }
      }
   } //grouping two from previous $group stage 
])

Using $project

Lecture Video

It's a 1:1 stage of the pipeline, so for every document that comes into the project phase, one document will leave the project phase.

You can do things like:

  • remove a key: if you don't mention a key, it is not included, except for _id, which must be explicitly suppressed {$project: {_id: 0, ...
  • add a new key: also possible to create new subdocuments
  • reshape the keys: you could take a key and decide to put it into a subdocument with another key.
  • keep keys: {$project: {myKey: 1, ...
  • rename keys / use functions: $toUpper, $toLower, $add, $multiply

Simple function of keys:

  • $toUpper
  • $toLower
  • $add add something to the value
  • $multiply multiply by number
db.products.aggregate([
    {$project:
     {
	 _id:0,
	 'maker': {$toLower:"$manufacturer"},
	 'details': {'category': "$category",
		     'price' : {"$multiply":["$price",10]}
		    },
	 'item':'$name'
     }
    }
])

Quiz:

db.zips.aggregate([
    {$project:
     {
     _id:0,                      //remove _id
     'city': {$toLower:"$city"},
     'pop':1,                    //show pop
     'state':1,                  //show state
     'zip':'$_id'                //ref _id
     }
    }
])

Using $match

Lecture Video

$match phase performs a filtering which is an n:1 operation. Match will go through each document and see if the document matches your criteria.

There are two reasons why you might want to match.

  • pre agg filter
  • filter the results
db.zips.aggregate([
    {$match:
     {
	 state:"CA"
     }
    },
    {$group:
     {
	 _id: "$city",
	 population: {$sum:"$pop"},
	 zip_codes: {$addToSet: "$_id"}
     }
    },
    {$project: //reshape the doc above grouped and matched doc
     {
	 _id: 0,
	 city: "$_id",
	 population: 1,
	 zip_codes:1
     }
    }
])

One thing to note about $match (and $sort) is that they can use indexes, but only if done at the beginning of the aggregation pipeline.

You can read the documentation here.

Quiz:

db.zips.aggregate([
    {$match:
     {
	 pop: {$gt:100000}
     }
    }
])

Using $sort

Lecture Video

$sort supports both disk and memory bases sorting.

  • by default the agg framework will try to sort in memory.
  • there is a limit of 100MB for any given pipeline stage

before or after the grouping stage (there are some good reasons!)

db.zips.aggregate([
    {$match:
     {
	 state:"NY"
     }
    },
    {$group:
     {
	 _id: "$city",
	 population: {$sum:"$pop"},
     }
    },
    {$project:
     {
	 _id: 0,
	 city: "$_id",
	 population: 1,
     }
    },
    {$sort:
     {
	 population:-1
     }
    }
])

Quiz:

db.zips.aggregate([
    {$sort:
     {
	 state:1, city:1
     }
    }
])

Using $limit and $skip

Lecture Video

So it doesn't make any sense to skip and limit, unless you first sort. First $skip – then $limit (order of the stages in the pipeline matter)

db.zips.aggregate([
    {$match:
     {
	 state:"NY"
     }
    },
    {$group:
     {
	 _id: "$city",
	 population: {$sum:"$pop"},
     }
    },
    {$project:
     {
	 _id: 0,
	 city: "$_id",
	 population: 1,
     }
    },
    {$sort:
     {
	 population:-1
     }
    },
    {$skip: 10},
    {$limit: 5}
])

Revisiting $first and $last

Lecture Video

$first and $last are group operators.

db.zips.aggregate([
    /* get the population of every city in every state */
    {$group:
     {
	 _id: {state:"$state", city:"$city"},
	 population: {$sum:"$pop"},
     }
    },
     /* sort by state, population */
    {$sort: 
     {"_id.state":1, "population":-1}
    },
    /* group by state, get the first item in each group */
    {$group: 
     {
	 _id:"$_id.state",
	 city: {$first: "$_id.city"},
	 population: {$first:"$population"}
     }
    },
    /* now sort by state again */
    {$sort:
     {"_id":1}
    }
])

Using $unwind

Lecture Video

use agg;
db.items.drop();
db.items.insert({_id:'nail', 'attributes':['hard', 'shiny', 'pointy', 'thin']});
db.items.insert({_id:'hammer', 'attributes':['heavy', 'black', 'blunt']});
db.items.insert({_id:'screwdriver', 'attributes':['long', 'black', 'flat']});
db.items.insert({_id:'rock', 'attributes':['heavy', 'rough', 'roundish']});
db.items.aggregate([{$unwind:"$attributes"}]);

{ "_id" : "nail", "attributes" : "hard" }
{ "_id" : "nail", "attributes" : "shiny" }
{ "_id" : "nail", "attributes" : "pointy" }
{ "_id" : "nail", "attributes" : "thin" }
{ "_id" : "hammer", "attributes" : "heavy" }
{ "_id" : "hammer", "attributes" : "black" }
{ "_id" : "hammer", "attributes" : "blunt" }
{ "_id" : "screwdriver", "attributes" : "long" }
{ "_id" : "screwdriver", "attributes" : "black" }
{ "_id" : "screwdriver", "attributes" : "flat" }
{ "_id" : "rock", "attributes" : "heavy" }
{ "_id" : "rock", "attributes" : "rough" }
{ "_id" : "rock", "attributes" : "roundish" }

$unwind example

Lecture Video

use blog;
db.posts.aggregate([
    /* unwind by tags */
    {"$unwind":"$tags"},
    /* now group by tags, counting each tag */
    {"$group": 
     {"_id":"$tags",
      "count":{$sum:1}
     }
    },
    /* sort by popularity */
    {"$sort":{"count":-1}},
    /* show me the top 10 */
    {"$limit": 10},
    /* change the name of _id to be tag */
    {"$project":
     {_id:0,
      'tag':'$_id',
      'count' : 1
     }
    }
])

Quiz: Reverse the effects of an $unwind? $push

Double $unwind

Lecture Video

And therefore create a Cartesian product of the two arrays as well as the rest of the document.

use agg;
db.inventory.drop();
db.inventory.insert({'name':"Polo Shirt", 'sizes':["Small", "Medium", "Large"], 'colors':['navy', 'white', 'orange', 'red']})
db.inventory.insert({'name':"T-Shirt", 'sizes':["Small", "Medium", "Large", "X-Large"], 'colors':['navy', "black",  'orange', 'red']})
db.inventory.insert({'name':"Chino Pants", 'sizes':["32x32", "31x30", "36x32"], 'colors':['navy', 'white', 'orange', 'violet']})
> db.inventory.find()
{ "_id" : ObjectId("552ef5ecdf3c683c8c1fd66f"), "name" : "Polo Shirt", "sizes" : [ "Small", "Medium", "Large" ], "colors" : [ "navy", "white", "orange", "red" ] }
{ "_id" : ObjectId("552ef5ecdf3c683c8c1fd670"), "name" : "T-Shirt", "sizes" : [ "Small", "Medium", "Large", "X-Large" ], "colors" : [ "navy", "black", "orange", "red" ] }
{ "_id" : ObjectId("552ef5eddf3c683c8c1fd671"), "name" : "Chino Pants", "sizes" : [ "32x32", "31x30", "36x32" ], "colors" : [ "navy", "white", "orange", "violet" ] }
db.inventory.aggregate([
    {$unwind: "$sizes"},
    {$unwind: "$colors"},
    {$group: 
     {
	'_id': {'size':'$sizes', 'color':'$colors'},
	'count' : {'$sum':1}
     }
    }
])
{ "_id" : { "size" : "31x30", "color" : "orange" }, "count" : 1 }
{ "_id" : { "size" : "Medium", "color" : "navy" }, "count" : 2 }
{ "_id" : { "size" : "31x30", "color" : "navy" }, "count" : 1 }
{ "_id" : { "size" : "32x32", "color" : "violet" }, "count" : 1 }
{ "_id" : { "size" : "36x32", "color" : "navy" }, "count" : 1 }
{ "_id" : { "size" : "32x32", "color" : "white" }, "count" : 1 }
{ "_id" : { "size" : "32x32", "color" : "navy" }, "count" : 1 }
{ "_id" : { "size" : "X-Large", "color" : "orange" }, "count" : 1 }
{ "_id" : { "size" : "Medium", "color" : "red" }, "count" : 2 }
{ "_id" : { "size" : "Large", "color" : "orange" }, "count" : 2 }
{ "_id" : { "size" : "X-Large", "color" : "black" }, "count" : 1 }
{ "_id" : { "size" : "X-Large", "color" : "navy" }, "count" : 1 }
{ "_id" : { "size" : "Small", "color" : "black" }, "count" : 1 }
{ "_id" : { "size" : "Large", "color" : "navy" }, "count" : 2 }
{ "_id" : { "size" : "Medium", "color" : "black" }, "count" : 1 }
{ "_id" : { "size" : "Small", "color" : "navy" }, "count" : 2 }
{ "_id" : { "size" : "31x30", "color" : "white" }, "count" : 1 }
{ "_id" : { "size" : "Large", "color" : "white" }, "count" : 1 }
{ "_id" : { "size" : "Large", "color" : "red" }, "count" : 2 }
{ "_id" : { "size" : "36x32", "color" : "orange" }, "count" : 1 }

Quiz: Can you reverse the effects of a double $unwind (2 unwinds in a row) in our inventory collection (shown in the lesson ) with the $push operator?

db.inventory.aggregate([
    {$unwind: "$sizes"},
    {$unwind: "$colors"},
    /* create the color array */
    {$group: 
     {
	'_id': {name:"$name",size:"$sizes"},
	 'colors': {$push: "$colors"},
     }
    },
    /* create the size array */
    {$group: 
     {
	'_id': {'name':"$_id.name",
		'colors' : "$colors"},
	 'sizes': {$push: "$_id.size"}
     }
    },
    /* reshape for beauty */
    {$project: 
     {
	 _id:0,
	 "name":"$_id.name",
	 "sizes":1,
	 "colors": "$_id.colors"
     }
    }
])
{ "sizes" : [ "Medium", "Large", "Small" ], "name" : "Polo Shirt", "colors" : [ "navy", "white", "orange", "red" ] }
{ "sizes" : [ "Large", "X-Large", "Small", "Medium" ], "name" : "T-Shirt", "colors" : [ "navy", "black", "orange", "red" ] }
{ "sizes" : [ "36x32", "31x30", "32x32" ], "name" : "Chino Pants", "colors" : [ "navy", "white", "orange", "violet" ] }

Simple grouping reverting with $addToSet

db.inventory.aggregate([
    {$unwind: "$sizes"},
    {$unwind: "$colors"},
    {$group: 
     {
	'_id': "$name",
	 'sizes': {$addToSet: "$sizes"},
	 'colors': {$addToSet: "$colors"},
     }
    }
])
{ "_id" : "Chino Pants", "sizes" : [ "36x32", "31x30", "32x32" ], "colors" : [ "violet", "white", "orange", "navy" ] }
{ "_id" : "T-Shirt", "sizes" : [ "X-Large", "Large", "Medium", "Small" ], "colors" : [ "red", "black", "orange", "navy" ] }
{ "_id" : "Polo Shirt", "sizes" : [ "Large", "Medium", "Small" ], "colors" : [ "red", "white", "orange", "navy" ] }

Mapping between SQL and Aggregation

Lecture Video

SQL to Aggregation Mapping Chart

SQL MongoDB
WHERE $match
GROUP BY $group
HAVING $match
SELECT $project
ORDER BY $sort
LIMIT $limit
SUM() $sum
COUNT $sum
JOIN No direct corresponding operator; however, the $unwind operator allows for somewhat similar functionality, but with fields embedded within the document.

Some Common SQL examples

Lecture Video

SQL to Aggregation Mapping - Examples

Limitations of the Aggregation Framework

Lecture Video

  • by default, 100MB limit for pipeline stages
    • by allowDiskUse option, get around that 100MB limit
  • to return the results in one document can only be a 16MB limit (by default in python)
    • easy way around this one, set cursor equal to empty document cursor = {} and you can have aggregation result that have no limit.
  • in sharded environment:
    • group by or a sort or anything that requires looking at all the data, then will be brought back to the first shard from multiple shards.

Alternatives to aggregation framework:

  • map-redude built from hadoop: hadoop connector
  • map-reduce built from mongodb: don't recommended

Tips: Projection or a match, could go in parallel with all the shards.

Aggregation Framework with the Java Driver

Lecture Video

Examples