In [1]:
import pandas as pd

## Question 1
Make sure that the theme is also included in dataset `sets`:

1. Load the CSV data into dataframes. Be sure to load every file in a separate variable so you can analyze them easily.  
   The URLs for the files reside are as follows, one for each table:
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/colors.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/elements.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventories.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_minifigs.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_parts.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_sets.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/minifigs.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/parts.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/part_categories.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/part_relationships.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/sets.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/themes_hierachie.csv`
2. Link `themes` to `sets`. The hierarchy is already solved for you in this `themes` set
3. Create a new field (variable, column) that indicates with 0/1 if a set belongs to the Creator theme

When reading all files at once in one script, you might see an error because of throttling at GitHub.
The solution to this it to split the script into multiple cells, each of which load a part of the data.

In [2]:
# Reading data from a CSV file: pd.read_csv
# All tables go into their own separate dataframe and variable
colors = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/colors.csv")
elements = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/elements.csv")
inventories = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventories.csv")
inventory_minifigs = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_minifigs.csv")
inventory_parts = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_parts.csv")
inventory_sets = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_sets.csv")
minifigs = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/minifigs.csv")
parts = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/parts.csv")
part_categories = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/part_categories.csv")
part_relationships = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/part_relationships.csv")
sets = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/sets.csv")
themes = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/themes_hierachie.csv")

In [3]:
# link themes to sets
themes_sets = themes.merge(sets, left_on='id', right_on='theme_id')
themes_sets.head(10) # Display top 10 rows to show the result

Unnamed: 0,id,name_x,parent_name,set_num,name_y,year,theme_id,num_parts
0,1,Technic,Technic,001-1,Gears,1965,1,43
1,1,Technic,Technic,002-1,4.5V Samsonite Gears Motor Set,1965,1,3
2,1,Technic,Technic,1030-1,TECHNIC I: Simple Machines Set,1985,1,191
3,1,Technic,Technic,1038-1,ERBIE the Robo-Car,1985,1,120
4,1,Technic,Technic,1039-1,Manual Control Set 1,1986,1,39
5,1,Technic,Technic,1168-1,Battery Box,1986,1,1
6,1,Technic,Technic,1314-1,Stop bush / Small pulley,1987,1,210
7,1,Technic,Technic,1315-1,Piston Rod,1987,1,50
8,1,Technic,Technic,1316-1,Connector peg,1987,1,150
9,1,Technic,Technic,1317-1,TECHNIC Chainlinks,1987,1,350


### Before we continue
There are apparently two "theme" columns in the table `themes`:

* name (now renamed to `name_x` to avoid ambiguous column names)
* parent_name

From the data model we know already that this was (once) a parent-child hierarchy - let's do some further analysis:

In [4]:
print("Number of sets with parent_name Creator: ", themes_sets.query('parent_name == "Creator"')["id"].count()) # Count the number of sets with parent_name Creator
print("Number of sets with name_x Creator: ", themes_sets.query('name_x == "Creator"')["id"].count()) # Count the number of sets with name_x Creator
print("Values for parent_name when child_name is Creator: ", themes_sets.query('name_x == "Creator"')["name_x"].value_counts())
print("Values for child name when parent_name is Creator: ", themes_sets.query('parent_name == "Creator"')["name_x"].value_counts())


Number of sets with parent_name Creator:  540
Number of sets with name_x Creator:  124
Values for parent_name when child_name is Creator:  name_x
Creator    124
Name: count, dtype: int64
Values for child name when parent_name is Creator:  name_x
Creator 3-in-1    182
Basic Set         120
Creator            92
Creator Expert     58
Early Creator      29
Supplemental       22
Food & Drink       16
Basic Model         8
Creature            5
Construction        3
Traffic             3
Castle              1
Building            1
Name: count, dtype: int64


Based on this analysis, we can state that:

* When name_x is 'Creator', parent_name is also 'Creator'
* When parent_name is 'Creator', name_x can be different values, for example:
  * Early Creator
  * Creator Expert
  * Creator 3-in-1
  * Basic Set

We can safely assume that all of these still are "Creator" sets, but have some sub-theme.
Therefor, the filter should be on `parent_name`

In [5]:
# Create a new field that indicates with 0/1 if a set belongs to the Creator theme
themes_sets['is_creator'] = (themes_sets['parent_name'] == 'Creator').astype(int)

In [6]:
themes_sets.query('is_creator == 1').head(5) # 5 examples for rows with creator theme

Unnamed: 0,id,name_x,parent_name,set_num,name_y,year,theme_id,num_parts,is_creator
580,22,Creator,Creator,10664-1,Creative Tower,2013,22,1600,1
581,22,Creator,Creator,11938-1,Robot,2020,22,45,1
582,22,Creator,Creator,11939-1,Octopus,2020,22,63,1
583,22,Creator,Creator,11940-1,Fortress,2020,22,52,1
584,22,Creator,Creator,11941-1,Frog,2020,22,56,1


In [7]:
themes_sets.query('is_creator == 0').head(5) # 5 examples for rows without creator theme

Unnamed: 0,id,name_x,parent_name,set_num,name_y,year,theme_id,num_parts,is_creator
0,1,Technic,Technic,001-1,Gears,1965,1,43,0
1,1,Technic,Technic,002-1,4.5V Samsonite Gears Motor Set,1965,1,3,0
2,1,Technic,Technic,1030-1,TECHNIC I: Simple Machines Set,1985,1,191,0
3,1,Technic,Technic,1038-1,ERBIE the Robo-Car,1985,1,120,0
4,1,Technic,Technic,1039-1,Manual Control Set 1,1986,1,39,0


In [8]:
themes_sets.query('is_creator == 0 and parent_name == "Creator"') # Check if there are rows with is_creator == 0 and parent_name == 'Creator'

Unnamed: 0,id,name_x,parent_name,set_num,name_y,year,theme_id,num_parts,is_creator


Question: What % of all sets do you own?

In [9]:
num_creator_sets = themes_sets['is_creator'].sum()
print("Number of sets that belong to the Creator theme:", num_creator_sets)
num_total_sets = len(themes_sets)
print("Total number of sets:", num_total_sets)
print(f"Percentage of total sets owned: {num_creator_sets / num_total_sets * 100:.2f}")

Number of sets that belong to the Creator theme: 540
Total number of sets: 17835
Percentage of total sets owned: 3.03


Three things to note:

1. Because the `is_creator` is a number, we can add it to get the number of creator sets
2. I've used a *format string*. Basically the following three expressions yield the same result:
```python
print("Perc of sets owned:", (num_creator_sets / num_total_sets * 100))
print("Perc of sets owned: {placeholder}".format( placeholder=(num_creator_sets / num_total_sets * 100))) # You can use any name instead of placeholder, of cours
print(f"Perc of sets owned: {(num_creator_sets / num_total_sets * 100)}")
```
3. The `:.2f` signifies the number of decimal places that should be displayed in a floating point number. Instead of displaying `3.027754415475189` it now displays `3.03`.


## Question 2:

Filter the dataset based on your created variable **isCreator**

In [10]:
creator_sets = themes_sets.query('is_creator == 1')

Link the data sources together so that you know which parts it contains for all sets together. 

You need the following tables for this:

* Inventories
* Inventory_parts
* Parts
* Colors

This table is unique on `part_num` and `Color_id`

Based on the *rebrickable* datamodel, let's create a join. We will use method chaining in parentheses to create a clear and readable statement for this:

In [11]:
all_my_parts = (creator_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
    .merge(inventory_parts, left_on='id_i', right_on='inventory_id', suffixes=(None, '_ip'))
    # .merge(parts, left_on='part_num', right_on='part_num') # I don't think we will need parts for now
    .merge(colors, left_on='color_id', right_on='id', suffixes=(None, '_c'))
)

How many parts do you have in the color "Red"?

In [12]:
# First guess: sum all quantities where the color is "red"
# This is a somewhat dangerous one: there could have been duplication along the way
all_my_parts.query('name == "Red"')["quantity"].sum()

26372

Let's double-check: it should be somewhat in line (but not exactly the same) as the number of parts from `creator_sets`. Let's check this:

In [13]:
print("All my parts:", all_my_parts["quantity"].sum())
print("On set level:", creator_sets["num_parts"].sum())

All my parts: 280639
On set level: 255766


That sounds reasonable, albeit a little bit high:

* Lego puts some spare parts in every box, so it should be higher
* However, we have 540 sets, and there are 24.873 parts "spare" here
* Lego should then have added approximately 46 spare parts in every box, which is kinda high

Let's double check if no unintended duplication is going on:

In [14]:
len(creator_sets)

540

In [15]:
len(creator_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
)

564

In [16]:
len(creator_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
    .merge(inventory_parts, left_on='id_i', right_on='inventory_id', suffixes=(None, '_ip'))
)

58206

In [17]:
len(creator_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
    .merge(inventory_parts, left_on='id_i', right_on='inventory_id', suffixes=(None, '_ip'))
        # .merge(parts, left_on='part_num', right_on='part_num') # I don't think we will need parts for now
    .merge(colors, left_on='color_id', right_on='id', suffixes=(None, '_c'))
)

58206

Apparently, nothing goes wrong: 

* The number of rows increases predictably
* There are a few sets with multiple inventories (540 -> 564)
* There seem to be approximately 100 inventory_parts rows per set (564 -> 58K)
* Merging the colors in doesn't make any difference.

From a high level, nothing is wrong. 

However, we could have a separate look:

* When there are multiple *inventories* associated with one *set*, are lists of parts maybe reported twice?

In [18]:
(creator_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
)

Unnamed: 0,id_s,name_x,parent_name,set_num,name_y,year,theme_id,num_parts,is_creator,id_i,version
0,22,Creator,Creator,10664-1,Creative Tower,2013,22,1600,1,12223,1
1,22,Creator,Creator,11938-1,Robot,2020,22,45,1,69864,1
2,22,Creator,Creator,11939-1,Octopus,2020,22,63,1,72642,1
3,22,Creator,Creator,11940-1,Fortress,2020,22,52,1,73478,1
4,22,Creator,Creator,11941-1,Frog,2020,22,56,1,76834,1
...,...,...,...,...,...,...,...,...,...,...,...
559,674,Early Creator,Creator,4906-1,Helicopter,2005,674,16,1,12591,1
560,674,Early Creator,Creator,5370-1,Large Make and Create Bucket with Special LEGO...,2005,674,0,1,10795,1
561,674,Early Creator,Creator,7830-1,Small Blue Bucket,2002,674,200,1,6120,1
562,674,Early Creator,Creator,K4103-1,Creator Bucket bundled with 4782 (TRU Exclusive),2005,674,0,1,104,1


A second look makes me wonder at the "version" column. There are *some* sets with a **version** that is equal to 2:

In [19]:
(creator_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
)["version"].value_counts()

version
1    539
2     25
Name: count, dtype: int64

In [20]:
(creator_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
).query("version == 2")

Unnamed: 0,id_s,name_x,parent_name,set_num,name_y,year,theme_id,num_parts,is_creator,id_i,version
198,37,Basic Set,Creator,5508-1,Deluxe Brick Box,2010,37,704,1,30080,2
267,48,Supplemental,Creator,6117-1,Doors and Windows,2008,48,100,1,74248,2
282,672,Creator 3-in-1,Creator,31004-1,Fierce Flyer,2013,672,166,1,27496,2
289,672,Creator 3-in-1,Creator,31010-1,Treehouse,2013,672,356,1,78334,2
293,672,Creator 3-in-1,Creator,31013-1,Red Thunder,2014,672,66,1,45633,2
296,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,18231,2
350,672,Creator 3-in-1,Creator,31070-1,Turbo Track Racer,2017,672,670,1,29830,2
365,672,Creator 3-in-1,Creator,31085-1,Mobile Stunt Show,2018,672,581,1,95283,2
390,672,Creator 3-in-1,Creator,31111-1,Cyber Drone,2021,672,113,1,88727,2
405,672,Creator 3-in-1,Creator,4838-1,Mini Vehicles,2008,672,79,1,29359,2


Let's zoom in on one with few parts: Emerals Express (`31015-1`):

In [21]:
(creator_sets
    .query("set_num == '31015-1'")
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
    .merge(inventory_parts, left_on='id_i', right_on='inventory_id', suffixes=(None, '_ip'))
        # .merge(parts, left_on='part_num', right_on='part_num') # I don't think we will need parts for now
    .merge(colors, left_on='color_id', right_on='id', suffixes=(None, '_c'))
)

Unnamed: 0,id_s,name_x,parent_name,set_num,name_y,year,theme_id,num_parts,is_creator,id_i,version,inventory_id,part_num,color_id,quantity,is_spare,id,name,rgb,is_trans
0,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,10201,0,1,f,0,Black,05131D,f
1,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,2540,0,1,f,0,Black,05131D,f
2,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,3023,0,2,f,0,Black,05131D,f
3,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,3068b,0,1,f,0,Black,05131D,f
4,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,3942c,0,1,f,0,Black,05131D,f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,18231,2,18231,87580,4,1,f,4,Red,C91A09,f
74,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,98138,46,1,f,46,Trans-Yellow,F5CD2F,t
75,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,98138,46,1,t,46,Trans-Yellow,F5CD2F,t
76,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,18231,2,18231,98138,46,1,t,46,Trans-Yellow,F5CD2F,t


Now the column `is_spare` looks interesting.. Let's focus on only spare parts:

In [22]:
(creator_sets
    .query("set_num == '31015-1'")
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
    .merge(inventory_parts, left_on='id_i', right_on='inventory_id', suffixes=(None, '_ip'))
    .query('is_spare == "t"')
        # .merge(parts, left_on='part_num', right_on='part_num') # I don't think we will need parts for now
    .merge(colors, left_on='color_id', right_on='id', suffixes=(None, '_c'))
)

Unnamed: 0,id_s,name_x,parent_name,set_num,name_y,year,theme_id,num_parts,is_creator,id_i,version,inventory_id,part_num,color_id,quantity,is_spare,id,name,rgb,is_trans
0,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,3070b,72,1,t,72,Dark Bluish Gray,6C6E68,f
1,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,18231,2,18231,3070b,72,1,t,72,Dark Bluish Gray,6C6E68,f
2,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,3673,71,1,t,71,Light Bluish Gray,A0A5A9,f
3,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,18231,2,18231,3673,71,1,t,71,Light Bluish Gray,A0A5A9,f
4,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,6141,0,2,t,0,Black,05131D,f
5,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,18231,2,18231,6141,0,2,t,0,Black,05131D,f
6,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,3217,1,3217,98138,46,1,t,46,Trans-Yellow,F5CD2F,t
7,672,Creator 3-in-1,Creator,31015-1,Emerald Express,2014,672,56,1,18231,2,18231,98138,46,1,t,46,Trans-Yellow,F5CD2F,t


It seems we've caught something: there can be two versions of an inventory, so parts can be reported doubly.
What if we use only version == 1?

In [23]:
all_my_parts.query("version == 1")["quantity"].sum()

255103

That number is actually *lower* than the number of parts in the sets. This could well be because the minifigs are counting inside the number of parts of the box, and not in the number of parts of the inventory.
Let's check how many minifig "parts" would be inside

In [24]:
all_my_minifigs = (creator_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
    .query("version == 1")
    .merge(inventory_minifigs, left_on='id_i', right_on='inventory_id', suffixes=(None, '_im'))
    # .merge(parts, left_on='part_num', right_on='part_num') # I don't think we will need parts for now
    .merge(minifigs, left_on='fig_num', right_on='fig_num', suffixes=(None, '_m'))
)

In [25]:
(all_my_minifigs["quantity"] * all_my_minifigs["num_parts_m"]).sum()

1142

In [26]:
total_num_parts_including_minifigs = (
    all_my_parts.query("version == 1")["quantity"].sum()
    +
    (all_my_minifigs["quantity"] * all_my_minifigs["num_parts_m"]).sum()
)
total_num_parts_including_minifigs

256245

Which is *a little bit more* than the number of parts on the box:

In [27]:
total_num_parts_according_to_box = creator_sets["num_parts"].sum()
total_num_parts_according_to_box

255766

And it boils down to approximately 0.9 item per box extra, a much more reasonable number:

In [28]:
(total_num_parts_including_minifigs - total_num_parts_according_to_box) / len(creator_sets)

0.8870370370370371

With this new counting method, how many red bricks do we have?

In [29]:
all_my_parts.query("version == 1 and name == 'Red'")["quantity"].sum()

25277

## Question 3
You are also curious how many parts you now have per category. Link your just created table to the table `Parts_Categories` and count the number of parts and sort this table descending.

In [30]:
# Let's first link the table with the parts and parts_categories tables.
# Because of our findings with Q2, we will keep the "version" column in here as well:
my_parts_categories = (
    all_my_parts
    .merge(parts, left_on='part_num', right_on='part_num', suffixes=(None, '_p'))
    .merge(part_categories, left_on="part_cat_id", right_on="id", suffixes=(None, "_pc"))
)[["part_num", "name_pc", "name_p", "quantity", "version"]]
my_parts_categories.head(10)

Unnamed: 0,part_num,name_pc,name_p,quantity,version
0,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,4,1
1,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,3,1
2,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,3,1
3,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,2,1
4,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,3,1
5,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,2,1
6,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,1,1
7,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,4,1
8,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,3,1
9,2412b,Tiles Special,Tile Special 1 x 2 Grille with Bottom Groove,2,1


In [31]:
my_parts_categories[["name_pc", "name_p", "part_num", "quantity"]].groupby(["name_pc", "name_p", "part_num"]).sum().sort_values("quantity", ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,quantity
name_pc,name_p,part_num,Unnamed: 3_level_1
Bricks,Brick 1 x 2,3004,19753
Bricks,Brick 2 x 2,3003,17117
Bricks,Brick 1 x 1,3005,15083
Plates,Plate 1 x 2,3023,10577
Plates Round Curved and Dishes,Plate Round 1 x 1 with Solid Stud,6141,8165
Bricks,Brick 2 x 4,3001,8012
Plates,Plate 1 x 1,3024,7241
Bricks,Brick 1 x 4,3010,6402
Bricks Sloped,Slope 30° 1 x 1 x 2/3 (Cheese Slope),54200,5406
Bricks,Brick 2 x 3,3002,4906


Which category do you have the most parts from?

In [32]:
# Method 1 (naive)
my_parts_categories[["name_pc", "quantity"]].groupby(["name_pc"]).sum().sort_values("quantity", ascending=False).head(10)

Unnamed: 0_level_0,quantity
name_pc,Unnamed: 1_level_1
Bricks,85852
Plates,50832
Bricks Sloped,23558
Plates Special,20983
Tiles,14987
Plates Round Curved and Dishes,11661
Bricks Special,8755
Bricks Curved,8236
Bricks Round and Cones,5873
Technic Pins,5440


In [34]:
# Method 2 (only version 1 inventories)
my_parts_categories.query("version == 1")[["name_pc", "quantity"]].groupby(["name_pc"]).sum().sort_values("quantity", ascending=False).head(10)

Unnamed: 0_level_0,quantity
name_pc,Unnamed: 1_level_1
Bricks,82852
Plates,44867
Bricks Sloped,21366
Plates Special,18216
Tiles,12834
Plates Round Curved and Dishes,10600
Bricks Special,7806
Bricks Curved,7665
Bricks Round and Cones,4953
Technic Pins,4684


What is the average number of parts per creator set?

In [38]:
my_parts_categories_sets = (
    all_my_parts
    .merge(parts, left_on='part_num', right_on='part_num', suffixes=(None, '_p'))
    .merge(part_categories, left_on="part_cat_id", right_on="id", suffixes=(None, "_pc"))
)[["quantity", "version", "name_y", "set_num", "name_pc"]]
my_parts_categories_sets

Unnamed: 0,quantity,version,name_y,set_num,name_pc
0,4,1,Creative Tower,10664-1,Tiles Special
1,3,1,Buildings,4406-1,Tiles Special
2,3,1,Build Your Own House Tub,3600-1,Tiles Special
3,2,1,Safari Building Set,4637-1,Tiles Special
4,3,1,Deluxe Starter Set,7795-1,Tiles Special
...,...,...,...,...,...
58201,1,1,NASA Space Shuttle Discovery,10283-1,Stickers
58202,1,1,Friends - The Apartments,10292-1,Stickers
58203,1,1,Sopwith Camel,3451-1,Stickers
58204,1,1,Harley-Davidson Mini Motorcycle,HARLEY-1,Stickers


* Select only version 1 of the inventory
* Select the name, set number and quantity as columns
* First sum the quantity per set (this is important: because we can have multiple tile types per set, each set num will occur multiple times!)
* Then take the average over the total number of tiles

In [45]:
my_parts_categories_sets.query("version == 1")[["set_num", "name_y", "quantity"]].groupby(["set_num", "name_y"]).sum().mean()

quantity    492.476834
dtype: float64

In [48]:
my_parts_categories_sets[["set_num", "name_y", "quantity"]].groupby(["set_num", "name_y"]).sum().mean()

quantity    540.73025
dtype: float64

### Incorrect calculations:

1. Not summing the totals per set before taking the average:

In [51]:
my_parts_categories_sets.query("version == 1")[["set_num", "name_y", "quantity"]].groupby(["set_num", "name_y"]).mean().mean()

quantity    3.928597
dtype: float64

In [52]:
my_parts_categories_sets[["set_num", "name_y", "quantity"]].groupby(["set_num", "name_y"]).mean().mean()

quantity    3.935028
dtype: float64

2. All parts instead of only the "creator sets" parts:

In [55]:
(
    themes_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
    .merge(inventory_parts, left_on='id_i', right_on='inventory_id', suffixes=(None, '_ip'))
    # .merge(parts, left_on='part_num', right_on='part_num') # I don't think we will need parts for now
    .merge(colors, left_on='color_id', right_on='id', suffixes=(None, '_c'))
    .merge(parts, left_on='part_num', right_on='part_num', suffixes=(None, '_p'))
    .merge(part_categories, left_on="part_cat_id", right_on="id", suffixes=(None, "_pc"))
)[["quantity", "version", "name_y", "set_num", "name_pc"]].query("version == 1")[["set_num", "name_y", "quantity"]].groupby(["set_num", "name_y"]).sum().mean()

quantity    188.563959
dtype: float64

In [56]:
(
    themes_sets
    .merge(inventories, left_on='set_num', right_on='set_num', suffixes=('_s', '_i'))
    .merge(inventory_parts, left_on='id_i', right_on='inventory_id', suffixes=(None, '_ip'))
    # .merge(parts, left_on='part_num', right_on='part_num') # I don't think we will need parts for now
    .merge(colors, left_on='color_id', right_on='id', suffixes=(None, '_c'))
    .merge(parts, left_on='part_num', right_on='part_num', suffixes=(None, '_p'))
    .merge(part_categories, left_on="part_cat_id", right_on="id", suffixes=(None, "_pc"))
)[["quantity", "version", "name_y", "set_num", "name_pc"]][["set_num", "name_y", "quantity"]].groupby(["set_num", "name_y"]).sum().mean()

quantity    202.799388
dtype: float64

## Question 4

Determine which parts are included in the Hobby Train set (set 10183-1).

Match this table with the table of all your Creator Collection items. Create a new variable that counts how much you miss of each part.

How many % of all required stones are you still missing?


What is the top 5 of the parts that you already have and the bottom 5 that you are still missing?

## Bonus

You bought every set from the “Creator” theme in the year it came out. 
Make an overview per year of which parts you received.

Make a list of unique parts that only appear in one set.

Which recent set(s) (2017-now) contain the most parts you still need?

Is this set still available? Answer this last question by scraping the LEGO website (tip: BeautifulSoup)

Do you know any alternative ways to answer the main question? If so, please explain how to do this.