In [2]:
import pandas as pd

## Question 1
Make sure that the theme is also included in dataset `sets`:

1. Load the CSV data into dataframes. Be sure to load every file in a separate variable so you can analyze them easily.  
   The URLs for the files reside are as follows, one for each table:
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/colors.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/elements.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventories.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_minifigs.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_parts.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_sets.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/minifigs.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/parts.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/part_categories.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/part_relationships.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/sets.csv`
   * `https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/themes_hierachie.csv`
2. Link `themes` to `sets`. The hierarchy is already solved for you in this `themes` set
3. Create a new field (variable, column) that indicates with 0/1 if a set belongs to the Creator theme

When reading all files at once in one script, you might see an error because of throttling at GitHub.
The solution to this it to split the script into multiple cells, each of which load a part of the data.

In [3]:
# Reading data from a CSV file: pd.read_csv
# All tables go into their own separate dataframe and variable
colors = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/colors.csv")
elements = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/elements.csv")
inventories = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventories.csv")
inventory_minifigs = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_minifigs.csv")
inventory_parts = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_parts.csv")
inventory_sets = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/inventory_sets.csv")
minifigs = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/minifigs.csv")
parts = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/parts.csv")
part_categories = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/part_categories.csv")
part_relationships = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/part_relationships.csv")
sets = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/sets.csv")
themes = pd.read_csv("https://github.com/wortell-smart-learning/lego-casus/raw/main/dataset/themes_hierachie.csv")

In [4]:
# link themes to sets
themes_sets = themes.merge(sets, left_on='id', right_on='theme_id')
themes_sets.head(10) # Display top 10 rows to show the result

Unnamed: 0,id,name_x,parent_name,set_num,name_y,year,theme_id,num_parts
0,1,Technic,Technic,001-1,Gears,1965,1,43
1,1,Technic,Technic,002-1,4.5V Samsonite Gears Motor Set,1965,1,3
2,1,Technic,Technic,1030-1,TECHNIC I: Simple Machines Set,1985,1,191
3,1,Technic,Technic,1038-1,ERBIE the Robo-Car,1985,1,120
4,1,Technic,Technic,1039-1,Manual Control Set 1,1986,1,39
5,1,Technic,Technic,1168-1,Battery Box,1986,1,1
6,1,Technic,Technic,1314-1,Stop bush / Small pulley,1987,1,210
7,1,Technic,Technic,1315-1,Piston Rod,1987,1,50
8,1,Technic,Technic,1316-1,Connector peg,1987,1,150
9,1,Technic,Technic,1317-1,TECHNIC Chainlinks,1987,1,350


### Before we continue
There are apparently two "theme" columns in the table `themes`:

* name (now renamed to `name_x` to avoid ambiguous column names)
* parent_name

From the data model we know already that this was (once) a parent-child hierarchy - let's do some further analysis:

In [23]:
print("Number of sets with parent_name Creator: ", themes_sets.query('parent_name == "Creator"')["id"].count()) # Count the number of sets with parent_name Creator
print("Number of sets with name_x Creator: ", themes_sets.query('name_x == "Creator"')["id"].count()) # Count the number of sets with name_x Creator
print("Values for parent_name when child_name is Creator: ", themes_sets.query('name_x == "Creator"')["name_x"].value_counts())
print("Values for child name when parent_name is Creator: ", themes_sets.query('parent_name == "Creator"')["name_x"].value_counts())


Number of sets with parent_name Creator:  540
Number of sets with name_x Creator:  124
Values for parent_name when child_name is Creator:  name_x
Creator    124
Name: count, dtype: int64
Values for child name when parent_name is Creator:  name_x
Creator 3-in-1    182
Basic Set         120
Creator            92
Creator Expert     58
Early Creator      29
Supplemental       22
Food & Drink       16
Basic Model         8
Creature            5
Construction        3
Traffic             3
Castle              1
Building            1
Name: count, dtype: int64


Based on this analysis, we can state that:

* When name_x is 'Creator', parent_name is also 'Creator'
* When parent_name is 'Creator', name_x can be different values, for example:
  * Early Creator
  * Creator Expert
  * Creator 3-in-1
  * Basic Set

We can safely assume that all of these still are "Creator" sets, but have some sub-theme.
Therefor, the filter should be on `parent_name`

In [12]:
# Create a new field that indicates with 0/1 if a set belongs to the Creator theme
themes_sets['is_creator'] = (themes_sets['parent_name'] == 'Creator').astype(int)

In [13]:
themes_sets.query('is_creator == 1').head(5) # 5 examples for rows with creator theme

Unnamed: 0,id,name_x,parent_name,set_num,name_y,year,theme_id,num_parts,is_creator
580,22,Creator,Creator,10664-1,Creative Tower,2013,22,1600,1
581,22,Creator,Creator,11938-1,Robot,2020,22,45,1
582,22,Creator,Creator,11939-1,Octopus,2020,22,63,1
583,22,Creator,Creator,11940-1,Fortress,2020,22,52,1
584,22,Creator,Creator,11941-1,Frog,2020,22,56,1


In [14]:
themes_sets.query('is_creator == 0').head(5) # 5 examples for rows without creator theme

Unnamed: 0,id,name_x,parent_name,set_num,name_y,year,theme_id,num_parts,is_creator
0,1,Technic,Technic,001-1,Gears,1965,1,43,0
1,1,Technic,Technic,002-1,4.5V Samsonite Gears Motor Set,1965,1,3,0
2,1,Technic,Technic,1030-1,TECHNIC I: Simple Machines Set,1985,1,191,0
3,1,Technic,Technic,1038-1,ERBIE the Robo-Car,1985,1,120,0
4,1,Technic,Technic,1039-1,Manual Control Set 1,1986,1,39,0


In [22]:
themes_sets.query('is_creator == 0 and parent_name == "Creator"') # Check if there are rows with is_creator == 0 and parent_name == 'Creator'

name_x
Creator 3-in-1    182
Basic Set         120
Creator Expert     58
Early Creator      29
Supplemental       22
Food & Drink       16
Basic Model         8
Creature            5
Construction        3
Traffic             3
Castle              1
Building            1
Name: count, dtype: int64

Question: What % of all sets do you own?

In [24]:
num_creator_sets = themes_sets['is_creator'].sum()
print("Number of sets that belong to the Creator theme:", num_creator_sets)
num_total_sets = len(themes_sets)
print("Total number of sets:", num_total_sets)
print(f"Percentage of total sets owned: {num_creator_sets / num_total_sets * 100:.2f}")

Number of sets that belong to the Creator theme: 540
Total number of sets: 17835
Percentage of total sets owned: 3.03


Three things to note:

1. Because the `is_creator` is a number, we can add it to get the number of creator sets
2. I've used a *format string*. Basically the following three expressions yield the same result:
```python
print("Perc of sets owned:", (num_creator_sets / num_total_sets * 100))
print("Perc of sets owned: {placeholder}".format( placeholder=(num_creator_sets / num_total_sets * 100))) # You can use any name instead of placeholder, of cours
print(f"Perc of sets owned: {(num_creator_sets / num_total_sets * 100)}")
```
3. The `:.2f` signifies the number of decimal places that should be displayed in a floating point number. Instead of displaying `3.027754415475189` it now displays `3.03`.


## Question 2:

Filter the dataset based on your created variable **isCreator**

Link the data sources together so that you know which parts it contains for all sets together. 

You need the following tables for this:

* Inventories
* Inventory_parts
* Parts
* Colors

This table is unique on `part_num` and `Color_id`

## Question 3
You are also curious how many parts you now have per category. Link your just created table to the table `Parts_Categories` and count the number of parts and sort this table descending.

Which category do you have the most parts from?

What is the average number of tiles per creator set?

## Question 4

Determine which parts are included in the Hobby Train set (set 10183-1).

Match this table with the table of all your Creator Collection items. Create a new variable that counts how much you miss of each part.

How many % of all required stones are you still missing?


What is the top 5 of the parts that you already have and the bottom 5 that you are still missing?

## Bonus

You bought every set from the “Creator” theme in the year it came out. 
Make an overview per year of which parts you received.

Make a list of unique parts that only appear in one set.

Which recent set(s) (2017-now) contain the most parts you still need?

Is this set still available? Answer this last question by scraping the LEGO website (tip: BeautifulSoup)

Do you know any alternative ways to answer the main question? If so, please explain how to do this.