# Exploratory Data Analysis and Data Visualization of Lego Database

## Objective of this Notebook

This notebook aims to explore, investigate for the facts and perform data analysis techniques on the provided dataset of Legos. Also, visualize data to inquire for more facts and inspect relationships between the datasets. 

#### Importing Data
- Importing Data with Pandas

#### Required Packages for Data Analysis
- NumPy
- Pandas

#### Data Visualization 
- Matplotlib
- Seaborn

## So, what exactly are Legos?
As explained by Wikipedia:
>Lego (/ˈlɛɡoʊ/ LEG-oh, Danish: [ˈle̝ːko] stylised as LEGO) is a line of plastic construction toys that are manufactured by The Lego Group, a privately held company based in Billund, Denmark. The company's flagship product, Lego, consists of variously colored interlocking plastic bricks accompanying an array of gears, figurines called minifigures, and various other parts. Lego pieces can be assembled and connected in many ways to construct objects, including vehicles, buildings, and working robots. Anything constructed can be taken apart again, and the pieces reused to make new things.

<img src="https://miro.medium.com/max/1400/1*_HEXQOt7cl3TJT_MI0VM7g.jpeg" width="700"/>

That's not possible that you haven't seen these little figures made up of Lego Bricks in your real life. Legos are currently very trending in entertainment, media, movies and games. They are very popular and many theme parks and retail stores all over the world.

## Let's start with the Lego Database

A comprehensive database of lego blocks is provided by Rebrickable. The data is available as csv files and the schema is shown below.
<img src="https://storage.googleapis.com/kagglesdsdata/datasets%2F1599%2F2846%2Fdownloads_schema.png?GoogleAccessId=databundle-worker-v2@kaggle-161607.iam.gserviceaccount.com&Expires=1592688871&Signature=GxsFD0xJMmH5FPkPR49fItBIXQ%2Bi1lnczK0lhI087kkDx11CswjxFtJrDf3y3fCxK%2B0Z%2BuMcJ5XwzCYTzHin4E6OlykR3l6LUuEeftKqvBYoPqvXVZt7tCtDVWRNs4r0n7ie3GqZcN3gS1RvLtNJDhB19AZ%2BWb%2F7T9j89BKz1pK5vaiZ1ErsGYDv6n%2FPF8W%2FbSrsmfe6QaYQDJ%2FOAmbLchDIPW831V2zkDpziiuhKOW0xt30A3Kk8agbeHH3uLst2Ni3GO%2FiDIDndoD1zLZ6bMvfD5Cx2mF%2FuJQiToYB2GZZTppLg4pPukJsI%2FZzkX35tzyofLawPZb4HQxK8tKwLw%3D%3D" width="800">

We can see clearly the relationship between these tables from the above **Schematic Diagram**. We have been provided 8 csv files containing data of every table shown in the figure. We will explore them one by one. 

## Importing all the necessary packages to start data analysis



In [None]:
# for performing mathematical operations
import numpy as np 

# for data processing, CSV file I/O 
import pandas as pd 

# visualizing inventory_parts that has most colors using matplotlib
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

## Data Preparation

In this section, required dataset is imported, explored and cleaned to make it available in the right format for data visualization. 

Data Analysis techniques used in this section includes: 
- Importing data set using Pandas 
- Exploring data to find features and target 
- Handling missing or corrupted values in the data

*You can skip this section if you want to play with data yourself.*

## Importing Datasets using Pandas Library

In [None]:
# read the data from the csv files into a dataframe
themes = pd.read_csv('../input/lego-database/themes.csv', index_col=0)
sets = pd.read_csv('../input/lego-database/sets.csv', index_col=0)
parts = pd.read_csv('../input/lego-database/parts.csv', index_col=0)
part_categories = pd.read_csv('../input/lego-database/part_categories.csv', index_col=0)
inventories = pd.read_csv('../input/lego-database/inventories.csv', index_col=0)
inventory_sets = pd.read_csv('../input/lego-database/inventory_sets.csv', index_col=0)
inventory_parts = pd.read_csv('../input/lego-database/inventory_parts.csv', index_col=0)
colors = pd.read_csv('../input/lego-database/colors.csv', index_col=0)

## Exploring Colors, Inventories, Parts and Part Categories
We have loaded our required datasets, now we will start with colors, parts and inventories first. Then we will move to inventory_parts. As we can see that colors, parts and inventories combines together to form inventory parts from the Schematic Diagram above.

#### According to the above database diagram following are the relationships mapping: 
- Id (Colors) -> Color_Id (Inventory_Parts)
- Id (Inventories) -> Inventory_Id (Inventory_Parts)
- Part_Num (Parts) -> Part_Num (Inventory_Parts)
- Id (Part_Categories) -> Part_Cat_Id (Parts)

## Investigating Colors

### How many colors does the database have?

In [None]:
# checking first twenty rows for colors csv file
colors.head(20)

In [None]:
# checking the info of the colors dataset
colors.info()

In [None]:
# checking the shape of the dataset
colors.shape

In our database we have a total of **135** colors. Every color has their respective name, rgb and transparency factor.

### How many transparent and non-transparent colors do we have?

In [None]:
# checking the number of transparent and non transparent colors
colors['is_trans'].value_counts()

We have **107** non transparent yet, **28** transparent colors.

In [None]:
matplotlib.rcParams.update({'font.size': 20})

# visualize transparent vs non transparent colors
transparent = colors['is_trans'] == 't'
non_transparent = colors['is_trans'] == 'f'

# data to plot
labels = 'Transparent Colors', 'Non Transparent Colors'
sizes = [transparent.sum(), non_transparent.sum()]
colors = ['lightcoral', 'lightskyblue']

# explode 1st slice
explode = (0.1, 0) 

fig, axs = plt.subplots(figsize=(14, 7))
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=90)

plt.axis('equal')
plt.show()

## Investigating Parts and Part Categories

### How many parts does the database have?

In [None]:
# checking first twenty rows for parts csv file
parts.head(20)

In [None]:
# checking the info of the parts dataset
parts.info()

In [None]:
# checking the shape of the dataset
parts.shape

In our database we have a total of **25993** different parts. Every part is associated with its part_category.

### How many part categories does the database have?

In [None]:
# checking first twenty rows for part_categories csv file
part_categories.head(20)

In [None]:
# checking the shape of the dataset
part_categories.shape

In our database we have a total of **57** part categories. 

### How many parts each part_category contains?

In [None]:
# creating new dataframe with part_categories and their parts
parts_with_categories = pd.merge(left=part_categories, right=parts, left_on='id', right_on='part_cat_id')
parts_with_categories = parts_with_categories.rename(columns={'name_x': 'Part_Category_Name', 'name_y':'Part_Name'})
parts_with_categories.head(20)

In [None]:
# grouping categories and counting their respective number of parts
parts_with_categories = parts_with_categories['Part_Category_Name'].value_counts()
parts_with_categories.sort_values(ascending=False)

### Let's visualize the number of parts every part-category contains

In [None]:
matplotlib.rcParams.update({'font.size': 16})

fig, axs = plt.subplots(figsize=(18,4))
parts_with_categories.plot(kind="bar", color="brown", alpha=0.6, width= 0.8)

plt.ylabel('Number of Parts')
plt.title('Number of Parts in Each Part Category')
plt.xticks(rotation=90)
plt.legend()

plt.show()

**"Minifigs"** is the most used category and contains more 8.5K Parts.

## Investigating Inventories

### How many inventory parts are there in the database?

In [None]:
# checking first twenty rows for inventories csv file
inventories.head(20)

In [None]:
# checking the info of the parts dataset
inventories.info()

In [None]:
# checking the shape of the dataset
inventories.shape

There are **11681** total inventory parts in our database. Every inventory is associated with its set_num.

### How many sets each version of inventory has?

In [None]:
# grouping each version and counting the frquency of sets in each group of inventory
inventories['version'].value_counts()

We have **5** inventory types, out of which **verison 1** has the most sets that is **11669**.

### Let's visualize the number of sets every inventory part contains?

In [None]:
sets_per_inventory_parts = inventories['version'].value_counts()
sets_per_inventory_parts.sort_values(ascending=False)

In [None]:
fig, axs = plt.subplots(figsize=(16,4))
sets_per_inventory_parts.plot(kind="bar", color="green", alpha=0.6, width= 0.5)

plt.ylabel('Number of Sets')
plt.title('Inventory Version')
plt.xticks(rotation=0)
plt.grid()

plt.show()

From the the above graph,**version 1** inventory has **11669** sets. Since **version 2** inventory has just **9** sets there is a slight upwards bar in the graph.

### Which inventory parts have the most colors availability?

In [None]:
unique_inventory_parts = inventory_parts[['color_id']]
unique_inventory_parts = unique_inventory_parts.groupby('inventory_id').count()

# taking out the top 15 inventory parts with most colors available
inventory_parts_most_colors = unique_inventory_parts.sort_values(by='color_id', ascending=False)
inventory_parts_most_colors = inventory_parts_most_colors[0:15]
inventory_parts_most_colors

In [None]:
matplotlib.rcParams.update({'font.size': 16})

fig, axs = plt.subplots(figsize=(18,9))
inventory_parts_most_colors['color_id'].plot(kind="barh", color="orange", alpha=0.6, width= 0.8)

plt.xlabel('Number of Colors Available')
plt.ylabel('Inventory Ids')
plt.title('Inventory Parts that has Most Colors Available')
plt.xticks(rotation=90)
plt.legend()
plt.grid()
axs.set_xticks(np.arange(0,800,20))

plt.show()

In [None]:
sns.set(style="whitegrid")

# initialize the matplotlib figure
fig, axs = plt.subplots(figsize=(18,9))

# plot the Total Missing Values
sns.set_color_codes("bright")
sns.barplot(x=inventory_parts_most_colors.index, y="color_id", data=inventory_parts_most_colors, color="r")

# customizing Bar Graph
plt.xticks(rotation='90')
plt.xlabel('Inventory Parts', fontsize=15)
plt.ylabel('Number of Colors Available', fontsize=15)
plt.title('Numebr of Colors available per Inventory Part', fontsize=20)

Inventory_ID **1305** has the most available colors.

## Conclusion
We have analyzed data and have found different facts about the lego data set. Different colors, parts and their categories, inventory and its parts, most available colors in inventory, transparent and non-transparent colors, number is sets per inventory part.
