# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)
# DAT09 Data Science Capstone: LEGO Case Study by Ryan Peralta

## Background
- The name 'LEGO' is an abbreviation of the two Danish words "leg godt", meaning "play well".
- The key product remains the traditional LEGO brick which was launched in 1958. The interlocking design makes it unique and offers unlimited building possibilities. 
- LEGO gets the imagination going and letting a wealth of creative ideas emerge through play.

<img src="Lego_01.jpeg" alt="LEGO Figurine with Computer" title="LEGO Computer Scientist"/>

## Creativity, Innovation, and Invention
- As per Wikipedia, creativity is a phenomenon whereby something new and somehow valuable is formed. The creation may either be tangible or intangible. Innovation and invention are respectively the implementation of something new and he creation of something that has never been made before and is recognized as the product of some unique insight. Both innovation and invention go hand-in-hand with creativity.
- It would be of interest to see if there is a relationship between LEGO and creativity. Specifics such as price, piece count, theme, age range would be good data points to investigate this.

## The Hypothesis
- Price accesibility to LEGO would help drive creativity.
- Getting kids to play at a younger age would help drive creativity.
- Complexity of the set helps drive creativity.

## Sources of Data
In order to investigate the above hypothesis we will be using the following datasets:
- LEGO dataset from Kaggle https://www.kaggle.com/mterzolo/lego-sets/home which includes the information such as: age, price, reviews, piece count, play rating, description, difficulty, set name, theme, and country.
- Global Creativity Index from http://martinprosperity.org/content/the-global-creativity-index-2015/ which includes country rankings across technology, talent, and tolerance - which are ultimately summarized into a single number index. The table in the webpage will be converted into a CSV file for purposes of this case study.

## Data Preparation & Exploratory Data Analysis
#### Loading the data sets

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

legofile = "lego_sets.csv"
gcifile = "GCI.csv"
dfl = pd.read_csv(legofile)
dfg = pd.read_csv(gcifile)

#### Starting dataframes insights

In [2]:
dfl.head()

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country
0,6-12,29.99,2.0,277.0,4.0,Catapult into action and take back the eggs fr...,75823.0,Use the staircase catapult to launch Red into ...,Average,Bird Island Egg Heist,4.5,Angry Birds™,4.0,US
1,6-12,19.99,2.0,168.0,4.0,Launch a flying attack and rescue the eggs fro...,75822.0,Pilot Pig has taken off from Bird Island with ...,Easy,Piggy Plane Attack,5.0,Angry Birds™,4.0,US
2,6-12,12.99,11.0,74.0,4.3,Chase the piggy with lightning-fast Chuck and ...,75821.0,Pitch speedy bird Chuck against the Piggy Car....,Easy,Piggy Car Escape,4.3,Angry Birds™,4.1,US
3,12+,99.99,23.0,1032.0,3.6,Explore the architecture of the United States ...,21030.0,Discover the architectural secrets of the icon...,Average,United States Capitol Building,4.6,Architecture,4.3,US
4,12+,79.99,14.0,744.0,3.2,Recreate the Solomon R. Guggenheim Museum® wit...,21035.0,Discover the architectural secrets of Frank Ll...,Challenging,Solomon R. Guggenheim Museum®,4.6,Architecture,4.1,US


In [3]:
dfg.head()

Unnamed: 0,Ranking,Country,Technology,Talent,Tolerance,Global Creativity Index
0,1,Australia,7,1,4,0.97
1,2,United States,4,3,11,0.95
2,3,New Zealand,7,8,3,0.949
3,4,Canada,13,14,1,0.92
4,5,Denmark,10,6,13,0.917


> In order to combine the LEGO dataframe the the GCI dataframe, we will need to map the countries' ISO 3166 codes from https://datahub.io/core/country-list#resource-country-list_zip using the CSV file in the link.

In [4]:
isofile = "country_code.csv"
dfc = pd.read_csv(isofile)
dfg.Country = dfg.Country.map(dfc.set_index('Name').Code)
dfg = dfg.rename(columns = {'Country':'country'})

In [5]:
dfg.head()

Unnamed: 0,Ranking,country,Technology,Talent,Tolerance,Global Creativity Index
0,1,AU,7,1,4,0.97
1,2,US,4,3,11,0.95
2,3,NZ,7,8,3,0.949
3,4,CA,13,14,1,0.92
4,5,DK,10,6,13,0.917


> Creating the dataframe that we will use to continue to do the analysis by using merge.

In [6]:
df = pd.merge(dfl,dfg)

#### Working dataframe insights

In [7]:
df.head()

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country,Ranking,Technology,Talent,Tolerance,Global Creativity Index
0,6-12,29.99,2.0,277.0,4.0,Catapult into action and take back the eggs fr...,75823.0,Use the staircase catapult to launch Red into ...,Average,Bird Island Egg Heist,4.5,Angry Birds™,4.0,US,2,4,3,11,0.95
1,6-12,19.99,2.0,168.0,4.0,Launch a flying attack and rescue the eggs fro...,75822.0,Pilot Pig has taken off from Bird Island with ...,Easy,Piggy Plane Attack,5.0,Angry Birds™,4.0,US,2,4,3,11,0.95
2,6-12,12.99,11.0,74.0,4.3,Chase the piggy with lightning-fast Chuck and ...,75821.0,Pitch speedy bird Chuck against the Piggy Car....,Easy,Piggy Car Escape,4.3,Angry Birds™,4.1,US,2,4,3,11,0.95
3,12+,99.99,23.0,1032.0,3.6,Explore the architecture of the United States ...,21030.0,Discover the architectural secrets of the icon...,Average,United States Capitol Building,4.6,Architecture,4.3,US,2,4,3,11,0.95
4,12+,79.99,14.0,744.0,3.2,Recreate the Solomon R. Guggenheim Museum® wit...,21035.0,Discover the architectural secrets of Frank Ll...,Challenging,Solomon R. Guggenheim Museum®,4.6,Architecture,4.1,US,2,4,3,11,0.95


> Renaming columns prior to checking the tail.

In [8]:
rename = {'Ranking':'ranking','Technology':'technology','Talent':'talent','Tolerance':'tolerance','Global Creativity Index':'creativity'} 
df.rename(columns=rename, inplace=True)
df.tail()

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country,ranking,technology,talent,tolerance,creativity
11681,7-14,36.5878,6.0,341.0,4.4,Protect NINJAGO® City from flying Manta Ray Bo...,70609.0,Help Cole save Shen-Li in this cool THE LEGO® ...,Easy,Manta Ray Bomber,4.3,THE LEGO® NINJAGO® MOVIE™,4.2,PT,23,35,36,22,0.71
11682,7-14,24.3878,8.0,217.0,4.1,Stop a Piranha Attack with Kai and Misako!,70629.0,Play out an action-packed Piranha Mech pursuit...,Easy,Piranha Attack,3.6,THE LEGO® NINJAGO® MOVIE™,4.1,PT,23,35,36,22,0.71
11683,7-14,24.3878,18.0,233.0,4.6,Stop a crime in the NINJAGO® City street market!,70607.0,"Team up with Lloyd Garmadon, Nya and Officer T...",Easy,NINJAGO® City Chase,4.6,THE LEGO® NINJAGO® MOVIE™,4.5,PT,23,35,36,22,0.71
11684,6-14,12.1878,1.0,48.0,5.0,Achieve Spinjitzu greatness with the Green Ninja!,70628.0,Learn all the skills of Spinjitzu with THE LEG...,Very Easy,Lloyd - Spinjitzu Master,5.0,THE LEGO® NINJAGO® MOVIE™,5.0,PT,23,35,36,22,0.71
11685,6-14,12.1878,11.0,109.0,4.5,Practice your Spinjitzu skills with Kai and Zane!,70606.0,Join the ninja heroes at the dojo with this ac...,Easy,Spinjitzu Training,4.7,THE LEGO® NINJAGO® MOVIE™,4.8,PT,23,35,36,22,0.71


In [9]:
df.shape

(11686, 19)

In [10]:
df.describe()

Unnamed: 0,list_price,num_reviews,piece_count,play_star_rating,prod_id,star_rating,val_star_rating,ranking,creativity
count,11686.0,10143.0,11686.0,9996.0,11686.0,10143.0,9977.0,11686.0,11686.0
mean,65.617539,16.826383,493.344857,4.337585,59816.95,4.514286,4.229398,15.013178,0.823992
std,92.736468,36.354701,825.108231,0.652136,163784.0,0.518996,0.660119,10.882678,0.114028
min,2.2724,1.0,1.0,1.0,630.0,1.8,1.0,1.0,0.516
25%,19.99,2.0,97.0,4.0,21034.0,4.3,4.0,5.0,0.788
50%,36.5878,6.0,216.0,4.5,42069.0,4.7,4.3,14.0,0.837
75%,71.221725,13.5,544.0,4.8,70922.0,5.0,4.7,20.0,0.917
max,1104.87,367.0,7541.0,5.0,2000431.0,5.0,5.0,46.0,0.97


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11686 entries, 0 to 11685
Data columns (total 19 columns):
ages                 11686 non-null object
list_price           11686 non-null float64
num_reviews          10143 non-null float64
piece_count          11686 non-null float64
play_star_rating     9996 non-null float64
prod_desc            11327 non-null object
prod_id              11686 non-null float64
prod_long_desc       11686 non-null object
review_difficulty    9729 non-null object
set_name             11686 non-null object
star_rating          10143 non-null float64
theme_name           11683 non-null object
val_star_rating      9977 non-null float64
country              11686 non-null object
ranking              11686 non-null int64
technology           11686 non-null object
talent               11686 non-null object
tolerance            11686 non-null object
creativity           11686 non-null float64
dtypes: float64(8), int64(1), object(10)
memory usage: 1.8+ MB


> There seems to be missing values in num_reviews, play_star_rating, prod_desc, review_difficulty, star_rating, theme_name, val_star rating. We will apply the mean for the following: num_reviews, play_star_rating, star_rating, and val_star_rating. We will drop (1) prod_desc given the presence of prod_long_desc and (2) the 3 rows without the theme name. We will use the piece count to determine the review difficulty.

In [12]:
missing = ['num_reviews','play_star_rating','star_rating','val_star_rating']
for x in missing:
    df[x].fillna(df[x].mean(), inplace=True)

In [13]:
df.drop(columns='prod_desc', inplace=True)

In [14]:
df.dropna(subset=['theme_name'], inplace=True)

In [20]:
df[df.review_difficulty.isnull()].head()

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country,ranking,technology,talent,tolerance,creativity
22,10+,9.99,1.0,136.0,4.337585,41607.0,This Gamora LEGO® BrickHeadz construction char...,,Gamora,5.0,BrickHeadz,4.229398,US,2,4,3,11,0.95
32,10+,19.99,16.826383,209.0,4.337585,41610.0,These LEGO® BrickHeadz™ 41610 Tactical Batman™...,,Tactical Batman™ & Superman™,4.514286,BrickHeadz,4.229398,US,2,4,3,11,0.95
48,5-12,49.99,16.826383,387.0,4.337585,60175.0,Pick up your badge and join the LEGO® City Mou...,,Mountain River Heist,4.514286,City,4.229398,US,2,4,3,11,0.95
55,7-12,99.99,1.0,883.0,5.0,60188.0,Grab your hard hat and head out to the LEGO® C...,,Mining Experts Site,5.0,City,5.0,US,2,4,3,11,0.95
69,5-12,39.99,16.826383,297.0,4.337585,60172.0,Pick up your badge and join the LEGO® City Mou...,,Dirt Road Pursuit,4.514286,City,4.229398,US,2,4,3,11,0.95


In [16]:
df.groupby(["review_difficulty"]).piece_count.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
review_difficulty,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Average,3590.0,602.669081,471.525419,1.0,257.0,494.0,830.0,2595.0
Challenging,1007.0,2161.286991,1690.504234,1.0,603.0,1969.0,3444.0,7541.0
Easy,4033.0,247.056286,406.459927,1.0,102.0,174.0,285.0,4900.0
Very Challenging,7.0,1966.142857,0.377964,1966.0,1966.0,1966.0,1966.0,1967.0
Very Easy,1089.0,79.401286,68.437053,1.0,10.0,68.0,122.0,347.0


> Using the information above, we can map review_difficulty using the piece_count 25% quartile and comparing with the mean:
- Very easy: 1.0 to 101.0
- Easy: 102.0 to 256.0
- Average: 257.0 to 602.0
- Challenging/Very Challenging: 603.0 and above

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11683 entries, 0 to 11685
Data columns (total 18 columns):
ages                 11683 non-null object
list_price           11683 non-null float64
num_reviews          11683 non-null float64
piece_count          11683 non-null float64
play_star_rating     11683 non-null float64
prod_id              11683 non-null float64
prod_long_desc       11683 non-null object
review_difficulty    9726 non-null object
set_name             11683 non-null object
star_rating          11683 non-null float64
theme_name           11683 non-null object
val_star_rating      11683 non-null float64
country              11683 non-null object
ranking              11683 non-null int64
technology           11683 non-null object
talent               11683 non-null object
tolerance            11683 non-null object
creativity           11683 non-null float64
dtypes: float64(8), int64(1), object(9)
memory usage: 1.7+ MB
