# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)
# DAT09 Data Science Capstone: LEGO Case Study
#### by Ryan Peralta

## Background
- The name 'LEGO' is an abbreviation of the two Danish words "leg godt", meaning "play well".
- The key product remains the traditional LEGO brick which was launched in 1958. The interlocking design makes it unique and offers unlimited building possibilities. 
- LEGO gets the imagination going and letting a wealth of creative ideas emerge through play.

<img src="Lego_01.jpeg" alt="LEGO Figurine with Computer" title="LEGO Computer Scientist"/>

## Creativity, Innovation, and Invention
- As per Wikipedia, creativity is a phenomenon whereby something new and somehow valuable is formed. The creation may either be tangible or intangible. Innovation and invention are respectively the implementation of something new and he creation of something that has never been made before and is recognized as the product of some unique insight. Both innovation and invention go hand-in-hand with creativity.

## The Hypothesis
>It would be of interest to see if there is a relationship between LEGO and creativity. Specifics such as price, piece count, theme, age range would be good data points to investigate this.

> Furthermore, we will to try to predict the price of a LEGO set based on the features available and identify which are the most important in driving that price.

<img src="Lego_02.jpg" alt="LEGO Figurine with Magnifying Glass" title="LEGO Sherlock"/>

## Sources of Data
In order to investigate the above hypothesis we will be using the following datasets:
- LEGO dataset from Kaggle https://www.kaggle.com/mterzolo/lego-sets/home which includes the information such as: age, price, reviews, piece count, play rating, description, difficulty, set name, theme, and country.
- Global Creativity Index (GCI) from http://martinprosperity.org/content/the-global-creativity-index-2015/ which includes country rankings across technology, talent, and tolerance - which are ultimately summarized into a single number index. The table in the webpage will be converted into a CSV file for purposes of this case study.

## Data Preparation & Exploratory Data Analysis
#### Loading the data sets

In [40]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error 
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline

In [41]:
legofile = "lego_sets.csv"
gcifile = "GCI.csv"

dfl = pd.read_csv(legofile)
dfg = pd.read_csv(gcifile)

#### Starting dataframes insights

In [42]:
dfl.head()

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country
0,6-12,29.99,2.0,277.0,4.0,Catapult into action and take back the eggs fr...,75823.0,Use the staircase catapult to launch Red into ...,Average,Bird Island Egg Heist,4.5,Angry Birds™,4.0,US
1,6-12,19.99,2.0,168.0,4.0,Launch a flying attack and rescue the eggs fro...,75822.0,Pilot Pig has taken off from Bird Island with ...,Easy,Piggy Plane Attack,5.0,Angry Birds™,4.0,US
2,6-12,12.99,11.0,74.0,4.3,Chase the piggy with lightning-fast Chuck and ...,75821.0,Pitch speedy bird Chuck against the Piggy Car....,Easy,Piggy Car Escape,4.3,Angry Birds™,4.1,US
3,12+,99.99,23.0,1032.0,3.6,Explore the architecture of the United States ...,21030.0,Discover the architectural secrets of the icon...,Average,United States Capitol Building,4.6,Architecture,4.3,US
4,12+,79.99,14.0,744.0,3.2,Recreate the Solomon R. Guggenheim Museum® wit...,21035.0,Discover the architectural secrets of Frank Ll...,Challenging,Solomon R. Guggenheim Museum®,4.6,Architecture,4.1,US


In [43]:
dfg.head()

Unnamed: 0,Ranking,Country,Technology,Talent,Tolerance,Global Creativity Index
0,1,Australia,7,1,4,0.97
1,2,United States,4,3,11,0.95
2,3,New Zealand,7,8,3,0.949
3,4,Canada,13,14,1,0.92
4,5,Denmark,10,6,13,0.917


#### Combining LEGO and GCI dataframes
> In order to combine the LEGO dataframe the the GCI dataframe, we will need to map the countries' ISO 3166 codes from https://datahub.io/core/country-list#resource-country-list_zip using the CSV file in the link.

<img src="Lego_06.jpg" alt="LEGO Woman Scientist" title="LEGO Combine"/>

In [44]:
isofile = "country_code.csv"
dfc = pd.read_csv(isofile)
dfc.head()

Unnamed: 0,Name,Code
0,Afghanistan,AF
1,Åland Islands,AX
2,Albania,AL
3,Algeria,DZ
4,American Samoa,AS


In [45]:
dfg.Country = dfg.Country.map(dfc.set_index('Name').Code)
dfg = dfg.rename(columns = {'Country':'country'})
dfg.head()

Unnamed: 0,Ranking,country,Technology,Talent,Tolerance,Global Creativity Index
0,1,AU,7,1,4,0.97
1,2,US,4,3,11,0.95
2,3,NZ,7,8,3,0.949
3,4,CA,13,14,1,0.92
4,5,DK,10,6,13,0.917


> Creating the dataframe that we will use to continue to do the analysis by using merge.

In [46]:
df = dfg.merge(dfl, on='country', how='inner')
df.columns = map(str.lower, df.columns)
df.head()

Unnamed: 0,ranking,country,technology,talent,tolerance,global creativity index,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating
0,1,AU,7,1,4,0.97,12+,113.9924,23.0,1032.0,3.6,Explore the architecture of the United States ...,21030.0,Discover the architectural secrets of the icon...,Average,United States Capitol Building,4.6,Architecture,4.3
1,1,AU,7,1,4,0.97,12+,75.9924,14.0,744.0,3.2,Recreate the Solomon R. Guggenheim Museum® wit...,21035.0,Discover the architectural secrets of Frank Ll...,Challenging,Solomon R. Guggenheim Museum®,4.6,Architecture,4.1
2,1,AU,7,1,4,0.97,12+,60.7924,7.0,597.0,3.7,Celebrate Shanghai with this LEGO® Architectur...,21039.0,Recreate Shanghai in China's blend of historic...,Average,Shanghai,4.9,Architecture,4.4
3,1,AU,7,1,4,0.97,12+,53.1924,24.0,780.0,4.4,Recreate Buckingham Palace with LEGO® Architec...,21029.0,Build a LEGO® brick model of London's official...,Average,Buckingham Palace,4.7,Architecture,4.3
4,1,AU,7,1,4,0.97,12+,53.1924,37.0,598.0,3.7,Celebrate New York City with this LEGO® Archit...,21028.0,Celebrate the architectural diversity of New Y...,Average,New York City,4.2,Architecture,4.1


> Renaming columns prior to checking the tail.

In [47]:
rename = {'global creativity index':'creativity'} 
df.rename(columns=rename, inplace=True)
df.tail()

Unnamed: 0,ranking,country,technology,talent,tolerance,creativity,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating
11681,46,PL,46,25,101,0.516,8-14,40.5971,8.0,341.0,4.2,Take on the tentacular Flying Jelly Sub with Jay!,70610.0,Join ninja Jay in battle against the shark arm...,Average,Flying Jelly Sub,4.6,THE LEGO® NINJAGO® MOVIE™,4.5
11682,46,PL,46,25,101,0.516,7-14,40.5971,6.0,341.0,4.4,Protect NINJAGO® City from flying Manta Ray Bo...,70609.0,Help Cole save Shen-Li in this cool THE LEGO® ...,Easy,Manta Ray Bomber,4.3,THE LEGO® NINJAGO® MOVIE™,4.2
11683,46,PL,46,25,101,0.516,7-14,26.0971,8.0,217.0,4.1,Stop a Piranha Attack with Kai and Misako!,70629.0,Play out an action-packed Piranha Mech pursuit...,Easy,Piranha Attack,3.6,THE LEGO® NINJAGO® MOVIE™,4.1
11684,46,PL,46,25,101,0.516,7-14,26.0971,18.0,233.0,4.6,Stop a crime in the NINJAGO® City street market!,70607.0,"Team up with Lloyd Garmadon, Nya and Officer T...",Easy,NINJAGO® City Chase,4.6,THE LEGO® NINJAGO® MOVIE™,4.5
11685,46,PL,46,25,101,0.516,6-14,13.0471,,48.0,,Achieve Spinjitzu greatness with the Green Ninja!,70628.0,Learn all the skills of Spinjitzu with THE LEG...,,Lloyd - Spinjitzu Master,,THE LEGO® NINJAGO® MOVIE™,


#### Missing Values

There seems to be missing values in num_reviews, play_star_rating, prod_desc, review_difficulty, star_rating, theme_name, val_star rating.

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11686 entries, 0 to 11685
Data columns (total 19 columns):
ranking              11686 non-null int64
country              11686 non-null object
technology           11686 non-null object
talent               11686 non-null object
tolerance            11686 non-null object
creativity           11686 non-null float64
ages                 11686 non-null object
list_price           11686 non-null float64
num_reviews          10143 non-null float64
piece_count          11686 non-null float64
play_star_rating     9996 non-null float64
prod_desc            11327 non-null object
prod_id              11686 non-null float64
prod_long_desc       11686 non-null object
review_difficulty    9729 non-null object
set_name             11686 non-null object
star_rating          10143 non-null float64
theme_name           11683 non-null object
val_star_rating      9977 non-null float64
dtypes: float64(8), int64(1), object(10)
memory usage: 1.8+ MB


#### Strategy to deal with missing values
- We will apply the mean for the following: num_reviews, play_star_rating, star_rating, and val_star_rating.
- We will drop (1) prod_desc given the presence of prod_long_desc and (2) the 3 rows without the theme name.
- We will explore using a model to fill in missing values for review_difficulty.

<img src="Lego_03.jpg" alt="LEGO Batman Missing Pieces" title="LEGO Missing Pieces"/>

In [49]:
missing = ['num_reviews','play_star_rating','star_rating','val_star_rating']
for x in missing:
    df[x].fillna(df[x].mean(), inplace=True)

In [50]:
df.drop(columns='prod_desc', inplace=True)

In [51]:
df.dropna(subset=['theme_name'], inplace=True)

In [52]:
df[df.review_difficulty.isnull()].head()

Unnamed: 0,ranking,country,technology,talent,tolerance,creativity,ages,list_price,num_reviews,piece_count,play_star_rating,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating
19,1,AU,7,1,4,0.97,10+,12.1524,1.0,136.0,4.337585,41607.0,This Gamora LEGO® BrickHeadz construction char...,,Gamora,5.0,BrickHeadz,4.229398
29,1,AU,7,1,4,0.97,10+,22.7924,16.826383,209.0,4.337585,41610.0,These LEGO® BrickHeadz™ 41610 Tactical Batman™...,,Tactical Batman™ & Superman™,4.514286,BrickHeadz,4.229398
38,1,AU,7,1,4,0.97,7-12,121.5924,1.0,883.0,5.0,60188.0,Grab your hard hat and head out to the LEGO® C...,,Mining Experts Site,5.0,City,5.0
51,1,AU,7,1,4,0.97,5-12,45.5924,16.826383,387.0,4.337585,60175.0,Pick up your badge and join the LEGO® City Mou...,,Mountain River Heist,4.514286,City,4.229398
52,1,AU,7,1,4,0.97,5-12,37.9924,16.826383,297.0,4.337585,60172.0,Pick up your badge and join the LEGO® City Mou...,,Dirt Road Pursuit,4.514286,City,4.229398


#### Explore using a model to predict the values for review_difficulty
> We will use the piece_count and play_star_rating as the features to help classify missing values for review_difficulty. Source https://towardsdatascience.com/the-tale-of-missing-values-in-python-c96beb0e8a9d. The process goes thus:
- Call the variable where you have missing values as y.
- Split data into sets with missing values and without missing values, name the missing set X_test and the one without missing values X_train and take y (variable or feature where there is missing values) off the second set, naming it y_train.
- Use one of classification methods to predict y_pred.
- Add it to X_test as your y_test column. Then combine sets together.

<img src="Lego_04.png" alt="LEGO Classifier" title="LEGO Classifier"/>

In [53]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

dfr_test = df[df.review_difficulty.isnull()]
dfr_train = df.dropna()
dfr_results = dfr_test.drop(['review_difficulty'], axis=1)

feature_cols = ['piece_count', 'play_star_rating']

y_pred = dfr_test.review_difficulty
X_test = dfr_test[feature_cols]

y_train = dfr_train.review_difficulty
X_train = dfr_train[feature_cols]

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_s, y_train)

y_pred = knn.predict(X_test_s)
y = y_pred.tolist()

columns=['review_difficulty']
index=dfr_results.index.values.tolist()

ydf = pd.DataFrame(y,columns=columns,index=index)
dfr_results = dfr_results.join(ydf)

frames = [dfr_train, dfr_results]
dfr = pd.concat(frames, sort=True)

#### Working dataframe insights
Missing values have been dealt with, now we start to work with our working dataframe:
- Changing data type for talent, technology, tolerance which are numerical anyways as per documentation.
- Changing review_difficulty from categorical into numerical using get_dummies - but combining data points to fix imbalance

In [54]:
dfr['talent'] = dfr['talent'].astype(int)
dfr['technology'] = dfr['technology'].astype(int)
dfr['tolerance'] = dfr['tolerance'].astype(int)

In [55]:
dfr.review_difficulty.value_counts()

Easy                5195
Average             4079
Very Easy           1316
Challenging         1086
Very Challenging       7
Name: review_difficulty, dtype: int64

In [56]:
dfr.review_difficulty.replace('Very Easy','easy',inplace=True)
dfr.review_difficulty.replace('Easy','easy',inplace=True)
dfr.review_difficulty.replace('Average','challenging',inplace=True)
dfr.review_difficulty.replace('Challenging','challenging',inplace=True)
dfr.review_difficulty.replace('Very Challenging','challenging',inplace=True)
dfr.review_difficulty.value_counts()

easy           6511
challenging    5172
Name: review_difficulty, dtype: int64

In [57]:
dfd = dfr.copy()
dfd = pd.get_dummies(dfd, columns=['review_difficulty'], prefix = ['difficulty'])
dfd = pd.get_dummies(dfd, columns=['country'], prefix = ['country'])

In [58]:
dfd.head()

Unnamed: 0,ages,creativity,list_price,num_reviews,piece_count,play_star_rating,prod_id,prod_long_desc,ranking,set_name,...,country_GB,country_IE,country_IT,country_LU,country_NL,country_NO,country_NZ,country_PL,country_PT,country_US
0,12+,0.97,113.9924,23.0,1032.0,3.6,21030.0,Discover the architectural secrets of the icon...,1,United States Capitol Building,...,0,0,0,0,0,0,0,0,0,0
1,12+,0.97,75.9924,14.0,744.0,3.2,21035.0,Discover the architectural secrets of Frank Ll...,1,Solomon R. Guggenheim Museum®,...,0,0,0,0,0,0,0,0,0,0
2,12+,0.97,60.7924,7.0,597.0,3.7,21039.0,Recreate Shanghai in China's blend of historic...,1,Shanghai,...,0,0,0,0,0,0,0,0,0,0
3,12+,0.97,53.1924,24.0,780.0,4.4,21029.0,Build a LEGO® brick model of London's official...,1,Buckingham Palace,...,0,0,0,0,0,0,0,0,0,0
4,12+,0.97,53.1924,37.0,598.0,3.7,21028.0,Celebrate the architectural diversity of New Y...,1,New York City,...,0,0,0,0,0,0,0,0,0,0


In [None]:
dfd.tail()

In [None]:
dfd.shape

In [None]:
dfd.info()

In [None]:
dfd.describe()

From our exploratory data analysis using dfd.corr() and a seaborn heatmap, we can observe the following:
- Weak correlation between data points from the LEGO dataset with the GCI datasets. As such, we do not have the support to be able to accept the earlier stated hypothesis
- There is a strong correlation between the target - list_price, with the features - difficulty_challenging, prod_id, piece_count, num_reviews

In [None]:
f,ax=plt.subplots(figsize=(25,25))
mask = np.zeros_like(dfd.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(dfd.corr(), mask=mask, cmap = 'coolwarm', annot=True, ax=ax, linewidth=.5)

## Building the Machine Learning Models
#### Creating the training and testing sets
<img src="Lego_08.jpg" alt="LEGO Office" title="LEGO Office"/>

In [None]:
from sklearn import metrics

dfx = dfd.copy()
not_features=['list_price','ages','prod_long_desc', 'set_name', 'theme_name']
X = dfx.drop(columns=not_features)
y = dfd.list_price

from sklearn.model_selection import train_test_split
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99, test_size=0.25)

#### Applying Machine Learning
##### Linear Regression

In [None]:
linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_pred_lr = linreg.predict(X_test)


acc_lr = round(linreg.score(X_train,y_train) * 100, 2)
print(round(acc_lr,2,), "%")
print('MAE:', metrics.mean_absolute_error(y_test,y_pred_lr))
print('MSE:', metrics.mean_squared_error(y_test,y_pred_lr))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test,y_pred_lr)))

##### Random Forest Regressor

In [None]:
random_forest = RandomForestRegressor(n_estimators=100, max_depth=3)
random_forest.fit(X_train, y_train)

y_pred_rf = random_forest.predict(X_test)

acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
print(round(acc_random_forest,2,), "%")
print('MAE:', metrics.mean_absolute_error(y_test,y_pred_rf))
print('MSE:', metrics.mean_squared_error(y_test,y_pred_rf))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test,y_pred_rf)))

##### Nearest Neighbors Regression
###### Finding k

In [None]:
rmse=[]
for k in range(20):
    k=k+1
    knn=neighbors.KNeighborsRegressor(n_neighbors = k)

    knn.fit(X_train, y_train)
    y_pred_knn=knn.predict(X_test)
    error = sqrt(mean_squared_error(y_test,y_pred_knn))
    rmse.append(error)
    print('RMSE value for k= ' , k , 'is:', error)

In [None]:
curve = pd.DataFrame(rmse)
curve.plot()

###### Applying k to find the accuracy

In [None]:
knn = neighbors.KNeighborsRegressor(n_neighbors = 7)
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)

acc_knn = round(knn.score(X_train, y_train) * 100, 2)
print(round(acc_knn,2,), "%")
print('MAE:', metrics.mean_absolute_error(y_test,y_pred_knn))
print('MSE:', metrics.mean_squared_error(y_test,y_pred_knn))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test,y_pred_knn)))

> From the above, we can see that the Random Forest Regressor performs better than the linear regression and nearest neighbor regression.

## K-Fold Cross Validation
#### Five-fold Cross Validation

In [None]:
from sklearn import model_selection

kf = model_selection.KFold(n_splits=5, shuffle=True)

rmse_rf = []
scores = []
n = 0

print('CROSS VALIDATION each fold:')
for train_index, test_index in kf.split(X_train, y_train):
    rf  = RandomForestRegressor().fit(X.iloc[train_index], y.iloc[train_index])
    
    rmse_rf.append(np.sqrt(metrics.mean_squared_error(y.iloc[test_index], rf.predict(X.iloc[test_index]))))
    scores.append(rf.score(X, y))
    
    n += 1
    
    print('Model {}'.format(n))
    print('RMSE: {}'.format(rmse_rf[n-1]))
    print('R2: {}\n'.format(scores[n-1])) 
    #print('Coefficients: \n', lr.coef_)

print("SUMMARY OF CROSS VALIDATION")
print('Mean of RMSE for all folds: {}'.format(np.mean(rmse_rf)))
print('Mean of R2 for all folds: {}'.format(np.mean(scores)))

#### Ten-fold Cross Validation

In [None]:
from sklearn import model_selection

kf = model_selection.KFold(n_splits=10, shuffle=True)

rmse_rf = []
scores = []
n = 0

print('CROSS VALIDATION each fold:')
for train_index, test_index in kf.split(X_train, y_train):
    rf  = RandomForestRegressor().fit(X.iloc[train_index], y.iloc[train_index])
    
    rmse_rf.append(np.sqrt(metrics.mean_squared_error(y.iloc[test_index], rf.predict(X.iloc[test_index]))))
    scores.append(rf.score(X, y))
    
    n += 1
    
    print('Model {}'.format(n))
    print('RMSE: {}'.format(rmse_rf[n-1]))
    print('R2: {}\n'.format(scores[n-1])) 
    #print('Coefficients: \n', lr.coef_)

print("SUMMARY OF CROSS VALIDATION")
print('Mean of RMSE for all folds: {}'.format(np.mean(rmse_rf)))
print('Mean of R2 for all folds: {}'.format(np.mean(scores)))

Using five-fold cross validation and ten-fold cross validation, RMSE across the models are relatively close to the RMSE mean and high R2 across the models and comparable to the model score.

#### Using Random Forest Importance to Tune Linear Regression
##### Getting and summarizing feature importance

In [None]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(10)

##### Using the features and creating the linear regression model

In [None]:
dfz = dfd.copy()
features=['piece_count','val_star_rating','num_reviews']

Xz = dfz[features]
yz = dfd.list_price

X_train, X_test, y_train, y_test = train_test_split(Xz, yz, random_state=99, test_size=0.25)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_pred_lr = linreg.predict(X_test)


acc_lr = round(linreg.score(X_train,y_train) * 100, 2)
print(round(acc_lr,2,), "%")
print('MAE:', metrics.mean_absolute_error(y_test,y_pred_lr))
print('MSE:', metrics.mean_squared_error(y_test,y_pred_lr))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test,y_pred_lr)))

## Conclusions and Recommendations
> On the first hypothesis, we have seen there is no strong relationship between LEGO and creativity.

> On the second hypothesis, we have been able to create a model that can help predict the price of a LEGO set - and also knowing that the key features that will determine the price are piece_count, val_star_rating, num_reviews, and difficulty.

> It would be good to find a dataset to see if we can have a stronger relationship between LEGO and creativity and be able to predict the ranking of the market based on the features of this combined data set.

<img src="Lego_09.gif" alt="LEGO Saying" title="LEGO Saying" width="300" class="center">


<img src="LinkedIn.jpeg"/><img src="Lego_07.jpg" alt="LEGO End" title="LEGO End"/>