
# 1. Investigating IKEA Furniture Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

# 2. Predicting IKEA's features

## Table of Contents
    
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#prep">Imports and preperations</a></li>
<li><a href="#model_1">Model 1</a></li>
<li><a href="#model_2">Model 2</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Context:
This dataset is a practice of web scraping techniques. The web scraping has been applied on IKEA Saudi Arabian website for the furniture category. The scraped website link: https://www.ikea.com/sa/en/cat/furniture-fu001/

The data requested by 4/20/2020. <br>
dataset: https://www.kaggle.com/ahmedkallam/ikea-sa-furniture-web-scraping

### Content:

* item_id : item id wich can be used later to merge with other IKEA dataframes
* name: the commercial name of items
* category:the furniture category that the item belongs to (Sofas, beds, chairs, Trolleys,…)
* Price: the current price in Saudi Riyals as it is shown in the website by 4/20/2020
* old_price: the price of item in Saudi Riyals before discount
* Short_description: a brief description of the item
* full_Description: a very detailed description of the item. Because it is long, it is dropped from the final dataframe, but it   is available in the code in case it needs to be analyzed.
* designer: The name of the designer who designed the item. this is extracted from the full_description column.
* size: the dimensions of the item including a lot of details.As a lot of dimensions mentioned and they vary from item to item,
  the most common dimensions have been extracted which are: Height, Wideh, and Depth. This column is dropped from the final       dataframe, but it is available in the code in case it is needed.
* width: Width of the item in Centimeter
* height: Height of the item in Centimeter
* depth: Depth of the item in Centimeter
* sellable_Online: if the item is available for online purchasing or in-stores only (Boolean)
* other_colors: if other colors are available for the item, or just one color as displayed in the website (Boolean)
* link: the web link of the item

### Licences:
The scraped website link: https://www.ikea.com/sa/en/cat/furniture-fu001/

## Importing and Loading data

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf

import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import *


%matplotlib inline

In [None]:
df = pd.read_csv('../input/ikea-sa-furniture-web-scraping/IKEA_SA_Furniture_Web_Scrapings_sss.csv',  index_col=0)

In [None]:
df.head()

<a id='wrangling'></a>
## Data Wrangling

In [None]:
df.describe()

* max value of price might be an indication of an outlier, a visualization will indicate more.


In [None]:
df.info()

* hug amount of null values in the depth height and width features
* old_price is an object, might required more looking

In [None]:
df.nunique()

* All features have a variance of unique values but category, sellable_online, and other_colors could be useful in the analysis

In [None]:
df.duplicated().sum()

* No Duplicates

### Investigating Quantitative values

In [None]:
df['category'].value_counts()

* old_price feature needs some modifications:
    1. remove the "SR " string
    2. Change the "No old price" to the same price as now
    3. make it float

In [None]:
df.category.unique()

In [None]:
df.other_colors.unique()

In [None]:
df.sellable_online.unique()

### Investigating Old Price

# Add new column for if an Item discounted or not (todo)

In [None]:
df.old_price.unique()[:20]

In [None]:
def fix_old_price(df):
    '''modify old_price feature'''
    
    if df['old_price']  == 'No old price':
        return df['price']

    elif df['old_price'][-4:] != 'pack':
        return float(str(df['old_price'])[3:].replace(',',''))
        
    else:
        return np.nan

df['discounted'] = (df['old_price'] != 'No old price').astype(int)   
df['old_price'] = df.apply(fix_old_price, axis=1)
df[['price', 'old_price', 'discounted']].head()

* here I found 10 values with the ward pack in it, so I decided to drop them (later)
* and I applied the modifications mentioned above

<hr>


### Investigating depth, height, width

In [None]:
df[['depth', 'height', 'width']].isna().head(5)

* looks like the null values can be one of the three or more
* before fixing those values let's plot them.
    - from: https://www.kaggle.com/pozdniakov/ikea-furniture
    - Inspiration: https://twitter.com/henrywrover2/status/1323626098924621825

In [None]:
ggplot(df, aes(xmin = 0, ymin = 0, xmax = 'width', ymax = 'height', colour = 'category')) + \
geom_rect(alpha = 0.05, fill = "#FFFFFF", size = 1) + \
scale_x_continuous(limits = (0, 200)) + \
scale_y_continuous(limits = (0, 200)) + \
facet_wrap('category', ncol = 3) + \
guides()+ \
coord_fixed() +\
theme(figure_size=(9, 9)) 

* Nice, looks like each category have almost similar shapes
* now let's count the valid (not null) values of each measure

In [None]:
ggplot(df, aes(xmin = 0, ymin = 0, xmax = 'width', ymax = 'height', colour = 'price', size='price',  fill = 'sellable_online')) + \
geom_rect(alpha = 0.05, fill = "#FFFFFF", size = 1) + \
scale_x_continuous(limits = (0, 200)) + \
scale_y_continuous(limits = (0, 200)) + \
facet_wrap('category', ncol = 3) + \
guides()+ \
coord_fixed() +\
theme(figure_size=(9, 9)) 

* And here brighter colors (yellowish) means its pricer than others

In [None]:
df.groupby('category')[['width', 'height', 'depth']].apply(lambda x: x.notnull().sum())

* some patterns found in some categories, like for example most of the Trolleys don't have depth
<br><br>
* now I will fill the null values with its category mean but first I will make 3 new columns indicating if it was available before or not (for other purposes)

In [None]:
df['width_d'] = (df['width'].notnull()).astype(int)
df['height_d'] = (df['height'].notnull()).astype(int)
df['depth_d'] = (df['depth'].notnull()).astype(int)
df[['width', 'height', 'depth', 'width_d', 'height_d', 'depth_d']].head(5)

In [None]:
df[['width', 'height', 'depth']] = df.groupby(['category'])['width', 'height', 'depth'].transform(lambda x: x.fillna(x.mean()))

In [None]:
df.groupby('category')[['width', 'height', 'depth']].apply(lambda x: x.notnull().sum())


<hr>

### Dropping unused columns and the 10 weird old_price's raw

In [None]:
cols = ['item_id', 'name','link', 'short_description',
        'designer']
df2 = df.drop(cols, axis=1)
df2.columns

In [None]:
df2.isna().sum()

In [None]:
df2.dropna(inplace=True)
df2.isna().sum()

In [None]:
df2.head()

<hr>

<a id='eda'></a>
## Exploratory Data Analysis


* now after we did our cleaning, let's look at the data and look for any interesting insights 


## Univariate Exploration

In [None]:
order = df['category'].value_counts().index
color0 = sns.color_palette()[0]
color1 = sns.color_palette()[1]

plt.figure(figsize=[10, 6])


sns.countplot(data=df2, y='category', order=order, color=color1)

In [None]:
binsize = 500

plt.figure(figsize=[16, 5])
plt.hist(data=df2, x='price',bins=binsize, color=color1)

plt.xlabel('Price');

* price looks kinda log shape, lets zome in a little bit

In [None]:
binsize = 500

plt.figure(figsize=[16, 5])
plt.hist(data=df2, x='price',bins=binsize, color=color1)
plt.xlim(0,1500)

plt.xlabel('Price');

* looks like there is a peak in prices every 100 SR and much noticeable at the 1000 SR mark

In [None]:
binsize = 500

plt.figure(figsize=[16, 5])
plt.hist(data=df2, x='old_price',bins=binsize, color=color0)
plt.xlim(0,1500)

plt.xlabel('Old Price');

In [None]:
binsize = 500

plt.figure(figsize=[16, 5])
plt.hist(data=df2, x='old_price',bins=binsize)
plt.hist(data=df2, x='price',bins=binsize)

plt.xlim(0,1500)

plt.xlabel('Price vs Old Price');

* old price as expected follow the price shape

In [None]:
selable_online_count = df2['sellable_online'].value_counts()

plt.figure(figsize=[6, 6])
explode = (0, 0.4)

plt.pie(selable_online_count, explode=explode, autopct='%1.1f%%');
plt.legend(df2['sellable_online'].unique())

In [None]:
other_colors_count = df2['other_colors'].value_counts()

plt.figure(figsize=[6, 6])
explode = (0, 0.1)

plt.pie(other_colors_count, autopct='%1.1f%%')
plt.legend(df2['other_colors'].unique());

In [None]:
other_colors_count = df2['discounted'].value_counts()

plt.figure(figsize=[6, 6])
explode = (0, 0.1)

plt.pie(other_colors_count, autopct='%1.1f%%')
plt.legend(df2['other_colors'].unique());

* here we see that most items are sellable online (99.2)
* and only 40% of items have other colors

In [None]:
binsize = 30

measures = ['width', 'height', 'depth']

fig, ax = plt.subplots(nrows=3, figsize = [6,8])
for index, measure in enumerate(measures): 
    ax[index].hist(data=df2, x=measure, bins=binsize, color=color0)
    ax[index].set_ylabel(measure);
    ax[index].set_xlabel('');

## Bivariate Exploration

### 1. Price vs Old Price

In [None]:
plt.figure(figsize=[16, 6])

sns.scatterplot(data=df2, x="old_price", y="price", alpha=0.3);

In [None]:
plt.figure(figsize=[16, 6])

sns.scatterplot(data=df2.query('old_price < 500'), x="old_price", y="price", alpha=0.3)

* interesting relation between old price and price, here we can see a linear increase in the value of discounts the more the price increases
* maybe we can see more if we look at the relation between prices and the discount amount

In [None]:
plt.figure(figsize=[16, 6])

df2['discount_amount'] = df2['old_price'] - df2['price']

sns.scatterplot(data=df2, x="price", y="discount_amount", alpha=0.4)

In [None]:
plt.figure(figsize=[16, 6])

sns.scatterplot(data=df2.query('price < 3000'), x="price", y="discount_amount", alpha=0.5)

* from this visualization, we found:
    1. most of the items don't have any discount on it
    2. for low prices there is two line relations, one that shares the same line with the high prices and one limited only for low prices
    3. this relation is roughly 25% discount
    4. items from 8k to 10k SR don't have any discount that follows this relation, 200 SR discount only

### 2. Price vs Categorical Variables

In [None]:
plt.figure(figsize=[16, 6])
result = df.groupby(["category"])['price'].aggregate(np.mean).reset_index().sort_values('price', ascending=False)

sns.barplot(data=df2, y='price', x='category', color=color0, order=result['category'])

plt.xticks(rotation=90);

In [None]:
plt.figure(figsize=[6, 4])

sns.barplot(data=df2, y='price', x='other_colors', color=color1)
plt.xticks(rotation=90);

* items with other colors are more expensive

In [None]:
plt.figure(figsize=[6, 4])

sns.barplot(data=df2, y='price', x='sellable_online', color=color0)
plt.xticks(rotation=90);

* items that are sellable online are more expensive than those are local only

In [None]:
order = df['category'].value_counts().index

sns.catplot(data=df2, x="category", hue='other_colors', kind="count", order=order, height=8, aspect=12/10)

plt.xticks(rotation=90);

### Measures vs Price

In [None]:
binsize = 30

measures = ['width', 'height', 'depth']

fig, ax = plt.subplots(nrows=3, figsize = [6,8])
for index, measure in enumerate(measures): 
    sns.scatterplot(data=df2, x="price", y=measure, alpha=0.5, ax = ax[index])
    ax[index].set_ylabel(measure);
    ax[index].set_xlabel('');

In [None]:
df2['size'] = (np.where(df2['depth_d'] == 1, df2['depth'],1)) *\
(np.where(df2['width_d'] == 1, df2['width'],1)) *\
(np.where(df2['height_d'] == 1, df2['height'],1))


df2[['size', 'width', 'height', 'depth', 'width_d', 'height_d', 'depth_d']].head(10)

In [None]:
plt.figure(figsize=[16, 6])

sns.scatterplot(data=df2, x="price", y="size", alpha=0.5)

In [None]:
plt.figure(figsize=[16, 6])

sns.scatterplot(data=df2.query('price < 3000'), x="price", y="size", hue='discounted', alpha=0.5)


In [None]:
result = df.groupby(["category"])['price'].aggregate(np.mean).reset_index().sort_values('price', ascending=False)

sns.catplot(data=df2, x="category", hue='other_colors', kind="bar", y='price', order=result['category'], height=10, aspect=12/9)

plt.xticks(rotation=90);

In [None]:
result = df.groupby(["category"])['price'].aggregate(np.mean).reset_index().sort_values('price', ascending=False)

sns.catplot(data=df2, x="category", hue='discounted', kind="bar", y='price', order=result['category'], height=10, aspect=12/9)

plt.xticks(rotation=90);

In [None]:
plt.figure(figsize=[16, 10])

sns.scatterplot(data=df2, x="width", y="height", size='price', hue='price')

In [None]:
df2.head()

In [None]:
df2.to_csv('clean_IKEA_dataset.csv', index=False)

<a id='conclusions'></a>
## Conclusions

<a id='prep'></a>
## Imports and preperations

In [None]:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers
from tensorflow.keras import utils

In [None]:
df = df2.copy()
df.head()

In [None]:
scaler = MinMaxScaler()

df[['size', 'width', 'height', 'depth', 'discount_amount','price']] = scaler.fit_transform(df[['size', 'width', 'height', 'depth', 'discount_amount','price']])
df.head()

In [None]:
# encode class values as integers
encoder = LabelEncoder()
encoded_Y = encoder.fit_transform(df.category)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = utils.to_categorical(encoded_Y)


In [None]:
df_train , df_test, dummy_y_train, dummy_y_test = train_test_split(df, dummy_y, shuffle=True, test_size=0.3)

<a id='model_1'></a>
## Model 1: Item's price prediction

In [None]:
feature_columns = []

discount_amount = tf.feature_column.numeric_column("discount_amount")
feature_columns.append(discount_amount)

size = tf.feature_column.numeric_column("size")
feature_columns.append(size)

width = tf.feature_column.numeric_column("width")
feature_columns.append(width)

height = tf.feature_column.numeric_column("height")
feature_columns.append(height)

depth = tf.feature_column.numeric_column("depth")
feature_columns.append(depth)

width_d = tf.feature_column.numeric_column("width_d")
feature_columns.append(width_d)

height_d = tf.feature_column.numeric_column("height_d")
feature_columns.append(height_d)

depth_d = tf.feature_column.numeric_column("depth_d")
feature_columns.append(depth_d)

other_colors = tf.feature_column.categorical_column_with_vocabulary_list(
    key='other_colors', vocabulary_list=('Yes', 'No'), default_value=0)
feature_columns.append(tf.feature_column.indicator_column(other_colors))

category = tf.feature_column.categorical_column_with_vocabulary_list(
    key='category', vocabulary_list=('Bar furniture', 'Beds', 'Bookcases & shelving units',
                                     'Cabinets & cupboards', 'Café furniture', 'Chairs',
                                     'Chests of drawers & drawer units', "Children's furniture",
                                     'Nursery furniture', 'Outdoor furniture', 'Room dividers',
                                     'Sideboards, buffets & console tables', 'Sofas & armchairs',
                                     'Tables & desks', 'Trolleys', 'TV & media furniture', 'Wardrobes'),
    default_value=0)
feature_columns.append(tf.feature_column.indicator_column(category))

my_feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [None]:
#@title Define the plotting function.

def plot_the_loss_curve(epochs, mse):
  """Plot a curve of loss vs. epoch."""

  plt.figure()
  plt.xlabel("Epoch")
  plt.ylabel("Mean Squared Error")

  plt.plot(epochs, mse, label="Loss")
  plt.legend()
  plt.ylim([mse.min()*0.95, mse.max() * 1.03])
  plt.show()  

print("Defined the plot_the_loss_curve function.")

In [None]:
def create_model(my_learning_rate, my_feature_layer):
  """Create and compile a simple linear regression model."""
  # Most simple tf.keras models are sequential.
  model = tf.keras.models.Sequential()

  # Add the layer containing the feature columns to the model.
  model.add(my_feature_layer)

  # Describe the topography of the model by calling the tf.keras.layers.Dense
  # method once for each layer. We've specified the following arguments:
  #   * units specifies the number of nodes in this layer.
  #   * activation specifies the activation function (Rectified Linear Unit).
  #   * name is just a string that can be useful when debugging.

  # Define the first hidden layer with 20 nodes.   
  model.add(tf.keras.layers.Dense(units=20, 
                                  activation='relu', 
                                  kernel_regularizer=tf.keras.regularizers.l2(l=0.0),
                                  name='Hidden1'))
  
  # Define the second hidden layer with 10 nodes. 
  model.add(tf.keras.layers.Dense(units=10, 
                                  activation='relu', 
                                  kernel_regularizer=tf.keras.regularizers.l2(l=0.0),
                                  name='Hidden2'))

  
  # Define the output layer.
  model.add(tf.keras.layers.Dense(units=1,  
                                  name='Output'))                              
  
  model.compile(optimizer=tf.keras.optimizers.Adam(lr=my_learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.MeanSquaredError()])

  return model


def train_model(model, dataset, epochs, label_name,
                batch_size=None):
  """Train the model by feeding it data."""

  # Split the dataset into features and label.
  features = {name:np.array(value) for name, value in dataset.items()}
  label = np.array(features.pop(label_name))
  history = model.fit(x=features, y=label, batch_size=batch_size,
                      epochs=epochs, shuffle=True) 

  # The list of epochs is stored separately from the rest of history.
  epochs = history.epoch
  
  # To track the progression of training, gather a snapshot
  # of the model's mean squared error at each epoch. 
  hist = pd.DataFrame(history.history)
  mse = hist["mean_squared_error"]

  return epochs, mse

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.01
epochs = 20
batch_size = 2


# Specify the label
label_name = "price"

# Establish the model's topography.
my_model = create_model(learning_rate, my_feature_layer)

# Train the model on the normalized training set. We're passing the entire
# normalized training set, but the model will only use the features
# defined by the feature_layer.
epochs, mse = train_model(my_model, df_train, epochs, 
                          label_name, batch_size)
plot_the_loss_curve(epochs, mse)

# After building a model against the training set, test that model
# against the test set.
test_features = {name:np.array(value) for name, value in df_test.items()}
test_label = np.array(test_features.pop(label_name)) # isolate the label
print("\n Evaluate the new model against the test set:")
my_model.evaluate(x = test_features, y = test_label, batch_size=batch_size)

<a id='model_2'></a>
## Model 2: Category classifier

In [None]:
feature_columns = []

price = tf.feature_column.numeric_column("price")
feature_columns.append(price)

discount_amount = tf.feature_column.numeric_column("discount_amount")
feature_columns.append(discount_amount)

size = tf.feature_column.numeric_column("size")
feature_columns.append(size)

width = tf.feature_column.numeric_column("width")
feature_columns.append(width)

height = tf.feature_column.numeric_column("height")
feature_columns.append(height)

depth = tf.feature_column.numeric_column("depth")
feature_columns.append(depth)

width_d = tf.feature_column.numeric_column("width_d")
feature_columns.append(width_d)

height_d = tf.feature_column.numeric_column("height_d")
feature_columns.append(height_d)

depth_d = tf.feature_column.numeric_column("depth_d")
feature_columns.append(depth_d)

other_colors = tf.feature_column.categorical_column_with_vocabulary_list(
    key='other_colors', vocabulary_list=('Yes', 'No'), default_value=0)
feature_columns.append(tf.feature_column.indicator_column(other_colors))


my_feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [None]:
#@title Define the plotting function
def plot_curve(epochs, hist, list_of_metrics):
  """Plot a curve of one or more classification metrics vs. epoch."""  
  # list_of_metrics should be one of the names shown in:
  # https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#define_the_model_and_metrics  

  plt.figure()
  plt.xlabel("Epoch")
  plt.ylabel("Value")

  for m in list_of_metrics:
    x = hist[m]
    plt.plot(epochs[1:], x[1:], label=m)

  plt.legend()

print("Loaded the plot_curve function.")

In [None]:
def create_model(my_learning_rate, my_feature_layer):
  """Create and compile a deep neural net."""
  
  # All models in this course are sequential.
  model = tf.keras.models.Sequential()

  # The features are stored in a two-dimensional 28X28 array. 
  # Flatten that two-dimensional array into a a one-dimensional 
  # 784-element array.
  model.add(my_feature_layer)

  # Define the first hidden layer.   
  model.add(tf.keras.layers.Dense(units=500, activation='relu'))
  model.add(tf.keras.layers.Dropout(rate=0.2))

  model.add(tf.keras.layers.Dense(units=200, activation='relu'))
  model.add(tf.keras.layers.Dropout(rate=0.2))

  model.add(tf.keras.layers.Dense(units=20, activation='relu'))
  model.add(tf.keras.layers.Dropout(rate=0.2))


  # Output Layer
  model.add(tf.keras.layers.Dense(units=17, activation='softmax'))     
                           
  # Construct the layers into a model that TensorFlow can execute.  
  # Notice that the loss function for multi-class classification
  # is different than the loss function for binary classification.  
  model.compile(optimizer=tf.keras.optimizers.Adam(lr=my_learning_rate),
                loss="categorical_crossentropy",
                metrics=['accuracy'])
  
  return model    


def train_model(model, dataset, train_label, epochs,
                batch_size=None):
  """Train the model by feeding it data."""

  # Split the dataset into features and label.
  features = {name:np.array(value) for name, value in dataset.items()}
  label = np.array(features.pop(label_name))


  history = model.fit(x=features, y=train_label, batch_size=batch_size,
                      epochs=epochs, shuffle=True) 

  # The list of epochs is stored separately from the rest of history.
  epochs = history.epoch
  
  # To track the progression of training, gather a snapshot
  # of the model's mean squared error at each epoch. 
  hist = pd.DataFrame(history.history)

  return epochs, hist

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.001
epochs = 200
batch_size = 5


label_name = "category"
# Establish the model's topography.
my_model = create_model(learning_rate, my_feature_layer)

# Train the model on the normalized training set.
epochs, hist = train_model(my_model, df_train, dummy_y_train, 
                           epochs, batch_size)

# Plot a graph of the metric vs. epochs.
list_of_metrics_to_plot = ['accuracy']
plot_curve(epochs, hist, list_of_metrics_to_plot)

# Evaluate against the test set.
print("\n Evaluate the new model against the test set:")
features = {name:np.array(value) for name, value in df_test.items()}

my_model.evaluate(x=features, y=dummy_y_test, batch_size=batch_size)