# Coding Quiz for L15-16 (30 July 2022)

Please complete the following notebook and submit your answers using the following link:

https://forms.gle/gxg18mzPsDf1BtB1A

There are a total of **4** questions and you have 15 mins for this quiz. Good luck! 

## **You are strongly recommended to run this notebook with GPUs.**

*(For Google Colab users, select Runtime> Change runtime type > Hardware Accelerator > GPU)*

In [48]:
import numpy as np
import pandas as pd
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [49]:
df = sb.load_dataset('diamonds')

In [50]:
df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


## Data Pre-processing & Analysis

In [51]:
# remove the column "color"

print("No. of columns: {} (Before)".format(df.shape[1]))

df.drop(columns=['color'], inplace=True) # your code here (1)

print("No. of columns: {} (After)".format(df.shape[1]))

No. of columns: 10 (Before)
No. of columns: 9 (After)


**Q1. Correct the code above to remove the column "color" from df. (Copy your code to the submission form)**

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   clarity  53940 non-null  category
 3   depth    53940 non-null  float64 
 4   table    53940 non-null  float64 
 5   price    53940 non-null  int64   
 6   x        53940 non-null  float64 
 7   y        53940 non-null  float64 
 8   z        53940 non-null  float64 
dtypes: category(2), float64(6), int64(1)
memory usage: 3.0 MB


In this exercise, we are trying to predict diamond prices (*price*) using the 8 features (*carat, cut, clarity, depth, table, x, y, z*) given in the dataset.



Categorical feature description:

- cut: Fair, Good, Very Good, Premium, Ideal
- clarity: I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)

**Q2a. Which of the following is a better encoding method for the categorical features: cut & clarity?**
- **One hot encoder**
- **Ordinal encoder**

**Q2b.  Explain your answer in Q2a.**
data has a hierarchy

## Encoding & feature scaling

In [53]:
# choose an encoder for categorical features
# based on your answewr in Q2, assign encoder = 'one hot' or 'ordinal'

encoder = 'ordinal' # your code here (2)

In [54]:
# encode categorical features
def encode_cat(encoder_name, data):
  if encoder_name == 'one hot':
    return OneHotEncoder().fit_transform(data).toarray()
  elif encoder_name == 'ordinal':
    return OrdinalEncoder(categories=[['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']]).fit_transform(data)
  else:
    raise Exception("Please assign encoder = 'one hot' or 'ordinal'.")

cat = ['cut', 'clarity']
cat_encoded = encode_cat(encoder, df[cat])

In [55]:
# scaling on numerical features

num = ['carat', 'depth', 'table', 'x', 'y', 'z']
num_scaled = StandardScaler().fit_transform(df[num])

In [56]:
# combine the numpy arrays cat_encoded and num_scaled as input features

X =  np.concatenate((cat_encoded, num_scaled), axis=1)

*Hint: Try np.concatenate()* 

*https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html*

In [57]:
# scaling on prediction label (price)

encoder_y = StandardScaler().fit(df.price.to_numpy().reshape(-1, 1))
Y = encoder_y.transform(df.price.to_numpy().reshape(-1, 1))

## Diamonds Price Prediction Model Training & Evaluation

In [58]:
# Split the data into training and testing set with a ratio of 8:2 and random state 10

x_train, x_test, y_train, y_test =  train_test_split(X, Y, test_size=0.2, random_state=10)

In [59]:
# train a basic linear regression model

lr_model = LinearRegression()
lr_model.fit(x_train, y_train)

predict = lr_model.predict(x_test)
print("MSE: {}".format(mean_squared_error(y_test, predict)))

MSE: 0.11386911250118208


In [60]:
# define a deep neural network model

dnn_model = Sequential()
dnn_model.add(Dense(32, input_shape=(x_train.shape[1],), activation='relu'))
dnn_model.add(Dense(32, activation='relu'))
dnn_model.add(Dense(1))

dnn_model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_squared_error'])

In [61]:
# train the dnn_model using x_train and y_train, with batch_size = 2048, epochs = 5, verbose=False

dnn_model.fit(x_train, y_train, batch_size=2048, epochs=10)

print("MSE: {}".format(dnn_model.evaluate(x_test, y_test)[1]))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
MSE: 0.09306003898382187


**Q3a. Which model (lr_model or dnn_model) performed better in terms of MSE?**

*Hint: The lower the better* 
lr_model

**Q3b. Does the result matches your expectations? Explain your answer.**

**Q4. Which of the following is a possible way to improve dnn_model? (You may choose more than one)**
- **Adding more Dense() layers**
- **Increasing the number of units in the first two layers**
- **Increasing the number of epochs**
- **Increasing the batch size**

## End of Quiz! 👏
## Remember to submit your answers and notebook!