# Shonil Dabreo, s3835204, Group 62

## Data Preparation 
### Loading dataset
Importing packages and displaying the dataset shape and feature values

In [1]:
import warnings
warnings.filterwarnings("ignore")

# importing packages
import numpy as np
import pandas as pd

# Setting random seed to 999
np.random.seed(999)

# Importing THA_diamonds.csv dataset and assigning it to a variable using read_csv function
diamond = pd.read_csv('THA_diamonds.csv', decimal = '.')

# printing no of observations & features
print(diamond.shape)

# headers
diamond.columns.values

(212, 5)


array(['cut', 'color', 'depth', 'price', 'carat'], dtype=object)

There are total 212 observations and 5 descriptive features. The `carat` is the target feature.

### Checking Missing Values


In [2]:
diamond.isna().sum()

cut      0
color    0
depth    0
price    0
carat    0
dtype: int64

We found no missing values.

Lets look at the contents of the dataset.

In [3]:
# To display all columns in a dataframe. 
pd.set_option('display.max_columns', None) 
diamond

Unnamed: 0,cut,color,depth,price,carat
0,Good,D,63.6,low,0.44
1,Fair,F,64.2,low,0.45
2,Good,I,60.4,low,0.50
3,Good,F,56.8,low,0.45
4,Fair,F,64.3,low,0.45
...,...,...,...,...,...
207,Good,F,63.7,premium,0.96
208,Fair,D,57.5,premium,0.90
209,Fair,F,64.7,premium,0.90
210,Good,I,58.2,premium,0.93


## Encoding Categorical Features
Encoding features into Numerical features is necessary to begin with Modelling.


Dropping the `carat` feature into *target* variable and the rest of the features into *data* variable.

In [4]:
data = diamond.drop(columns='carat')
target = diamond['carat']
target.head(5)

0    0.44
1    0.45
2    0.50
3    0.45
4    0.45
Name: carat, dtype: float64

Now, we perform one-hot encoding for categorical features.

In [5]:
category_feat = data.columns[data.dtypes==object].tolist()
category_feat

['cut', 'color', 'price']

We have the `cut`, `color` & `price` categorical features. The `drop_first` is assigned as `True`to encode the feature with 2 levels first and then the features with regular ones.

In [6]:
for column in category_feat:
    n = len(data[column].unique())
    if (n == 2):
        data[column] = pd.get_dummies(data[column], drop_first=True)

# One-hot encoding for features with more than 2 levels
data = pd.get_dummies(data)

# printing list of features after encoding them
print(data.columns)

Index(['cut', 'depth', 'color_D', 'color_F', 'color_I', 'price_high',
       'price_low', 'price_medium', 'price_premium'],
      dtype='object')


In [7]:
data.head(5)

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium
0,1,63.6,1,0,0,0,1,0,0
1,0,64.2,0,1,0,0,1,0,0
2,1,60.4,0,0,1,0,1,0,0
3,1,56.8,0,1,0,0,1,0,0
4,0,64.3,0,1,0,0,1,0,0


## Feature Scaling
Implementing the min-max scaling of categorical features. Keeping a copy of the dataframe for later use.

In [8]:
from sklearn import preprocessing

dt_copy = data.copy()

Data_scaler = preprocessing.MinMaxScaler()
Data_scaler.fit(data)
data = Data_scaler.fit_transform(data)

The output of dataframe is lost, so using the copy of dataframe for showing the column names. 

In [9]:
data = pd.DataFrame(data, columns = dt_copy.columns)
data.tail(10)

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium
202,1.0,0.134328,0.0,0.0,1.0,0.0,0.0,0.0,1.0
203,1.0,0.201493,0.0,0.0,1.0,0.0,0.0,0.0,1.0
204,0.0,0.776119,0.0,1.0,0.0,0.0,0.0,0.0,1.0
205,1.0,0.320896,0.0,0.0,1.0,0.0,0.0,0.0,1.0
206,0.0,0.858209,0.0,1.0,0.0,0.0,0.0,0.0,1.0
207,1.0,0.626866,0.0,1.0,0.0,0.0,0.0,0.0,1.0
208,0.0,0.164179,1.0,0.0,0.0,0.0,0.0,0.0,1.0
209,0.0,0.701493,0.0,1.0,0.0,0.0,0.0,0.0,1.0
210,1.0,0.216418,0.0,0.0,1.0,0.0,0.0,0.0,1.0
211,0.0,0.268657,0.0,1.0,0.0,0.0,0.0,0.0,1.0


The last 10 observations are displayed. We can see that the values are between 0 and 1 after scaling.

## Train-Test Splitting
Splitting the dataset into train and test sets with a 70:30 ratio.

In [10]:
from sklearn.model_selection import train_test_split

d_train, d_test, t_train, t_test = train_test_split(data, target, test_size = 0.3, random_state = 999)

print(d_train.shape)
print(d_test.shape)

(148, 9)
(64, 9)


## Model Hyperparameter tuning and Evaluation

Cross validation with all training data and building model for k = 1, 5, 10 (no of neighbors). The RMSE error is an evaluation measure where the model with lowest value would perform better. The average RMSE value is calculated out of all errors for each K-neighbor.  

Here, the thing to be noted is that the mean square error gives negative values which are converted into positive values.

In [11]:
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsRegressor

k = [1, 5, 10]
rmse_errors = []
for count in range(0, 3):
    knn_cv = KNeighborsRegressor(n_neighbors = k[count], metric = 'euclidean')
    rmse = np.sqrt(-cross_val_score(knn_cv, data, target, cv = 10, scoring = 'neg_mean_squared_error'))
    rmse_errors.append(rmse.mean())
    print("The RMSE error for k =", k[count], "is:", rmse_errors[count].round(3))
    print()
    count = count + 1

The RMSE error for k = 1 is: 0.088

The RMSE error for k = 5 is: 0.077

The RMSE error for k = 10 is: 0.078



The optimal KNN model has been viewed with a lowest RMSE error of 0.077 where the best parameter, no of neighbors is 5.

We will build models by fitting the data with different k-neighbors. 

In [12]:
knn_1 = KNeighborsRegressor(n_neighbors = 1, metric = 'euclidean')
knn_1.fit(d_train, t_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean',
                    metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                    weights='uniform')

In [13]:
knn_5 = KNeighborsRegressor(n_neighbors = 5, metric = 'euclidean')
knn_5.fit(d_train, t_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [14]:
knn_10 = KNeighborsRegressor(n_neighbors = 10, metric = 'euclidean')
knn_10.fit(d_train, t_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean',
                    metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                    weights='uniform')

Now, that the models with different parameters are fitted, we will use predict function to predict the target value for a single observation. But first, lets create a dataframe with a single observation and convert categorical features. Then, to map the observation with the trained data, we encode the new observation and scale them based on trained data.

In [15]:
single_df = pd.DataFrame([['good', 60, 'D', 'premium']] , columns = ['cut', 'depth', 'color', 'price'])

# converting features to categories
single_df['cut'] = pd.Categorical(single_df['cut'], 
                                  categories = ["Good", "Fair"])

single_df['color'] = pd.Categorical(single_df['color'], 
                                  categories = ["D", "F", "I"])

single_df['price'] = pd.Categorical(single_df['price'], 
                                  categories = ["low", "medium", "high", "premium"])

# Encoding 2 levels and >2 levels categorical values based on training data (d_train.columns). 
single_df['cut'] = pd.get_dummies(single_df['cut'], drop_first=True)
single_df_enc = pd.get_dummies(single_df)[d_train.columns]

# transforming the encoded data of single observation
d_new = preprocessing.MinMaxScaler().fit_transform(single_df_enc)

The carat predicted value for k = 1

In [16]:
t_pred1 = knn_1.predict(d_new)
t_pred1

array([0.64])

The carat predicted value for k = 5

In [17]:
t_pred5 = knn_5.predict(d_new)
t_pred5

array([0.664])

The carat predicted value for k = 10

In [18]:
t_pred10 = knn_10.predict(d_new)
t_pred10

array([0.678])

## Wrap Up

**df_summary_sklearn**

|method | prediction | is_best|
|---|---|---|
|1-KNN | 0.64 | False |
|5-KNN | 0.664 | False |
|10-KNN | 0.678 | True |