# Diamond Price Regression

dataset link: https://www.kaggle.com/shivam2503/diamonds

### Context

This classic dataset contains the prices and other attributes of almost 54,000 diamonds. It's a great dataset for beginners learning to work with data analysis and visualization.

### Content

**price** price in US dollars (\$326--\$18,823)

**carat** weight of the diamond (0.2--5.01)

**cut** quality of the cut (Fair, Good, Very Good, Premium, Ideal)

**color** diamond colour, from J (worst) to D (best)

**clarity** a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

**x** length in mm (0--10.74)

**y** width in mm (0--58.9)

**z** depth in mm (0--31.8)

**depth** total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

**table****** width of top of diamond relative to widest point (43--95)

# Imports

In [20]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os # system module

import sklearn # ML library
from sklearn import svm, preprocessing

In [17]:
df = pd.read_csv("../input/diamonds.csv", index_col=0)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


# Data preparation

**Converting labels to numerical values**

In [18]:
df['cut'].unique()

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

In [19]:
cut_dict = {"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5}
clarity_dict = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}
color_dict = {"J": 1,"I": 2,"H": 3,"G": 4,"F": 5,"E": 6,"D": 7}

df['cut'] = df['cut'].map(cut_dict)
df['clarity'] = df['clarity'].map(clarity_dict)
df['color'] = df['color'].map(color_dict)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,5,6,4,61.5,55.0,326,3.95,3.98,2.43
2,0.21,4,6,5,59.8,61.0,326,3.89,3.84,2.31
3,0.23,2,6,7,56.9,65.0,327,4.05,4.07,2.31
4,0.29,4,2,6,62.4,58.0,334,4.2,4.23,2.63
5,0.31,2,1,4,63.3,58.0,335,4.34,4.35,2.75


**Preparing data**

In [26]:
# schuffle data
df = sklearn.utils.shuffle(df)

# split data to feature set X and labels y
X = df.drop("price", axis=1).values
y = df["price"].values

# scale feature set
X = preprocessing.scale(X)

# split data to training & testing
test_size = 200

X_train = X[:-test_size]
y_train = y[:-test_size]

X_test = X[-test_size:]
y_test = y[-test_size:]

# Model training (linear kernel)

**Training**

In [28]:
clf = svm.SVR(kernel="linear")
clf.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)

**Show the score**

In [29]:
clf.score(X_test, y_test)

0.8840228338814546

**Check some predictions**

In [31]:
for X,y in list(zip(X_test, y_test))[:10]:
    print(f"Predicted value: {clf.predict([X])[0]}, Real value: {y}")

Predicted value: 5273.0066010428745, Real value: 5293
Predicted value: 4645.503997530525, Real value: 4308
Predicted value: 8121.141560187336, Real value: 8555
Predicted value: 6006.7119480563815, Real value: 5124
Predicted value: 5534.519467297801, Real value: 8403
Predicted value: 1736.5549284488188, Real value: 1950
Predicted value: 172.99162966173662, Real value: 602
Predicted value: 3377.316242012824, Real value: 2940
Predicted value: 1819.4750076035025, Real value: 1720
Predicted value: 654.4692295436957, Real value: 811


# Model training (RBF kernel)

**Training**

In [33]:
clf = svm.SVR(kernel="rbf")
clf.fit(X_train, y_train)



SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)

**Score**

In [35]:
clf.score(X_test, y_test)

0.5443050551573221

**Check some predictions**

In [36]:
for X,y in list(zip(X_test, y_test))[:10]:
    print(f"Predicted value: {clf.predict([X])[0]}, Real value: {y}")

Predicted value: 4627.196543866474, Real value: 5293
Predicted value: 4364.349684095466, Real value: 4308
Predicted value: 5814.670649599392, Real value: 8555
Predicted value: 5337.807842931223, Real value: 5124
Predicted value: 5175.573856092917, Real value: 8403
Predicted value: 2918.335621723976, Real value: 1950
Predicted value: 1466.650783485402, Real value: 602
Predicted value: 3122.03292429705, Real value: 2940
Predicted value: 2033.4289685926972, Real value: 1720
Predicted value: 682.3429363939044, Real value: 811


# Model training (default SVR)

**Training**

In [37]:
clf = svm.SVR()
clf.fit(X_train, y_train)



SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)

**Score**

In [40]:
clf.score(X_test, y_test)

0.5443050551573221

**Check some predictions**

In [41]:
for X,y in list(zip(X_test, y_test))[:10]:
    print(f"Predicted value: {clf.predict([X])[0]}, Real value: {y}")

Predicted value: 4627.196543866474, Real value: 5293
Predicted value: 4364.349684095466, Real value: 4308
Predicted value: 5814.670649599392, Real value: 8555
Predicted value: 5337.807842931223, Real value: 5124
Predicted value: 5175.573856092917, Real value: 8403
Predicted value: 2918.335621723976, Real value: 1950
Predicted value: 1466.650783485402, Real value: 602
Predicted value: 3122.03292429705, Real value: 2940
Predicted value: 2033.4289685926972, Real value: 1720
Predicted value: 682.3429363939044, Real value: 811
