# Why are low quality diamonds more expensive?
## 66 Days of Data with Ken Jee


Today I was wondering if the objective of a data science project is to build a predictive model. So I take back my "R for Data Science" book and read again the modeling chapter, and this time I found new insights about. A model can be used for data processing like a filter that you can train to make your data more understandable. I hope you can read it and reproduce the diamonds example. Here my Python version...

**Author:** Andres Jejen   
**Bibliography**: [model building](https://r4ds.had.co.nz/model-building.html)

### Loading Libraries

In [None]:
import numpy as np    # linear algebra
import pandas as pd   # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # graphication library
import matplotlib.pyplot as plt

#Modeling libreries
from sklearn import linear_model
sns.set_style("whitegrid") # setting style

### Loading Data

Diamonds dataset contains information about ~54K diamonds, includind the ``price``, ``carat (weight)``, ``cut``, ``clarity``, ``depth``, ``x,y,x dimensions`` and so on.   
After the Exploatory Data Analysis we found son counter intuitive facts. For Example, apparently the median price of diamonds is higher for lower quality cuts, colors, and clarity.   
Is that true or can we explain this phenomenon with the data. First at all lets take a look.

In [None]:
diamonds = pd.read_csv('/kaggle/input/diamonds/diamonds.csv')

order = {
    "cut": ["Fair","Good", "Very Good", "Premium", "Ideal"],
    "clarity": ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"],
    "color": ["D","E","F","G","H","I","J"]
}

diamonds.head()

In [None]:
def EDA(data=diamonds ,target_feature="price"):
    for feature in ["cut","clarity","color"]:
        plt.figure(figsize=(14,7))
        sns.boxplot(data=data,x = feature, y=target_feature, order=order[feature])
        print(f"Median analysis by {feature}")
        print(data[[feature,target_feature]].groupby(feature).median())

In [None]:
EDA()

You can find that the median value is higher for the lost quality diamonds.

- In ``cut`` category, ``Fair`` is the worst but this have in median the higher price.   
- In ``color`` category, ``J`` is the worst color but this have in median the higher price.   
- In ``clarity`` category, ``I1`` and ``SI2`` are the worst but they have in median the higher price.   

> I would like a challenge In this case, Why are we using the median instead of the mean?, **please comment below**.

Lets take another insight, What if we compare the carat (weight) feature vs price?.

In [None]:
plt.figure(figsize=(14,7))
sns.scatterplot(data=diamonds, x="carat", y="price")

It seems like a exponential growing, and we observe that most of the data are from carat lower than 2.5.   
Let's make a couple of tweas to the dataset in order to make it easier to work with.   

1. Focus on diamonds smaller than 2.5 carats.
2. Log transform carat and price in order to avoid the exponential relationshit between.

In [None]:
diamonds_filtered = diamonds.query("carat < 2.5") 
diamonds_filtered["log_price"] = np.log2(diamonds_filtered["price"])
diamonds_filtered["log_carat"] = np.log2(diamonds_filtered["carat"])

In [None]:
plt.figure(figsize=(14,7))
sns.scatterplot(data=diamonds_filtered, x="log_carat", y="log_price")

Now, we found a possible linear relationship, now let create a linear model and evaluate the result.

In [None]:
X = diamonds_filtered.loc[:, "log_carat"].values.reshape(-1, 1)  # values converts it into a numpy array
Y = diamonds_filtered.loc[:, "log_price"].values.reshape(-1, 1)  # -1 means that calculate the dimension of rows, but have 1 column

linear_regressor = linear_model.LinearRegression()  # create object for the class
linear_regressor.fit(X, Y)  # perform linear regression
score = linear_regressor.score(X,Y)

print(f"R^2 of the linear regression {score}")

diamonds_filtered["log_predicted_price"] = linear_regressor.predict(X)  # make predictions
diamonds_filtered["predicted_price"] = diamonds_filtered["log_predicted_price"].apply(lambda x: 2**x)

plt.figure(figsize=(14,7))
fig, ax = plt.subplots()
sns.scatterplot(diamonds_filtered["log_carat"],diamonds_filtered["log_price"], ax=ax)
sns.scatterplot(diamonds_filtered["log_carat"],diamonds_filtered["log_predicted_price"], ax=ax)

plt.figure(figsize=(14,7))
fig, ax = plt.subplots()
sns.scatterplot(diamonds_filtered["carat"],diamonds_filtered["price"], ax=ax)
sns.scatterplot(diamonds_filtered["carat"],diamonds_filtered["predicted_price"], ax=ax)

Let's see the residuals, remember that a linear regression can be evaluated if the sparse of the residuals is uniform.

In [None]:
diamonds_filtered["model_log_residuals"] = diamonds_filtered["log_price"]-diamonds_filtered["log_predicted_price"]
plt.figure(figsize=(14,7))
sns.scatterplot(data=diamonds_filtered, x="log_carat", y="model_log_residuals")

## Final part
Now just try the Explortory Data analysis, in this case we use the residuals instead of the price.


In [None]:
EDA(diamonds_filtered,"model_log_residuals")

# > The counterintuitive part is because the poorest quality diamonds tend to be the largest, possibly used for tunneling or drilling for oil. Using the linear regression model we found a way to overcome this effect and be able to explain the phenomenon, now it is possible to create more sophisticated models that can lead to a possible model that predicts the price of a diamond.

## TAKEAWAYS

the modeling process is not only the final task of a data science project, they are also usefull to performs some twaks over the data, in order to extraxt insights or clean it. This Example is takem from "R for data science" Book, I just translate it to Python and add some personal comments.
Please let me know your thoughts below.