In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
plt.rcParams["figure.figsize"] = (8,8)

In [None]:
datapath = Path('data/raw/diamonds.csv')
df = pd.read_csv(datapath)

We will explore the diamonds dataset. This is a nice dataset for dataexploration, because:

1. It is easy to form hypothesis about the dataset
2. There is a lot of data. Not 150 observations, like the iris-dataset, but more than 50k observations. This makes the plotting a bit more interesting.

The information available about the 10 variables:
1. price: price in US dollars (\$326--\$18,823)
2. carat: weight of the diamond (0.2--5.01)
3. cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
4. color: diamond colour, from D (best) to J (worst)
5. clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
6. x: length in mm (0--10.74)
7. y: width in mm (0--58.9)
8. z: depth in mm (0--31.8)
9. depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
10. table: width of top of diamond relative to widest point (43--95)

With this, let's start exploring.
First, let's explore with some basic `pandas` functions.

In [None]:
df.head()

In [None]:
df.describe()

This doesnt tell us that much new. Let's check if everything is read in as expected with `.info()`

In [None]:
df.info()

This seems to be correct. To me, the most straight forward relation I can spot is between carat and price. Big diamonds will cost more, right?

In [None]:
plt.scatter(data=df, x='carat', y='price')

That's a lot of points. Because there are 53k points, probably a lot will overlap. Let's make the points a bit transparent. 

In [None]:
plt.scatter(data=df, x='carat', y='price', alpha=0.1)

That's better. At least we get some idea about the distribution. Let's try to zoom in a bit more. We see that another column is called `cut` which stands for the quality of the cut. This will probably have some impact. Lets use colors to see if that makes sense.

At the seaborn [documentation](https://seaborn.pydata.org/tutorial/color_palettes.html) we can look up the color palettes. Because this is `Sequential` data (ranging from low to high), we will pick one of the sequential colorschemes.

In [None]:
sns.scatterplot(data=df, x='carat', y='price',
    linewidth=0, alpha=0.1,
    hue='cut', palette='rocket')

Ok, nice, but now we have another problem. The colors are ordered alphabetically, but we want them ordered by their meaning. We can do this in different ways. One way is to use `hue_order`.

In [None]:
cutorder = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
sns.scatterplot(data=df, x='carat', y='price',
    alpha=0.3, linewidth=0,
    hue='cut', hue_order=cutorder, 
    palette='rocket')

Well, that seems to work. At least the colors are ordered. But it is still too crowded. Let's try to split things up with `FacetGrid`. But before that, let's use a better way than passing cutorder. The type of the `cut` column is a string. We can transform that into `category`, which is for categorical data, and pass an order.

In [None]:
df.cut = df.cut.astype('category')
df.cut.cat.set_categories(cutorder, inplace=True)

In [None]:
g= sns.FacetGrid(df, 
        col='cut', hue='cut', palette='rocket',
        height=10, aspect=0.5)
g.map_dataframe(sns.scatterplot, x='carat', y='price',
        alpha=0.3, linewidth=0)

Well, the plots are separated. And indeed, it seems to be the case that better cuts have a higher price at lower carats. Lets try to confirm that with a `lowess` model, which stands for `locally weighted scatterplot smoothing`.

In [None]:
g = sns.lmplot(data=df, x='carat', y='price',
        hue='cut', palette='rocket',
        lowess=True, scatter_kws={'facecolors':'none', 'alpha':0.5})
plt.ylim(0, 20000)

Now, that seems to confirm what we thought. Changing the marker might make things a bit clearer

In [None]:
sns.lmplot(data=df, x='carat', y='price',
        hue='cut', palette='rocket',
        lowess=True, markers='+')
plt.ylim(0, 20000)


Or dropping the markers completely:

In [None]:
sns.lmplot(data=df, x='carat', y='price',
        hue='cut', palette='rocket',
        lowess=True, scatter_kws={'facecolors':'none', 'edgecolors':'none'})

It is not very consistent, but `lmplot` has actually a facet grid implemented. You can simply pass it a `col='cut'` parameter.

In [None]:
sns.lmplot(data=df, x='carat', y='price',
        hue='cut', palette='rocket',
        lowess=True, scatter_kws={'facecolors':'none', 'edgecolors':'none'},
        col='cut')

But this is looking too empty again. 
Now let's try to find more properties that influence the price. We have `clarity`. That might influence the price, too. Let's use the plot we had, and add an extra dimension. But first make sure, clarity is categorical.

In [None]:
clarityorder = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
df.clarity = df.clarity.astype('category')
df.clarity.cat.set_categories(clarityorder, inplace=True)

In [None]:
sns.lmplot(data=df, x='carat', y='price', 
        hue='clarity', palette='rocket',
        col='cut',
        lowess=True,
        height=5,

        scatter_kws={'facecolors':'none', 'alpha':0.1})

While you might be able to see the relationships by comparing the facets, it is a bit hard to see.
Much clearer is the heatmap. First, we create a column `value`, that shows the price unit per unit of weight.
Then, we pivot the data.

In [None]:
subset = df[['cut', 'clarity', 'carat', 'price']].copy()
subset['value'] = df.apply(lambda x: x.price / x.carat, axis=1)
hm = pd.pivot_table(subset, values='value', columns='clarity', index='cut')
hm

In [None]:
sns.heatmap(hm, annot=hm/1000)

Interestingly, we can see a hotspot here: If your cut is very good, and clarity is the best, you get about 5k for every carat of diamond. Improving the cut will lower the price per carat! You could try another palette to make it clearer.

In [None]:
sns.heatmap(hm, annot=hm/1000, cmap='vlag')

Interesting, this shows that dropping the clarity even a bit will give you a price below the mean value. I had not expected that a medium clarity would give a higher price per unit. Maybe it has something to do with outliers? Pivot_table uses 'mean', so let's switch to the median.

In [None]:
hm = pd.pivot_table(subset, values='value', columns='clarity', index='cut', aggfunc='median')
sns.heatmap(hm, annot=hm/1000, cmap='vlag')

This raises other questions. But there don't seem to be these sudden drops. Outliers do have an impact here! Still, it might be unexpected that you get the most "bang for you buck" with worse cuts, but better clarity or better cuts, but worse clarity. 

Let's try to get a grip on these outliers.  We could wonder what the distribution actually is, for every category. So, we want to look at the distribution of the prices in every clarity group.
Let's put the clarity on the x-axis, and the prices on the y-axis, with a boxplot.

In [None]:
sns.boxplot(data=subset, x='clarity', y='price')

Well, that's something you might not have expected. What's going on? The carat has the biggest impact on price, right? So we might want to put the caret on the x-axis. But we run into problems if we do that just like this:

In [None]:
sns.boxplot(data=subset, x='carat', y='price')

What's going on? `carat` is a continuous variable. If we want to make boxplot, we need groups on which we want to calculate a boxplot. So let's make our own bin's on the x-axis to fix this.


In [None]:
subset['bins'] = pd.cut(df['carat'], bins=5)

In [None]:
sns.boxplot(data=subset, x='bins', y='price')

This looks better! We can see how the price grows in every group. We can also see the outliers, and how the groups get smaller. Maybe we can split this up again? The weirdness seems to be happeing in the lowest carat group.

Let's say, we could use clarity to split up the caratbins into even smaller groups?

In [None]:
sns.boxplot(data=subset, x='bins', y='price', hue='clarity', palette='rocket')

This gives a really nice overview. We can still see how the groups differ in size. We can also see very clearly how the clarity for the small diamonds has an impact on price, but isn't distributed normally. For the rest of the groups, there is more or less a normal distribution.

Let's try to zoom in on the weird subset.

In [None]:
subset2 = df[(df.carat < 1.162) & (df.clarity == 'IF')]
sns.boxplot(data=subset2, x ='cut', y='price', palette='rocket')
len(subset)


Ok, now zoom in even further on the 'ideal' group.

In [None]:
colororder = ['D', 'E', 'F', 'G', 'H', 'I', "J"]
df.color = df.color.astype('category')
df.color.cat.set_categories(colororder, inplace=True)

In [None]:
subset2 = df[(df.carat < 1.162) & (df.clarity == 'IF') & (df.cut == 'Ideal')]
sns.boxplot(data=subset2, x ='color', y='price', palette='rocket')

And add the F and G color groups.

In [None]:
subset2 = df[(df.carat < 1.162) & (df.clarity == 'IF') & (df.cut == 'Ideal') & ((df.color == 'F') | (df.color == 'G'))]
sns.scatterplot(data=subset2, x='carat', y='price')

Ah,  there we have it. The group between 0.8-1.0 is missing. This gives the boxplot the unbalanced look. If this was a normal distribution, we would have expected diamonds between 0.8 and 1.0. This is a pattern we actually could have noticed already in the first plots. For some reason, carats tend to converge be distributed more around the start of a carat, and tend to be very rare close to the end of a carat. My hypothesis is, that this is caused by a human decision. We can see how the carat converges to certain numbers by checking the histogram.

In [None]:
df['ceil'] = np.ceil(df.carat)

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

ax = sns.histplot(data=df, x='carat')
ticks = np.linspace(0, 5, 21)
labels = ['{:.2f}'.format(x) for x in ticks]
ax.set_xticks(ticks)
ax.set_xticklabels(labels, rotation=45)
plt.show()

This confirms my hypothesis. Carat seems to diverge towards nice round numbers. You find peaks at 0.25, 0.5, 0.75, 1.00, 1.5, 2.0. Probably this is due to how diamonds are sold, in these categories. With this, we could go to an expert to ask for clarification.

Remember we calculated a 'price increase per unit of weight'? Let's see if that changes what is going on.


In [None]:
sns.boxplot(data=subset, x='cut', y='value', hue='clarity', palette='rocket')

It turns out, it really matters how we look at the data. Things that seem weird from one perspective, sometimes turn out to be reflections of human decisions that create non-normal distributions.

If we normalize the price, we can see that there is some other process going on that drives up the price per unit of diamond. Probably something like esthethics, or context, or history of the diamond, etc., that makes the diamond more valuable.

But we also see this happening more often in the Ideal group, with perfect coloring, which makes sense. These diamonds are probably rare, something we can explore with a heatmap:

In [None]:
hm = pd.pivot_table(subset, values='value', columns='clarity', index='cut', aggfunc='count')
sns.heatmap(hm, annot=hm/1000, cmap='vlag')

In [None]:
g= sns.FacetGrid(df, 
        row='cut', col='color',
        hue='clarity', palette='rocket',
        height=2, aspect=1)
g.map_dataframe(sns.scatterplot, x='carat', y='price')