In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
plt.rcParams["figure.figsize"] = (8,8)

In [None]:
datapath = Path('data/raw/diamonds.csv')
df = pd.read_csv(datapath)
df.head()

We will explore the diamonds dataset. This is a nice dataset for dataexploration, because:

1. It is easy to form hypothesis about the dataset
2. There is a lot of data. Not 150 observations, like the iris-dataset, but more than 50k observations. This makes the plotting a bit more interesting.

The information available about the 10 variables:
1. price: price in US dollars (\$326--\$18,823)
2. carat: weight of the diamond (0.2--5.01)
3. cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
4. color: diamond colour, from D (best) to J (worst)
5. clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
6. x: length in mm (0--10.74)
7. y: width in mm (0--58.9)
8. z: depth in mm (0--31.8)
9. depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
10. table: width of top of diamond relative to widest point (43--95)

With this, let's start exploring.

## step 1
Start with basic plots. 
First explore the basis correlation between carat and price.

Make sure you get some basic esthetics correct, like: the amount of observations needs a solution, because default plots will be too crowded. So eexperiment with:
- changing the properties of a plot like marker transparency or shape
- add colors to explore additional dimensions like cut or clarity
- use some simple statistics like boxplots or lmplots.




## step 2
Formulate hypothesis based on the basic plots. 
E.g., what is the impact of extra dimensions on the relationship between price and carat?

Use facetgrids and color to try to differentiate. Use different approaches (heatmaps, lmplots, boxplots) depending on the question you are asking.


## Step 3

Pay special attention to the distributions. Explore questions like, for example:

- Are the diamonds normally distributed in all groups? 
- What is the impact of using different metrics (e.g., mean or median)?
- What might be an explanation for anomalies you find?
- can you make plots that you could show to an expert, to clarify your question?
- Maybe you can create new types of features, that are less sensitive to these outliers? 

