# Explantory Visualizations Exercise

In this notebook, your goal will be to practice polishing plots. There will be one exercise with a bivariate plot. There is quite a bit of data manipulation shown; you will just be responsible for the plot polishing!

In [None]:
# prerequisite package imports
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# The `solutions_explanatory.py` is a Python file available in the Notebook server that contains solution to the TO DO tasks.
# The solution to each task is present in a separate function in the `solutions_univ.py` file. 
# Do not refer to the file untill you attempt to write code yourself. 
from solutions_explanatory import *

We're going to start with the Diamonds data set you are familiar with.

In [None]:
df = pd.read_csv('data/diamonds.csv')
df.shape

In [None]:
df.head(5)

### Question and Data Maniupulation

Here we want to answer the question "how does the _average_ cost of diamonds for a given carat size change for diamonds of different quality?" Specifically, we want to use a scatterplot to see trends as diamonds grow, and compare `Ideal` diamonds versus `Fair` (i.e., the highest and lowest quality diamonds in our data set).

To start, I am going to do some data manipulation to prepare for plotting:
1. Extract only the diamonds of quality grade of interest.
1. We will define bins of carat size of 1/4 carat.
1. Use pandas.cut() to put each diamond into the correct bin.
1. For each bin, we will calculate the average size and standard deviation.

In [None]:
# only look at data with Ideal or Fair quality grades
df_subset = df[df['cut'].isin(['Ideal', 'Fair'])].reset_index(drop=True)
print(f'Original dataframe shape: {df.shape}')
print(f'Subset dataframe shape: {df_subset.shape}')

In [None]:
# define xbins
step = 0.25
xbins = np.arange(0, df['carat'].max()+step, step)

# the bin label is the middle value of the bin
labels = [lower+step/2 for lower in xbins[:-1]]

In [None]:
# bin data using pd.cut
df_subset['carat_avg'] = pd.cut(df_subset['carat'],
                                   bins=xbins,
                                   include_lowest=True,
                                   labels=labels)

# pd.cut() returns categorical data, so let's make sure they are floats
df_subset['carat_avg'] = df_subset['carat_avg'].astype(float)

In [None]:
# let's do a group by bin and diamond cut
dgroup = df_subset.groupby(by=['carat_avg', 'cut'], as_index=False).agg(
    price_avg=('price', np.mean)
)

In [None]:
# display grouped data frame
dgroup

After all that manipulation, now we are ready to plot the average price and standard deviation for a given carat size and quality grade.

In [None]:
sns.scatterplot(data=dgroup, x='carat_avg', y='price_avg', hue='cut');

### Polishing Exercise

Now it's time to take this plot and add some polish! Besides relabeling the axes, you may consider performing transformations as well!

In [None]:
# YOUR CODE HERE

### Solution

When you 've given the exercise a try, run the following code to see what I came up with.

In [None]:
explanatory_solution()