In the following, we will load the "tips" dataset from the file "tips.csv". This dataset includes information from 244 receipts at a resturant. In each order, we have
* total bill amount
* tip amount
* gender of the customer
* if a customer is a smoker
* day of the oder
* time of the order (dinner or lunch)
* size of dining group


How do we visualize the potential relationship between 'total_bill' and 'tip'?

# Scatterplots by seaborn

**seaborn** is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. With this library, one may create  different types of charts.


<img title="plots" src="https://www.dropbox.com/scl/fi/7guvvmp6os0bjqp7slge2/sns_plots.png?rlkey=erswd72c4wp1s0xlnuzvp8mz3&st=a3f9651f&raw=1" width=600>

In this notebook, we will focus on how to use seaborn to create scatterplots. 

You can learn more about the library through its [documentation](https://seaborn.pydata.org/).

By default, seaborn is included in Anaconda, and we can simply import it using the `import` command. However, if you find that seaborn hasn't been installed, you may use the Anaconda Navitator, the commoand `pip install seaborn` in your command line tool (windows powershell, launchable from Anaconda Navigator, or the terminal on a Mac), or simply run `!pip install seaborn` in a cell of Jupyter.

When importing seaborn, a convention is to use the alia "sns". Since seaborn is built on top of matplotlib, it is a good practice to import pyplot from matplotlib at the same time.

A nice feature of seaborn is its ability to work directly with a DataFrame

The basic command to create a scatterplot is 
* `sns.scatterplot(data=name_of_DataFrame,x=column_a,y=column_b)`

where we can specify the dataset to work by a DataFrame


Let create a scatterplot of 'tip' vs. 'total_bill'.
* Use `data=` to specify the DataFrame
* Use different colors for dinner and lunch
* Use different marker symbols for dinner and lunch
* Let the size of the marker be larger for larger groups (i.e., a larger 'size')
    * Use `sizes=()` to specify the range of sizes of the markers
* Use `alpha=` to make the markers partially transparent
* Also add the title, and the x- and y-labels

Interestingly, based on a catigorical variable in the dataset, we can:

* use different colors for the points, by `hue=column_name`; e.g., use one color for all lunches, and a different color for all dinners,
* use different shapes for the points, by `style=column_name`
* use different sizes for the points, by `size=column_name`
    * the base size can be adjusted by `sizes=(min_size,max_size)`, where min_size and max_size specifies the size of the smallest point and the size of the largest point, respectively


We can use `pyplot` to add the title and labels.

You can find the arguments for `sns.scatterplot()` [here](https://www.dropbox.com/scl/fi/bsuxn9j7ab0zxb998ue9t/seaborn_parameters.pdf?rlkey=u6hyerymywoll9q78jcw3p5ze&st=id0wp1cb&dl=0).


| **Parameter**      | **Description**                                                                                                         | **Default**                 |
|--------------------|-------------------------------------------------------------------------------------------------------------------------|-----------------------------|
| `data`             | DataFrame or array-like structure (dict, array, etc.) for plotting data.                                                 | `None`                      |
| `x`                | Variable for x-axis data (from `data` or direct).                                                                        | `None`                      |
| `y`                | Variable for y-axis data (from `data` or direct).                                                                        | `None`                      |
| `hue`              | Grouping variable to map plot colors.                                                                                    | `None`                      |
| `size`             | Grouping variable to map plot sizes.                                                                                     | `None`                      |
| `style`            | Grouping variable to map plot markers.                                                                                   | `None`                      |
| `palette`          | Colors for the different levels of the `hue` variable.                                                                   | `None` (default color cycle) |
| `hue_order`        | Order of levels in `hue`.                                                                                                | `None`                      |
| `hue_norm`         | Normalization range for continuous `hue`.                                                                                | `None`                      |
| `sizes`            | Mapping from `size` values to point sizes (tuple, dict, etc.).                                                           | `None`                      |
| `size_order`       | Order of levels in `size`.                                                                                               | `None`                      |
| `size_norm`        | Normalization range for continuous `size`.                                                                               | `None`                      |
| `markers`          | Marker styles for the levels of `style`.                                                                                 | `None`                      |
| `style_order`      | Order of levels in `style`.                                                                                              | `None`                      |
| `legend`           | How to draw the legend (`"auto"`, `"brief"`, `"full"`, `False`).                                                         | `"auto"`                    |
| `ax`               | Axes object to draw the plot onto.                                                                                       | `None` (creates new axes)    |
| `cmap`             | Colormap for mapping the `hue` values (if continuous).                                                                   | `None`                      |
| `linewidth`        | Width of the lines around the markers.                                                                                   | `None` (default width)       |
| `edgecolor`        | Color for marker edges.                                                                                                  | `None` (default behavior)    |
| `alpha`            | Transparency level for the markers (0: fully transparent, 1: fully opaque).                                              | `None` (fully opaque)        |
| `x_bins`           | Bin values on the x-axis for plotting discrete data as scatterplot.                                                      | `None`                      |
| `y_bins`           | Bin values on the y-axis for plotting discrete data as scatterplot.                                                      | `None`                      |
| `units`            | Grouping variable to map plot styles independently of color and size.                                                    | `None`                      |
| `estimator`        | Statistical function used to estimate the point estimate for scatter plots with continuous variables.                    | `None`                      |
| `ci`               | Confidence interval used when plotting with an estimator.                                                                | `None` (no interval)         |
| `n_boot`           | Number of bootstrap iterations to determine confidence interval.                                                         | `None`                      |
| `sort`             | Sort points by hue before plotting.                                                                                      | `True`                      |
| `err_style`        | Style of error bars (`"band"` or `"bars"`).                                                                              | `"band"`                    |
| `err_kws`          | Keyword arguments for the error bar plot customization.                                                                  | `None`                      |
| `log_scale`        | Apply logarithmic scaling to axes (`x`, `y`, or both).                                                                   | `False`                     |


We can add a trend line to visualize how the two variables are correlated with each other.

To do this, we will use the `regplot()` function in `seaborn` instead of `scatterplot`.

**Notes**:
1. The argument `ci=` shows a confidence interval in the form of a shaded area. By default, the confidence level is 95%, but we may change it to other values, such as `ci=80` for a 80% confidence interval
2. The `order=` argument selects the order of the polynomial function as a trend line. By default, `order=1`, meaning that we show a linear trend line. But we can use `order=2` for a quadratic trend, or higher orders.