<font size="+3"><strong>Visualizing Data: seaborn</strong></font>

There are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), [plotly express](../%40textbook/08-visualization-plotly.ipynb), and seaborn. In this section, we'll focus on using seaborn.

# Scatter Plots

A **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**. 

In the following example, we will see some scatter plots based on the Mexico City real estate data. Specifically, we can use scatter plot to show how `"price"` and `"surface_covered_in_m2"` are correlated. First we need to read the data set and do a little cleaning.

In [None]:
import pandas as pd
import seaborn as sns

# Read Data
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")

# Clean the data and drop `NaNs`
mexico_city1 = mexico_city1.drop(
    ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
)

mexico_city1 = mexico_city1.dropna(axis=0)

# Exclude some outliers
mexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]

mexico_city1.head()

Use seaborn to plot the scatter plot for `"price"` and `"surface_covered_in_m2"`:

In [None]:
sns.scatterplot(data=mexico_city1, x="price", y="surface_covered_in_m2");

There is a very useful argument in `scatterplot` called `hue`. By specifying a categorical column as `hue`, seaborn can create a scatter plot between two variables in different categories with different colors. Let's check the following example using `"property_type"`:

In [None]:
sns.scatterplot(
    data=mexico_city1, x="price", y="surface_covered_in_m2", hue="property_type"
);

<font size="+1">Practice</font>

Plot a scatter plot for `"price"` and `"surface_total_in_m2"` by `"property_type"` for `"mexico-city-real-estate-1.csv"`:

# Bar Charts

A **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. 

In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to import the dataset and extract the borough and other location information from column `"place_with_parent_names"`.

In [None]:
# Read Data
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")

# Clean the data and drop `NaNs`
mexico_city1 = mexico_city1.drop(
    ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
)

# find location columns from place_with_parent_names
mexico_city1[
    ["First Empty", "Country", "City", "Borough", "Second Empty"]
] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)
mexico_city1 = mexico_city1.dropna(axis=0)

# Exclude some outliers
mexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]
mexico_city1 = mexico_city1[mexico_city1["Borough"] != ""]

mexico_city1.head()

Let's check the example of a bar plot showing the value counts of each borough in the dataset. We first need to create a DataFrame showing the value counts:

In [None]:
bar_df = pd.DataFrame(mexico_city1["Borough"].value_counts()).reset_index()
bar_df

Since there are 16 different categories in Borough, we should increase the default plot size and rotate the x axis to make the plot more readable using the following syntax:

In [None]:
# Increase plot size
sns.set(rc={"figure.figsize": (15, 4)})

# Plot the bar plot
ax = sns.barplot(data=bar_df, x="index", y="Borough")

# Rotate the x axis
ax.set_xticklabels(ax.get_xticklabels(), rotation=75)

<font size="+1">Practice</font>

Plot a bar plot showing the value counts for property types in `"mexico-city-real-estate-1.csv"`:

In [None]:
bar_df = ...


# Correlation Heatmaps

A **correlation heatmap** shows the relative strength of correlations between the variables in a dataset. Here's what the code looks like:

In [None]:
import pandas as pd
import seaborn as sns

mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
mexico_city1 = mexico_city1.drop(
    ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
)
mexico_city1 = mexico_city1.dropna(axis=0)
mexico_city1_numeric = mexico_city1.select_dtypes(include="number")
corr = mexico_city1_numeric.corr(method="pearson")
sns.heatmap(corr)

Notice that we dropped the columns and rows with missing entries before plotting the graph.

This heatmap is showing us what we might already have suspected: the price is moderately positively correlated with the size of the properties. 

<font size="+1">Practice</font>

The seaborn documentation on heat maps indicates how to add numeric labels to each cell and how to use a different colormap. Modify the plot to use the `viridis` colormap, have a linewidth of 0.5 between each cell and have numeric labels for each cell.

# References and Further Reading
- [Official Plotly Express Documentation on Scatter Plots](https://plotly.com/python/plotly-express/#scatter-line-area-and-bar-charts)
- [Official Plotly Express Documentation on 3D Plots](https://plotly.com/python/plotly-express/#3d-coordinates)
- [Official Plotly Documentation on Notebooks](https://plotly.com/python/ipython-notebook-tutorial/)
- [Plotly Community Forum Post on Axis Labeling](https://community.plotly.com/t/re-name-the-axes-in-plotly-express/39645/3)
- [Plotly Express Official Documentation on Tile Maps](https://plotly.com/python/plotly-express/#tile-maps)
- [Plotly Express Official Documentation on Figure Display](https://plotly.com/python/renderers/#setting-the-default-renderer)
- [Online Tutorial on String Conversion in Pandas](https://www.statology.org/convert-string-to-float-pandas/)
- [Official Pandas Documentation on using Lambda Functions on a Column](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)
- [Official seaborn Documentation on Generating a Heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html)
- [Online Tutorial on Correlation Matrices in Pandas](https://www.stackvidhya.com/plot-correlation-matrix-in-pandas-python/)
- [Official Pandas Documentation on Correlation Matrices](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)
- [Official Matplotlib Documentation on Colormaps](https://matplotlib.org/stable/tutorials/colors/colormaps.html)
- [Official Pandas Documentation on Box Plots](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#box-plots)
- [Online Tutorial on Box Plots](https://www.statology.org/matplotlib-boxplot-by-group/)
- [Online Tutorial on Axes Labels in seaborn and Matplotlib](https://www.geeksforgeeks.org/rotate-axis-tick-labels-in-seaborn-and-matplotlib/)
- [Matplotlib Gallery Example of an Annotated Heatmap](https://matplotlib.org/stable/gallery/images_contours_and_fields/image_annotated_heatmap.html#sphx-glr-gallery-images-contours-and-fields-image-annotated-heatmap-py)

---
Copyright © 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
