<font size="+3"><strong>Visualizing Data: plotly express</strong></font>

There are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), plotly express, and [seaborn](../%40textbook/09-visualization-seaborn.ipynb). In this section, we'll focus on using plotly express.

# Scatter Plots

A **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**.

In [None]:
import pandas as pd

mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")

# clean the data and drop `NaNs`
mexico_city1 = mexico_city1.drop(
    ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
)
mexico_city1 = mexico_city1.dropna(axis=0)
mexico_city1.head()

After cleaning the data, we can use plotly express to draw scatter plots by specifying the DataFrame and the interested column names.

In [None]:
import plotly.express as px

fig = px.scatter(mexico_city1, x="price", y="surface_covered_in_m2")
fig.show()

<font size="+1">Practice</font> 

Plot the scatter plot for column "price" and "surface_total_in_m2".

In [None]:
fig = ...
fig.show()

# 3D Scatter Plots

**Scatter plots** can summarize information in a DataFrame. Three dimensional scatter plots look great, but be careful: it can be difficult for people who might not be sure what they're looking at to accurately determine values of points in the plot. Still, scatter plots are useful for displaying relationships between three quantities that would be more difficult to observe in a two dimensional plot. 

Let's take a look at the first 50 rows of the `mexico-city-real-estate-1.csv` dataset.

In [None]:
import pandas as pd
import plotly.express as px

mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
mexico_city1 = mexico_city1.drop(
    ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
)
mexico_city1 = mexico_city1.dropna(axis=0)
mexico_city1[
    ["First Empty", "Country", "City", "Borough", "Second Empty"]
] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)
mexico_city1_subset = mexico_city1.loc[1:50]
fig = px.scatter_3d(
    mexico_city1_subset,
    x="Borough",
    y="surface_covered_in_m2",
    z="price",
    symbol="property_type",
    color="property_type",
    labels={
        "surface_covered_in_m2": "Surface Covered in m^2",
        "price": "Price",
        "property_type": "Property Type",
    },
)

fig.show()

Notice that the plot is interactive: you can rotate it zoom in or out. These kinds of plots also makes outliers easier to find; here, we can see that houses have higher prices than other types of properties.

<font size="+1">Practice</font> 

Modify the DataFrame to include columns for the base 10 log of `price` and `surface_covered_in_m2` and then plot these for the entire `mexico-city-real-estate-1.csv` dataset.

In [None]:
import math



# Mapbox Scatter Plots

A **mapbox scatter plot** is a special kind of scatter plot that allows you to create scatter plots in two dimensions and then superimpose them on top of a map. Our `mexico-city-real-estate-1.csv` dataset is a good place to start, because it includes **location data**. After importing the dataset and removing rows with missing data, split the `lat-lon` column into two separate columns: one for `latitude` and the other for `longitude`. Then use these to make a mapbox plot. Unfortunately, at present this type of plot does not easily allow for marker shape to vary based on a column of the DataFrame.

In [None]:
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
mexico_city1 = mexico_city1.drop(
    ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
)
mexico_city1 = mexico_city1.dropna(axis=0)
mexico_city1[["latitude", "longitude"]] = mexico_city1["lat-lon"].str.split(
    ",", 2, expand=True
)
mexico_city1["latitude"] = mexico_city1["latitude"].astype(float)
mexico_city1["longitude"] = mexico_city1["longitude"].astype(float)
fig = px.scatter_mapbox(
    mexico_city1,
    lat="latitude",
    lon="longitude",
    color="property_type",
    mapbox_style="carto-positron",
    labels={"property_type": "Property Type"},
    title="Distribution of Property Types for Sale in Mexico City",
)
fig.show()

<font size="+1">Practice</font> 

Create another column in the DataFrame with a log scale of the prices. Then create three separate plots, one for `stores`, another for `houses`, and a final one for `apartments`. Color the points in the plots by the log of the price.

In [None]:
from math import log10



# Boxplots

A **boxplot** is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful because they provide a visual summary of the data, enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. In the following example, we will explore how to draw boxplots for specific columns of a DataFrame.

In [None]:
# Read Data
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")

# Clean the data and drop `NaNs`
mexico_city1 = mexico_city1.drop(
    ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
)
mexico_city1 = mexico_city1.dropna(axis=0)

# Exclude some outliers
mexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]
mexico_city1.head()

Check the boxplot for column `"price"`:

In [None]:
import plotly.express as px

fig = px.box(mexico_city1, y="price")
fig.show()

If you want to check the distribution of a column value by different categories, defined by another categorical column, you can add an `x` argument to specify the name of the categorical column. In the following example, we check the price distribution across different property types:

In [None]:
fig = px.box(mexico_city1, x="property_type", y="price")
fig.show()

<font size="+1">Practice</font> 

Check the "surface_covered_in_m2" distribution by property types.

In [None]:
fig = ...
fig.show()

# Bar Chart

A **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. 

In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to read the data set and extract Borough and other location information from column `"place_with_parent_names"`.

In [None]:
# Read Data
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")

# Clean the data and drop `NaNs`
mexico_city1 = mexico_city1.drop(
    ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
)

# find location columns from place_with_parent_names
mexico_city1[
    ["First Empty", "Country", "City", "Borough", "Second Empty"]
] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)
mexico_city1 = mexico_city1.dropna(axis=0)

# Exclude some outliers
mexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]
mexico_city1 = mexico_city1[mexico_city1["Borough"] != ""]

mexico_city1.head()

We can calculate the number of real estate showing in the data set by Borough using `value_counts()`, then plot it as bar plot:

In [None]:
# Use value_counts() to get the data
mexico_city1["Borough"].value_counts()

In [None]:
# Plot value_counts() data
fig = px.bar(mexico_city1["Borough"].value_counts())
fig.show()

We can plot more expressive bar plots by adding more arguments. For example, we can plot the number of observations by borough and property type. First of all, we need use `groupby` to calculate the aggregated counts for each Borough and property type combination:

In [None]:
size_df = mexico_city1.groupby(["Borough", "property_type"], as_index=False).size()
size_df.head()

By specifying `x`, `y` and `color`, the following bar graph shows the total counts by Borough, with different property types showing in different colors. Note `y` has to be numerical, while `x` and `color` are usually categorical variables.

In [None]:
fig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="relative")
fig.show()

Note the argument `barmode` is specified as 'relative', which is also the default value. In this mode, bars are stacked above each other. We can also use 'overlay' where bars are drawn on top of each other.

In [None]:
fig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="overlay")
fig.show()

If we want bars to be placed beside each other, we can specify `barmode` as "group":

In [None]:
fig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="group")
fig.show()

<font size="+1">Practice</font> 

Plot bar plot for the number of observations by property types in `"mexico-city-real-estate-1.csv"`.

In [None]:
bar_df = ...

fig = ...
fig.show()

# References and Further Reading
- [Official plotly express Documentation on Scatter Plots](https://plotly.com/python/plotly-express/#scatter-line-area-and-bar-charts)
- [Official plotly Express Documentation on 3D Plots](https://plotly.com/python/plotly-express/#3d-coordinates)
- [Official plotly Documentation on Notebooks](https://plotly.com/python/ipython-notebook-tutorial/)
- [plotly Community Forum Post on Axis Labeling](https://community.plotly.com/t/re-name-the-axes-in-plotly-express/39645/3)
- [plotly express Official Documentation on Tile Maps](https://plotly.com/python/plotly-express/#tile-maps)
- [plotly express Official Documentation on Figure Display](https://plotly.com/python/renderers/#setting-the-default-renderer)
- [Online Tutorial on String Conversion in Pandas](https://www.statology.org/convert-string-to-float-pandas/)
- [Official Pandas Documentation on using Lambda Functions on a Column](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)
- [Official Seaborn Documentation on Generating a Heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html)
- [Online Tutorial on Correlation Matrices in Pandas](https://www.stackvidhya.com/plot-correlation-matrix-in-pandas-python/)
- [Official Pandas Documentation on Correlation Matrices](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)
- [Official Matplotlib Documentation on Colormaps](https://matplotlib.org/stable/tutorials/colors/colormaps.html)
- [Official Pandas Documentation on Box Plots](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#box-plots)
- [Online Tutorial on Box Plots](https://www.statology.org/matplotlib-boxplot-by-group/)
- [Online Tutorial on Axes Labels in Seaborn and Matplotlib](https://www.geeksforgeeks.org/rotate-axis-tick-labels-in-seaborn-and-matplotlib/)
- [Matplotlib Gallery Example of an Annotated Heatmap](https://matplotlib.org/stable/gallery/images_contours_and_fields/image_annotated_heatmap.html#sphx-glr-gallery-images-contours-and-fields-image-annotated-heatmap-py)

---
Copyright © 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
