<font size="+3"><strong>Visualizing Data: pandas</strong></font>

There are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), [plotly express](../%40textbook/08-visualization-plotly.ipynb), and [seaborn](../%40textbook/09-visualization-seaborn.ipynb). In this section, we'll focus on using pandas.

# Correlation Matrices

When examining numerical data in columns of a DataFrame, you might want to know how well one column can be approximated as a linear function of another column. In our `mexico-city-real-estate-1` dataset, for example, we might suspect that there was some relationship between the `"price_aprox_usd"` and `"surface_covered_in_m2"` variables. For the sake of thoroughness, let's make a table that shows all the **correlations** in the dataset. the code looks like this:

In [None]:
import pandas as pd

columns = ["price_aprox_usd", "surface_covered_in_m2"]
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)
corr = mexico_city1.corr()
corr.style.background_gradient(axis=None)

As you can see, there seems to be a moderate, positive correlation between `"price_aprox_usd"` and `"surface_covered_in_m2"`, but there are other relationships here, too. For instance, what if we look at the square root of `"surface_covered_in_m2"`, which is an approximation of a property's length?

In [None]:
mexico_city1["length"] = mexico_city1["surface_covered_in_m2"] ** 0.5
corr = mexico_city1.corr()
corr.style.background_gradient(axis=None)

We see that `price_aprox_local_currency` and `price_aprox_usd` have a stronger positive correlation with the `length` of a property than with `surface_covered_in_m2`. This sort of transformation can help improve the performance of a linear model.

<font size="+1">Practice</font> 

Try it yourself! Repeat the previous calculations for the `mexico-city-real-estate-5.csv` dataset. Is `"length"` better correlated with `"price_aprox_local_currency"` than `"surface_covered_in_m2"`?

In [None]:
# Load CSV into DataFrame
columns = ["price_aprox_local_currency", "surface_covered_in_m2"]
mexico_city5 = ...
mexico_city5["length"] = ...
corr = ...
corr.style.background_gradient(axis=None)

# Bar Charts

A **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.

Let's make a bar chart in pandas using the `colombia-real-estate-1` dataset. We might be curious about how many houses and apartments there are in Colombia, so let's take a look at all the values in the `property_type` variable.

While we often use Matplotlib for our visualizations, pandas has many plotting tools that it borrows from Matplotlib. So we can generate a Series from our DataFrame using `value_counts` and then append the `plot` method to make our visualization. Here's what the code looks like:

In [None]:
df1 = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["property_type"])
df1["property_type"].value_counts().plot(
    kind="bar", title="Property Types in Colombia", ylabel="Count"
);

If we would prefer a horizontal bar chart (it'll be easier to read the labels), we can change `"bar"` to `"barh"`, like this:

In [None]:
df1["property_type"].value_counts().plot(
    kind="barh", title="Property Types in Colombia", ylabel="Count"
);

<font size="+1">Practice</font>

Try it yourself! Use `value_counts` and the `colombia-real-estate-2` dataset to make a bar chart called `"Property Types in Colombia"`.

# Histograms

A **histogram** is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers. We can use the `.hist()` function from Pandas DataFrame to draw histograms for a specific column, as long as the data type is numerical. Let's check the following example:

In [None]:
df1 = pd.read_csv("data/mexico-city-real-estate-1.csv")
df1.head()

We can plot the histogram for the `price` column:

In [None]:
df1["price"].hist()

We can specify the number of `bins` for the histogram to see a more detailed distribution plot. The default number of bins equals to 10.

In [None]:
df1["price"].hist(bins=100)

<font size="+1">Practice</font>

Try it yourself! Make a histogram for the `"price_aprox_usd"` column in `"mexico-city-real-estate-1.csv"`, specify 100 bins:

# Scatter Plots

A **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**. 

You can create a scatter plot from a DataFrame using the `plot` method. You need to set the `kind` as `"scatter"`, and then you specify the columns you want to plot on the x- and y-axes. Check the following example:

In [None]:
df1 = pd.read_csv("data/mexico-city-real-estate-1.csv")
df1.head()

In [None]:
df1.plot(kind="scatter", x="price", y="surface_covered_in_m2")

The scatter plot shows a week positive correlation between real estate price and area.  

<font size="+1">Practice</font>

Try it yourself! Make a scatter plot for `"price"` and `"surface_total_in_m2"` in `"mexico-city-real-estate-1.csv"`:

# Line Plots

**Line plots** demonstrate relationships between two variables which have some order. If we look at the data in `mexico-city-real-estate-1.csv`, a scatter plot shows us that there's a relationship between `"surface_covered_in_m2"` and `"price_aprox_local_currency"`. 

In [None]:
columns = ["surface_covered_in_m2", "price_aprox_local_currency"]
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)
mexico_city1.plot.scatter(x="surface_covered_in_m2", y="price_aprox_local_currency");

To make clear the relationship between these two features, it would be helpful to have a line showing how price goes up as surface area increases. If we create a linear regression model using this data, the equation for this line would be In Module 2, we determine that the equation for such a line is `price = 3467349 + 23642 * area`. Let's create a series of `x` and `y` values for this line and then plot it.

In [None]:
df = pd.DataFrame({"x_coords": range(0, 9000, 1000)})
df["y_coords"] = 3467349 + 23642 * df["x_coords"]
df

In [None]:
df.plot(
    x="x_coords",
    y="y_coords",
    xlabel="surface_covered_in_m2",
    ylabel="price_aprox_local_currency",
    label="linear model",
);

<font size="+1">Practice</font>

Create a line plot for properties with areas from `0` to `8000`, where the price is determined by the equation `price = 2500000 + 2000 * area`.

In [None]:
df = pd.DataFrame({"x_coords": range(0, 9000, 1000)})
df["y_coords"] = ...


# References & Further Reading

- [Online Tutorial on Correlation Matrices using Pandas](https://www.stackvidhya.com/plot-correlation-matrix-in-pandas-python/)
- [Official Pandas Documentation on Correlations in DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)
- [Official Pandas Documentation on Styling a Table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.background_gradient.html)
- [Wikipedia Article on Correlation](https://en.wikipedia.org/wiki/Correlation)
- [Investopedia Article on Correlation](https://www.investopedia.com/terms/c/correlationcoefficient.asp)
- [Online Tutorial on Correlations](https://www.statology.org/what-is-a-strong-correlation/)
- [Pandas Documentation for Bar Charts](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.barh.html)
- [Pandas Official Visualization User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)
- [Pandas Official Documentation on Sorting Values in a DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values)

---
Copyright Â© 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
