# Categorical Data

Categorical data visualizes a quantifiable variable in the context of a qualitative variable. Surveys like the ones we see on the show Family Fued or the frequency of people with various eye colors in a population are examples of categorical data. Categorical data can be visualized using bar graphs and pie charts.

## Bar graphs

Bar graphs are a popular method used to visualize categorical data. They're simple, concise, and can condense
large and complex datasets into a simple visual summary. Most bar graphs depict a categorical element as an
independent variable on the x-axis while the height of the bar corresponds to a numerical variable on the y-axis.
We will practice making bar graphs using our military dataset.

First, let's create a graph to examine the percent of the GDP spent on the military in Canada. We will look at the years 2018, 2019, and 2020. To do this, we must extract the data for the years of interest from the column containing the data pertaining to GDP percentage of military spending in Canada. We will call this `can_gdp`. Then, we can call the `plt.bar()` function to create a bar chart using this data. The `plt.bar()` function needs two arguments. The first argument, `x`, can be an array of values that will be plotted on the x-axis. The second argument, `height`, determines the height of the bars and is essentially the y-values. We can create a list of our years of interest and call it `years` to input as the first argument. We can use `can_gdp` as our second argument.

In [None]:
can_gdp = military.iloc[[58,59,60], 1]
years = ['2018', '2019', '2020']
plt.bar(years, can_gdp)
plt.show()

The above graph yielded us a plot, but this plot has no descriptive labels that would help others to understand the data. We need to add axis labels and a title to communicate what is being measured. Aestetically, we can also change the width each bar to give more room on the graph.

In [None]:
plt.bar(years, can_gdp, width = 0.25)
plt.title('Military Spending in Canada')
plt.ylabel('Percentage of GDP')
plt.show()

This plot looks better and is a lot more descriptive. Let's add the data from Mexico and the United States. To do
this, we can use the `plt.subplots()` function. This function creates a `figure` object and an `axis` object, which we will define as `fig` and `ax`, respectively. Using this function, we can add data for Canada, Mexico, and the United States to the same plot, within the boundaries of the same axes. We do this by calling `ax.bar()`.

In [None]:
can_gdp = military.iloc[[58,59,60], 1]
mex_gdp = military.iloc[[58,59,60], 2]
usa_gdp = military.iloc[[58,59,60], 3]
years = ['2018', '2019', '2020']
x = np.arange(len(years))

#Create plotting area and subplots
fig, ax = plt.subplots()
can_bar = ax.bar(x - 0.25, can_gdp, width = 0.25)
mex_bar = ax.bar(x , mex_gdp, width = 0.25)
usa_bar = ax.bar(x + 0.25, usa_gdp, width = 0.25)


plt.tight_layout()
plt.show()

We were able to create a bar plot with all three data sets together. Now, let's add the appropriate titles, axis labels, and other details. To add a title to the entire graph, we can call the `plt.title()` function, just as we did before. In a subplot, to add axis labels, we have to use the `.set_xlabel()` and `.set_ylabel()` methods on the axis object, `ax`. We can also label the numerical value for each individual bar by calling the `bar_label()` method on `ax`. In order to label the bars, the label must be specified when creating each each bar.

In [None]:
#Create plotting area and subplots
fig, ax = plt.subplots()
can_bar = ax.bar(x - 0.25, can_gdp.round(decimals=2), 0.25, label='Canada') 
mex_bar = ax.bar(x , mex_gdp.round(decimals=2), 0.25, label='Mexico')
usa_bar = ax.bar(x + 0.25, usa_gdp.round(decimals=2), 0.25, label='USA')
    # .round(decimals=2) rounds to 2 places after decimal

# Add labels, title, custom x-axis tick labels, etc.
plt.title("Military Spending in North America", pad = 10)
ax.set_ylabel('Percentage of GDP')
ax.set_xlabel('Year')
ax.set_xticks(x, years)
ax.bar_label(can_bar, label_type= "edge", padding=4)
ax.bar_label(mex_bar, label_type= "edge", padding=4)
ax.bar_label(usa_bar, label_type= "edge", padding=4)

ax.legend(loc=4, bbox_to_anchor=(1.3, 0.5))
plt.ylim(0, 5)

plt.tight_layout()
plt.show()

Great! Now we have a well annotated, visually appealing graph that depicts an important message about the data: the percentage of the GDP spent on the military for each country from the years 2018 - 2020. From this graph, we can easily see that during this time period, Canada and Mexico contribute a smaller proportion of their GDP to military spending as compared to the United States. This may not have been easily discernable by just looking at our large data table. 

## Pie chart

Pie charts are a commonly used visualization method to represent proportions in datasets. Pie charts use *wedges* to represent the numerical value of a proportion corresponding to a categorial variable. While pie charts are very common and can be easily interpreted by a layperson audience, they may not be the best way to represent data in certain cases. Firstly, because pie charts use the area of a circle to represent the proportion of a categorical variable, it can be difficult to gauge the numerical value that a wedge represents if the area doesn't appear as an easily discernible fraction (i.e. 1/2, 1/3, 1/4). This can be aided with the help of labels and legends that explicitly show the numerical values associated with the wedges of the pie chart. Secondly, if you want to visualize many categorical variables or variables that make up a significantly small proportion of the dataset, it may be difficult to see the variable on a pie chart. Overall, pie charts can be a simple and effective way to communicate proportional categorical data, but before using them, consider what attributes of the data need to be highlighted to help decide if a pie chart is the most appropriate visualization method. 

Next, let's look at the total amount of money spent on the military in the entire North American continent for the year of 2020 and determine what proportion of this total amount came from each country. To do this, we will need to extract data for the year 2020 and call it `pie2020`. Then, we can  make a pie chart using the `plt.pie()` funtion. We will use `pie2020` to determine the wedge sizes and the argument `normalize=True` to normalize the data to 1. We'll also set the figure size using `plt.figure()`.

In [None]:
pie2020 = military.iloc[60,[4,5,6]]
plt.figure(figsize=(10, 6))
pie = plt.pie(pie2020, normalize = True)
plt.show()

Now that we have a pie chart, let's add some more detail to it to make it more descriptive. We can label the sectors of the chart so that we know which country corresponds to which color. Likewise, we can label the percentage of each sector to know the definitive proportion of each country's contribution to the total amount of money spent on the military in North America. To do this, we can create a list called `countries`,  containing the strings "Canada", "Mexico", and "USA". We can then assign the `labels` argument within `plt.pie()` to `countries`. We can also add the `autopct` argument, which labels the wedges using the printf style format. More information on that format can be found here: <https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting>




In [None]:
plt.figure(figsize=(10, 6))

countries = ['Canada', 'Mexico', 'USA']
pie = plt.pie(pie2020, normalize = True, labels = countries, autopct='%.1f')
plt.show()

This plot is okay, but it can be better. Because the sectors of Mexico and Canada are a lot smaller than the sector for the United States, overlaying the percentages on top of the sector creates spatial issues that can be visually displeasing. Instead, let's add the percentages into a legend along with the labels of each sector. Let's also add a title so others can know what they are looking at when they view this chart. We'll use the following code:

In [None]:
plt.figure(figsize=(10, 6))


patches, text = plt.pie(pie2020, normalize = True)
labels = ['Canada (2.8 %)', 'Mexico (0.8 %)', 'USA (96.4 %)']
total = pie2020.sum().round(decimals=1) #finds the sum of pie2020 and rounds it to 1 position after the decimal

plt.legend(patches, labels, loc=4, bbox_to_anchor=(0.8, -0.2), fontsize=15)
plt.title("Military Spending in North America in 2020" + " (" + str(total) + " Billion USD)",  loc = 'center',
         fontsize = 15)


plt.show()

In creating this, we used `plt.pie()` in a way that we had not used it before. Under the hood, the `plt.pie()` function gives two default outputs: `patches` and`text`. The output `patches` is an object that dictates the size of each wedge. The `text` output consists of a list of labels for out data. Here, we needed to specifically defined `patches` and `text` because later, we had to use `patches` as an argument for the `plt.legend()` function. 

The `plt.legend()` function, has two default arguments. The first argument dictates **what** is being labeled. In our case, the wedges of the pie chart (i.e. the `patches` object) is being labeled. The second argument dictates **how** things are labeled. Here, we just simply created an variable called labels, which consists of the three strings 'Canada (2.8 %)', 'Mexico (0.8 %)' and 'USA (96.4 %)'. The other arguments `bbox_to_anchor` and `fontsize` are optional when using the `plt.legend()` function.

## Conclusions

In this section, we were introduced the the `plt.bar()` and `plt.pie()` functions to construct bar plots and pie charts, respectively. The `plt.bar()` function requires `x` and `height` arguments, which can be an array of number values, but other parameters can be included. The `plt.pie()` function only requires an `x` argument as an array of values and has other arguments that can be included as well. Both of these types of visualizations are used for depicting categorical data. As a reminder, when deciding on whether to use a pie chart, consider certain attributes of the data, such as the number of categorical variables or the size of the proportions to be plotted. Below is a list of functions with linked documentation for your reference and further reading:

- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html">plt.bar()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pie.html">plt.pie()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html">plt.subplots()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html">plt.title()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylim.html">plt.ylim()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlim.html">plt.xlim()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylabel.html">plt.ylabel()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlabel.html">plt.xlabel()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.bar.html">ax.bar()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.legend.html">ax.legend()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.tight_layout.html">plt.tight_layout()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html">plt.figure()</a>
- <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html">plt.show()</a>
