# Plotting in with Pandas and Matplotlib


In this tutorial, we'll swiftly review the creation of various charts covered in our course lectures, including boxplots, histogram charts, barcharts, and more. While it's possible to generate simple plots directly from [pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) (you can practice for yourself), for finer control over multiple aspects of these plots, we'll explore the utilization of the [Matplotlib](https://matplotlib.org/stable/gallery/index.html) plotting module. Matplotlib stands out for its remarkable power and flexibility, as we'll demonstrate throughout this tutorial.

Python has many nice, useful libraries that can be used for plotting.
some of the most popular are as follows:
    
1. [matplotlib](https://matplotlib.org/): matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell, web application servers, and six graphical user interface toolkits.

2. [seaborn](https://seaborn.pydata.org/): Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

3. [plotly](https://plot.ly/python/): Plotly's Python graphing library makes interactive, publication-quality graphs online. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

4. [bokeh](https://bokeh.pydata.org/en/latest/): Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets.

5. [ggplot](http://ggplot.yhathq.com/): ggplot is a plotting system for Python based on R's ggplot2 and the Grammar of Graphics. It is built for making profressional looking, plots quickly with minimal code.

6. [holoviews](http://holoviews.org/): HoloViews is an open-source Python library designed to make data analysis and visualization seamless and simple. With HoloViews, you can usually express what you want to do in very few lines of code, letting you focus on what you are trying to explore and convey, not on the process of plotting.


```{note}

Explore the galleries and examples of different visualization libraries above to learn what’s possible to do in Python.

```

# Anatomy of a Plot

Before visualizing our data on a plot, let's understand the components of a plot. We won't delve into the details of different plot types in this tutorial but provide a brief introduction to various plots that can be created using Python and the essential elements of a plot.

Here is a list of several types of plots used to represent different kinds of data:

- [Bar Chart](https://en.wikipedia.org/wiki/Bar_chart)
- [Histogram](https://en.wikipedia.org/wiki/Histogram)
- [Scatter Plot](https://en.wikipedia.org/wiki/Scatter_plot)
- [Line Chart](https://en.wikipedia.org/wiki/Line_chart)
- [Pie Chart](https://en.wikipedia.org/wiki/Pie_chart)
- [Box Plot](https://en.wikipedia.org/wiki/Box_plot)
- [Violin Plot](https://en.wikipedia.org/wiki/Violin_plot)
- [Dendrogram](https://en.wikipedia.org/wiki/Dendrogram)
- [Chord Diagram](https://en.wikipedia.org/wiki/Chord_diagram_(information_visualization))
- [Treemap](https://en.wikipedia.org/wiki/Treemap)
- [Network Chart](https://en.wikipedia.org/wiki/Network_chart)

Despite the variety, most plots share common elements. Understanding basic terminology helps when creating or modifying plots. The figure below illustrates elements of a basic line plot.

![Basic elements of a plot](img/basic-elements-of-plot.png)
*Basic elements of a plot. (source: https://geo-python-site.readthedocs.io/en/latest/lessons/L7/plot-anatomy.html)*


## Common Plotting Terms

These terms may vary slightly depending on the plotting library, and for this list, we use typical Matplotlib terms.

| Term         | Description                                                                                    |
|--------------|-----------------------------------------------------------------------------------------------|
| *Axis*       | Graph axes, typically x, y, and z for 3D plots.                                               |
| *Title*      | Title of the plot.                                                                            |
| *Label*      | Name for the axis (e.g., xlabel or ylabel).                                                   |
| *Legend*     | Legend for the plot.                                                                          |
| *Tick Label* | Text or values represented on the axis.                                                       |
| *Symbol*     | Symbol for data point(s) on a scatter plot, presented with different shapes/colors.           |
| *Size*       | Size of a point on a scatter plot, also used for text sizes on a plot.                         |
| *Linestyle*  | The style of how the line should be drawn, solid or dashed, for example.                        |
| *Linewidth*  | The width of a line on a plot.                                                                |
| *Alpha*      | Transparency level of a filled element in a plot (values between 0.0 and 1.0).                 |
| *Tick(s)*    | Refers to the tick marks on a plot.                                                          |
| *Annotation* | Text added to a plot.                                                                        |
| *Padding*    | The distance between an (axis/tick) label and the axis.                                      |


In [1]:
import warnings
warnings.filterwarnings("ignore")

## Getting started

Let's start by importing pandas and matplotlib

In [2]:
# import required libraries

import pandas as pd
import matplotlib.pyplot as plt


## Input data: Community Crime Statistics Map

Our input data in this tutorial is a text file containing Building Permits by Community map in city of Calgary, Alberta, Canda retrieved from [City of Calgary Open Data Portal](https://data.calgary.ca/browse?limitTo=datasets):

- File name: [Building_Permits_20240122.csv] 
- You can download the data from the link provided: [City of Calgary Open Data Portal](https://data.calgary.ca/Business-and-Economic-Activity/Building-Permits-by-Community/kr8b-c44i)
- Data is provides Building permit applications made to The City of Calgary's Planning & Development department in 2023.
- There are totally 22,073 rows and 30 columns in this dataset.

In [3]:
data = pd.read_csv('Building_Permits_20240122.csv')

In [4]:
data.head()

Unnamed: 0,PermitNum,StatusCurrent,AppliedDate,IssuedDate,CompletedDate,PermitType,PermitTypeMapped,PermitClass,PermitClassGroup,PermitClassMapped,...,CommunityCode,CommunityName,Latitude,Longitude,LocationCount,LocationTypes,LocationAddresses,LocationsWKT,LocationsGeoJSON,Point
0,BP2023-00001,Cancelled,2023/01/01,,2023/08/04,Commercial / Multi Family Project,Building,3106 - Retail Shop,Commercial,Non-Residential,...,SNA,SUNALTA,51.037938,-114.095388,2.0,Titled Parcel;Building,1438 17 AV SW;1438 17 AV SW,MULTIPOINT (-114.09538783812974 51.03793752506...,"{""type"":""MultiPoint"",""coordinates"":[[-114.0953...",POINT (-114.09538783812974 51.03793752506271)
1,BP2023-00002,Completed,2023/01/01,2023/01/24,2023/06/22,Residential Improvement Project,Building,1301 - Private Detached Garage,Garage,Residential,...,WWO,WOLF WILLOW,50.874928,-114.009338,2.0,Titled Parcel;Building,223 WOLF CREEK AV SE;223 WOLF CREEK AV SE,MULTIPOINT (-114.00933819969845 50.87492753090...,"{""type"":""MultiPoint"",""coordinates"":[[-114.0093...",POINT (-114.00933819969845 50.874927530904024)
2,BP2023-00016,Completed,2023/01/02,2023/01/02,2023/05/12,Demolition,Demolition,1106 - House,Single Family,Residential,...,BRD,BRIDGELAND/RIVERSIDE,51.053518,-114.039212,1.0,Titled Parcel,205 9A ST NE,POINT (-114.03921222082343 51.05351846849098),"{""type"":""Point"",""coordinates"":[-114.0392122,51...",POINT (-114.03921222082343 51.05351846849098)
3,BP2023-00015,Completed,2023/01/02,2023/01/03,2023/11/16,Residential Improvement Project,Building,1301 - Private Detached Garage,Garage,Residential,...,DAL,DALHOUSIE,51.108398,-114.147277,2.0,Titled Parcel;Building,6204 DALBEATTIE HL NW;6204 DALBEATTIE HL NW,MULTIPOINT (-114.1472766668805 51.108397819725...,"{""type"":""MultiPoint"",""coordinates"":[[-114.1472...",POINT (-114.1472766668805 51.10839781972519)
4,BP2023-00010,Issued Permit,2023/01/02,2023/01/03,,Residential Improvement Project,Building,1101 - Basement Development,Single Family,Residential,...,SHN,SHAWNESSY,50.908529,-114.095202,2.0,Titled Parcel;Building,72 SHAWBROOKE MR SW;72 SHAWBROOKE MR SW,MULTIPOINT (-114.09520199036152 50.90852910976...,"{""type"":""MultiPoint"",""coordinates"":[[-114.0952...",POINT (-114.09520199036152 50.90852910976423)


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22073 entries, 0 to 22072
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PermitNum          22073 non-null  object 
 1   StatusCurrent      22073 non-null  object 
 2   AppliedDate        22073 non-null  object 
 3   IssuedDate         19574 non-null  object 
 4   CompletedDate      10433 non-null  object 
 5   PermitType         22073 non-null  object 
 6   PermitTypeMapped   22073 non-null  object 
 7   PermitClass        22072 non-null  object 
 8   PermitClassGroup   22073 non-null  object 
 9   PermitClassMapped  22073 non-null  object 
 10  WorkClass          22073 non-null  object 
 11  WorkClassGroup     22073 non-null  object 
 12  WorkClassMapped    21533 non-null  object 
 13  Description        20973 non-null  object 
 14  ApplicantName      13679 non-null  object 
 15  ContractorName     13282 non-null  object 
 16  HousingUnits       220

### Quickly clean the data and remove unnecessary information

In [6]:
# keep only the columns we need

data = data[['PermitNum', 'StatusCurrent', 'AppliedDate', 'IssuedDate', 'CompletedDate', 'PermitType', 'PermitClass', 'PermitClassGroup', 'WorkClassGroup', 'HousingUnits', 'EstProjectCost', 'CommunityCode', 'Latitude', 'Longitude', 'LocationCount']]

data.head()

Unnamed: 0,PermitNum,StatusCurrent,AppliedDate,IssuedDate,CompletedDate,PermitType,PermitClass,PermitClassGroup,WorkClassGroup,HousingUnits,EstProjectCost,CommunityCode,Latitude,Longitude,LocationCount
0,BP2023-00001,Cancelled,2023/01/01,,2023/08/04,Commercial / Multi Family Project,3106 - Retail Shop,Commercial,Improvement,0,,SNA,51.037938,-114.095388,2.0
1,BP2023-00002,Completed,2023/01/01,2023/01/24,2023/06/22,Residential Improvement Project,1301 - Private Detached Garage,Garage,New,0,42530.0,WWO,50.874928,-114.009338,2.0
2,BP2023-00016,Completed,2023/01/02,2023/01/02,2023/05/12,Demolition,1106 - House,Single Family,Demolition,0,,BRD,51.053518,-114.039212,1.0
3,BP2023-00015,Completed,2023/01/02,2023/01/03,2023/11/16,Residential Improvement Project,1301 - Private Detached Garage,Garage,New,0,46291.0,DAL,51.108398,-114.147277,2.0
4,BP2023-00010,Issued Permit,2023/01/02,2023/01/03,,Residential Improvement Project,1101 - Basement Development,Single Family,Improvement,0,56141.0,SHN,50.908529,-114.095202,2.0


In [7]:
# how many null values are there?   

data.isnull().sum()

PermitNum               0
StatusCurrent           0
AppliedDate             0
IssuedDate           2499
CompletedDate       11640
PermitType              0
PermitClass             1
PermitClassGroup        0
WorkClassGroup          0
HousingUnits            0
EstProjectCost       2461
CommunityCode           7
Latitude                0
Longitude               0
LocationCount           7
dtype: int64

As in the next steps we want to plot the total project costs, we need them to be complete. So we remove all the rows with missing `EstProjectCost`.

In [8]:
# remove rows with null values in the EstProjectCost column

data = data.dropna(subset=['EstProjectCost'])

# how many null values are there now?

data.isnull().sum()

PermitNum               0
StatusCurrent           0
AppliedDate             0
IssuedDate           1592
CompletedDate       10781
PermitType              0
PermitClass             0
PermitClassGroup        0
WorkClassGroup          0
HousingUnits            0
EstProjectCost          0
CommunityCode           6
Latitude                0
Longitude               0
LocationCount           6
dtype: int64

Convert the `AppliedDate` column to `datetime` format so afterward we can work with time data

In [10]:
# convert the AppliedDate columns to datetime format
data['AppliedDate'] = pd.to_datetime(data['AppliedDate'])


## Explatory Analysis by plotting charts

Now lets try to create some charts to get more familiar with dtaa and get more insights about different stats of data

We proceed to extract the date component from the 'AppliedDate' column and count the number of rows for each unique date. The `groupby()` method groups the DataFrame by the date component, and `size()` counts the occurrences. We then reset the index and name the resulting DataFrame columns as 'Datetime' and 'Row Count'.

### Line Chart



In [9]:
import folium # map rendering library

f = folium.Figure(width=800, height=500) # set figure size

# set map center and zoom level on Calgary
m = folium.Map(location=[51.0447, -114.0719], zoom_start=12).add_to(f) # add to figure

for index, row in data[:10000].iterrows(): # iterate over rows in dataframe
    # folium.Marker([row['customer_latitude'], row['customer_longitude']], popup=row['customer_city']).add_to(m) # add marker one by one on the map
    folium.CircleMarker([row['Latitude'], row['Longitude']], 
                              radius=0.1
                              ).add_to(m) # add circle one by one on the map
# display map
f