# UEP-0239: Python for Data Analysis and Visualization

---

**A Tufts University Data Lab Tutorial**  
Written by [Uku-Kaspar Uustalu](https://directory.tufts.edu/user/view/90E8E773F8EC92B23679584546E5E321/)

Contact: <uku-kaspar.uustalu@tufts.edu>

Last updated: `GH_ACTIONS_DATE`

---

## Importing Packages

We will be using the following Python data analysis and visualization libraries throughout this tutorial:

- [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) is the primary data analysis library in Python. It allows for easy analysis and manipulation of tabular data and is usually imported under the alias `pd`.
- [Matplotlib](https://matplotlib.org/) is the most essential data visualization library in Python. Although it consists of many modules, most of the plotting functionality is contained within the `matplotlib.pyplot` module, which is usually imported under the alias `plt`.
- [Seaborn](https://seaborn.pydata.org/) is an advanced plotting library that is built on top of Matplotlib. It has a simpler interface and allows for the easy creation of beautiful visualizations. Seaborn is usually imported under the alias `sns`.
- [HVPlot](https://hvplot.holoviz.org/) is a high-level plotting interface that integrates seamlessly with Pandas and allows for the easy creation of interactive visualizations. The `hvplot.pandas` module must be imported to allow for seamless integration with Pandas.
- [Plotly](https://plotly.com/python/) is an alternative interactive visualization library. It consists of many modules, but the `plotly.express` module is the easiest to use as it allows for the creation of whole plots using a single command. The module is usually imported under the alias `px`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import hvplot.pandas
import plotly.express as px

---

## Getting Started with Pandas

For the first part of this tutorial, we will be using the following datasets from the `data` directory to investigate the relationship between health and wealth:

- [`gdp.csv`](./data/gdp.csv) -- World Bank gross domestic product (GDP) estimates (in USD) for world countries and regions from 1960 until 2021
- [`life-expectancy.csv`](./data/life-expectancy.csv) -- World Bank life expectancy estimates for world countries and regions from 1960 until 2020
- [`m49.csv`](./data/m49.csv) -- United Nations [M49](https://en.wikipedia.org/wiki/UN_M49) Standard Country or Area Codes for Statistical Use
- [`population.csv`](./data/population.csv) -- World Bank population estimates for world countries and regions from 1960 until 2021

All the datasets are in [IEFT RFC 4180 CSV](https://www.rfc-editor.org/info/rfc4180) (comma-separated values) format and the first four rows of the World Bank data files contain metadata with the actual data table starting on row five.

Let us start by reading in the population data. Pandas can easily read CSV datasets via the [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function. The function reads the contents of the file into a [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) data structure and supports various additional arguments. For example, we can utilize the `skiprows` argument to tell Pandas to skip the first four rows of the dataset as the data table does not start until fow five.

In [None]:
population = pd.read_csv('data/population.csv', skiprows=4)

Now the World Bank population dataset is stored in a DataFrame called `population`. Calling the DataFrame by its name will display the first and last five rows of the table by default.

In [None]:
population

We see that the DataFrame appears to have the following columns:

- `Country Name` -- English name of the country
- `Country Code` -- [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) country code
- `Indicator Name` -- name of the indicator represented by the data
- `Indicator Code` -- World Bank code for the indicator
- `1960` ... `2021` -- population estimates by year

We also see that the DataFrame has 266 rows and 66 columns. We can double-check this by looking at the value of the [`pandas.DataFrame.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) attribute.

In [None]:
population.shape

The [`pandas.DataFrame.size`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.size.html) attribute will give us the total number of values in the table (number of columns times number of rows).

In [None]:
population.size

[`pandas.DataFrame.columns`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html) can be used to get a list of all the column names and [`pandas.DataFrame.dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) will display the datatype of each column.

In [None]:
population.columns

In [None]:
population.dtypes

Note how the first four columns all have the `object` datatype. This could mean that the column contains textual data (string), has a mix of different datatypes (both textual and numeric for example), or contains a more complex data structure (like a list or tuple). The population columns are all `float64` denoting floating-point numbers. It might feel odd to store population values as floating-point numbers as population counts are always whole integers. However, in Pandas all numeric data is stored as floating-point numbers by default. This is due to the fact that integer columns in Pandas do not support missing data values by default. The default missing data value in Pandas is the `numpy.nan` from NumPy, which is a `float64` datatype.

We know that the `population` DataFrame stores population values, so the `Indicator Name` and `Indicator Code` columns are redundant. We can drop them from the table using the [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method.

In [None]:
population.drop(columns=['Indicator Name', 'Indicator Code'], inplace=True)

Note how we specified two arguments when calling the [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method. First we specified a list of columns to drop using the `columns` argument. The [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method also supports dropping rows, so that is why the `columns` argument is needed. Then we also specified `inplace` to be `True`. This ensures that the original `population` DataFrame gets modified. Otherwise the method would just return a new DataFrame and keep the `population` DataFrame unchanged.

We can validate that the desired columns have been removed by taking a quick peek at the DataFrame via the [`pandas.DataFrame.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method. It displays the fist five rows of the DataFrame by default but you can also pass the number of rows desired as an argument.

In [None]:
population.head()

Knowing that the World Bank GDP dataset follows the exact same format as the World Bank population dataset, we can read it in and drop the `Indicator Name` and `Indicator Code` columns all in one go by chaining together the [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function and the [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method. If we want to include a line break somewhere in the chain, we need to wrap the whole thing in parentheses `()`.

In [None]:
gdp = (pd.read_csv('data/gdp.csv', skiprows=4)
         .drop(columns=['Indicator Name', 'Indicator Code']))

Note how here we did not specify `inplace=True` when dropping the columns. That is because we want the [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method to take the DataFrame generated by [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and then output a new DataFrame that we can save into the `gdp` variable. We can take a look at our newly created DataFrame by using the [`pandas.DataFrame.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method again.

In [None]:
gdp.head()

---

## Long vs Wide Data

GDP on its own is not a good indicator of the wealth of a country as countries with more people tend to have higher GDP. But if we were to normalize GDP by population, then the resulting GDP per capita values can be compared across countries and used as a proxy for wealth. To do so, we must be able to match up the GDP and population values for each unique combination of country and year.

The GDP and population tables currently are in wide format -- each row represents a unique country and each column represents a unique year with the cell values representing unique population estimates. While this wide format has many advantages and is commonly used in geospatial applications, it does complicate joining various datasets. One option would be to treat both tables as matrices and calculate GDP per capita by dividing the GDP matrix with the population matrix. However, both tables need to have the exact same layout with the same number of countries and years in the same exact order for this to work and the result to be reliable. Ensuring this is not a trivial task, so this method would involve a lot of work to produce reliable results.

Alternatively the two tables could be joined by country. Then we will have an extra-wide table with two sets of year columns -- one set of year columns for population and another set of year columns for GDP. Then we would need to create another new column for each year by dividing the corresponding GDP column with the corresponding population column, resulting in another new set of year columns. As you can see, this approach would quickly lead to a very messy and difficult to manage dataset and would also involve a lot of work, making it far from preferred.

The easiest option for calculating GDP per capita would involve converting both datasets into a long format, where each row represents a single unique observation (estimation). Instead of having countries in rows and years in columns, each row would instead represent a unique country and year combination. This would allow us to easily combine datasets on both country and year, ensuring that the GDP and population values for each country-year combination get matched.

We can use the [`pandas.DataFrame.melt()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html) method to convert wide DataFames to long format. We need to specify three arguments when using this method:

- `id_vars` – name(s) of the column(s) that define a unique observation in the original wide dataset
- `var_name` – name of the the column in the new long dataset that stores the column names of the original wide dataset
- `value_name` – name of the column in the new long dataset that stores the values of the original wide dataset

Each observation in the original wide dataset represents a unique country defined either by the country name or country code. Let us include both of these as `id_vars` to carry both columns over to the long dataset. The columns of the wide dataset represent years, so that is the name we will pass on to the `var_name` argument. The values of the wide dataset represent population estimates, so that will be the name passed on to the `value_name` argument.

The reverse command for [`pandas.DataFrame.melt()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html) is [`pandas.DataFrame.pivot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html), which can convert a long format table to wide format.

In [None]:
population_long = population.melt(id_vars=['Country Name', 'Country Code'],
                                  var_name='year',
                                  value_name='population')

In [None]:
population_long

Now we have a new long population DataFrame called `population_long`, where each row represents a unique country and year combination. Let us use `pandas.DataFrame.dtypes` to confirm the data types of this new table.

In [None]:
population_long.dtypes

Note how the `year` column is of type `object`, meaning that the years are currently stored as strings. As the years were perviously column names, this makes sense. However, as years are actually numbers, they should also be stored as such to allow for easy comparisons and mathematical operations.

To convert the year values to integers, we must first extract the `year` column as a [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) object. This can be done by either using square brackets `df["column"]` or via dot-notation `df.column`. The latter requires the column name to consist of only letters, numbers, and underscores (and not start with a number), so it is only useful if the column names are neatly formatted. Using square brackets to extract columns is more robust and as the column name is passed as a string, it can contain spaces and other special characters.

Square brackets can be used to also create a new column or overwrite an existing column. Dot-notation should **only** be used to read columns. Attempting to write columns using dot-notation could have unexpected consequences.

Knowing this, let us extract the `year` column as a Series object using dot-notation `df.column` and then call [`pandas.Series.astype()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html) on the extracted values to convert them to integers. Then we can use square bracket notation `df["column"]` to replace the values of the `year` column with their integer equivalents.

In [None]:
population_long['year'] = population_long.year.astype(int)

In [None]:
population_long.head()

In [None]:
population_long.dtypes

Note how the values of the `year` column seemingly did not change, but the datatype of the values is now `int32`, which means that the values have been converted to numeric integers.

Now let us convert the GDP dataset to long format as well. We can chain the [`pandas.DataFrame.melt()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html) method together with the [`pandas.DataFrame.astype()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) method to convert the DataFrame from wide to long format and change the datatype of the `year` column to integer all in one go. The [`pandas.DataFrame.astype()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) method is very similar to the [`pandas.Series.astype()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html) method, but instead of taking a single datatype as an argument, it takes a dictionary that maps column names to datatypes.

In [None]:
gdp_long = (gdp.melt(id_vars=['Country Name', 'Country Code'],
                    var_name='year',
                    value_name='gdp')
               .astype({'year': int}))

In [None]:
gdp_long

In [None]:
gdp_long.dtypes

---

## Joining Datasets

Finally we are ready to combine the population and GDP datasets. [`pandas.DataFrame.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) can be used to perform a join on on one or more columns. The method is called on the left DataFrame and takes the right DataFrame as its first argument (this is only important to know when performing a left or right join). Additional arguments are as follows:

- `on` – A single column name (string) or list of column names to join on. These column names should appear in both tables. If the column names differ between datasets, the separate `left_on` and `right_on` arguments should be used instead.
- `how` – The type of join to perform. Here are the possible values:
    - `"left"` – use only keys from the left DataFrame (include all rows from left DataFrame)
    - `"right"` – use only keys from the right DataFrame (include all rows from right DataFrame)
    - `"outer"` – use the union of keys from both DataFrames (include all rows from both DataFrames)
    - `"inner"` – use the intersection of keys from both DataFrames (include only matching rows)
    - `"cross"` – creates the cartesian product from both DataFrames (similar to cross-tabulation)

We would like to join on each unique country and year combination. As spellings of country names might differ between datasets, it is good practice to always use the [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) country code or some other analogous unique identifier to distinguish between countries. The country code for each country is determined by an international standard and should not differ between datasets, allowing us to reliably join the data. Hence we will specify `on=["Country Code, "year"]` to perform the join on unique country-year combinations and `how="inner"` to only keep year-country combinations that are present in both datasets. Since we do not want the `Country Name` column repeated in the joined dataset, we should remove it from the GDP table using [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) before performing the join. Otherwise the `Country Name` column from the GDP dataset will also get joined, resulting in the joined table having two separate columns with country names.

In [None]:
data = population_long.merge(gdp_long.drop(columns='Country Name'),
                             on=['Country Code', 'year'],
                             how='inner')

In [None]:
data.head()

Now we have a table with a population and GDP value for each country and year combination. We can easily add a new column denoting GDP per capita to this table by dividing the GDP column with the population column.

In [None]:
data['gdp_per_capita'] = data.gdp / data.population

In [None]:
data.head()

Now we would also like to add life expectancy information to this joined dataset. Knowing that all World Bank data tables follow the same format, we can easily convert the workflow from before into a function that reads in a World Bank dataset, drops unneeded columns, converts it to long format, and ensures the year is in numeric format. That function would only need two inputs – the path of the CSV file and the name of the indicator represented by the data. (This name will be used as the colum name for the values column in the long format table.) Let us define this function and use it to read in the World Bank life expectancy dataset and convert it to long format.

In [None]:
def read_world_bank_data(file_name, value_name):
    return (pd.read_csv(file_name, skiprows=4)
              .drop(columns=['Indicator Name', 'Indicator Code'])
              .melt(id_vars=['Country Name', 'Country Code'],
                    var_name='year',
                    value_name=value_name)
              .astype({'year': int}))

In [None]:
life_exp = read_world_bank_data(file_name='data/life-expectancy.csv',
                                value_name='life_exp')

In [None]:
life_exp.head()

Using the same workflow from before, we can join the long format life expectancy dataset to our table containing the GDP and population data.

In [None]:
data = data.merge(life_exp.drop(columns='Country Name'),
                  on=['Country Code', 'year'],
                  how='inner')

In [None]:
data.head()

Finally we would also like to know which [United Nations regional geoscheme](https://en.wikipedia.org/wiki/United_Nations_geoscheme) the country belongs to. Information on this is available in the United Nations [M49](https://en.wikipedia.org/wiki/UN_M49) dataset. As this dataset is a  standard CSV table, we can use [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) without any additional arguments to read it into a DataFrame.

In [None]:
m49 = pd.read_csv('data/m49.csv')

In [None]:
m49.head()

In [None]:
m49.columns

Note how this dataset contains a lot of information on the various groups and codes assigned to each country. We are only interested in the name of the region the country belongs into and the [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) code assigned to the country. Using double square brackets `[[ ]]` we can extract the desired columns as a new DataFrame. (In reality we are just passing a list of column names to the standard single square brackets indexer.)

In [None]:
regions = m49[['Region Name', 'ISO-alpha3 Code']]

In [None]:
regions.head()

Now we can use [`pandas.DataFrame.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) again to join the regions to the rest of our data. Since the names of the columns containing the country code information differ between the datasets, we must use the `left_on` and `right_on` arguments instead of the `on` argument from before.

In [None]:
data = data.merge(regions,
                  left_on='Country Code',
                  right_on='ISO-alpha3 Code',
                  how='inner')

In [None]:
data.head()

Note how the new joined dataset contains both of the country code columns (because their names were different). Also, the naming convention in our table is not uniform – some column names are in [`snake_case`](https://en.wikipedia.org/wiki/Snake_case) (which is preferred) while others contain spaces and a mix of uppercase and lowercase letters. Let us use [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to drop the second country code column and [`pandas.DataFrame.rename()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) to rename some of the columns to ensure an uniform column naming convention. Remember that we can use the `inplace=True` argument to apply the changes to the original DataFrame.

In [None]:
data.drop(columns='ISO-alpha3 Code', inplace=True)

[`pandas.DataFrame.rename()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) takes a dictionary in the format `{"old_name": "new_name"}` as an argument and you need to specify whether you would like to rename rows or columns.

In [None]:
data.rename(columns={'Country Name': 'country_name',
                     'Country Code': 'country_code',
                     'Region Name': 'region_name'},
            inplace=True)

In [None]:
data.head()

---

## Boolean Indexing

To extract specific rows from a DataFrame, we can combine the square brackets indexing operator `pandas.DataFrame[]` with a logical operation that produces a boolean array. This would select every row from the DataFrame where the corresponding element in the boolean array equals `True`. For example, to extract all rows that correspond to the United States, we could use `data.country_code == "USA"`. This would return an array of `True` and `False` values where the value of a specific element in the array is `True` if the corresponding row in the `data` DataFrame had the value `"USA"` in its `country_code` column.

In [None]:
data.country_code == 'USA'

Combining this with the square brackets indexing operator `data[]` will extract all values from the `data` DataFrame where the `country_code` column has the value `"USA"`.

In [None]:
usa_data = data[data.country_code == 'USA']

In [None]:
usa_data.head()

We can ensure that this new `usa_data` DataFrame only contains values corresponding to the United States by calling [`pandas.Series.unique()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html) on the `country_name` column. This will return an array of all the unique country names present in the table.

In [None]:
usa_data.country_name.unique()

Note that even though there is only one unique value, the result is still an array. To extract the value as a string, we must extract the first element of the array using `[0]`.

In [None]:
usa_data.country_name.unique()[0]

---

## Creating Static Line Graphs

[Matplotlib](https://matplotlib.org/) is the primary plotting library in Python and it is designed to resemble the plotting functionalities of MATLAB. While it provides all kinds of different plotting functionality, the [`matplotlib.plyplot`](https://matplotlib.org/stable/api/pyplot_summary.html) module is used the most. It is common to import this module under the alias `plt` as we did before. Matplotlib works in a layered fashion. First you define your plot using [`matplotlib.pyplot.plot(x, y, ...)`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html), then you can use additional [`matplotlib.plyplot`](https://matplotlib.org/stable/api/pyplot_summary.html) methods to add more layers to your plot or modify its appearance. Finally, you use [`matplotlib.pyplot.show()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html) to display the plot or [`matplotlib.pyplot.savefig()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html) to save it to an external file.

The `x` and `y` arguments in the [`matplotlib.pyplot.plot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) call can be either arrays or [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) objects. For example, we can visualize the population of the United States over time by extracting the `year` and `population` columns of the `usa_data` table as [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) objects and passing them along to [`matplotlib.pyplot.plot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) as follows.

In [None]:
plt.plot(usa_data.year, usa_data.population)
plt.show()

Alternatively we could pass the [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to the [`matplotlib.pyplot.plot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) command using the optional `data` argument. This will allow us to specify the desired column names as the `x` and `y` arguments instead of having to extract them as [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) objects. For example, we can visualize the GDP of the United States over time as follows.

In [None]:
plt.plot('year', 'gdp', data=usa_data)
plt.show()

[Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) also has built-in plotting functionality via the [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) method. It takes the column names of the `x` and `y` columns as arguments and uses a plotting backend to generate the plot. By default, the plotting backend is Matplotlib, but this could be reconfigured to be something else instead. For example, we can create a Matplotlib visualization showing United States life expectancy over time as follows.

In [None]:
usa_data.plot(x='year', y='life_exp')
plt.show()

To create a line graph with multiple lines, we need to stack the lines using multiple [`matplotlib.pyplot.plot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) or [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) calls. But how can we specify that we would like to stack the lines onto a single plot instead of creating a new plot for each line? This is where the [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#the-axes-class) class comes into play. For simplicity, you can think of each [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#the-axes-class) object as a canvas onto which one can add multiple layers of visualization. When using [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) to create visualizations, we can utilize [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#the-axes-class) to create multi-layered plots as follows:
1. The first [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) command will return a [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#the-axes-class) object. This object should be saved into a variable. It is common to save it into a variable called `ax`.
2. In each subsequent [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) call, the [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#the-axes-class) object from before should be passed on using the `ax` argument. This will ensure the new plot gets added to the same canvas.

We can combine the [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) call with boolean indexing to easily visualize subsets of the data and use the `color` and `label` arguments to specify a color and legend label for each subset.

Once all the lines have been added to the plot, we can use [`matplotlib.pyplot.ylabel()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylabel.html) and [`matplotlib.pyplot.xlabel()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlabel.html) to label the axes and [`matplotlib.pyplot.title()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html) to specify a title for the visualization. Finally we call [`matplotlib.pyplot.show()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html) to display the plot.

Knowing all this, a visualization illustrating the GDP per capita over time for North American countries can be generated as follows.

In [None]:
ax = data[data.country_code == 'USA'].plot(x='year',
                                           y='gdp_per_capita',
                                           color='blue',
                                           label='USA')

data[data.country_code == 'CAN'].plot(x='year',
                                      y='gdp_per_capita',
                                      color='red',
                                      label='Canada',
                                      ax=ax)

data[data.country_code == 'MEX'].plot(x='year',
                                      y='gdp_per_capita',
                                      color='green',
                                      label='Mexico',
                                      ax=ax)

plt.ylabel('GDP per capita')
plt.xlabel('Year')
plt.title('GDP per Capita Over Time for North American Countries')
plt.show()

There are many benefits to using [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) over [`matplotlib.pyplot.plot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) when dealing with DataFrames. Most importantly, [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) interacts directly with a Pandas DataFrame and has a much simpler user interface with numerous named arguments allowing for easy customization. However, when it comes to more advanced tasks, Matplotlib allows for better fine-tuning and more flexibility. However, this comes at a cost of more complex commands. For example, to create a plot that displays the temporal variation of both the life expectancy and GDP per capita of the United States using two different Y axes, we must use a relatively advanced workflow.

First, we define the size of our plot using the `figsize` argument of [`matplotlib.pyplot.subplots()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html). This command allows for the creation of multiple subplots, but is also frequently used to specify the size of a single plot. It returns a tuple consisting of a [`matplotlib.figure.Figure`](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure) and a [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#the-axes-class) object.

To add a plot layer to a specific [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#the-axes-class) object, we can use [`matplotlib.axes.Axes.plot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.plot.html) which works very similarly to the previously discussed [`matplotlib.pyplot.plot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) command. Both commands take an optional format string as the third positional argument that allows you to specify the line and marker style and color using a simple shorthand. For example, `"g--"` means a green dashed line and `"mx"` indicates magenta-colored X-shaped markers. Refer to the function documentation for a full overview of all the shorthand characters. The Matplotlib commands for adding axes labels and plot titles also have additional arguments that modify the appearance of the label or title. For example, `color` usually specifies the text color and `size` is used to specify the size of the font.

To add another Y axis to the plot, we can use [`matplotlib.axes.Axes.twinx()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.twinx.html) to create another [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#the-axes-class) object that defines a new Y axis but shares the same X axis.

Finally, we can use [`matplotlib.figure.Figure.legend()`](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.legend) to add a legend to the whole figure (including all the axes objects).

In [None]:
fig, ax = plt.subplots(figsize=(7, 5))
ax.plot(usa_data.year, usa_data.gdp_per_capita, 'g--', label='GDP per Capita')
plt.ylabel('GDP per Capita', color='g',)
plt.xlabel('Year')
ax2 = ax.twinx()
ax2.plot(usa_data.year, usa_data.life_exp, 'mx', label='Life Expectancy')
plt.ylabel('Life Expectancy', color='m')
plt.title('United States', size=20)
fig.legend()
plt.show()

---

## Visualizing Distributions and Correlations

Let us return to our original goal of exploring the relationship between health and wealth. We will use GDP per capita as a proxy for wealth and life expectancy as an indicator of health. We can simplify the analysis by looking only at one point in time and focus our analysis on 2020, which is the latest year we have both GDP per capita and life expectancy data available. We shall use boolean indexing to extract 2020 data into a new DataFrame called `data2020`.

In [None]:
data2020 = data[data.year == 2020]

In [None]:
data2020.head()

How is wealth distributed amongst the global population? Let us get a vague idea by visualizing the distribution of GDP per capita amongst world countries in 2020. We can easily create an histogram by using the [`matplotlib.pyplot.hist()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) command and passing it the GDP per capita [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html).

In [None]:
plt.hist(data2020.gdp_per_capita)
plt.xlabel('GDP per Capita')
plt.show()

Note how we were able to easily create a histogram, but the result was quite ugly. If we wanted a prettier plot, we could go though the trouble of customizing the plot using various additional arguments and commands, which would take quite a while. Or we could use [Seaborn](https://seaborn.pydata.org/) which allows us to easily create beautiful visualizations with sensible defaults. For example, we could create a well-designed histogram with a smoothed kernel density estimate (KDE) overlay using the [`seaborn.histplot()`](https://seaborn.pydata.org/generated/seaborn.histplot.html) function along with the `kde=True` flag. Knowing this, let us look at the distribution of life expectancy amongst world countries in 2020.

In [None]:
sns.histplot(data2020.life_exp, kde=True)
plt.show()

To easily create a scatter plot analyzing the relationship between GDP per capita and life expectancy, we can use [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) and specify `kind="scatter"` to ensure the result is a scatter plot.

In [None]:
data2020.plot(x='gdp_per_capita', y='life_exp', kind='scatter')
plt.show()

The relationship appears to be logarithmic. This is likely due to the distribution of GDP per capita being heavily skewed. We can easily confirm this by plotting a two-dimensional kernel density estimate (KDE) plot using [`seaborn.jointplot()`](https://seaborn.pydata.org/generated/seaborn.jointplot.html) along with `kind="kde"`. (To get a scatter plot with histograms, one would use `kind="scatter"`.)

In [None]:
sns.jointplot(data=data2020,
              x='gdp_per_capita',
              y='life_exp',
              kind='kde',
              fill=True)
plt.show()

To get a better sense of the potentially logarithmic relationship between GDP per capita and life expectancy, we should apply a logarithmic transformation to the axis corresponding to GDP per capita. In our example this is the X axis and we can apply a logarithmic transformation on the X axis by passing `"log"` to the [`matplotlib.pyplot.xscale()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xscale.html) function.

In [None]:
plt.scatter(data2020.gdp_per_capita, data2020.life_exp)
plt.xscale('log')
plt.xlabel('GDP per Capita')
plt.ylabel('Life Expectancy')
plt.show()

Note how we used [`matplotlib.pyplot.scatter()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) instead of [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) to create the scatter plot. Both functions are very similar and in reality, the latter simply calls the former. Also note how now the X axis of the scatter plot is logarithmic. This makes the relationship much clearer and we can quite definitely state that there appears to be a logarithmic relationship between life expectancy and GDP per capita.

But does the size of a country play a role in this relationship? To find out, we can scale the size of the data points proportionally to the population such that bigger points indicate countries with more population. This can be done using the `s` argument in [`matplotlib.pyplot.scatter()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html), which takes an array of point sizes. This array needs to be the same size as the `x` and `y` arrays with one size value for each `x` and `y` combination. We can easily generate an array like this using the formula $X \div max(X) \times s$ where $X$ is the array we want to base the sizes on and $s$ is a scaling factor in arbitrary plot units. Note that the scaling factor is completely arbitrary and you might need to try different values until you find something that makes the visualization look good. We divide the input array with its maximum value to properly normalize and scale the sizes.

Scaling the point sizes by population might cause some bigger points to overlap smaller ones. To ensure we can properly see overlapping points, we can use the `alpha` argument in the [`matplotlib.pyplot.scatter()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) call to specify a transparency factor.

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(data2020.gdp_per_capita,
           data2020.life_exp,
           s=data2020.population/data2020.population.max()*5000,
           alpha=0.5)
plt.xscale('log')
plt.xlabel('GDP per Capita')
plt.ylabel('Life Expectancy')
plt.show()

Looks like the size of a country does not seem to be related to GDP per capita or life expectancy. But what about the region a country is in? There is a good chance a correlation exists between the geographical location of a country and other indicators. To find out, we should color the points based on their geographic region. We know from before that this requires adding multiple layers to the plot – one for each region. We can get a list of all the regions by using [`pandas.Series.unique()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html) on the `region_name` column. Then we can iterate over that list using a loop, subset the data for each region, and create a scatter plot layer using the subset data.

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))

for region in data2020.region_name.unique():
    
    subset = data2020[data2020.region_name == region]
    
    ax.scatter(subset.gdp_per_capita,
                subset.life_exp,
                s=subset.population/data2020.population.max()*5000,
                label=region,
                alpha=0.5)
    
plt.xscale('log')
plt.xlabel('GDP per Capita')
plt.ylabel('Life Expectancy')
plt.title('2020')
plt.show()

One of the main drawbacks on Matplotlib is the fact that one needs to create multiple layers to visualize groups using different colors. This can be a tedious process and usually involves having to subset the data using a loop. To circumnavigate this, many choose to use Seaborn instead, which allows for a grouping variable to be passed via the `hue` argument. For example, to recreate the plot from above without having to use a loop, we can utilize [`seaborn.scatterplot()`](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) with `hue="region_name"`. To scale the point sizes by population, we can specify `size="population"` then use the `sizes` arguments to give a tuple that defines the smallest point size and the largest point size in arbitrary plot units. As before, you might need to play around with the tuple values in `sizes` until you find a combination that looks good.

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))
sns.scatterplot(data=data2020,
                x='gdp_per_capita',
                y='life_exp',
                hue='region_name',
                size='population',
                sizes=(10, 5000),
                alpha=0.5,
                legend=False,
                ax=ax)
plt.xscale('log')
plt.xlabel('GDP per Capita')
plt.ylabel('Life Expectancy')
plt.title('2020')
plt.show()

---

## Creating Interactive Visualizations

While the static scatter plot above is quite pretty to look at, it is not the most informative. We have no idea which points represent which countries and many countries appear clustered together, which makes it harder to tell them apart. An interactive visualization would allow for better exploration and investigation of the data. The easiest way of creating an interactive visualization out of a Pandas DataFrame is to use [HVPlot](https://hvplot.holoviz.org/), which is built on top of [Bokeh](https://bokeh.org/) and [HoloViews](https://holoviews.org/) and utilizes them in the background. Importing the `hvplot.pandas` module as we did before adds a new [`pandas.DataFrame.hvplot`](https://hvplot.holoviz.org/user_guide/Plotting.html#the-plot-interface) interface that allows for the creation of interactive plots using a syntax very similar to that of [`pandas.DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html).

We can easily create an interactive version of the scatter plot from before by using [`pandas.DataFrame.hvplot.scatter()`](https://hvplot.holoviz.org/reference/pandas/scatter.html) with the following arguments:

- `x` and `y` – the column names for the data plotted on the X and Y axes respectively
- `c` – the column name that defines the groups or values based on which to color the points by
- `s` – the column name that defines values to use as point sizes
- `scale` – scaling factor to use when deriving point sizes from values specified by `s` (we will use $1 \div max(X) \times y$, where $X$ is the column specified in `s` and $y$ is an arbitrary scaling factor)
- `hover_cols` – fields to include in the tooltips in addition to those specified in `x`, `y`, `c`, and `s`
- `alpha` – transparency factor
- `logx` – whether to apply a logarithmic transformation on the X axis
- `width` and `height` – the size of the visualization in pixels

Take some time to explore the the interactive visualization using the available controls. Experiment with panning and zooming and hover over various points to explore the tooltips.

In [None]:
data2020.hvplot.scatter(x='gdp_per_capita',
                        y='life_exp',
                        c='region_name',
                        s='population',
                        scale=1/data2020.population.max()*2000000,
                        hover_cols=['country_name', 'country_code'],
                        alpha=0.5,
                        logx=True,
                        width=650,
                        height=500)

An alternative to HVPlot is [Plotly](https://plotly.com/python/), which is a popular interactive visualization library used in many programming languages. It consists of a complex ecosystem of various modules, but the [`plotly.express`](https://plotly.com/python/plotly-express/) module is the most popular and easiest to use. The syntax of [`plotly.express`](https://plotly.com/python/plotly-express/) is very similar to that of HVPlot. The biggest difference between the two libraries is that [`plotly.express`](https://plotly.com/python/plotly-express/) does not handle missing data and expects the input DataFrame to not contain any missing values. Hence we must drop all rows with missing values from the table using [`pandas.DataFrame.dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) before passing it onto Plotly.

We can create an interactive scatter plot via Plotly using the [`plotly.express.scatter()`](https://plotly.com/python-api-reference/generated/plotly.express.scatter) function along with the following arguments (note the similarities between HVPlot):

- `data_frame` – the [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to use for the visualization with rows containing missing values removed
- `x` and `y` – the column names for the data plotted on the X and Y axes respectively
- `color` – the column name that defines the groups or values based on which to color the points by
- `size` – the column name that defines values to use as point sizes
- `size_max` – the size of the largest point in pixels (used to scale all point sizes)
- `hover_name` – the column name that defines the values to be used as tooltip titles
- `hover_data` – fields to include in the tooltip in addition to those specified in `x`, `y`, `color`, `size`, and `hover_name`
- `opacity` – transparency factor
- `log_x` – whether to apply a logarithmic transformation on the X axis
- `width` and `height` – the size of the visualization in pixels

As before, make sure to explore the the interactive visualization using the available controls. Note how the tooltips and controls differ from those provided by HVPlot.

In [None]:
px.scatter(data_frame=data2020.dropna(),
           x='gdp_per_capita',
           y='life_exp',
           color='region_name',
           size='population',
           size_max=40,
           hover_name='country_name',
           hover_data=['country_code'],
           opacity=0.5,
           log_x=True,
           width=650,
           height=600)

---

## Working with Time Series

Thus far we have covered the basics of working with data in Python, including reading CSV files, manipulating and reshaping data, joining tables, and creating both static and interactive visualizations. This covers the majority of the most essential data analysis workflows you might need. However, there are two major topics we have yet to discuss -- working with time series and aggregating data by group. We will explore these concepts using rapid transit ridership data from the Massachusetts Bay Transportation Authority (MBTA). Once we have covered these two final topics, you should have all the skills you need to begin your Python data analysis journey.

The dataset we will use is a CSV file named [`mbta-gated-entries-2020.csv`](./data/mbta-gated-entries-2020.csv) located in the `data` directory. Each row in the table represents an unique 30-minute service time period for a specific MBTA rapid transit station and line combination in 2020. The columns are as follows:

- `service_date` -- date in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) `YYYY-MM-DD` format
- `time period` -- timestamp denoting the start of the 30-minute time period in a somewhat unusual `(HH:mm:ss)` format
- `stop_id` -- unique identifier for the rapid transit stop
- `station_name` -- name of the rapid transit stop
- `route_or_line` -- route or line served by the stop
- `gated_entires` -- number of gated entries at the specified stop for the specified line or route in the specified time period

Let us read this dataset into a [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) called `mbta` using [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and explore it via [`pandas.DataFrame.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) and [`pandas.DataFrame.dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html).

In [None]:
mbta = pd.read_csv('data/mbta-gated-entries-2020.csv')

In [None]:
mbta.head()

In [None]:
mbta.dtypes

Note how both the `service_date` and `time_period` have the datatype of `object`, indicating that they are stored as text. This does not allow us to treat these values as proper timestamps, limiting our options for quantitative analysis. To fix this, we should combine the `service_date` and `time_period` into a single timestamp using [`pandas.to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html). But first we must clean the `time_period` values, which are all in parentheses for some reason.

To strip the `time_period` values of the parentheses, we can utilize the [`pandas.Series.str`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html) interface that allows us to apply string methods on the whole column. This allows us to apply a vectorized version of the built-in [`str.strip()`](https://docs.python.org/3/library/stdtypes.html#str.strip) method to the whole column via [`pandas.Series.str.strip()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html).

Hence we can do the following all in one command:
1. Extract the `time_period` column as a [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) object.
2. Utilize [`pandas.Series.str.strip()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html) to remove the parentheses from the values, creating a new [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) object.
3. Replace the `time_period` column with this new [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) object where the parentheses have been removed.

In [None]:
mbta['time_period'] = mbta.time_period.str.strip('()')

In [None]:
mbta.head()

We can concatenate [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) objects containing textual data the same way we can concatenate strings in Python. Knowing this, we can easily combine the `service_date` and `time_period` columns into a single timestamp.

In [None]:
mbta['timestamp'] = mbta.service_date + ' ' + mbta.time_period

In [None]:
mbta.head()

In [None]:
mbta.dtypes

Although the text in the new `timestamp` column sure looks like a valid timestamp, it is still just textual data and has no meaning to Python or Pandas. To covert these textual timestamps into Pandas-aware timestamps, we can use [`pandas.to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) and pass the `timestamp` series as input. This will produce a new series that we will use to replace the `timestamp` column.

In [None]:
mbta['timestamp'] = pd.to_datetime(mbta.timestamp)

In [None]:
mbta.head()

In [None]:
mbta.dtypes

Note how `timestamp` now has a datatype of `datetime64` (with nanosecond precision). This allows us to perform arithmetic and comparisons on the timestamps and also utilize various additional date-time methods (like extracting the month or weekday for example) via the [`pandas.Series.dt`](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html) interface.

Let us simplify our further analysis by dropping the redundant `service_date` and `time_period` columns using the [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method.

In [None]:
mbta.drop(columns=['service_date', 'time_period'], inplace=True)

In [None]:
mbta.head()

---

## Simple Aggregations

We can easily get the total number of gated entries across the whole MBTA system in 2020 by using [`pandas.Series.sum()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.sum.html).

In [None]:
mbta.gated_entries.sum()

Combining [`pandas.Series.sum()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.sum.html) with boolean indexing allows us to extract the total number of gated entries for specific stations, lines, or even dates.

In [None]:
mbta.gated_entries[mbta.station_name == 'Davis'].sum()

In [None]:
mbta.gated_entries[mbta.route_or_line == 'Red Line'].sum()

In [None]:
mbta.gated_entries[mbta.timestamp == '2020-02-24'].sum()

Having the timestamps in `datetime64` format allows us to extract specific time periods using comparisons. For example, we can get the total number of gated entries across the whole MBTA system in February 2020 as follows.

In [None]:
mbta.gated_entries[
    (mbta.timestamp >= '2020-02-01') & (mbta.timestamp < '2020-03-01')].sum()

Alternatively, we could take advantage of [`pandas.Series.dt.month`](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month.html) to extract the month numbers of the `datetype64` values and use that to get the same information.

In [None]:
mbta.gated_entries[mbta.timestamp.dt.month == 2].sum()

---

## Aggregating by Group

Let us say we would like to get the number of gated entries across the whole MBTA system for each day in 2020. Pandas provides easy functionality to calculate various aggregate values by group, as long as there is a categorical column that defines the groups. Currently we only have a datetime column, which is not categorical and hence not suitable for aggregating entires by date. However, the `service_date` column we removed would have been perfect for this task. Luckily we can easily recreate this column using [`pandas.Series.dt.date`](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.date.html) to extract the date from the `datetime64` timestamp.

In [None]:
mbta['date'] = mbta.timestamp.dt.date

In [None]:
mbta.head()

Now we can use [`pandas.DataFrame.groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) to convert the [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) into a [`pandas.groupby.DataFrameGroupBy`](https://pandas.pydata.org/docs/reference/groupby.html) object, where all the values of the DataFrame are grouped by the specified categorical variable and any methods will apply by group. Note that this is no longer a DataFrame, so we cannot display it as such.

In [None]:
mbta.groupby('date')

We can extract the desired column from this [`pandas.groupby.DataFrameGroupBy`](https://pandas.pydata.org/docs/reference/groupby.html) object as a [`pandas.groupby.SeriesGroupBy`](https://pandas.pydata.org/docs/reference/groupby.html) object, where any methods called on the series will apply by the previously defined groups.

In [None]:
mbta.groupby('date').gated_entries

When we call [`pandas.groupby.GroupBy.sum()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.sum.html) on this  [`pandas.groupby.SeriesGroupBy`](https://pandas.pydata.org/docs/reference/groupby.html) object, we will get a new [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) object where all the gated entries for each unique date have been added together.

In [None]:
mbta.groupby('date').gated_entries.sum()

We can convert this [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) into a [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) using [`pandas.Series.to_frame()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html). We can also specify a new name for the column containing the aggregated values if desired.

In [None]:
mbta.groupby('date').gated_entries.sum().to_frame('total_entries').head()

Note how the groups make up the index of the new DataFrame. We can use [`pandas.DataFrame.reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) to convert the dates back into a column and reset the index to a numerical one ranging from zero to one less than the number of rows. We can chain all the methods from before together and create a new DataFrame called `mbta_daily_sum` that contains the total number of gated entries across the MBTA system for each date in 2020.

In [None]:
mbta_daily_sum = (mbta.groupby('date')
                      .gated_entries.sum()
                      .to_frame('total_entries')
                      .reset_index())

In [None]:
mbta_daily_sum.head()

If we wanted to find out which date had the most ridership, we could use [`pandas.Series.max()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html) to get the maximum number of gated entries and then utilize boolean indexing to find out which date it corresponds to.

In [None]:
mbta_daily_sum.total_entries.max()

In [None]:
mbta_daily_sum.date[
    mbta_daily_sum.total_entries == mbta_daily_sum.total_entries.max()]

In [None]:
mbta_daily_sum.date[
    mbta_daily_sum.total_entries == mbta_daily_sum.total_entries.max()
].values[0]

Alternatively, we could use [`pandas.Series.argmax()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.argmax.html) to extract the index of the row with the most ridership and then utilize [`pandas.DataFrame.loc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) to extract said row using its index.

In [None]:
mbta_daily_sum.total_entries.argmax()

In [None]:
mbta_daily_sum.loc[mbta_daily_sum.total_entries.argmax()]

In [None]:
mbta_daily_sum.loc[mbta_daily_sum.total_entries.argmax(), 'date']

Finally, we could utilize [`pandas.DataFrame.sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) to sort the values by total number of gated entries.

In [None]:
mbta_daily_sum.sort_values('total_entries')

Knowing all of this, we can easily take a quick look at the most and least used rapid transit stations and lines across the MBTA system in 2020 by combining the following:
- [`pandas.DataFrame.groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)
- [`pandas.groupby.GroupBy.sum()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.sum.html)
- [`pandas.Series.sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_values.html)
- [`pandas.Series.to_frame()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html)
- [`pandas.DataFrame.reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html)

In [None]:
(mbta.groupby('station_name')
     .gated_entries.sum()
     .sort_values(ascending=False)
     .to_frame('total_entries')
     .reset_index())

In [None]:
(mbta.groupby('route_or_line')
     .gated_entries.sum()
     .sort_values(ascending=False)
     .to_frame('total_entries')
     .reset_index())

We can also group by multiple columns. For example, we can group by `date`, `station_name`, and `route_or_line` to create a new DataFrame `mbta_daily`, where the gated entries for each station and line combination are shown in 24-hour intervals instead of 30-minute intervals.

In [None]:
mbta_daily = (mbta.groupby(['date', 'station_name', 'route_or_line'])
                   .gated_entries.sum()
                   .to_frame()
                   .reset_index())

In [None]:
mbta_daily.head()

This new DataFrame will allow us to perform further analysis that does not require 30-minute temporal resolution and where a daily resolution is suitable. For example, we could look at the daily number of gated entries at the Harvard Square MBTA station throughout 2020 and see whether the onset of the global pandemic had an effect on ridership.

In [None]:
fig, ax = plt.subplots(figsize=(7, 5))
mbta_daily[(mbta_daily.station_name == 'Harvard')].plot(x='date',
                                                        y='gated_entries',
                                                        legend=False,
                                                        color='crimson',
                                                        ax=ax)
plt.xlabel('Date')
plt.ylabel('Gated Entries')
plt.title('2020 Daily Gated Entires at the Harvard Square MBTA Station')
plt.show()

---

## Additional Resources

Interactive [Kaggle tutorials](https://www.kaggle.com/learn) with built-in exercises:
- Introduction to Python: https://www.kaggle.com/learn/python
- Introduction to Pandas: https://www.kaggle.com/learn/pandas
- Data Cleaning with Pandas: https://www.kaggle.com/learn/data-cleaning
- Data Visualization using Seaborn: https://www.kaggle.com/learn/data-visualization

Official [Pandas](https://pandas.pydata.org/pandas-docs/stable) resources:
- Pandas Getting Started Guide: https://pandas.pydata.org/pandas-docs/stable/getting_started
- Pandas User Guide: https://pandas.pydata.org/pandas-docs/stable/user_guide
- Pandas API Reference: https://pandas.pydata.org/pandas-docs/stable/reference
- Pandas Cheat Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Official [Matplotlib](https://matplotlib.org) resources:
- Matplotlib Tutorials: https://matplotlib.org/stable/tutorials
- Matplotlib User Guide: https://matplotlib.org/stable/users
- Matplotlib Plot Types: https://matplotlib.org/stable/plot_types
- Matplotlib Examples Gallery: https://matplotlib.org/stable/gallery
- Matplotlib API Reference: https://matplotlib.org/stable/api
- Matplotlib Cheat Sheets: https://matplotlib.org/cheatsheets

Official [Seaborn](https://seaborn.pydata.org) resources:
- Seaborn User Guide and Tutorial: https://seaborn.pydata.org/tutorial
- Seaborn Examples Gallery: https://seaborn.pydata.org/examples
- Seaborn API Reference: https://seaborn.pydata.org/api

Official [HVPlot](https://hvplot.holoviz.org) resources:
- HVPlot User Guide: https://hvplot.holoviz.org/user_guide
- HVPlor Examples Gallery: https://hvplot.holoviz.org/reference

Official [Plotly](https://plotly.com/python) resources:
- Plotly Express User Guide: https://plotly.com/python/plotly-express
- Plotly Python Graphing Library: https://plotly.com/python