# Introductory data processing and analytics
### in Python, using pandas

pandas is a Python library for analyzing and organizing tabular data. It's probably the most common library for working with both big and small datasets in Python, and is the basis for working with more analytical packages (e.g. scikit-learn) and analyzing geographic data (e.g. geopandas)

This notebook provides an intro to pandas for analyzing urban data. We'll be learning the following
 - TODO

In [1]:
import pandas as pd

## DataFrames: the basic unit of pandas

In pandas, a `DataFrame` is a tabular data structure similar to a spreadsheet, where data is organized in rows and columns. These can contain different kinds of data, such as numbers, strings, dates, and so on. When we load data in pandas, we typically load it into the structure of a `DataFrame`.

Let's first take a look at a small dataset, Canadian municipalities and their population in 2021 and 2016, based on Census data. In Statistics Canada lingo, these are called [Census Subdivisions](https://www12.statcan.gc.ca/census-recensement/2021/ref/dict/az/Definition-eng.cfm?ID=geo012). This dataset only includes municipalities with a population greater than 25,000 in 2021.

The main method for loading csv data is to use the [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function, but pandas can also read [many other](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) data formats.

In [2]:
df = pd.read_csv("data/cities.csv")

Great! Now our data is stored in the variable `df` in the structure of a `DataFrame`.

### Exploring

Let's explore what this data frame looks like. Adding the function `.head(N)` or `.tail(N)` prints the top or bottom `N` rows of the DataFrame. The following prints the first 10 rows - **see if you can print the bottom 10 rows or a different number of rows**.

In [3]:
df.head(10)

Unnamed: 0,Name,Prov/terr,"Population, 2021","Population, 2016"
0,Abbotsford,B.C.,153524,141397
1,Airdrie,Alta.,74100,61581
2,Ajax,Ont.,126666,119677
3,Alma,Que.,30331,30771
4,Aurora,Ont.,62057,55445
5,Barrie,Ont.,147829,141434
6,Belleville,Ont.,55071,50716
7,Blainville,Que.,59819,56863
8,Boisbriand,Que.,28308,26884
9,Boucherville,Que.,41743,41671


Notice that each column has a unique name. We can view the data of this column alone by using that name, and see what unique values exist using `.unique()`. **Try viewing the data of another column**. Beware of upper and lower case -- exact names matter.

In [20]:
df['Prov/terr'].head(10)  # Top 10 only

0     B.C.
1    Alta.
2     Ont.
3     Que.
4     Ont.
5     Ont.
6     Ont.
7     Que.
8     Que.
9     Que.
Name: Prov/terr, dtype: object

In [21]:
df['Prov/terr'].unique()  # Unique values for the *full* dataset - what happens if you do df['Prov/terr'].head(10).unique()?

array(['B.C.', 'Alta.', 'Ont.', 'Que.', 'Man.', 'N.S.', 'P.E.I.', 'N.L.',
       'N.B.', 'Sask.', 'Y.T.'], dtype=object)

### Filtering and sorting data

We can use the columns to identify data that we might want to filter by. The line below shows data only for Ontario, **but see if you can filter for another province or territory**.

In [15]:
df.loc[df['Prov/terr'] == 'Ont.']

Unnamed: 0,Name,Prov/terr,"Population, 2021","Population, 2016"
2,Ajax,Ont.,126666,119677
4,Aurora,Ont.,62057,55445
5,Barrie,Ont.,147829,141434
6,Belleville,Ont.,55071,50716
10,Bradford West Gwillimbury,Ont.,42880,35325
...,...,...,...,...
171,Whitby,Ont.,138501,128377
172,Whitchurch-Stouffville,Ont.,49864,45837
174,Windsor,Ont.,229660,217188
177,Woodstock,Ont.,46705,41098


Pandas allows us to use other similar mathematical concepts filter for data. Previously, we asked for all data in Ontario. **Now, filter for all cities which had a population of at least 100,000 in 2021**.

HINT: in Python, "greater than or equals to" (i.e., "at least") is represented using the syntax `>=`.

Pandas also allows us to combine filtering conditions. **Use the template below to select for all cities in Ontario with a population of over 100,000 in 2021**.

In [None]:
df.loc[(df["Prov/terr"] == "Ont.") & (YOUR CONDITION HERE)]

Now let's count how many cities actually meet these conditions. Run the line below to see how many cities there are in this data set in Ontario.

In [26]:
df.loc[df['Prov/terr'] == 'Ont.'].count()

Name                69
Prov/terr           69
Population, 2021    69
Population, 2016    69
dtype: int64

The function `.count()` tells us how much data there is for each column - but if we wanted to just see one column, we could also filter for that individual column using `df[COL_NAME]`. **Try a different condition and count the amount of data for it**.

You might have noticed that these cities are in alphabetical order - what if we wanted to see them in the order of population? In pandas, we do this using the [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) function. The default is to sort in ascending order, so we set this to be `False` (i.e. descending) so the most populous cities are at the top.

In [31]:
df.sort_values(by='Population, 2021', ascending=False)

Unnamed: 0,Name,Prov/terr,"Population, 2021","Population, 2016"
158,Toronto,Ont.,2794356,2731571
89,Montréal,Que.,1762949,1704694
19,Calgary,Alta.,1306784,1239220
106,Ottawa,Ont.,1017449,934243
42,Edmonton,Alta.,1010899,933088
...,...,...,...,...
130,Saint-Bruno-de-Montarville,Que.,26273,26197
155,Thetford Mines,Que.,26072,25403
75,Lincoln,Ont.,25719,23787
115,Prince Edward County,Ont.,25704,24735


Let's put some in this together now. **Filter the data to show all cities which are in the province of Quebec with at least a population of 50,000 in 2016, and make sure to sort the cities by their 2016 population**.

HINT: You can do this in two steps (which is more readable) by storing the data that you filter into a variable called `df_filtered`, then running the command to sort the values on `df_filtered`.

## Frame physics: modifying DataFrames

### Cleaning, renaming, and NaNs

### Creating new columns 

### Joining, merging, and concatenating tables

## Learning about data: summarizing tables

### `.describe()` and summary stats

### Plotting

### Cross tabulation

## Data wrangling: finding, loading, and saving data

TODO: Motivation

In [None]:
# TODO: Example