# 1 Pandas and Plotting

Pandas is a Python library with fairly advanced spreadsheet-like methods. This notebook explores some basic syntax.

This `import` command below enables us to use `pandas`, which is is not included in standard Python, and the `as pd` part means we don't have to type `pandas`, but rather `pd` when referring to a command from the `pandas` library.

In [None]:
import pandas as pd

Read data from a csv-file into a `pandas` data frame. (Yes there is a corresponding command for .xlsx files.) You can see: To call a `pandas` function, we use the `pd.` "pd-dot" notation.

Each data frame works exactly like a "tab" in Excel or Google-sheets. In other words, a data frame is a two-dimensional row-column data structure. 

In [None]:
df = pd.read_csv('AU_COURSE_LIST.csv')
df

If you think about it, using individual "cells" or blocks of "cells" is rarely needed. Most often we need to manipulate a whole column: 

In [None]:
df['Subject']

But we can get individual "cells" using the '.loc': `df.loc[row,column]`. Here`column` is the column name and `row` is normally the row index, but it may be the row name - provided that exists.

In [None]:
df.loc[1,"Subject"]

Of course, we can get a row, too.

In [None]:
df.loc[1]

Much more interesting: Defining patterns and filtering out all matching rows.

In [None]:
pattern=(df['Subject']=='CRKT')
df[pattern]

### Task
Modify the box above to find
* all sword-classes: 'SWRD'.
* all classes offered in the disciplne 'Fencing'.

## Let's investigate enrollment in AU university

That means we deal with the column: `df['Tot Enrl']`.
We can plot:
* The raw data.
* Histograms of the raw data.
* The cumulative distribution function of the raw data.
* A bar graph showing how often each enrollment value appears: `value-counts`.

***If you use floating plots, close each plot before executing the next cell.*** 

In [None]:
# by default each bar has a label, let's suppress that in this plot 
df['Tot Enrl'].plot.bar(xticks=[], title='Raw enrollment unsorted list')

There is more than one way to skin a cat, and there are at least four good ways to draw histograms. The `df[].hist()` is nice because is works with the DataFrame, but still communicates with matplotlib rather directly.

In [None]:
#df['Tot Enrl'].plot.hist(title='default histogram')
df['Tot Enrl'].hist()

A plain vanilla histogram.

In [None]:
df['Tot Enrl'].hist(bins=51)

A histogram that shows the probability of each bin, that is, the bins sum to $1$.

In [None]:
df['Tot Enrl'].hist(bins=51, density=1)

Sum the probabilities of all bins below to obtain the cummulative distribution function (CDF). Study this plot carefully. Make sure you understand why it starts at $0$ and ends $1$.

In [None]:
#df['Tot Enrl'].plot.hist(bins=51, cumulative=True, density=1, title='cummulative distribution function')
df['Tot Enrl'].hist(bins=51, cumulative=True, density=1)

In [None]:
#
# Value counts: Maximum-resultion histogram or bar-graph.
#
#
# This counts how often certain values show up in a column 
# and sorts the values by frequency
#
df['Tot Enrl'].value_counts()

In [None]:
# Sorting it by value = enrollment is slightly more tricky.
#
counts_by_freq = df['Tot Enrl'].value_counts()
counts_by_value = counts_by_freq.sort_index()
counts_by_value

In [None]:
counts_by_value.plot.bar(xticks=[10,20,30,40,50,60,70,80,90], 
                         xlabel='Enrollment', ylabel='Number of classes')

### Task

For each of the five plots:
* Write a brief desciption: What specifically does this plot show?
* Does the plot have a clear message or at least important highlights?
* Make a bar graph of the `Cap Enrl` column. What do this column and plot communicate?
* Define histogram, PMF, and CDF with words and with formulas.