# Packages, Modules, Methods, and Functions

> The Python source distribution has long maintained the philosophy of "batteries included" -- having a rich and versatile standard library which is immediately available, without making the user download separate packages. This gives the Python language a head start in many projects.
>
> \- PEP 206

## Applied Review

### Python and Jupyter Overview

- We're working with Python through Jupyter, the most common IDE for data science.

### Fundamentals

- Python's common *atomic*, or basic, data types are:
    - Integers
    - Floats (decimals)
    - Strings
    - Booleans

- These simple types can be combined to form more complex types, including:
    - Lists: Ordered collections
    - Dictionaries: Key-value pairs
    - DataFrames: Tabular datasets

## Packages (aka Modules)

So far we've seen several data types that Python offers out-of-the-box.
However, to keep things organized, some Python functionality is stored in standalone *packages*, or libraries of code.
The word "module" is generally synonymous with package; you will hear both in discussions of Python.

For example, functionality related to the operating system -- such as creating files and directories -- is stored in a package called `os`.
To use the tools in `os`, we *import* the package.

In [1]:
import os

Once we import it, we gain access to everything inside.
With Jupyter's autocomplete, we can view what's available.

In [None]:
# Move your cursor the end of the below line and press tab.
os.

Some packages, like `os`, are bundled with every Python install; downloading Python guarantees you'll have these packages.
Collectively, this group of packages is known as the *standard library*.

Other packages must be downloaded separately, either because
- they aren't sufficiently popular to merit inclusion in the standard library
- *or* they change too quickly for the maintainers of Python to keep up

The DataFrame type that we saw earlier is part of the `pandas` package (short for *Panel Data*), one such package.
Since pandas is specific to data science and is still rapidly evolving, it is not part of the standard library.

We can download packages like pandas from the internet using a website called PyPI, the *Python Package Index*.
Fortunately, since we are using Binder today, that has been handled for us and pandas is already installed.

It's possible to import packages under an *alias*, or a nickname.
The community has adopted certain conventions for aliases for common packages;
while following them isn't mandatory, it's highly recommended, as it makes your code easier for others to understand.

pandas is conventionally imported under the alias `pd`.

In [2]:
import pandas as pd

In [3]:
# Importing pandas has given us access to the DataFrame, accessible as pd.DataFrame
pd.DataFrame

pandas.core.frame.DataFrame

<font class="question">
    <strong>Question</strong>:<br><em>What is the type of `pd`? Guess before you run the code below.</em>
</font>

In [4]:
type(pd)

module

Third-party packages unlock a huge range of functionality that isn't available in native Python; much of Python's data science capabilities come from a handful of packages outside the standard library:

- pandas
- numpy (numerical computing)
- scikit-learn (modeling)
- scipy (scientific computing)
- matplotlib (graphing)

We won't have time to touch on most of these in this training, but if you're interested in one, google it!

<font class="your_turn">
    Your Turn
</font>

1. Import the `numpy` library, listed above. Give it the alias "np".
2. Using autocomplete, determine what variable inside the numpy library starts with "eig". *Hint: remember you'll need to preface the variable name with the package alias, e.g. `np.eig`*

## Functions

On several occasions, we've seen parentheses used to produce a result.

```python
# Get the type of pd.
type(pd)
```

```python
# Get the first few rows of the planes data.
planes.head()
```

```python
# Read in the planes.csv file.
pd.read_csv('../data/planes.csv')
```

Expressions using parentheses like this are called "function calls".
The name to the left of the parens (`type`, `planes`, `pd`) is called the **function**, and any variables within the parens are called function arguments, or simply **arguments**.

Functions are wrappers for chunks of Python code that are stored in a shorter name.
For example, the `read_csv` function is actually a call to a block of Python code that reads in a data file, turns it into a DataFrame, and *returns* it to the user.

The key idea of functions is **they take inputs and produce outputs**;
usually you don't need to know anything about *how* they do it.

Functions are integral to using Python, because it's much more efficient to use pre-written code than to always write your own.

If you ever do want to write your own function -- perhaps to share with others, or to make it easier to reuse your work -- it's fairly simple to do so, but beyond the scope of this training.

## Objects, Attributes, and Methods

Up to this point, we've referred to *variables* and *values*.
Variables are the names we give to values.

To be more precise and pythonic, the values stored in variables are **objects**.
Everything in Python is an object: integers, strings, DataFrames, and even modules (like the `pd` variable above).

All objects support **dot-notation**: using a period to access data *within* the object.

For example, we saw this syntax above:
```python
pd.DataFrame
```

This refers to the `DataFrame` **attribute** of the `pd` object.

In the Fundamentals notebook, we created a DataFrame called `planes` and used the `head` attribute to view the first 5 rows of the data.
```python
planes.head()
```

As you can see, attributes can be functions (note the parens after `head`);
we call such attributes **methods**.

Methods are just like other functions, except that their relationship to their parent object allows them to use its data.
For example, `head` accesses the first rows of its parent object (a DataFrame) and returns them.

## Concept Map

We've introduced a lot of Python concepts. Let's review how they relate to each other.

![concept_map](images/concept_map.jpg)

### Functions, Objects, and Methods in the Context of DataFrames

DataFrames are a type of Python object, so let's use them to explore the new Python features we've learned.

Using the `read_csv` function from pandas to read in a DataFrame.

In [5]:
df = pd.read_csv('../data/airlines.csv')

Using the `type` function to determine the type of `df`.

In [6]:
type(df)

pandas.core.frame.DataFrame

Using the `head` method of the DataFrame to view some of its rows.

In [7]:
df.head()

Unnamed: 0,carrier,name
0,9E,Endeavor Air Inc.
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.


Examining the `columns` attribute of the DataFrame to see the names of its columns.

In [8]:
df.columns

Index(['carrier', 'name'], dtype='object')

Inspecting the `shape` attribute to find the *dimensions* (rows and columns) of the DataFrame.

In [9]:
df.shape

(16, 2)

Calling the `describe` method to get a summary of the data in the DataFrame.

In [10]:
df.describe()

Unnamed: 0,carrier,name
count,16,16
unique,16,16
top,VX,US Airways Inc.
freq,1,1


Now let's combine them: using the `type` function to determine what `df.describe` holds.

In [11]:
type(df.describe)

method

<font class="question">
    <strong>Question</strong>:<br><em>Does this result make sense? What would happen if you added parens? i.e. </em><code>type(df.describe())</code>
</font>

<font class="your_turn">
    Your Turn
</font>

Spend some time using autocomplete to explore the methods and attributes of the `df` object we used above.
Remember from the Jupyter lesson that you can use a question mark to see the documentation for a function or method (e.g. `df.describe?`).

# Deeper Dive on DataFrames

Now that we understand objects and functions better, let's look more at DataFrames.

## What Are DataFrames Made of?

Accessing an individual column of a DataFrame can be done by passing the column name as a string, in brackets.

In [12]:
carrier_column = df['carrier']
carrier_column

0     9E
1     AA
2     AS
3     B6
4     DL
5     EV
6     F9
7     FL
8     HA
9     MQ
10    OO
11    UA
12    US
13    VX
14    WN
15    YV
Name: carrier, dtype: object

Individual columns are pandas `Series` objects.

In [13]:
type(carrier_column)

pandas.core.series.Series

How are Series different from DataFrames?

- They're always 1-dimensional

- They have different attributes than DataFrames
    - For example, Series have a `to_list` method -- which doesn't make sense to have on DataFrames

- They don't print in the pretty format of DataFrames, but in plain text (see above)

In [14]:
carrier_column.shape

(16,)

In [15]:
df.shape

(16, 2)

In [16]:
carrier_column.to_list()

['9E',
 'AA',
 'AS',
 'B6',
 'DL',
 'EV',
 'F9',
 'FL',
 'HA',
 'MQ',
 'OO',
 'UA',
 'US',
 'VX',
 'WN',
 'YV']

In [17]:
df.to_list()

AttributeError: 'DataFrame' object has no attribute 'to_list'

It's important to be familiar with Series because they are fundamentally the core of DataFrames.
Not only are columns represented as Series, but so are rows!

In [18]:
# Fetch the first row of the DataFrame
first_row = df.loc[0]
first_row

carrier                   9E
name       Endeavor Air Inc.
Name: 0, dtype: object

In [19]:
type(first_row)

pandas.core.series.Series

Whenever you select individual columns or rows, you'll get Series objects.

### What Can You Do with a Series?

First, let's create our own Series object from scratch -- they don't need to come from a DataFrame.

In [20]:
# Pass a list in as an argument and it will be converted to a Series.
s = pd.Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [20]:
# Pass a list in as an argument and it will be converted to a Series.
s = pd.Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

There are 3 things to notice about this Series:

- The values (10, 20, 30...)

- The *dtype*, short for data type.

- The *index* (0, 1, 2...)

#### Values
Values are fairly self-explanatory; we chose them in our input list.

#### dtype
Data types are also straightforward.

Series are always homogeneous, holding only integers, floats, or generic Python objects (called just `object`).

Because a Python object is general enough to contain any other type, any Series holding strings or other non-numeric data will typically default to be of type `object`.

For example, going back to our carriers DataFrame, note that the carrier column is of type `object`.

In [21]:
df['carrier']

0     9E
1     AA
2     AS
3     B6
4     DL
5     EV
6     F9
7     FL
8     HA
9     MQ
10    OO
11    UA
12    US
13    VX
14    WN
15    YV
Name: carrier, dtype: object

#### Index
Indexes are more interesting.
Every Series has an index, a way to reference each element.
The index of a Series is a lot like the keys of a dictionary: each index element corresponds to a value in the Series, and can be used to look up that element.

In [22]:
# Our index is a range from 0 (inclusive) to 5 (exclusive).
s.index

RangeIndex(start=0, stop=5, step=1)

In [23]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [24]:
s[3]

40

In our example, the index is just the integers 0-4, so right now it looks no different that referencing elements of a regular Python list.
*But* indexes can be changed to something different -- like the letters a-e, for example.

In [25]:
s.index = ['a', 'b', 'c', 'd', 'e']
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

Now to look up the value 40, we reference `'d'`.

In [26]:
s['d']

40

We saw earlier that rows of a DataFrame are Series.
In such cases, the flexibility of Series indexes comes in handy;
the index is set to the DataFrame column names.

In [27]:
df.head()

Unnamed: 0,carrier,name
0,9E,Endeavor Air Inc.
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.


In [28]:
# Note that the index is ['carrier', 'name']
first_row = df.loc[0]
first_row

carrier                   9E
name       Endeavor Air Inc.
Name: 0, dtype: object

This is particularly handy because it means you can extract individual elements based on a column name.

In [29]:
first_row['carrier']

'9E'

## DataFrame Indexes

It's not just Series that have indexes!
DataFrames have them too.
Take a look at the carrier DataFrame again and note the bold numbers on the left.

In [30]:
df.head()

Unnamed: 0,carrier,name
0,9E,Endeavor Air Inc.
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.


These numbers are an index, just like the one we saw on our example Series.
And DataFrame indexes support similar functionality.

In [31]:
# Our index is a range from 0 (inclusive) to 16 (exclusive).
df.index

RangeIndex(start=0, stop=16, step=1)

When loading in a DataFrame, the default index will always be 0 to N-1, where N is the number of rows in your DataFrame.
This is called a `RangeIndex`.

Selecting individual rows by their index is done with the `.loc` accessor.
An *accessor* is an attribute designed specifically to help users reference something else (like rows within a DataFrame).

In [32]:
# Get the row at index 4 (the fifth row).
df.loc[4]

carrier                      DL
name       Delta Air Lines Inc.
Name: 4, dtype: object

As with Series, DataFrames support reassigning their index.

However, with DataFrames it often makes sense to change one of your columns into the index.

This is analogous to a primary key in relational databases: a way to rapidly look up rows within a table.

In our case, maybe we will often use the carrier code (`carrier`) to look up the full name of the airline.
In that case, it would make sense set the carrier column as our index.

In [33]:
df = df.set_index('carrier')
df.head()

Unnamed: 0_level_0,name
carrier,Unnamed: 1_level_1
9E,Endeavor Air Inc.
AA,American Airlines Inc.
AS,Alaska Airlines Inc.
B6,JetBlue Airways
DL,Delta Air Lines Inc.


Now the RangeIndex has been replaced with a more meaningful index, and it's possible to look up rows of the table by passing carrier code to the `.loc` accessor.

In [34]:
df.loc['UA']

name    United Air Lines Inc.
Name: UA, dtype: object

<font style="color:#800;">
    <strong>Caution</strong>:<br><em>Pandas does not require that indexes have unique values (that is, no duplicates) although many relational databases do have that requirement of a primary key. This means that it is *possible* to create a non-unique index, but highly inadvisable. Having duplicate values in your index can cause unexpected results when you refer to rows by index -- but multiple rows have that index. Don't do it if you can help it!</em>
</font>

When starting to work with a DataFrame, it's often a good idea to determine what column makes sense as your index and to set it immediately.

This will make your code nicer -- by letting you directly look up values with the index -- and also make your selections and filters faster, because Pandas is optimized for operations by index.

If you want to change the index of your DataFrame later, you can always `reset_index` (and then assign a new one).

In [35]:
df.head()

Unnamed: 0_level_0,name
carrier,Unnamed: 1_level_1
9E,Endeavor Air Inc.
AA,American Airlines Inc.
AS,Alaska Airlines Inc.
B6,JetBlue Airways
DL,Delta Air Lines Inc.


In [36]:
df = df.reset_index()
df.head()

Unnamed: 0,carrier,name
0,9E,Endeavor Air Inc.
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.


<font class="your_turn">
    Your Turn
</font>

The below cell has code to load in the first 100 rows of the airports data as `airports`.
The data contains the airport code, airport name, and some basic facts about the airport location.

1. What kind of index is the current index of `airports`? 
2. Is this a good choice for the DataFrame's index? If not, what column or columns would be a better candidate?
3. If you chose a different column to be the index, make it your index using `airports.set_index()`.
4. Using your new index, look up "Pittsburgh-Monroeville Airport", code 4G0. What is its altitude?
5. Reset your index in case you want to make a different column your index in the future.

In [37]:
airports = pd.read_csv('../data/airports.csv')
airports = airports.loc[0:100]
airports.head()

Unnamed: 0,faa,name,lat,lon,alt,tz,dst,tzone
0,04G,Lansdowne Airport,41.130472,-80.619583,1044,-5,A,America/New_York
1,06A,Moton Field Municipal Airport,32.460572,-85.680028,264,-6,A,America/Chicago
2,06C,Schaumburg Regional,41.989341,-88.101243,801,-6,A,America/Chicago
3,06N,Randall Airport,41.431912,-74.391561,523,-5,A,America/New_York
4,09J,Jekyll Island Airport,31.074472,-81.427778,11,-5,A,America/New_York


# Questions

Are there any questions before we move on?