(04:Package-structure-and-distribution)=
# Package structure and distribution
<hr style="height:1px;border:none;color:#666;background-color:#666;" />

The previous chapter provided a practical, high-level guide of how to develop and distribute a Python package. This chapter now goes into more detail about what a Python package actually is and how it is structured. We begin with a discussion of what modules and packages are in Python and how they are imported and used in a Python program. We go on to discuss some more advanced package structure topics such as controlling the import behavior of a package and including non-code files, like data files. The chapter finishes with a discussion of what it means to "build" and "distribute" a package. Along the way, we'll show practical examples by building upon the `partypy` package developed in the previous chapter.

## Packaging fundamentals

We'll begin this chapter with some intuition about what Python packages are and how they're used. While part of the beauty of Python is that it abstracts lower-level implementations details away from users who don't need to understand them, to better understand Python packages, it's useful to peek under-the-hood, as we'll do in the section.

Firstly, all data in a Python program is represented by objects or by relations between objects. For example, integers and functions are kinds of Python objects. We can find the type of a Python object using the in-built `type()` function, as demonstrated in the examples below:

```{prompt} bash \$ auto
$ python
```

```{prompt} python >>> auto
>>> a = 1
>>> type(a)
```

```python
int
```

```{prompt} python >>> auto
>>> def hello_world(name):
        print(f"Hello world! My name is {name}.")
>>> type(hello_world)
```

```python
function
```

In the above code, we created an integer object and a function object which are mapped to the names `a` and `hello_world` respectively.

The object relevant to our discussion of Python packages is the module object. A module is an object that serves as an organizational unit of Python code. In the simplest case, this code is stored in a file with a .py suffix and is imported using the `import` statement. The created module object’s name is the name of the imported file (excluding the .py suffix). For example, imagine we have a module `greetings.py` in our current directory containing functions to print "Hello World!" in English, German, and Spanish:

```python
def hello_world():
    print("Hello World!")


def hallo_welt():
    print("Hallo Welt!")


def hola_mundo():
    print("Hola Mundo!")
```

We can import that module using the `import` statement and can use the `type()` function to verify that we created a module object which has been mapped to the name "greetings" (the name of the file):

```{prompt} python >>> auto
>>> import greetings
>>> type(greetings)
```

```python
module
```

As mentioned earlier, this module object is an organisational unit of code. We say this because the contents of the module (in this case, the three different "Hello World!" functions) can be accessed via the module name and "dot notation". For example:

```{prompt} python >>> auto
>>> greetings.hello_world()
```

```python
"Hello World!"
```

```{prompt} python >>> auto
>>> greetings.hallo_welt()
```

```python
"Hallo Welt!"
```

```{prompt} python >>> auto
>>> greetings.hola_mundo()
```

```python
"Hola Mundo!"
```

At this point in our discussion, it's useful to mention Python's namespaces. A "namespace" in Python is simply a mapping from names to objects. From the code examples above, we've added the names `a` (an integer), `hello_world` (a function), and `greetings` (a module) to the current namespace and can use those names to refer to the objects we created. Python provides various tools for introspecting namespaces, one of which is the `dir()` function, which, when called with no arguments, returns a list of names currently defined:

```{prompt} python >>> auto
>>> dir()
```

```python
['__annotations__', '__builtins__', '__doc__', '__loader__', '__name__',
 '__package__', '__spec__', 'a', 'hello_world', 'greetings']
```

In the output above, we can see the names of the three objects we defined: `a`, `hello_world`, and `greetings`. The other names prefixed with double underscores are objects that were initialised automatically when we started the Python interpreter and are implementation details that aren't important to our discussion here, but can be read about in the [Python documentation](https://docs.python.org/3/reference/executionmodel.html?highlight=__builtins__#execution-model). To help focus on just the names we intentionally defined in our namespace we can use the following list comprehension:

```{prompt} python >>> auto
>>> [name for name in dir() if not name.startswith("__")]
```

```python
['a', 'greetings', 'hello_world']
```

Namespaces are created at different moments, have different lifetimes, and can be accessed from different parts of a Python program - but these details digress from the text and we point interested readers to the [Python documentation](https://docs.python.org/3/tutorial/classes.html#python-scopes-and-namespaces) to learn more. The important point to make here is that, when a module is imported using the `import` statement, a module object is created and it has its own namespace populated by the Python code (i.e, definitions and statements) within that module. The namespace can be accessed using the module's name and dot notation, as we saw earlier. In this way, the module object isolates a codebase and provides us with a clean, logical, and organised way to access code.

We can view the namespace of a module by passing the module object as an input to the `dir()` function :

```{prompt} python >>> auto
>>> [name for name in dir(greetings) if not name.startswith("__")]
```

```python
['hallo_welt', 'hello_world', 'hola_mundo']
```

A final point to stress is that there is no relation between names in different namespaces. For example, in the Python session we've been running in this section we have access to two `hello_world` functions; one that was defined earlier in our interactive interpreter, and one defined in the `greetings` module. While these functions have the exact same name, there is no relation between them because they exist in different namespaces; `greetings.hello_world` exists in the `greetings` module namespace and `hello_world` exists in the top-level global namespace. So, we can access both with the appropriate syntax:

```{prompt} python >>> auto
>>> hello_world("Tom")
```

```python
"Hello world! My name is Tom."
```

```{prompt} python >>> auto
>>> greetings.hello_world()
```

```python
"Hello World!"
```

Now that we have a basic understanding of modules, we can further discuss packages. Luckily, it's an intuitive transition from modules to packages: Python packages are just a collection of one or more modules. Packages provide another level of abstraction for our code base and allows us to group and organise modules (as well as non-code files, like data; but more on that later) in one place for easy access and distribution.

We'll talk more about the specific file structure of packages later, but for now, a useful analogy to remember the distinction between modules and packages is to think of a file and directory structure on your computer: directories are packages and files within those directories are individual modules. Just like the file system on your computer, a root directory (package) may contain files (modules) and/or sub-directories (which we would call sub-packages). 

While this analogy holds at the conceptual level, the distinction between modules and packages in the Python programming language is a little more vague. In fact, regardless of whether you `import` a single, standalone module (i.e., a .py file) or a package (i.e., a directory), Python will create a module object in the current namespace. For example, let's import the `partypy` package we created in **Chapter 3: {ref}`03:How-to-package-a-Python`** and check its type:

```{note}
If you're following on from **Chapter 3: {ref}`03:How-to-package-a-Python`**, recall that we created and installed our `partypy` package in a `conda` virtual environment which can be activated by running `conda activate partypy` in the terminal.
```

```{prompt} python >>> auto
>>> import partypy
>>> type(partypy)
```

```python
module
```

Note that despite importing a package (a collection of modules), Python still created a module object. Just as before, we can access the contents of our package via "dot notation". For example, we can import the `simulate_party` function from the `simulate` module of the `partypy` package using the following syntax:

```{prompt} python >>> auto
>>> from partypy.simulate import simulate_party
>>> type(partypy)
```

```python
function
```

While we get a module object regardless of whether we import a module or a package, one technical difference between a module and a package in Python is that packages are module objects that have a `__path__` attribute. This `__path__` attribute tells Python where to look when importing the contents (modules or sub-packages) of your package. For example, let's check that the `partypy` module object we just created does indeed have an attribute called `__path__`, but the `greetings` module we imported from `greetings.py` earlier does not:

```{prompt} python >>> auto
>>> partypy.__path__
```

```python
['/Users/tomasbeuzen/GitHub/py-pkgs/partypy/src/partypy']
```

```{prompt} python >>> auto
>>> greetings.__path__
```

```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'greetings' has no attribute '__path__'
```

We'll talk a bit more about the nuances of importing from packages in the next section, but the takeaway message here is that packages are just a way to organize one or more (typically related!) modules along with other files like documentation, data, tests, etc., as we'll discuss in the later chapter of this book.

Now let's talk a little more about the file structure of Python packages. Python actually supports two kinds of packages: "regular packages" and "namespace packages". The majority of readers and Python users will be familiar with regular packages (even if they've never head the term), and these are the focus of this book. Regular packages are implemented as a directory of one or more modules and/or sub-packages that contains an `__init__.py` file. The `__init__.py` file is required to make Python treat the directory as a package (or sub-package). It is common for `__init__.py` files to be empty and simply there only to tell Python that a directory should be treated as a package, but they can also contain helpful initialization code to run when your package is imported, and we'll discuss that more in the next section. 

Below is an example of a regular package without sub-packages:

```
simple_pkg
├── __init__.py
├── module_1.py
└── module_2.py
```

Here is an example of a regular package with sub-packages:

```
nested_pkg
├── __init__.py
├── module_1.py
├── sub_pkg_1
│   ├── __init__.py
│   ├── module_2.py
│   └── module_3.py
└── sub_pkg_2
    ├── __init__.py
    └── module_4.py
```

In contrast to regular packages, namespace packages are typically used to group multiple, separate packages, which may reside in different locations on a file system, under a single namespace. The main structural difference between regular and namespace packages is that the parent directory of namespace packages do not need an `__init__.py` file. As the vast majority of users will never need to use a namespace package, we won't discuss them in this book, and instead refer readers interested in learning more about them to the [Python documentation](https://docs.python.org/3/reference/import.html#namespace-packages). As a result, when we use the term "package" in this book, we specifically mean "regular package".

## Package structure

With the theory out of the way, we'll now discuss some more practical considerations of structuring Python packages.

### Package and modules names

Python package naming guidelines and conventions are described in [Python Enhancement Proposal (PEP) 8 - Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) and [PEP 423 - Naming conventions and recipes related to packaging](https://www.python.org/dev/peps/pep-0423/). PEP 8 and PEP 423 should be read through at least once, but the fundamental guidelines are that:
- Packages and modules should have a single, short, all-lowercase name; and,
- Underscores can be used to separate words in a name if it improves readability, but their use is typically discouraged.

In terms of the actual name choosen for a module or package, it may be helpful to consider the following "three M's":
1. **Meaningful**: the name should somewhat reflect the functionality of the package.
2. **Memorable**: the name should be easy for users to find, remember, and relate to other relevant packages.
3. **Manageable**: remember that users of your package will access its contents/namespace via dot notation. Make it as quick and easy as possible for them to do this by keeping your package name short and sweet. For example, imagine if we called our `partypy` package something like `simulate_party_attendance`. Every time a user wanted to access the `simulate_party()` function from the `simulate` module, they'd have to write this: `from simulate_party_attendance.simulate import simulate_party()` - yikes!

Finally, you should also check [PyPI](https://pypi.org) and other popular hosting sites like GitHub, GitLab, BitBucket, etc. to make sure your chosen name is not already in use.

```{figure} images/pkg-naming.png
---
width: 40%
name: 04-pkg-naming
alt: Keep package names meaningful, memorable, and manageable.
---
Keep package names meaningful, memorable, and manageable.
```

### Nesting packages

We've discussed how packages are collections of one or more modules and/or sub-packages. Nesting packages (i.e., creating sub-packages within a package) can be helpful to organise complex codebases, but keep in mind the "manageable" guideline above - having deeply nested packages will make it verbose and cumbersome to access code. [PEP 423](https://www.python.org/dev/peps/pep-0423/) recommends generally using two levels of nesting at most, i.e., having one sub-package (`from partypy.simulate import ...`), but not two (`from partpy.simulate.wedding import ...`).

```{figure} images/pkg-nesting.png
---
width: 30%
name: 04-pkg-nesting
alt: It is typically recommended to have no more than two levels of nesting in a Python package.
---
It is typically recommended to have no more than two levels of nesting in a Python package.
```

### Intra-package references

When building packages of multiple modules, it is common to want to use code from one module in another. For example, consider the following package structure:

```
src
└── package
    ├── __init__.py
    ├── moduleA.py
    ├── moduleB.py
    └── subpackage
        ├── __init__.py
        └── moduleC.py
```

A developer may want to import code from `moduleA` in `moduleB`. This is sometimes called an "intra-package reference" and can be accomplished via an "absolute" or "relative" import:

- **Absolute imports** use the package name in an absolute context. For example, to import something from `moduleA` in `moduleC`, use the following syntax in `moduleC`: `from package.moduleA import XXX`.
- **Relative imports** were introduced in [PEP 328](https://www.python.org/dev/peps/pep-0328/). They use leading dots to indicate from where the relative import should begin. A single leading dot indicates an import relative to the current package (or sub-package). Two or more leading dots indicate a relative import to the parent(s) of the current package (sub-package), one level per dot after the first dot. For example, to import something from `moduleA` in `moduleB`, use the following syntax in `moduleB`: `from .moduleA import XXX` (one leading dot searches relative to `package`). To import something from `moduleA` in `moduleC`, use the following syntax in `moduleC`: `from ..moduleA import XXX` (one leading dot in `moduleC` would only search in `subpackage`, but two leading dots tells Python to search the parent of `subpackage`, i.e., `package`).

The choice here mostly comes down to personal preference and there isn't a standard across existing Python libraries. [PEP 8](https://www.python.org/dev/peps/pep-0008/) recommends using absolute imports because they are explicit and more readable. However, relative imports are less verbose (so may be preferable for complex, nested packages) and are flexible in the sense that you could change your package name without having to change all your import statements.

### The \_\_init\_\_.py file

Earlier we discussed how an `__init__.py` file is used to tell Python that the directory containing the `__init__.py` file is a (regular) package. The `__init__.py` file is often left empty and only used for the purpose of identifying a directory as a package, however this file can also be used to provide documentation and/or to run initialization code. When a regular package is imported, the `__init__.py` file is executed, and the objects it defines are bound to names in the package’s namespace.

The most common use case of this functionality is to control the import behaviour of a package using the `__init__.py` file. For example, recall from **Chapter 3: {ref}`03:How-to-package-a-Python`** that our `partypy` package has two main functions: `partypy.simulate.simulate_party()` and `partypy.plotting.plot_simulation()`. Users of our package will likely always want to use both these core functions, which currently requires two imports:

```python
from partypy.simulate import simulate_party
from partypy.plotting import plot_simulation
```

We could make life easier for our users by importing these core functions in `partypy`'s `__init__.py` file which would bind them to the top-level package namespace. Let's demonstrate all this through example. Your first though at this point might be "well, I can already access my package's functionality with a single `import` using the following code":

```{prompt} python >>> auto
>>> import partypy
```

However, if we check our imported package's namespace, we'll see that it's empty:

```{prompt} python >>> auto
>>> [name for name in dir(partypy) if not name.startswith("__")]
```

```python
[]
```

By default, the modules and sub-packages of a package aren’t imported when you import the entire package (i.e., they aren't bound to its namespace). However, as discussed above, we can bind names to the package's namespace using the `__init__.py` file. Let's change `partypy`'s `__init__.py` file to contain some package-level documentation, and import the `partypy.simulate.simulate_party()` and `partypy.plotting.plot_simulation()` functions:

```python
"""
partypy
=======

`partypy` is a Python package for simulating guest attendance at a party.

Usage
-----

Users can import the entire `partypy` package as follows::
  >>> import partypy
Here is a typical workflow::
  >>> results = partypy.simulate_party([0.1, 0.5, 0.9], simulations=5)
  >>> histogram = partypy.plot_simulation(results)
  >>> histogram
  ... alt.Chart(...)
Use the built-in ``help`` function to view a function's docstring::
  >>> help(partypy.simulate_party)
"""


__version__ = "0.1.0"


from partypy.simulate import simulate_party
from partypy.plotting import plot_simulation

```

````{note}
We're using absolute imports to import the `partypy.simulate.simulate_party()` and `partypy.plotting.plot_simulation()` functions in the code above, but we could also use relative imports:

```python
from . import plotting.plot_simulation
from . import simulate.simulate_party
```

````

Now let's restart our Python session, re-import `partypy` and check its namespace:

```{prompt} python >>> auto
>>> exit()
```

```{prompt} bash \$ auto
$ python
```

```{prompt} python >>> auto
>>> import partypy
>>> [name for name in dir(partypy) if not name.startswith("__")]
```

```python
['plot_simulation', 'plotting', 'simulate', 'simulate_party', 'utils']
```

```{note}
In the output above, we see the two functions (`plot_simulation` and `simulate_party`) we explicitly imported in `__init__.py` as well as their respective modules (`plotting` and `simulate`) mapped to our package's namespace. This is because to import each of these functions, their respective modules must also be executed. You can read more about Python's import system in the [official Python documentation](https://docs.python.org/3/reference/import.html).
```

By making this small change to our `__init__.py` file, we can now access the core functionality of our packages very easily:

```{prompt} python >>> auto
>>> partypy.simulate_party([0.1, 0.5, 0.9], simulations=5)
```

```python
            Total guests
Simulation              
1                      2
2                      2
3                      1
4                      1
5                      2
```

We can also now view high-level documentation of our package using the built-in `help()` function:

```{prompt} python >>> auto
>>> help(partypy)
```

```
Help on package partypy:

NAME
    partypy

DESCRIPTION
    partypy
    =======
    
    `partypy` is a Python package for simulating guest attendance at a party.
    
    Usage
    -----
    
    Users can import the entire `partypy` package as follows::
      >>> import partypy
    ...
```

### Package layout

In **Chapter 3: {ref}`03:How-to-package-a-Python`** we defined the file and directory structure of our package structure as follows:

```
partypy
├── ...
└── src
    └── partypy
        ├── __init__.py
        └── partypy.py
```

This is sometimes called the "src" or "source" layout due to putting all our source code in the `src` folder. It is also common to see packages without a "src" folder, in this format:

```
partypy
├── ...
└── partypy
    ├── __init__.py
    └── partypy.py
```

In general, the "src" layout is considered to be better packaging practice and is now endorsed by the [Python Packaging Authority](https://packaging.python.org/tutorials/packaging-projects/?highlight=src#creating-the-package-files). There are some excellent blogs discussing why this is the case (e.g., [Ionel Cristian Mărieș](https://blog.ionelmc.ro/2014/05/25/python-packaging/) and [Hynek Schlawack
](https://hynek.me/articles/testing-packaging/)), but the main reason for using the "src" layout is that it forces you to install your package before testing. The importance of this is difficult to appreciate without examples, and we'll revisit this point later in **Chapter 5: {ref}`05:Testing`**, but if you're new to packaging, or unsure of what structure to use, go with the "src" layout - it could save you time and tears in the long run.

### Including non-code files in a package

Up until now we've mostly been focusing on the code content (modules and sub-packages) of packages, but if you've been following along from **Chapter 3: {ref}`03:How-to-package-a-Python`**, you'll be wondering about all the other non-code files in our project too - things like tests, documentation, etc. Recall that the layout of our `partypy` project looks like this:

```
partypy
├── .gitignore
├── .readthedocs.yml
├── CHANGELOG.rst
├── CONDUCT.rst
├── CONTRIBUTING.rst
├── docs
├── LICENSE
├── pyproject.toml
├── README.md
├── src
└── tests
```

The `src` folder contains the guts of our package (modules and sub-packages). We'll talk about building packages for distribution in the next section, but the contents of the `src` folder is typically all that's included by default (along with some metadata) with a package built for distribution. The other components of the project (if they exist) are often shared separately, for example via a collaborative open-source service like GitHub. 

However, it is possible to include any arbitrary additional files in your package for distribution. What you include typically depends on the purpose of your package, the intended audience, and the method of distribution. For example, open-source packages don't typically include their documentation (`docs` folder) with their package, instead, documentation is usually hosted online (via a GitHub repository and/or dedicated documentation website like [Read the Docs](https://readthedocs.org/)). Similarly, some packages will ship a `tests` folder as part of their package, while others will share them separately (via a service like GitHub) for those who actually wish to develop on the package, rather than just use it. If you're sharing a package privately within an organization, you may wish to ship everything with the package (documentation, tests, changelog, etc.).

Regardless of your use-case, most packaging tools make it very easy to include/exclude extra contents of your package. In `poetry`, it's as simple as specifying a list of arguments to the keyword `include` or `exclude` in the `pyproject.toml` file under the `[tool.poetry]` heading. For example, if we wanted to include the `tests` folder and `CHANGELOG.rst` file as part of our package, we would add the following:

```toml
[tool.poetry]
...
include = ["tests/*", "CHANGELOG.rst"]
```

These files would then be included in the built version of our package that we distribute to others.

### Including data in a package

Including data with a package is a common use case, so it's worth a short section of its own. Often data may be required to actually use some of the functionality of a package, or a packager may just wish to ship some example data to help demonstrate the functionality of the package. There are two common ways to do this:

1. Include the raw data as part of the package and provide code to help users load it. This option is well-suited to smaller data files or for data that the package absolutely depends on and must be shipped with the package.
2. Include script(s) as part of the package that helps download the data from an external source. This option is suited to large data files, or ones that a user may only need optionally.

Let's first briefly demonstrate option 1 above with an example. We'll add a sample data set to our `partypy` package from **Chapter 3: {ref}`03:How-to-package-a-Python`**. The data set we'll be including is a csv file called `party.csv`, containing a sample guest list of 100 guests, including their (fake) name, and probability of attending a hypothetical party. It looks like this:

```csv
name, probability_of_attendance
Donovan Willis, 0.70
Jocelyn Navarro, 0.70
Houston Stein, 0.90
Carlos Mullins, 0.50
Bridger Pruitt, 0.70
..., ...
Maddox Santana, 0.50
Ariel Proctor, 0.50
Pedro Hull, 0.90
Janessa Collins, 0.95
Kendrick Burke, 0.30
```

To include this data set in our package we need to do two things:

1. Include the raw csv file in our package; and,
2. Include code to help a user load the csv file.

We'll start by creating a new "datasets" module in our package, as well as a "data" folder that includes the raw `party.csv` data file:

```
partypy
├── ...
└── src
    └── partypy
        ├── __init__.py
        ├── plotting.py
        ├── simulate.py
        ├── datasets.py
        └── data
            └── party.csv
```

All that's needed now is to add some code to `datasets.py` to help users load `party.csv`. That code is fairly simple, as shown below:

```python
import pandas as pd
from os.path import dirname, join


def load_party():
    """Return a dataframe of 100 party guests.

    Contains the following fields:
        name                           100 non-null object
        probability_of_attendance      100 non-null float

    Returns
    -------
    pandas.DataFrame
        DataFrame of party guest names and probabilities of attendance.

    Examples
    --------
    >>> data = load_party()
    >>> data.head()
                   name  probability_of_attendance
    0    Donovan Willis                       0.70
    1   Jocelyn Navarro                       0.70
    2     Houston Stein                       0.90
    3    Carlos Mullins                       0.50
    4    Bridger Pruitt                       0.70
    """
    module_path = dirname(__file__)  # directory location of datasets.py module
    data_path = join(module_path, "data", "party.csv")  # location of party.csv
    return pd.read_csv(data_path)

```

That's all there is to it. Let's re-install our package and see if we can import that data:

```{prompt} bash \$ auto
$ poetry install
```

Open an interactive Python session:

```{prompt} bash \$ auto
$ python
```

Then load our package's new example data set with the following code:

```{prompt} python >>> auto
>>> from partypy.datasets import load_party
>>> data = load_party()
>>> data.head()
```

```python
              name  probability_of_attendance
0   Donovan Willis                        0.7
1  Jocelyn Navarro                        0.7
2    Houston Stein                        0.9
3   Carlos Mullins                        0.5
4   Bridger Pruitt                        0.7
```

Looks like everything is working! Option 2 (providing scripts to download data from an external source) can be similarly implemented, except that the code will download data from an external source and save it on a user's local device, rather than load it locally from the installed package.

At this point, we've made a few changes to our package (like updated the `__init__.py` file and included example data) and we'd likely want to write some more tests to cover any new code we wrote, run all our tests, and release a new version of the package. We'll explore these topics more in **Chapter 5: {ref}`05:Testing`** and versioning and releasing updated packages in **Chapter 7: {ref}`07:Releasing-and-versioning`**.

## Package distribution

Up until now we've mostly been talking about the structure and content of packages, but the final step in the packaging workflow is to create a "distribution package" (which many people in the Python community still refer to simply as a "package"). Right now, our package is a collection of files and folders that's difficult to share, so the goal is to bundle it up into a distribution that can be easily shared (via a file-sharing service or by downloading from a public or private hosting repository) and then easily installed by users (which could just be you) with a simple `pip install` (or equivalent). This distribution should include the source code for your package, along with necessary metadata such as the dependencies of the package, a license (which tells users the terms under which they can use your package), etc.

Modern packaging tools like `poetry` (and others we'll discuss in a later section), aim to make the process of creating distributions as seamless and accessible as possible, so that developers don't have to worry too much about the low-level details. However, it's useful for prospective Python packagers to have a basic knowledge of what a distribution is and the two primary distribution types in Python: source distributions and built distributions.

### How to build a distribution package

Modern packaging tools like `poetry` and `flit` (which we'll discuss briefly at the end of this chapter) make use of the `pyproject.toml` file to define the metadata of your project, its dependencies, and how it should be built. We added various components to this file throughout **Chapter 3: {ref}`03:How-to-package-a-Python`**. The file is human-readable and fairly self-explanatory and we won't go into detail here, although interested readers can take a look at [PEP 518 - Specifying Minimum Build System Requirements for Python Projects](https://www.python.org/dev/peps/pep-0518/) which introduced `pyproject.toml`. The pragmatic takeaway for prospective packagers is that different packaging tools will outline the required and optional sections of the file; for example, see the `poetry` [documentation on `pyproject.toml`](https://python-poetry.org/docs/pyproject/). Tools like `poetry` will use this file to build distributions packages of your package, usually in the form of a source distribution and/or built distribution.

### Source distributions

A source distributions, or "sdist", is a single, compressed archive (e.g., `.tar.gz` or `.zip`) of the metadata and the source files needed to install your package. A user can download this archive, unpack it, and run `pip install` to install it. For example, in **Chapter 3: {ref}`03:How-to-package-a-Python`** we used `poetry build` to create distributions for our `partypy` package. A new directory was created in our working directory labelled `dist`:

```
partypy
├── ...
└── dist
    ├── partypy-0.1.0.tar.gz
    └── partypy-0.1.0-py3-none-any.whl
```

The `.tar.gz` compressed archive is our source distribution, if we uncompress it, we'd find the following contents:

```
partypy-0.1.0
├── LICENSE
├── PKG-INFO
├── pyproject.toml
├── README.md
├── setup.py
└── src
    └── partypy
        ├── __init__.py
        ├── plotting.py
        ├── simulate.py
        ├── datasets.py
        └── data
            └── party.csv
```

These are the files that `pip` would use to build and install the package. Discussing how `pip` works is beyond the scope of this book (although interested readers can read about it in the [`pip` documentation](https://pip.pypa.io/en/stable/)), but at this point, it's important to note that one of the most powerful features of Python is its ability to interoperate with libraries written in other languages, for example, C. Developers sometimes choose to take advantage of this interoperability and include code from other languages in their package to make their code faster, access libraries written in other languages, and generally improve the functionality of their code. While Python is typically referred to as an interpreted language (i.e., your Python code is translated to machine code as it is executed), languages such as C require compilation before they can be used (i.e., your code must be translated into "machine code" *before* it can be executed). Source distributions are "unbuilt" and require a build step before they can be installed - this basically means compiling the source code and extracting out package metadata. For packages written in pure Python, sdists are a perfectly acceptable way to distribute your package. However, for packages relying on any non-Python code, many users may not have the tools, experience, or time to build the package containing code written in other languages (typically called "extensions").

A useful analogy for all this is to imagine making a cake. A pure Python source distribution is a boxed cake mix: simply add liquid and bake. In contrast, an sdist relying on non-Python code is a collection of all the individual ingredients (flour, eggs, butter, etc.) and requires more time, effort, expertise, and tools to make the cake. At this point, you may be thinking - wouldn't it just be easier if you could distribute the baked cake, so users don't have to make it at all? Well, that would be a "built distribution".

### Built distributions

A built distribution is a distribution containing source files and metadata that has been pre-built and so do not require a build step before installation (unlike sdists) - they only need to be moved to the correct location on your system to be installed. The main built distribution format used by Python is the `wheel`, denoted with the extension `.whl`, and it is the preferred method of distribution in  Python. In fact, `pip` always prefers wheels because installation is simpler and faster, even for pure-Python packages (which don't require compilation of non-Python code).

Built distributions containing non-Python code are built for specific operating systems (because different operating systems require different machine code). However, developers don't typically build a wheel for every single operating system, only the most common ones. For this reason, built distributions are usually provided with their corresponding source distributions; if a wheel isn't available for a particular operating system, users will still be able to (try to) build the package from source. As an example, take a look at the downloadable distributions of [`NumPy` on PyPi](https://pypi.org/project/numpy/#files) - you'll see `wheels` for most common platforms, as well as a source distribution at the bottom of the list. There are actually three types of wheels depending on how specific they are to a version of Python and/or an operating system:

1. *Universal wheels*: pure-Python and support Python 2 and 3. Can be installed anywhere using `pip`; 
2. *Pure Python wheels*: pure-Python but don’t support both Python 2 and 3; and,
3. *Platform wheels*: binary package distributions specific to certain platforms as a result of containing compiled extensions.

You can tell a lot about a `wheel` from the name itself which follows a [strict naming convention](https://www.python.org/dev/peps/pep-0427/#file-name-convention): `{distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl`. For example, the `NumPy` wheel `numpy-1.18.1-cp37-cp37m-macosx_10_9_x86_64.whl` tells us that:

- The distribution is `numpy v1.18.1`;
- It is made for Python 3.7;
- It is specific to the `macosx_10_9_x86_64` platform (i.e, this is a "platform wheel" because it is platform-specific).

For our current `partypy` wheel `partypy-0.1.0-py3-none-any.whl`:

- The distribution is `partypy v0.1.0`;
- It is made for Python 3+;
- It is not specific to any one operating system, `any` (it is a pure-Python wheel).

The vast majority of readers will never deal with building extensions in other languages for their Python package, and so won't have to worry about making different built distributions of their package - the default outputs and configuration of modern packaging tools like `poetry` will take care of everything. However, we refer those interested in learning more to the [Python Packaging Authority guide](https://packaging.python.org/guides/packaging-binary-extensions/).

### Installed packages

An installed package is a distribution that’s been decompressed, built (in the case of an `sdist`) and then copied to your chosen installation directory. The default "chosen installation directory" varies by platform and by how you installed Python. 

"Installing" a package (e.g., by `pip install XXX`) is really a two-step process: 1) building the package (in the case of an `sdist`), and 2) installing the package. Using wheels takes out the first step, meaning we only need to install. The install step is fairly simple, it essentially copies the decompressed package files to the chosen installation directory, the default of which varies by platform and by how you installed Python.

In fact, we could manually install a package ourselves if we want to by decompressing a `wheel` and copying the files to their appropriate locations - there's no real reason to do this because it's far more effort than using a single one-liner at the command line, it does not resolve dependencies so could break your installation, and probably has other unwanted side-effects. However, it's a nice way to learn about the package installation process, so if you'd like to give it a go, you can try the following steps (which are based on using the MacOS and [`conda` package manager](https://docs.conda.io/en/latest/)):


1. Create a new virtual environment called "manualpkg" to act as a safe, test playground:

    ```{prompt} bash \$ auto
    $ conda create --name manualpkg python=3.9
    $ conda activate manualpkg
    ```
    
2. You can find a toy wheel to download in the GitHub repository of this book [here](https://github.com/UBC-MDS/py-pkgs/blob/master/docs/toy-pkg/dist/toy_pkg-0.0.1-py3-none-any.whl) (although you can try this manual installation procedure with any wheel). In the code below we'll `cd` into the test environment's install location, download this wheel, and unzip it:

    ```{prompt} bash \$ auto
    $ cd /opt/miniconda3/envs/manualpkg/lib/python3.7/site-packages
    $ curl -O https://raw.githubusercontent.com/UBC-MDS/py-pkgs/master/docs/toy-pkg/dist/toy_pkg-0.0.1-py3-none-any.whl
    $ unzip toy_pkg-0.0.1-py3-none-any.whl
    ```

3. This will result in two new unzipped directories: `toy_pkg` and `toy_pkg-0.0.1.dist-info`. From the terminal we can now start a Python session and try importing this package: 

    ```{prompt} bash \$ auto
    $ python
    ```
    
    ```{prompt} python >>> auto
    >>> from toy_pkg.toy_module import test_function
    >>> test_function()
    ```

    ```python
    You manually installed the toy_pkg example! Well done!
    ```
    
4. You can remove the `conda` virtual environment if you wish with the following:

    ```{prompt} bash \$ auto
    $ conda deactivate
    $ conda env remove --name manualpkg
    ```

### Packaging tools

The focus of this book is on workflows and tools that make packaging accessible and efficient. `poetry` is one of those tools. It abstract much of the lower-level details away from the packager and provides an easy-to-use CLI to develop, build, and publish a package - so developers can focus on writing code and not worry about the nuances of sharing it.

Another tool worth mentioning here is `flit`. `flit` is a slightly stripped down version of `poetry` in that it is a Python package that provides a simple tool to put Python packages and modules on PyPI. It is similarly configured with the `pyproject.toml` file and provides CLI commands such as `build`, `install`, and `publish`. At the time of writing, the main difference between `flit` and `poetry` is that `flit` doesn’t help you manage or resolve dependencies. This is a useful feature, especially for beginner and intermediate packagers, which is why we typically choose to use `poetry` for our package development. `poetry` also implicitly supports the use of virtual environments but this is a feature we haven't used in this book, in favor of using `conda`. This is mostly personal choice but `conda` can be used with `flit` for example, illustrating one of the reasons we chose to use it - it's universal.

A caveat on the use of `poetry` and `flit` is that, at the time of writing, they only support pure Python projects. This will be completely fine for the vast majority of prospective packagers. However, for those looking to build more advanced packages that include non-Python code, we recommend reading this documentation from the [Python Packaging Authority](https://packaging.python.org/guides/packaging-binary-extensions/#cross-platform-wheel-generation-with-scikit-build).