# Software Environment and Cloud Computing

[![Dependency](fig/dependency.png)](https://xkcd.com/2347/)

In modern computational science, running code is only half the
challenge.
The other half is managing the software environment it depends on.
Different projects may require different compilers, libraries, or
Python packages, and the ability to reproduce results across HPC
systems, laptops, and cloud platforms is important.

This week, we will learn practical tools for creating and managing
software environments:
* Package managers (e.g., `apt-get`) for installing software at the
  system level
* Building from source when precompiled packages are unavailable or
  when you do not have root access to the system.
* HPC `module` for loading different versions of software on a shared
  system.
* Python virtual environments for dependency isolation
* Containers (i.e., Docker) for reproducibility and portability

## Software Management Basics

On Unix/Linux systems, most software is installed through package
managers, which automatically handle downloading, installing, and
configuring programs and their dependencies.

Common package managers include:
* Debian/Ubuntu: [`apt`](https://wiki.debian.org/Apt) (or the older `apt-get`)
* Red Hat/Fedora/CentOS: [`yum`](https://fedoraproject.org/wiki/Yum) or [`dnf`](https://docs.fedoraproject.org/en-US/quick-docs/dnf/)
* Arch Linux: [`pacman`](https://wiki.archlinux.org/title/Pacman)
* macOS: [`brew`](https://brew.sh/) (Homebrew) and [`port`](https://www.macports.org/) (MacPorts)

[![sudo](fig/sandwich.png)](https://xkcd.com/149/)

By default, package managers install software system-wide, which
requires root (administrator) privileges.
On personal machines, this is often done by prefixing commands with
`sudo` (i.e., "superuser do").

On shared systems such as HPC clusters, users typically do not have
root access, so alternative approaches are used (modules, virtual
environments, containers).

1. Here, we use Docker to create a sandbox environment where you can
   safely practice system-level package management commands.
   We will cover Docker in more detail later, but for now, assuming
   Docker is installed, you can start a sandbox with:
   ```bash
   docker run -it --rm debian:forky-slim
   ```

2. Inside your Docker container, update the package list (always do
   this first):
   ```bash
   apt update
   ```

3. Install a simple system utility (e.g., `htop`) and a scientific
   library (e.g., GNU Scientific Library `gsl`):
   ```bash
   apt install -y htop
   apt install -y libgsl-dev
   ```

4. You may now verify the installation:
   ```bash
   htop                 # should start the process viewer
   dpkg -l | grep gsl   # list installed gsl package
   ```

## Building from Source

Not all software is available in package repositories or is prepared
in a way that you want it.
Sometimes the version provided by the package manager is outdated, or
the software is not packaged at all.
Sometimes you may want to enable some special settings or install a
package in non-standard locations.
In these cases, you can download the source code and compile it
yourself.

[![Virus on Linux](fig/virus.png)](https://www.gnu.org/fun/jokes/evilmalware.en.html)

This process usually follows steps:
1. **Configure** the build system (check dependencies, set options).
2. **Compile** the source code.
3. **Check** the resulting binaries.
4. **Install** the resulting binaries to `prefix`

We will try it out in a Docker sandbox.

1. Start a sandbox in a `gcc` Docker container; then download GSL
   ```bash
   docker run -it --rm gcc
   mkdir /src && cd /src
   wget https://ftp.wayne.edu/gnu/gsl/gsl-2.8.tar.gz
   tar -xvzf gsl-2.8.tar.gz && cd gsl-2.8
   ```

2. Configure, build, check, and install:
   ```bash
   ./configure
   make
   make check
   make install
   ```

3. Optionally, you may test it by compiling a small program:
   ```c
   /* Save as "test.c" and then compile by `gcc test.c -o test -lgsl` */
   #include <stdio.h>
   #include <gsl/gsl_sf_bessel.h>
   
   int main() {
       double x = 2.4048; /* first root of J0 */
       double y = gsl_sf_bessel_J0(x);
       printf("J0(%g) = %.18e\n", x, y);
       return 0;
   }
   ```

In [None]:
# HANDSON: how do you get the above source code in a Docker container?


In [None]:
# HANDSON: suppose that you don't have root access to a machine.
#          Let's install GSL to "/home/me/.local/" instead of the
#          standard location "/usr/local/".
#          How do you do it?
#          Hint: run `/configure --help` to see all the different
#          options.


## HPC Software Modules

On shared HPC systems, users do not have root access, so they cannot
install software with package managers like `apt`.
Instead, HPC centers provide software through the
[environment modules system](https://modules.readthedocs.io/en/latest/).
Modules let you load and switch between different software versions by
adjusting your environment variables (e.g., `PATH`,
`LD_LIBRARY_PATH`).
This avoids conflicts and allows multiple versions of compilers,
libraries, and applications to coexist.

Notes for UA HPC:
* Education (class) accounts are only available on `ocelote`.
* You must run the command `interactive` to request an interactive
  compute node before using `module` or compiling software.

1. Log in to ocelote and switch to ocelote:
   ```bash
   ssh NETID@hpc.arizona.edu
   shell
   ocelote
   ```

2. Request an interactive node (wait for the prompt to change):
   ```bash
   interactive
   ```

3. List all available modules:
   ```bash
   module avail
   ```
   You should see a long list of compilers, MPI libraries, Python
   versions, and scientific software.

4. List currently loaded modules:
   ```bash
   module list
   ```

5. Load a specific GSL module:
   ```bash
   module load gsl
   ```

6. Compile the above test program.

In [None]:
# Hands-On: Explore the available modules (`module avail`) and pick
#           one piece of scientific software that looks interesting to
#           you (e.g., `gcc`, `openmpi`, `gsl`).
#           Load it with `module load NAME/VERSION` and try a simple
#           test (e.g., check the compiler version with `gcc
#           --version`, or run `mpirun --version`).
#           Compare the output to what you get without loading the
#           module.


## Python Virtual Environments

In scientific computing, different projects often need different
Python packages, or even different versions of the same package.
Installing everything system-wide can quickly lead to dependency
conflicts, a.k.a. "package hell".

A virtual environment solves this problem by creating an isolated
Python workspace where you control exactly which packages are
installed, independent of the system or other projects.
This is similar to HPC modules, but focused on Python.

[![Python Environment](fig/python_environment.png)](https://xkcd.com/1987/)

1. On your system or HPC node, check the default Python version.
   Then use `venv` to create a directory to store all information in a
   virtual environment.

   ```bash
   python3 --version
   python3 -m venv ~/.venv/astr501
   ```

2. Activate the virtual environment by sourcing the "activate" file.
   ```bash
   . ~/.venv/astr501/bin/activate
   ```

3. Use `pip` to install packages with specific versions, e.g.,
   ```bash
   pip install numpy=2.0.0
   ```

4. Check your virtual environment does contain the specific package
   version:
   ```bash
   pip freeze | grep numpy
   ```

5. Exit the virtual environment and check that your system packages
   are not affected.
   ```bash
   deactivate
   pip freeze | grep numpy  # this is the system pip
   ```

In [None]:
# HANDSON: create a new virtual environment for your research
#          project, e.g., `galaxy-env` or `blackhole-env`.
#          Install astronomy package that is not in the system Python
#          (e.g., `yt`, `plasmapy`)
#          Write a short Python script that uses your chosen package
#          and run it inside the environment.
#          Verify that this package is not available in the system
#          Python after deactivation.


## Containers for Reproducibility

So far we have seen:
* Package managers install software at the system level.
* Building from source gives flexibility when no package exists or
  customization is needed.
* Modules solve the resolve software versions and "no root access"
  problem on HPC and shared workstations.
* Virtual environments isolate Python packages.

A container (e.g., Docker) can bundle all of these layers, from
operating system, compilers, libraries, and Python packages, into a
single portable image.
This ensures that your code will run the same way on your laptop, an
HPC cluster, or the cloud.

We already used Docker to create sandbox above.
Let's now dive deeper on what it actually did.
There are many training materials about Docker online, including this
[CyVerse workshop](https://cyverse-astrocontainers-workshop-2018.readthedocs-hosted.com/en/latest/docker/dockerintro.html).
Depending on the time, we may go through the workshop or run the
following simple example.