# Introduction

1. Part 1: What is numerical software
   1. Why develop numerical software
   2. Hybrid architecture
   3. Numerical software = C++ + Python
2. Part 2: What to learn in this course
   1. Term project
   2. How to write a proposal
   3. Term project grading guideline
   4. Online discussion
3. Part 3: Runtime and course marterials
   1. Runtime environment: Linux and AWS
   2. Jupyter notebook

# Part 1: What is numerical software

Digital computer was originally invented to do mathematics.  The mission of the first digital computer, Electronic Numerical Integrator and Computer ([ENIAC](https://www.britannica.com/technology/ENIAC), 1945), was to quickly obtain artillery range tables.  Running at the electronic speed, the computer performed a tremendous amount of calculation.

We use digital computers to crunch more and more numbers.  Computer code follows the numerical methods, which are developed based on the mathematic formulations.  Sometimes the mathematics has an associated physical problem.  But sometimes, it's just mathematics.  The applications are endless, but here list some famous packages, so you have an idea how useful it is:

* Infrastructure: [NumPy](https://numpy.org)
* Data analytics: [Pandas](https://pandas.pydata.org), [Arrow](https://arrow.apache.org), [PyTables](https://www.pytables.org)
* Linear algebra: [BLAS](http://www.netlib.org/blas/), [LAPACK](http://www.netlib.org/lapack/)
* Geometry: [boost.geometry](https://www.boost.org/doc/libs/1_72_0/libs/geometry/doc/html/index.html)
* Visualization: [VTK](https://vtk.org), [Matplotlib](https://matplotlib.org)
* Machine learning: [PyTorch](https://pytorch.org)

Despite the versatility, numerical software shares common traits:

* Not visually pleasant, oftentimes no graphical user interface
* Knowledge-intensive, unintuitive to code
* Computation-intensive, often incorporating parallelism, distributed computing, and special hardware

Numerical software is developed to solve problems in science or engineering.  It always has an application domain attached, and cannot be handled solely in computer science.  Of course, since it is computer software, it cannot exist without computer science.  Natually it is cross-discipline and demands knowledge and skills in at least 2 fields from the practitioners.

# Why develop numerical software

Numerical software is developed to solve problems that are either impracticable or unmarketable without it.

For the impracticable problems, numerical software simply enables the solution so that we can study them.  Problems in the fields of fluid dynamics and astrophysics are usually of this kind.  For the unmarketable problems, the software will significantly reduce the cost to solution.  Machine learning, visualization, communication, etc., are problems of this kind.

Like developing any software, the true driver must be identified so that the system can be properly specified.  After that, there is a pattern in developing numerical software:

1. Observation
2. Genralize to a theory in math
3. Obtain analytical solutions for simple setup
4. Get stuck with **complex** setup
5. Numerical analysis comes to rescue
6. ... **a lot of code development** ...
7. Release a software package

# Hybrid architecture

Computing is about commanding the computers to perform calculations to yield the results that we want to see.  We delegate work to computers as much as possible, but keep the highest possible system performance.

Numerical software usually uses a hybrid architecture to achieve this.  The system is composed of a fast, low-level computing engine and an easy-to-use, high-level scripting layer.  It is usually developed as a platform, working like a library that provides data structures and helpers for problem solving.  The users will use a scripting engine it provides to build applications.  Assembly is allowed in the low-level computing engine to utilize every drop of hardware: multi-core, multi-threading, cache, vector processing, etc.

A general description of the architecture is like the following layers, from high-level to low-level:

* External result
  * This is presented in a non-technical way to people outside the problem-solving team.  They can be stakeholders for business or general public.  The result has to be generated in some way, which may or may not be included in the numerical software we make.
* Problem presentation: physics, math, or equations
  * Users use the software or associated tools to present the technical result.
* Scripting or configuration
  * Users follow the example scripts to configure the problems to solve.  Configuration files may also be used.
* Library interface
  * This defines the application programming interface (API) for the numerical software.  Scripts should not touch anything below this layer.
* Library structure
  * This is where we architect the software.  Good book-keeping code is here to separate the interface and the computing kernel.  Data structures are designed at this layer to make sure no time is wasted in copying or converting data.
* Computing kernel
  * This is the place the does the heavy-lifting, and where we do most of the optimization.

## Pattern 1: Research code

For a research code, the boundary between external result, problem presentation, and scripting, and that between library interface, library structure, and computing kernel, may be less clear.  The architecture is usually like:

* Problem presentation: high-level description, physics, and scripting / code configuration
* Library implementation

But sometimes if we don't pay attention to architecting, there may be no boundary between anything.

## Pattern 2: Full-fledged application

For a commercial grade package, each of the layers will include more sub-layers.  It is a challenge to prevent those layers or sub-layers from interweaving.  From users' point of view, the sophistication appears in the problem presentation and the scripting layers.  Developers, on the other hand, take care of everything below problem presentation, so that users can focus on problem solving.

## Pattern 3: Scripting for modularization

At this point, it should be clear that the scripting layer is the key glue in the system architecture.  The high-level users, who use the code for problem solving, wouldn't want to spend time in the low-level implementation.  Instead, they will specify the performance of the API exposed in the scripting layer.  The performance may be about the quality of result and runtime (including memory).

The scripting layer can separate the programming work between the high-level problem presentation and the low-level library implementation.  A scripting language is usually dynamically typed, while for speed, the low-level implementation language uses static typing system.  In the dynamic scripting language, unit-testing is required for robustness.  In a statically typed language like C++, the compiler and static analyzers are very good at detecting errors before runtime.  But the great job done by the compiler makes it clumsy to use C++ to quickly write highly flexible code for problem presentation.

It is tempting to invent one programming language to rule them all.  That approach needs to convince both the high-level problem solvers and the low-level implementors to give up the tools they are familiar with.  The new language will also need to provide two distinct styles for both use cases.  It will be quite challenging, and before anyone succeeds with the one-language approach, we still need to live with a world of hybrid systems.

# Numerical software = C++ + Python

The key to a successful numerical software system is make it unnegotiably fast and extremely flexible.  It should be flexible enough so that users, i.e., scientists and engineers, can easily write lengthy programs to control everything.  It should be noted that, although the users program in the system, they by no means know about computer science.

Not all programming languages can meet the expectation.  To this point, the most suitable scripting language is Python, and the most suitable low-level language may be C++.  C++ can be controversial, but considering the support it received from the industry, it's probably difficult to find another language of higher acceptance.  Our purpose here is to introduce the skills for developing numerical software, not to analyze programming languages.  We will focus on C++ and Python.

## More reasons for Python

* Python provides a better way to describe the physical or mathematical problem.
* Python can easily build an even higher-level application, using GUI, scripting, or both.
* Is there alternative for C++?  No.  For Python?  Yes.  But Python is the easiest choice for its versatility and simplicity.
* A numerical software developer sees through the abstraction stack:
  * The highest-level application is presented as a Python script.
  * The Python script drives the number-crunching C++ library.
  * C++ is the syntactic sugar for the machine code.

# Part 2: What to learn in this course

The course is composed of 14 lectures, 6 homework assignment, 1 mid-term examination, and 1 term project, to introduce the development fundamentals to the students.  The term project will include a public github repository and an oral presentation in front of the class.

The lectures before the mid-term will review the basic engineering, programming languages, and computer science topics.  The mid-term will test students' understandings.  The lectures after the mid-term will cover various coding skills and structures that are found in numerical software.  The 6 homework assignments are designed for students to practice the individual topics.  And the term project will be used to help students learn how to put everything together.

# Term project

The term project is an important part in the course.  You need to develop the code in a public github repository.  In addition to applying the skills and structures covered in the course, you will use it to practice how to **specify** and **design** the software.  It is equally important as coding itself.  Numerical software is used to solve open questions, and we need to change the target as we go.

To be successful in the project, you should start to think about the project topic when the course begins.  To help you, you will be asked to submit a proposal of the project, along with the github repository that houses the source code.

If you start early and develop the project throughout the course, you will learn how to do iterative design.  That is what should happen in a healthy software shop.  A well-thought proposal will help you do well in the implementation, but a perfect proposal isn't a prerequisite.  It is OK to change the proposal when you are implementing the system, and it frequently happens with a real-world project.  Just don't change the plan at the last minute.

In a real software shop, the result is everything.  But for this course, the course of development is more important than the final product.  The term project is an opportunity to practice how to design and architect.

You are expected to do the project alone.  A project of intensive physics or mathematics is usually hard to explain.  The instructor will interact with you through the discussion for the proposal and development, to show you how to perform effective communication and collaboration.

# How to write a proposal

The proposal is to help you practice writing a specification.  It should at least include the following contents:

1. Basic information (including the GitHub repository)
2. Problem to solve
3. Perspective users
4. System architecture
5. API description
6. Engineering infrastructure
7. Schedule

The purpose of a proposal (or a specification) is to enable discussions that cannot be done with programming language.  For example, source code is not suitable for describing software architecture.  In [The Architecture of Open Source Applications](https://aosabook.org/en/index.html), you can see the many different ways that the developers use to present architecture.  There is not a fixed way, but the natural language and diagrams are the most common tools.  They are probably the most effective ones, too.

# Term project grading guideline

Here is a list for the items to be considered in grading the term project.  Your source code repository (including the history) and oral presentation will be used for grading.  The proposal 

* Software engineering:

  * Build system:
  * Testing:
  * Version control:
  * History quality:
  * Issue tracking:
  * Documentation:
  * Other:
* Correctness:

  * Existence of golden:
  * Quality of golden:
  * Order of development:
  * Other:
* Software architecture:

  * [SOLID](https://en.wikipedia.org/wiki/SOLID):
  * Proper use of high-level language wrapping:
  * Level of modularity:
  * Performance:
    * Profiling:
    * Runtime:
    * Memory:
  * Design for testing:
  * Iterative design:
  * Other:
* Presentation:

  * Technical fluency:
  * Slide clarity:
  * Time control:
  * Appearance:
  * Other:

# Online discussion

Being a full-time software engineer working in a commercial company, I do not show up on campus often.  But the course requires significant amount of discussions.  You will need to use github and emails to get my help.

* For anything about the course note, open an issue in the note repository https://github.com/yungyuc/nsd.
* For anything about the homework assignments, go to https://github.com/yungyuc/nsdhw_20au and open an issue there.
* For grades (of homework assignments or exam) that you see on E3, send me an E3 mail with `[nsd]` in the subject line.
* For anything else, send me an email at `yyc at solvcon.net` with `[nsd]` in the subject line.

**The `[nsd]` tag in the email subject line is important to draw my attention to your message.  Don't miss it.**

# Part 3: Runtime and course materials

# Runtime environment: Linux and AWS

In this course we will be using [AWS educate](https://aws.amazon.com/education/awseducate/) to practice coding.  AWS educate is the educational service provided by Amazon, on which you can launch virtual machines with the image I prepared.  The code for homework assignments must run written on the platform.  If the grader cannot run it, you get no point.

For setting up the service, take a look at https://www.it.nctu.edu.tw/?page_id=3193 (Chinese only).  Because it takes several days for aws to review your enrollment, I suggest you to do it as soon as possible.  Please do get the AWS educate account and use it for your homework code, unless you know so well about system administration that building depedencies is like breathing.

I will use the email you have in E3 for sending the AWS educate invitation.  If you don't see the invitation, please write an email to me after checking the spam box.

Everyone will get credits of USD 50 in the region `us-east-1`.  Once you have the account set up, go to ec2 and launch a virtual machine using the ami I prepared: `nsd-ubuntu1804`.  For testing, you may use as small as `t2.micro`, but more powerful instance types may be needed after we move on to later topics.

# Jupyter notebook

"Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages." -- https://jupyter.org

* This note is done by using Jupyter
* Show the code and run it in the same time
* Terminal access
* File management

I am using Jupyter notebook to write the course notes, and provide an interactive environment for you.  Everything in my notes should be runnable, so that you can tweak the code yourself and learn from doing.

## What is Jupyter

Jupyter is a client-server system.  What we are touching and playing is its "frontend", the interactive user interface.  It talks to the "backend", which is called a Jupyter kernel.  See the following image ([source](https://jupyter.readthedocs.io/en/latest/architecture/how_jupyter_ipython_work.html)):

<center><img src="https://jupyter.readthedocs.io/en/latest/_images/notebook_components.png" alt="Jupyter distributed architecture" /></center>

The system is distributed.  The browser and the Jupyter server run on different computers and HTTP is used to connect them.  The kernel can also be configured to run on a different computer than the server.

Jupyter has 3 types of cells:

1. Code.  The content will be executed.
2. [Markdown](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#markdown-cells).  Use a mark-up language called "[markdown](https://daringfireball.net/projects/markdown/)" to format text.
3. Raw nbconvert.  Jupyter skip processing the content and pass is through to other converting tools.

Most of the time we only care about the interactive computing capability provided by the code cell.

## Python code

In [None]:
import numpy as np
v1 = np.array([1,1,1], dtype='float64')
v2 = np.array([1,-1,0], dtype='float64')
print("dot product between v1 and v2:", v1.dot(v2))
print("v1 length:", np.sqrt((v1**2).sum()))

In [None]:
# simple math
d = 30.*np.pi/180.
print("trigonometric function at 30 degree:", np.sin(d), np.cos(d), np.tan(d))

## IPython magic

[IPython](https://ipython.readthedocs.io/) provides the Jupyter kernel for enhanced interactive execution.  The "magic" are part of the enhancements.  There are two types of magic commands: line and cell.  A line magic is a line starting with "`%`".

In [None]:
import sys
print(sys.path) # show python import paths
%pwd # print current directory

## Cell magic

A line starting with "`%%`" indicates a magic that takes all the content of a cell.

In [None]:
%%script bash
whoami
pwd
ls -l

## Other features

* Escape to shell in a line starting with "`!`":

In [None]:
!uptime

* Editor
* Terminal

# References

* The Architecture of Open Source Applications, http://aosabook.org/en/index.html.