<div align="center">
  <img src="http://vlpavlov.org/Pythagoras-Logo3.svg"><br>
</div>

# Intoduction to Pythagoras

This tutorial explains the core Pythagoras constructs, which allow everyone to easily parallelize their Python programs and execute them in the cloud with just a few extra lines of code. It makes it possible to significantly speed up computationally expensive calculations.

### Initial Setup

First, let's install and import Pythagoras:

In [1]:
!pip install pythagoras --quiet

In [2]:
from pythagoras import *

### Hello, World! 

There is only one new oblect that we must create in a basic Pythagoras program. This object must be an instance of a class, inherited from **PCloud** class. 
    
* **PCloud**: objects of this class are responsible for actuall connection to the cloud (AWS, GCP, Azhure, etc.); they are also capable to store and execute in the cloud parallelized versions of your functions. 

**PCloud** is a base class for a hierarchy of many classess that implement alternative deployment models (so called "backends"). We will not discuss them in this tutorial.

In [3]:
my_cloud = PCloud(requires = "some packages", connection = "some connection parameters")

The **PCloud.add_pure_function** decorator allows to register your function with Pythagoras cloud. Once registerd, a fucntion gets a few new capabilities which we will discuss below.

Not every function can be added to a cloud module. There are key requirements:
* a function must be [pure](https://en.wikipedia.org/wiki/Pure_function): fully deterministic, no side-effect function whose output value depends solidly on input values;
* a function is only allowed to accept keyword parameters. Positional parameters are forbidden.

In [4]:
@my_cloud.add_pure_function
def very_slow_function(*, important_parameter:int):
    """     >>>>>       THIS FUNCTION RUNS FOR ABOUT AN HOUR       <<<<<     """
    return important_parameter**2

@my_cloud.add_pure_function
def another_slow_function(*, best_ever_parameter:int):
    """     >>>>>       THIS FUNCTION RUNS FOR ABOUT AN HOUR       <<<<<     """
    return best_ever_parameter**3

There are three main benefits of turning your regular function into a cloud-hosted function:
* Cloud-based memoization (caching)
* Cloud-based (remote) execution
* Cloud-based parallelization

Let's take a closer look:

### Cloud-based Memoization

The first time we run a slow function with a specific combination of input arguments, Pythagoras will store the function output in a cache. The next time we we want to run a function with exactly the same input arguments, there will be no need to actually execute it, the output will be retrieved from the cache. 

The cache is cloud-based. It means, we can run the function once on any computer (either local or cloud-based) and then reuse the cached output on any other computer.

In [5]:
# The first execution is very slow: over an hour

very_slow_function( important_parameter=22 ) 

484

In [6]:
 # The second execution is very fast: a fraction of a second
    
very_slow_function( important_parameter=22 )

484

In [7]:
# If the function was executed on another computer with important_parameter=99 in the past,
# now the execution will be very fast
#
# However, if the function was never executed on this or another computer with important_parameter=99
# then this execution will be very slow. All subsequent executions with mportant_parameter=99
# will be fast

very_slow_function( important_parameter=99 ) 

9801

### Cloud-based Execution

Pythagoras allows us to choose whether to execute a specific function on a local computer, or remotely in the cloud. Remote execution happens seamlessly, we do not need to worry about server provisioning, data marshaling, etc.

In [8]:
# When we are executing a function with a new combination of input arguments, it will run locally 

very_slow_function( important_parameter=88 )

7744

In [9]:
# We can explicitly instruct a function to be executed in the cloud
# If we have a slow desktop, remote execution will be faster

very_slow_function.sync_remote( important_parameter=12345 )

152399025

In [10]:
# If the output of the function for a specific combination of inputs is available in the cache,
# no actual fucntion execution will happen. The output will be simply retrieved from the cache.

very_slow_function.sync_remote( important_parameter=12345 )

152399025

In [11]:
# If the function was executed on a local or a remote computer with important_parameter=55 in the past,
# now the execution will be very fast. It will not be an actual execution, but rather a retreival from the cache.
#
# However, if the function was never executed on this or another computer with important_parameter=55,
# then this execution will be slow. However, all subsequent executions with mportant_parameter=55
# will be fast

very_slow_function.sync_remote( important_parameter=55 )

3025

Prefix *sync* in .sync_remote(...) means that the remote execution is done in a synchronous way: local program waits till a remote function fully completes and sends back its results. The remote execution finishes, the output of the function gets back to the local computer, and only then execution flow on the local computer resumes.

Alternatively, other distributed computing frameworks allow to initiate remote execution in an asynchronous way. There is a .async_remote(...) syntax for such scenario, which is reserved for the future versions of Pythagoras and currently is not supported. We strongly encourage Pythagoras users to design their programs for the synchronous execution model. Synchronous model makes it way more easy to reason about your code. 

Some people prefer asynchronous model becouse they think they can better optimise asynchronous programs. But the flipside is increased complexity and cognitive load.  We believe that actual engineers’ productivity is more important than perceived code efficiency.

After all, there is a reason why back in 1962 Donald Knuth wrote “*The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming*”. 

Even 60 years ago, when computers were super slow, Knuth prioritized human productivity over code efficiency. Why would we take a different stand today?

### Cloud-based Parallelization

Pythagoras makes it possible to seamlessly parallelize loops, which execute the same function with different combinations of input values.

In [12]:
# The first time we execute this code, it will take 10 hours to run. 
# Of course, all the subsiquent executions will be very fast because of memoization.
# But what if we wanted to speed up even the very first execution? 

results = []

for i in range(10):
    results.append(   very_slow_function( important_parameter=i )   )
    
results

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [13]:
# Here we are using list compehention to illustrate exactly the same scenario as above:
# The first execution will take 10 hours, subsiquent executions will be fast.
# But we do not want to wait 10 hours even the first time we run this code.

[   very_slow_function( important_parameter=i ) for i in range(100, 110)   ]

[10000, 10201, 10404, 10609, 10816, 11025, 11236, 11449, 11664, 11881]

In [14]:
# Pythagoras offers a solution. The example below shows how to 
# simultaneously launch multiple instances of a function in the cloud.
# All the calculations will be done in parallel.
# The executuon will take only one hour when we run this codel for the first time.

very_slow_function.sync_parallel(   kw_args( important_parameter=i ) for i in range(200,210)   )

[40000, 40401, 40804, 41209, 41616, 42025, 42436, 42849, 43264, 43681]

In [15]:
# Of course, all the outputs are stored in the cache. 
# When we run the same code for the second time (no matter if on this or on another computer),
# it will only take a fraction of a second to execute.

very_slow_function.sync_parallel(   kw_args( important_parameter=i ) for i in range(200,210)   )

[40000, 40401, 40804, 41209, 41616, 42025, 42436, 42849, 43264, 43681]

Prefix *sync* in .sync_parallel(...) means that the remote execution is done in a synchronous way: local program waits till all remote functions fully complete and send back their results. The remote execution finishes, the outputs of all functions get back to the local computer, and only then execution flow on the local computer resumes.

Alternatively, there is a .async_parallel(...) construct for asynchronous execution, which is currently not implemented and reserved for future versions of Pythagoras.

### Summary of the Key Capabilities

By adding a simple decorator in front of your Python function, you can turn it into a serverless code that can run both locally and remotely. Another line of code replaces sequential loops with a parallel execution engine that simultaneously launches hundreds of serverless functions in the cloud. This is a perfect solution for complex computational tasks, such as multi-fold cross-validation, grid search for hyperparameter optimization, or feature selection algorithms.

For pure functions (fully deterministic, no side-effect functions whose output values depend solidly on input values), Pythagoras provides cloud storage to cache function outputs. Such memoized functions run only once, all subsequent calls on any computer will skip function execution and return previously computed values. It makes complex distributed algorithms cheap to rerun, and easy to resume in case they were interrupted.

Cloud storage is partially replicated on local computers, which allows Python scripts and notebooks to access stored values very fast. Each piece of data is associated with its hash that serves as a key to access the data. When some data (e.g., a large Pandas DataFrame) must be passed as an input to a serverless function, under the hood Pythagoras pushes the data to the cloud storage, and only passes its hash to the function. This approach optimizes traffic, associated with launching new instances of serverless functions in the cloud, and significantly speeds up the process.

The typical scenario of working with Pythagoras is to parallelize Python code using backend compute infrastructure, provided by a major cloud vendor. We are currently working on creating reference implementation for AWS, with plans to integrate with GCP and Azhure later. As an alternative, Pythagoras offers a simple P2P model, in which serverless code can be parallelized over a distributed swarm of workstations, on-premise servers, and even laptops. This model is a good solution for resource constrained teams and educational projects.

### Conclusion

Pythagoras democratizes access to serverless compute for data scientists and other engineers who need to use Python for computationally expensive calculations.

It makes engineers' lives simpler, while allowing them to solve more complex problems faster and with smaller budgets.