-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Parallel Computing" by Alex Razoumov #103
Comments
@razoumov, when you get a chance, comment with the details relating to your workshop so I can update the "TBA" values. Or you can edit the issue directly. |
@razoumov, when you get a chance after your trip, could you confirm whether SSH is all that is necessary for your workshop? That would mean it's pre-installed on OS X and Linux and something like PuTTY needs to be installed on Windows computers. |
@brunogrande Participants will need ssh (for Windows we recommend http://mobaxterm.mobatek.net but really any client will do) and a WestGrid account. You can find the details on how to obtain an account at https://www.westgrid.ca/support/accounts/getting_account . If attendees need a sponsor (who is usually their supervisor), please tell them to contact me, and I'll send them my CCRI which they need to fill in the application form. It would be best if all attendees get their accounts few days before the workshop. |
Thanks, @razoumov! |
@razoumov do you prefer attendees to contact you by e-mail or through this issue? I am sending out an e-mail friday afternoon-ish and I don't want to be the reason you get spammed in the future :) |
@lpix They can contact me via my WestGrid email listed at https://www.westgrid.ca/about_westgrid/staff_committees/staff . Thanks! |
Description
Join us for a beginner-level introduction to MPI (Message Passing Interface), an industry standard library for distributed-memory parallel computing on large systems. There are implementations of MPI for all major compiled and interpreted languages, including C/C++, Python, and R, and it's a default parallel computing library on all academic HPC systems, including the Compute Canada clusters. Using a simple example, we will learn how to partition your calculation on multiple processors, and how to use basic send/receive and broadcast commands.
If you are thinking of parallelizing a long and/or large-memory calculation, this is a session to attend. If you already have a problem in need of parallelization, please bring it along for an in-class discussion.
Time and Place
Where: Simon Fraser University, Burnaby Campus, Library Research Commons
When: Monday, August 8th, 10:30-11:30 am
Registration
REGISTER HERE
Required Preparation
Software Dependencies
SSH
WestGrid Account
You can find the details on how to obtain a WestGrid account here . If you need a sponsor (who would normally be your supervisor) to apply for an account, please contact Alex Razoumov, and he'll send you his CCRI which you need to fill in the application form.
⟶ It would be best if all attendees get their accounts few days before the workshop.
If you don't have an account, you can still attend the workshop, but you won't be able to do hands-on exercises -- you'll still be able to watch the presentation and participate in the discussion.
Notes
Basic concepts
Why do you want to parallelize your code?
Some people talk about task parallelism vs. data parallelism - the divide is not always clear, so I prefer not to use this terminology.
There are also embarrassingly parallel problems: often can simply do serial farming, no need to parallelize.
In general, whatever parallelization strategy you choose, you want to minimize the amount of
communication. Remember: I/O and network are usually bottlenecks -- for reasons look into the history of computing.
Amdahl's law
Architectures and programming models
Parallel hardware architecture determines the parallel programming model.
Distributed-memory programming paradigms:
Code efficiency and common sense
Python and R have terrible native performance, so it might not be such a good idea to parallelize these codes in the first place! Both are interpreted scripting languages designed for ease-of-use and high level of abstraction, not for performance. There are exceptions to this rule - can you name them?
Try to optimize your algorithm before parallelizing it! Don't use inefficient algorithms, silly data constructs, slow languages (including Java), don't reinvent the wheel coding everything from scratch. Do use precompiled libraries, optimization flags, profiling tools, think of the bottlenecks in your workflow and of the overall code design.
Always think of the bottlenecks! For example, with data processing, running 100 I/O-intensive processes on 100 cores on a cluster will not make it 100X faster - why?
Python vs. C timing
Let's compute \pi via numerical integration.
First let's take a look at the serial code pi.c
Can ask the compiler to optimize the code.
Now let's look at the same algorithm in pi.py
This is 50X performance drop compared to compiler-optimized C code on my laptop! On a cluster's compute node I get 80X. If you code a PDE solver with a native code, you'll likely see a 100X-300X drop in performance when switching to Python.
Then why use Python?
Does it make sense to parallelize a Python code?
Parallelization
Cluster environment
Normally on a cluster you need to submit jobs to the scheduler. Results (output and error files tagged #with the jobID) usually go into the directory into which you cd inside the job's submission script.
For debugging and testing you can start an interactive job specifying the resources from the command line. The job will start on "interactive" nodes with shorter runtime limits (to ensure that your job will start soon).
However, this might take a while (from seconds to many minutes) if the system is busy and the scheduler is overwhelmed, even if the "interactive" nodes are idle. For the duration of this workshop only, we reserved a node cl2n230 on Jasper cluster where you can work interactively.
Plan B solution: if this does not work, you can use an interactive node (b402 or b403) on bugaboo (Ok for quick testing only!).
Once you are on the interactive node, cd into a temporary directory and be prepared to run a parallel code.
We'll now take a look at mpi4py, an MPI implementation for Python. There are two versions of each MPI command in mpi4py:
Let's try running the following code (parallelPi.py) adding the lines one-by-one:
Compare the runtimes to the serial code
For n = 1,000,000 we get a slowdown from 0.8s to 1.9s! However, for n = 10,000,000 we get a speedup from 5.6s to 3.3s. And for n = 100,000,000 we get a speedup from 54s to 16s, getting closer to 4X.
Now let's compare Python MPI syntax to C MPI in parallelPi.c
Reduce() is an example of a collective operation.
Major MPI functions
Examples of point-to-point communication functions are Comm.Send(buf, dest = 0, tag = 0) and Comm.Recv(buf, source = 0, tag = 0, Status status = None).
Examples of collective communication functions are Comm.Reduce(sendbuf, recvbuf, Op op = MPI.SUM, root = 0), Comm.Allreduce(sendbuf, recvbuf, Op op = MPI.SUM) -- where the reduction operation could be MPI.MAX, MPI.MIN, MPI.SUM, MPI.PROD, MPI.LAND, MPI.BAND, MPI.LOR, MPI.BOR, MPI.LXOR, MPI.BXOR, MPI.MAXLOC, MPI.MINLOC -- and Comm.Bcast(buf, root=0) (sending same to all) and Comm.Scatter(sendbuf, recvbuf, root) (sending parts to all).
In the C MPI library there are 130+ communication functions. Probably there are quite a few in mpi4py.
Point-to-point example
Here is a code (point2point.py) demonstrating point-to-point communication, sending a number to the left in a round chain:
Exercise: write an MPI code for two processors in which each processor sends a number to the other one and receives a number. Use separate comm.Send() and comm.Recv() functions. Is their order important?
Exercise: write another code to compute \pi using a series
on an arbitrary number of processors, i.e., the code should work on any number of cores. Run it on 1, 2, 4, 8 cores and measure speedup. Do you get 100% parallel efficiency?
Discussion: how would you parallelize the diffusion equation?
Discussion: how would you parallelize your own problem?
Attending SciProg Organizers
The text was updated successfully, but these errors were encountered: