## Lecture 1: Introduction to Parallel Computing for Data Science and Review of Python/Jupyter/Conda/Github

## Syllabus Review

### Comments about 2022 revamp of course. 

The course description:

>This course studies parallelism in data science, drawing examples from data analytics, statistical programming, and machine learning. It focuses mostly on the Python programming ecosystem but will use C/C++ to accelerate Python and Java to explore shared-memory threading. It explores parallelism at all levels, including instruction level parallelism (pipelining and vectorization), shared-memory multicore, and distributed computing. Concepts from computer architecture and operating systems will be developed in support of parallelism, including Moore’s law, the memory hierarchy, caching, processes/threads, and concurrency control. The course will cover modern data-parallel programming frameworks, including Dask, Spark, Hadoop!, and Ray. The course will not cover GPU deep-learning frameworks nor CUDA. The course is suitable for second-year undergraduate CS majors and graduate students from other science and engineering disciplines that have prior programming experience. [Systems]

This course replaces _Parallel Programming_ as it was taught from 2013--2021. It shares the same course number
and you cannot receive credit for both.
The new syllabus changes the focus of the course:
  * Examples will be drawn from machine learning and data science as much as possible.
  * The course will not cover GPU programming, GPUs, or machine learning frameworks, such as TensorFlow, Keras, and PyTorch.
  * Supercomputing and scientific/numerical applications will be deemphasized.
  * The course adds material on instruction level parallelism, including pipelining and vectorization.

Most importantly, the term "Programming" has been replaced with "Computing" which reflects the natural
evolution of the space to focus more on the architecture and systems aspects of the subject matter and less on
programming patterns and parallel design.
        
### Course structure and grading

* 30% activities
  * These are programming exercises. The goal is to learn different programming environments through hands-on experience. Activities combine a parallel programming framework with an archicture/platform and use the programming tasks to explore principles.
  * Activities are to be completed. This is the goal. They are due and must be turned in. 
  * Grading of these projects determine whether the solution meets or does not meet the learning goal.

> The course includes of six programming activities that span one to two weeks of course time. Activities will be graded for completion of the assignment. Activities that are incomplete or do not fulfill the stated objectives may be resubmitted with permission from the instructor. The goal of the activities is for the student to gain skills with the algorithms, programming tools, and principles presented in the class. Credit for the assignment does not depend on providing correct answers to each question. Answers that are incorrect or programs that do not meet the assignment objectives will be either (1) be marked as incorrect to provide feedback to the student or (2) be returned to the student for resubmission. Every student will have the opportunity to receive all credit for all activities. Activities makes up 30% of the course grade.

* 60% in-person timed exams: two midterms, one final.  Each is 20% of the grade

* 10% in-course activities. This includes brainstorming, paper discussions, team exercises. Attendance is not required, but credit for this portion of the grade can only be accumulated during the course.

The final letter grades do not depend solely on the achievement of a target score over all assignments and exams. Grades will be determined based on the achievement of learning goals. The course staff will determine a map of total scores to grades at the end of the semester. This policy lets instructors account for variance in exam scores, specifically when the exam scores are lower than intended or expected by the instructors. Grades will start with the following guidelines:

* 93.% or more -> A
* 90% - 93.3% -> A-
* 86.6% - 90% -> B+
* 83.3% - 86.6% -> B
* 80% - 83.3% -> B-
* less than 80% -> TBD based on evidence of learning

The instructors may choose to move the grade boundaries down, i.e. move the A- threshold from 90% to 87% based on how the course realized learning goals. We will not move the thresholds up.
        
### The books

* Mattson -- Patterns for Parallel Programming
  * good, simple treatment
  * patterns are a powerful concept
  * outdated: old architectures, focus on high-performance computing (HPC)
* Matloff -- Parallel Computing for Data Science
  * more modern examples and good concepts
  * written from a statistics, rather than CS point-of-view (so, not sophisticated enough)
  * unreadable examples because of R
* Herlihy -- The Art of Multiprocessor Progamming
  * narrow book on concurrency control algorithms
  * a small part of the course
  * sophisticated, deep treatment
      
Lectures will not follow any of the books in a sequential manner. The schedule will reference materials in these books the support the lectures. 
    
### Ethics and collaboration

All recent offerings of this class have had major violations of academic integrity. The policy of this course is

<div class="alert alert-danger">
Students are encouraged to consult with each other and even collaborate on all programming assignments. This means that students may look at each other’s code, pair program, and even help each other debug. However, you must write your own code. You cannot copy and paste from external sources. If you work with a partner or team, you must cite the collaboration in a comment in your code. Additionally, each assignment also involves questions that analyze the assignment and connect the program to course concepts. The answers to these questions must be prepared independently by each student and must be work that is solely their own.
</div>

To very explicit, things that you cannot do:
  * copy and past code from the Internet, from prior solutions, or from your classmates
  * share or read other people's answers to questions on activities or exams
  * collaborate with other or use external sources without citing them in your code
You will turn in all your code.
 
The course intends to replicate a "real world" programming environment in which you have access to teams, peers, and stack overflow. Use any resources that you want to solve your programming problem. In this environment, if you copy and paste it is either:
(1) OK because the license allows it or (2) a copyright or license violation. Instructors want to avoid the latter and therefore we forbid any copying and pasting of code.
    
### Python Requirement

The course asks you to program in Python. It is understood that this is not necessarily a programming language in which you have had formal instruction. Many students have learned python as they go. Many other students already have some experience with Python. We intend to make this course about the parallel aspects of programming and not the python language. Every attempt is made to minimize the need for knowledge of Python.

If you are not comfortable with programming in Python or investing extra time in learning the language, I recommend that you drop the course and take the Python 1-unit adapter course _EN.500.133 Bootcamp: Python_.  This is mostly a disclaimer. If you are facile with Java, you should be able to adapt to the python needed for this course with a modicum of effort.



## What is Parallel Programming??

### Motivation for this Class

* Parallelism is everywhere!
    * Multicore, GPU, cloud, HPC, ML
    * Every program/programmer needs to address it
* Traditional CS curriculum totally misses the point
    * Model the world as serial algorithms
    
### Who should take this course?
 
* Designed for Undergraduates in CS: 
    * quick lift of skills good for employers and internships
* Suitable for graduate students in Science and Engineering
    * Mimimize dependencies on other CS courses
    * Self-contained treatment of OS, architecture, 
* Engineering and programming approach
    * Mostly ignore the theory of parallel computation
    * Focus on how programming languages interact with hardware architecture (particularly the memory system)
    
### What's a computer look like?

* Turing machine 

![Turing Machhine Cartoon](https://i0.wp.com/www.worldofcomputing.net/wp-content/uploads/2013/01/turingMachine.gif?zoom=2&resize=400%2C274 "Cartoon")

* Universal Turing Machine

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Universal_Turing_machine.svg/1600px-Universal_Turing_machine.svg.png" width="512" title="Universal TM" />

* Von Neumann Architecture

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Von_Neumann_Architecture.svg/2880px-Von_Neumann_Architecture.svg.png" width="512" title="Von Neumann Architecture" />    

#### Properties of "Computers"

* Sequential processing
    * Control or logic flow  
* Algorithm costs measured in this model
    * Big-O notation counts number of sequential steps
* This is the basis for the CS curriculum
    * And it’s just wrong
    * Computers are not sequential and performance is more nuanced than counting the number of steps 
* We look at computers as parallel entities
    * Do many tasks concurrently
    * Tasks interfere with each other
    * More accurately reflects hardware and bottlenecks
* What about parallel computation models?
    * Exist but not useful, because reality collides with the abstraction
    * The following schematic shows a PRAM (Parallel RAM) computational model. We will not discuss.
 
<img src="https://d3e8mc9t3dqxs7.cloudfront.net/wp-content/uploads/sites/11/2016/02/PRAM-Model.png" width="512" title="PRAM" /> 



## Python/Jupyter/Conda/Github (Running the Lectures)

(Almost) all lectures will be presented in Jupyter notebooks. Students are encouraged to run the notebooks interactively during the lectures to follow examples. We will often vary code during class to examine the effects of parameters, identify and resolve bugs, etc.

The tools we are going to use for lectures:
  * python -- interpreted programming language favored by CS-oriented data scientist and ML folks
  * conda -- package manager and configuration environment for Python
  * jupyter lab -- literate programming environment that mixes code and markdown
  * github -- cloud-based repository management for sharing stuff

To get here, we need to do the following:

1. Install python: 
  - For windows, I recommend using WSL (Windows Systems for Linux). I am running Ubuntu 20.04.2 LTS which comes with Python 3.8
  - For MacOSX, 
  
2. Install conda:
  - Anaconda prefers that you download an installer from https://www.anaconda.com/products/individual
  - If you are using WSL, you need to download and run the Ubuntu installer

3. Create and activate a conda virtual environment
```
conda create -n pp
conda activate pp
```
<br>

4. Install the python packages
```
conda install numba scikit-learn scikit-image jupyterlab pip matplotlib
```
<br>


6. Clone the github repository and move into directory
```
git clone https://github.com/randalburns/pplectures2022.git
cd pplectures2022
```
<br>

7. Run jupyter lab
```
jupyter lab
```
<br>
and launch this notebook.

