# Before we start
You've got two good ways of doing all the course material.  The first way is the simplest to get going, but is a bit less flexible, especially down the road. It just needs a web browser and a Google account. Download the course materials onto your Google Drive and you can run everything for the course in a web browser. Just head to [https://colab.research.google.com/](https://colab.research.google.com/) and you'll run everything on one of Google's servers.

The second way involves installing two bits of software on your machine. If you actualy move forward with Python in your work, you'll likely want to do this anyway. 
- [Anaconda](https://www.anaconda.com/products/distribution): Python + data science packages + R + more *or* [Miniconda](https://docs.conda.io/projects/miniconda/en/latest/miniconda-other-installer-links.html). For Mac folks, you may want to try [Miniforge](https://github.com/conda-forge/miniforge/releases).  Which of these you pick really doesn't matter. "Conda" is just a nice way to acesss Python modules / packages and Anaconda is a one-stop shop that bundles all you'll need. You only need their free license. 
- [VSCode](https://code.visualstudio.com/Download): A ridiculously good and powerful editor + coding environment.  It plays very well with Jupyter notebooks. 

If you really want, you can run in anything that has Python3 + Numpy / Scipy / Pandas and that supports Jupyter notebooks. If you don't know what any of that means or aren't wedded to some setup, give the VSCode + Anaconda setup a go. It's the easiest setup I've found.  If you do go the Miniconda or Miniforge route, you'll start out with a pretty bare setup.  Just type `conda install scipy pandas seaborn icecream ipykernel` into the terminal and it'll install what you need for the course.

## Course material
The course material is all located on [Github](https://github.com/celstark/n247data). Of course, you've got this here, so you must already know that. If you don't know how git works and don't care to, just download the `.zip` via the green _Code_ button. If you're interested in git / GitHub, but don't want to spend 20 hours learning the difference between forks and branches, just tell **VSCode** to clone the Git repository (I told you it was awesome) or download [Github desktop](https://desktop.github.com/). 

# Introduction

These notes are based on Chapter 1 of Dive into Python.  It's an old text now ([Python for Everybody](https://eng.libretexts.org/Bookshelves/Computer_Science/Programming_Languages/Python_for_Everybody_(Severance)) is a good option these days if you're looking for a text). But, the intro here is a nice place to start.as it hits a number of key concepts.


## Concepts
*   Functions & parameters & return values
*   Documentation
*   Indentation
*   Control flow
*   Variables
*   Loops
*   Exceptions

## Mechanics and Practicals
* Using the Jupyter Notebooks in Google Colab or VSCode / Anaconda to run existing code
* Using the Scratch cell (Google Colab)
* _Don't worry_ - you're not editing my version of this file

## Colab
If you got this notebook from a Google Share link, you'll notice it says up top that it can't save any changes you make here.  That's fine for now.  You're not actually working on my copy of this, but rather a temporary copy unique to you.  But, for good practice, click on the **Copy to Drive** text just below the menu bar.  This will save it to your Google Drive so you can come back to this at any point.

Now, below is the `humansize.py` file from *Dive Into Python*.  Hit the "Play" button in the upper left corner of it (aka _Run cell_ or *Execute cell*), scroll down and look for the output.


# ChatGPT, zotGPT, and AI tools
You may have heard that the large language models of things like ChatGPT can write code for you. You may be tempted to just pass in all the assignments here and let it do them for you.  I've tried it on several of the assignments here and sure, it can nail the first couple of them. Past that though, you start to run into issues.

Is this me saying "Don't trust ChatGPT!"?  Well yes, and no.  Is it "Don't use ChatGPT!"?  Again, yes and no.  Flash back to the days of yore - maybe 2020. What did students do in the course when they were stuck?  They asked another student or Google and found sourceforge pages that were at least kinda close. They used those results to help them understand the syntax or work through one hurdle and that then helped them move forward.  In effect, they had a reference or a tutor.  *Back in my day...*, sure we didn't have sourceforge. Really, the internet was invented when I was in grad school.  But, I had, and still have on my bookshelf, texts like *Numerical Recipes in C*.  It was an agorithm cheat-code and we all turned to it.


We're nowhere at the point where you could say to an LLM, "Write a program that will ingest an DICOM files it finds, figure out where any personal health information is stored in there, strip that out and save it in another location to anonymize all my data." or "Write a program that will take all my data and see if there's any structure in it I might have missed and then see if any amount of structre it finds would happen readily by chance."  We're scientists. We're doing things that haven't been done before.  AI models are based on knowledge on the web - **existing knowledge**.  Parts of what you need to do may-well have been solved. Small parts surely have.  AI tools are going to be great coaches for things like "How can I make an N-dimensional array of normal random numbers using numpy".  Here's what zotGPT returns on that:

---

You can create an N-dimensional array of normal random numbers using the `numpy.random.normal` function in Numpy. Here's a general way of doing this:

```
import numpy as np

# Dimensions of the array
dims = (dim1, dim2, ..., dimN)  # Replace dim1, dim2, ..., dimN with the size along each dimension.

# Parameters for the normal distribution
loc = 0.0  # Mean (center) of the distribution
scale = 1.0  # Standard deviation (spread or width) of the distribution

# Creating the N-dimensional array
array = np.random.normal(loc, scale, dims)
```

For example, if you want to create a 3-dimensional array of normal random numbers with dimensions 2x3x4, you would do:

```
import numpy as np

dims = (2, 3, 4)
loc = 0.0
scale = 1.0

array = np.random.normal(loc, scale, dims)
```


The `array` will be a 3-dimensional array filled with numbers drawn from a normal distribution with mean `loc` and standard deviation `scale`. The shape of the array is defined by `dims`.

---

So, think of these tools as coaches.  Feed them small bits when you're frustrated or don't understand some part of the problem.  But, how to break up your problem into smaller pieces, how to structure it, etc. is going to be something that you're going to need to learn to do no matter what.

# Sample Python program

In [None]:
'''Convert file sizes to human-readable form.

Available functions:
approximate_size(size, a_kilobyte_is_1024_bytes)
    takes a file size and returns a human-readable string

Examples:
>>> approximate_size(1024)
'1.0 KiB'
>>> approximate_size(1000, False)
'1.0 KB'

'''

SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
            1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}

def approximate_size(size, a_kilobyte_is_1024_bytes=True):
    '''Convert a file size to human-readable form.

    Keyword arguments:
    size -- file size in bytes
    a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024
                                if False, use multiples of 1000

    Returns: string

    '''
    if size < 0:
        raise ValueError('number must be non-negative')

    multiple = 1024 if a_kilobyte_is_1024_bytes else 1000
    for suffix in SUFFIXES[multiple]:
        size /= multiple
        if size < multiple:
            return '{0:.1f} {1}'.format(size, suffix)

    raise ValueError('number too large')

print(approximate_size(1000000000000, False))
print(approximate_size(1000000000000))

# Copyright (c) 2009, Mark Pilgrim, All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without modification,
# are permitted provided that the following conditions are met:
# 
# * Redistributions of source code must retain the above copyright notice,
#   this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright notice,
#   this list of conditions and the following disclaimer in the documentation
#   and/or other materials provided with the distribution.
# 
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS'
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.



What happened when you hit play (aka "run cell" aka Shift-Enter)?  It should have given an output that said:
```
1.0 TB
931.3 GiB
```
But, what happened internally?  Some host machine (e.g., Google Colab or your own machine) has a version of Python running and when you ran the cell, it "defined a function" called *approximate_size()*, and then called that function twice.  It called that function once "passing in" two values: `1000000000000` and `False` and a second time passing in just the `1000000000000`.

In memory now, on that host, is this function.  So, once you've run this cell, that function, *approximate_size* exists as a command you can use.  Right now, in the code block below here, type in something like `approximate_size(123858)` and hit the run button.  **Try this for a few values**.

**Now, on a blank line** there type _app_ and pause for a second. See how it knows about _approximate_size_? If you're in Colab, click on the little circled i and watch it show you some helpful bits about the function.  If you're in VSCode, click the ">" for the same effect. Now (and you may need to restart typing the function name), type _app_**TAB** and watch it auto-complete the name of the function for you.

_(Note, if you need to get it to do show you the completions again, it Ctrl-Space)_

# Unpacking this code

Let's trace through this code a bit:

**def line - defining a function...**
```
def approximate_size(size, a_kilobyte_is_1024_bytes=True):
```

*   function name - can be whatever you want
*   takes parameters
    *   Mandatory
    *   Optional w/default value

**Comment / description - just what does this do**?
```
    '''Convert a file size to human-readable form.

    Keyword arguments:
    size -- file size in bytes
    a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024
                                if False, use multiples of 1000

    Returns: string

    '''
```

*   Note the triple-quote. If you want text - called a string to span multiple lines, you use the triple-quote
*   Anything you put right after the def line becomes the documentation for that function - called the docstring. 
*   **Right now - in  your Scrtach window type** `help(approximate_size)`

**Indentation**

*   Notice how evertying inside the "def" bit is indented and that `print(approximate_size...` bit is back on the left margin?
*   Indentation is huge in Python - it says what belongs with what and under what. So, all that is part of the function
*   Python knows that the `def approximate_size(...` through the `raise ValueError...` all are part of one block and that the `print` starts something else because of the indentation (other languages use "end" statements, squiggly brackets, etc).
*   Note the colon here though at the end of the `def` line - that's basically saying an indent is coming - it's another block of code we're going into. It'll make a bit more sense later

**if-statement**
```
    if size < 0:
        raise ValueError('number must be non-negative')
```

*   Control flow - check if something is the case. If so, do X. If not, do Y.
*   if - elif - else
*   What is _size_? Where did it come from?
    *   Concept of a variable
    *   Note size is only known inside this function. You could have called it _foo_ or _s_ or _ILikePie_. Don't do that though.
*   Note the indent
*   Note the colon

**conditional assignment**
```
multiple = 1024 if a_kilobyte_is_1024_bytes else 1000
```
This is a fun bit in Python.  You can do fancy things like this, but you needn't get this fancy.  We could do a simpler version like this:

```
multiple = 1000
if a_kilobyte_is_1024_bytes:
    multiple = 1024
```

Keep in mind, there are many ways to solve any problem - neither is at all wrong. But in the version in the code, _multiple_ gets set based on the outcome of that `if`.

**for-loop**
```
    for suffix in SUFFIXES[multiple]:
        size /= multiple
        if size < multiple:
            return '{0:.1f} {1}'.format(size, suffix)
```
We're going to iterate some number of times and each time, _suffix_ will get a different value.  What values? Whatever is in _SUFFIXES_.

Well, whatever is in a part - a slice of _SUFFIXES_ - the 1000 or 1024 index of it

So, let's say we have _multiple_=1000. The first time, the result is 'KB'.  The next will be 'MB', etc.

Let's say _size_ = 1,000,000 (1 million).  



*   The first time in the loop we hit that `/=` sign (aka `size = size / multiple`), _size_ is now 1,000
*   if-statement -- is 1000 &lt; 1000? Nope
*   Loop back to the start
*   _suffix_ is now 'MB'
*   _size_ goes to be 1
*   is 1 &lt; 1000?  Yup
*   returns this goofy thing

What does return mean? Returns - jumps out of this function we're doing here. So we bail on the loop and anything after that and we return a value - in this case that string.

Now, we've got a lot of odd stuff in there - It's a string formatter - we'll get to it later, but:



*   `0` and `1` are placeholders - the 0th and 1st things that format will spit out - size and suffix
*   The `.1f` thing - means floating-point number with one decimal place.  For example, try running this in a scratch cell: `'{0:.1f} {1}'.format(1.2341, 'duckies')`
*   But for the moment, forget all this fancy formatting ... we'll come back to it later.  Just know there's a lot you can do.

**raise statement**
```
    raise ValueError('number too large')
```
What if things go wrong?  Well, here in the code, if we've gotten past that loop and haven't returned some nice ouput, things have gone badly. It's good practice to have your code let other code know that some kind of problem exists.  Here, we're going to send out a big alert (called an exception).  There are multiple kinds of exceptions - here it's a problem with a value - _ValueError_.  The idea is that your code that might call this function may be able to handle the error gracefully - "I'm sorry Dave, I'm afraid I can't do that".

We'll get to try ... except bits later.  But the idea is it's just a way to let anyone else know - any other bit of code know - that the sticky brown stuff hit the rotating blades.


## Change from the book
The book has an if-statement here and puts the print statments inside that if-statement:
```
if __name__ == '__main__':
    print(approximate_size(1000000000000, False))
    print(approximate_size(1000000000000))
```

**_TL;DR: This is your "when I run this file, do this" section_**

Every time I've taught this course, this trips people up and so I changed his code.  There are deep reasons why he did this, but they apply a lot less to us and less still when using things like Jupyter notebook environments.


## One more thing

Clear your current Python machine by:

*Colab*: going to *Runtime, Restart runtime* (or close this window and open it again).  
*VSCode*: Click the *Restart* button above.

Now, in the cell below, try running `approximate_size(123456)` or something like that.  What happens?

It failed right?  It gave you something like
```
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-4a112952de59> in <module>()
----> 1 approximate_size(1231)

NameError: name 'approximate_size' is not defined
```

Try the _app_**TAB** trick from before.  That fails too, right? _Do you know why?_


Python is all about adding your own bits to the language - we created this function `approximate_size()`.  But, at this point, Python doesn't know about it - it's a clean, virgin copy of Python we're running.  We need to tell it - we need to let it read these statements like the `def` bits to learn this - to add it in - to create that function.  We've not done anything to execute or load our code.  Python has never read over those lines and executed them, so that new function hasn't been learned by this clean copy of Python.

Now, fix this by hitting the _Run cell_ icon on the code and verify that your scratch cell works again.