# Intro to Data Science Lecture 2 - Intro to Command Line, Python, & Jupyter Notebooks
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/*

Welcome to our first coding lecture! We will be using Python, a popular data science programming language, in the lectures, homeworks, and projects. As part of Homework 0, you should have already setup Python, IPython, and Jupyter notebooks, so it's time to get started!


# Intro to Command Line Interfaces

You don't need a graphical interface, a mouse or a touch interface, to interact with a computer. Command Line Interfaces (CLIs) are the much older way of interacting with computers. And while most casual computer users won't need to use a command line interface today, as soon as you start programming you will likely encounter CLIs on a regular basis. They can be very powerful and even easier to use for some tasks. 






## Types of CLIs and Launching CLIs

There are various different types of command line interfaces. The biggest difference used to be between Microsoft CLIs (think MS DOS, and more recently [PowerShell](https://en.wikipedia.org/wiki/PowerShell)) and UNIX (and hence Linux/Mac) interfaces. 

But UNIX-like interfaces are dominant, and can also be installed on Windows nowadays, so we will be focusing on those. 

If you are on a Mac, you can start the "terminal" app now to follow along. 

On Windows, you can use the [Anaconda Prompt](https://docs.anaconda.com/anaconda/user-guide/getting-started/) that comes with your Anaconda installation. This will also have the python environment set up correctly. 

On Linux, we assume you know what you're doing already. 




## Principles and First Steps

Once you started your UNIX app, you're greeted by a prompt that might look like this: 

```
fdf@ubuntu:~$ 
```
This tells us that we're running as user `fdf` and are currently in the directory `~`. Here `~` refers to your user's home directory, so it's a shorthand for `/home/fdf` for my specific computer and my specific username. 

The `$` is called a [command prompt](https://en.wikipedia.org/wiki/Command-line_interface#Command_prompt), and indicates that you can enter a command. Common prompt symbols are `>`, `%`, and `#`; there really is no difference between these variants. 

Most of what you're doing on a command line is run little applications or commands. For example, we can run the `ls` application inside the folder that contains this notebook: 

```
$ ls
02-basic-python.ipynb		 datasciencecat.jpg
02-exercises.ipynb		    exercise.py
02-version-control.ipynb	  first_steps.py
anaconda_navigator.png		newrepo.png
```

Above, everything in the line of the prompt `$` is the command, and the rest is the return by the command. `ls` stands for list and it lists all the files and folders in the directory. 

Commands can have parameters. For example, we can use the `-l` parameter for ls, which produces a directory listing with more details: 

```
$ ls -l
total 1480
-rw-r--r--  1 little  staff   34512 Jan 11 16:47 02-basic-python.ipynb
-rw-r--r--  1 little  staff    3414 Jan 11 16:39 02-exercises.ipynb
-rw-r--r--  1 little  staff   15193 Jan 11 12:32 02-version-control.ipynb
-rw-r--r--  1 little  staff  549517 Jan 11  2024 anaconda_navigator.png
-rw-r--r--  1 little  staff   21395 Jan 11  2024 datasciencecat.jpg
-rw-r--r--  1 little  staff     146 Jan 11  2024 exercise.py
-rw-r--r--@ 1 little  staff     230 Jan 11  2024 first_steps.py
-rw-r--r--  1 little  staff  111784 Jan 11  2024 newrepo.png
```

Here we have info about the permissions, file size, when a file was last changed, and finally the file name. 

Absolutely essential is to understand file hierarchy and to navigate the file hierarchy. 

On a unix system, the file hierarchy starts at `\`, everything else is a file tree form there. 

We can navigate the file tree with the `cd` (change directory) command: 

`cd [target]`

Some shorthands: 
`.` is the current directory, so `cd .` doesn't do anything. 
`..` is one directory up the hierachy, so if I am in the directory 

`~/2024-datascience-lectures/02-basic-python`

running 

`cd ..`

will move me to the

`~/2024-datascience-lectures`

directory. 

I can create new direcories with the mkdir command, e.g., 

`$ mkdir testdirectory`

will create a new directory/folder called testdirectory as a subdirectory of the current directory. If we list the directories now we can see the new directory: 

```bash 
$ ls -l
total 1480
-rw-r--r--  1 little  staff   34512 Jan 11 16:47 02-basic-python.ipynb
-rw-r--r--  1 little  staff    3414 Jan 11 16:39 02-exercises.ipynb
-rw-r--r--  1 little  staff   15193 Jan 11 12:32 02-version-control.ipynb
-rw-r--r--  1 little  staff  549517 Jan 11  2024 anaconda_navigator.png
-rw-r--r--  1 little  staff   21395 Jan 11  2024 datasciencecat.jpg
-rw-r--r--  1 little  staff     146 Jan 11  2024 exercise.py
-rw-r--r--@ 1 little  staff     230 Jan 11  2024 first_steps.py
-rw-r--r--  1 little  staff  111784 Jan 11  2024 newrepo.png
drwxr-xr-x  2 little  staff      64 Jan 11 16:48 testdirectory
```

I can copy files with the `cp` command: 

```bash 
cp newrepo.png newrepo-copy.png
```

Here I specify to `cp` a specific file, `newrepo.png` and give it the name/destination `newrepo-copy.png`. 

I can then move a file into a different directory: 

```bash 
mv newrepo-copy.png testdirectory
```

Note that pressing the "TAB" key on your keyboard will auto-complete filenames in most CLI implementations. 

If we then run: 

```bash 
$ ls testdirectory
newrepo-copy.png
```

We can see that the listing of `testdirectory` contains our new file. 

We can remove files with `rm` 

```bash 
$ rm testdirectory/newrepo-copy.png
```

We can remove folders and all its content (including subfolders), and all i with the -r (recursive) parameter. 

```bash 
$ rm -r testdirectory
```

### Basic Commands

Here are a few other basic commands: 

`touch file.txt` creates a new text file called `file.txt`

`echo "Test"` writes "Test" to the command line.

`cat file.txt` prints the content of the file. 

`pwd` stands for Print Working Directory, i.e., it shows you which directory you're in.

`man [cmd]` were [cmd] is any command will drop you into the man pages, where you can find information about that command. Try `man ls`. You exit by pressing `q`. 

### Piping and Redirecting

The Unix philosophy is to have small programs do one job well and forward results between different programs with piping. We're not going into details here, but here's a simple example: 

`$ ls | grep ipynb`

[`grep`](https://man7.org/linux/man-pages/man1/grep.1.html) is a fairly complex command that you can use to match patterns from any input. Here we're asking grep to extract text that contains `ipynb`. 

`ls` lists directories. The pipe `|` character feeds the output of the `ls` command to the `grep` command. So if you do this in this directory, you will identify all files that contain the `ipynb` string: 

```bash
~/2024-datascience-lectures/02-basic-python $ ls | grep ipynb
02-basic-python.ipynb
02-exercises.ipynb
02-version-control.ipynb
```

Redirecting is similar, though you're passing the output of a command to a file:
```bash
$ echo "Hello World" >> test.txt
```

This will create a new file and write "Hello World" to that file. You can check with `cat`: 
```bash
$ cat test.txt
Hello World
```

`>>` actually appends a new line to a file if it already exists. You can use `>` to overwrite the file. 

### Running Programs

You can think of `ls`, `echo`, etc as real little programs (though in practice they're now built-in in your CLI). But you can also run proper programs out of a CLI. For example, you can run `git`, or `python`, or `jupyter notebook`. And we'll do all of this next.

These programs might not immediately return, like the ones we had before, but might keep running untill you terminate them. For example, `jupyter notebook` will start a server and redirect you to your browser. You can terminate a program on the shell by pressing "Ctrl + C". 

#### GIT

Here we'll only look at a few basic commands, check out the version control notebook for details. 

To clone (make a copy of) a repository, navigate to where you want that repository stored and run: 

```bash
git clone https://github.com/datascience-course/2024-datascience-homework
```

To get updates on the homeworks navigate into that directory and run: 
```bash
cd 2024-datascience-homework
git pull 
```

This is roughly what you should see: 

```bash
$ git pull
remote: Enumerating objects: 65, done.
remote: Counting objects: 100% (65/65), done.
remote: Compressing objects: 100% (51/51), done.
remote: Total 59 (delta 33), reused 4 (delta 3), pack-reused 0
Unpacking objects: 100% (59/59), done.
From https://github.com/visdesignlab/visdesignlab.github.io
   dd83bbd..2f65ea8  master     -> origin/master
Updating dd83bbd..2f65ea8
Fast-forward
 _persons/miholjcic.md              |  35 +++++++++++++++++++++++++++++++++++
 _persons/ssiu.md                   |  46 ++++++++++++++++++++++++++++++++++++++++++++++
 _persons/zcutler.md                |   6 +++---
 assets/images/people/miholjcic.jpg | Bin 0 -> 7758 bytes
 assets/images/people/ssiu.jpg      | Bin 0 -> 21360 bytes
 5 files changed, 84 insertions(+), 3 deletions(-)
 create mode 100644 _persons/miholjcic.md
 create mode 100644 _persons/ssiu.md
 create mode 100644 assets/images/people/miholjcic.jpg
 create mode 100644 assets/images/people/ssiu.jpg
```

Unfortunately one of the biggest drawback of Jupyter notebooks are that they aren't great for version control. If you change anything in a notebook, and we make an update later you will get a conflict in that file.

There are ways to deal with conflicts that are better in general, but for Jupyter notebooks we recommend to just make a copy of the file you changed with your file browser (if you want to keep your changes) and pull again. 

If your don't want to keep your changes, you can run the git checkout command: 

```bash 
$ git checkout HW1/HW1.ipynb
```

This will overwrite anything you have locally with the content from the server. 


For an introduction with more background, please refer to the [02-version-control.ipynb](02-version-control.ipynb) notebook. We will cover this if we have time at the end of the lecture or some time in the future. 

## Writing code in a file

Let's look at another way to run python: by executing a file. Exit the interactive environment, by calling the exit function:

```python
exit()
```

Now, open up your favorite text editor (if you don't have one, check out, e.g., [Sublime](https://www.sublimetext.com/)) and create a new file called "first_steps.py". We've created such a file for you [here](first_steps.py).

You can also copy and paste this code into the file:

```python
def double_number(a):
    # btw, here is a comment! Use the # symbol to add comments or temporarily remove code
    # shorthand operator for 'a = a * 2'
    a *= 2
    return a

print(double_number(3))
print(double_number(14.22))
```

Here we've also defined or first function! We'll go into details about functions at a later time. For now, just notice that the indentation matters!

Now, run

```bash
$ python first_steps.py
6
28.44
```

What happened here? Python executed the commands in the file, and then terminated. You saw the result, but it was not interactive anymore, but executed in a couple of milliseconds.

Larger and bigger programs are commonly written using source code files and are not run interactively. They will read data from files, wait for user input, etc.


In this class, we will neither work with the interactive mode nor with straight-up python files much, but instead will use Jupyter Notebooks, which we'll look at next!

# Intro to Jupyter Notebooks

Jupyter notebooks will be our main working environment for this class.

You should have already downloaded it as part of HW0. But you'll also need to start your notebook server.

There are two ways to do this:

1. You can use the command line to navigate to the directory that contains the notebook and then run:  

```bash
$ jupyter notebook
```

2. Or you can use the anaconda navigator to launch a notebook server in your home directory and then navigate to this folder: 

![Anaconda Navigator Screenshot](anaconda_navigator.png)

## Jupyter Notebook Basics

First, let's get familiar with Jupyter Notebooks. 

Notebooks are made up of "cells" that can contain text or code. Notebooks also show you output of the code right below a code cell. These words are written in a text cell using a simple formatting dialect called [markdown](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html). 

Double click on this cell text or press enter while the cell is selected to see how it is formatted and change it. We can make words *italic* or **bold** or add [links](http://datasciencecourse.net) or include pictures:

![Data science cat](datasciencecat.jpg)

The content of the notebook, as you edit in your browser, is written to the `.ipynb` file we provided. 

For your homeworks, you will write in `homework1.ipynb` files for which we will give you a template. You then create a zip archive of this file (and all relevant additional files) and submit it to canvas. 

If you want to read up on Notebooks in details check out the [excellent documentation](http://jupyter-notebook.readthedocs.io/en/latest/notebook.html).

## Google Colab
An alternative to native Jupyter Notebooks are cloud-hosted google colab notebooks. Google Colab is largely identical to jupyter notebooks on your local computer, though there are some differences when it comes to loading data from files. We generally recommend that you work on your homeworks and review lectures in local Jupyter Notebooks, but a Google Colab project could be a great idea for your final project, as it's really good for collaborative work – which is an area where Jupyter Notebooks themselves aren't so great because of the issues with doing proper version control on them.


## Writing Code

The most interesting aspect of notebooks, however, is that we can write code in the cells. You can use [many different programming languages](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels) in Jupyter notebooks, but we'll stick to Python. So, let's try it out:

In [1]:
print ("Hello World!")
a = 3
# This is a comment!
# The return value of the last line of a cell is the output
a 

Hello World!


3

Again, we've greeted the world out there using a print statement. 

We also assigned a variable and returned it, which makes it the output of this cell. Notice that the output here is directly written into the notebook. 

You can change something in a cell and re-run it using the "run cell" button in the toolbar, or use the `CTRL/CMD+ENTER` shortcut.

Another cool thing about cells is that they preserve the state of what happened before. Let's initialize a couple of variables in the next cell: 

In [2]:
age = 2
gender = "woman"
name = "Datascience Cat"
smart = True

These variables are now available to all cell below or above **if you executed the cell**. In practice, you should never rely on a variable from a lower cell in an earlier cell. **This behavior is different from if you were to execute the content of the cells in sequence in a python file.**

If you make a change to a cell, you need to execute it again. You can also batch-executed multiple cells using the "Cell" menu in the toolbar. 

Let's do something with the variables we just defined:

In [3]:
print(name + ", age: " + str(age) + ", " + 
       gender + ", is smart: " + str(smart))

Datascience Cat, age: 2, woman, is smart: True


In the previous cell, we've [concatenated a couple of strings](https://docs.python.org/3.5/tutorial/introduction.html#strings) to produce one longer string using the `+` operator. Also, we had to call the `str()` function to get [string representations of these variables](https://docs.python.org/3.5/library/stdtypes.html#str).

An alternative way to do this is not to concatenate the string but to pass each variable in as a separate argument to the print function: 

In [4]:
print(name, "\n",
       "age:", age, "\n",
       gender, "\n",
       "is smart:", smart)

Datascience Cat 
 age: 2 
 woman 
 is smart: True


Here, we're using a new-line character "\n" to break the lines. 

### Try it!

1. Create a Python cell below.
2. Create two variables, one for your UID and one for your email. What are the types of these variables?
3. Modify the above print statement to add your UID and email to the print-out.

In [5]:
uid = 1186463
email = "u1186463@utah.edu"
print (type(uid))
print (type(email))

<class 'int'>
<class 'str'>


## Modes

Notebooks have two modes, a **command mode** and **edit mode**. You can see which mode you're in by the color of the cell: 
 * **green** means edit mode, 
 * **blue** means command mode. 
 
Many operations depend on your mode. For code cells, you can switch into edit mode with "Enter", and get out of it with "Escape".


## Shortucts

While you can always use the tool-bar above, you'll be much more efficient if you use a couple of shortcuts. The most important ones are:

**`Ctrl/Cmd+Enter`** runs the current cell.  
**`Shift+Enter`** runs the current cell and jumps to the next cell.   
**`Alt/Option+Enter`** runs the cell and adds a new one below it.

In command mode:

**`h`** shows a help menu with all these commands.  
**`a`** adds a cell before the current cell.  
**`b`** adds a cell after the current cell.  
**`dd`** deletes a cell.  
**`m`** as in **m**arkdown, switches a cell to markdown mode.  
**`y`** as in p**y**thon switches a cell to code.  

## Kernels

When you [run code](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Running%20Code.html), the code is actually executed in a **kernel**. You can do bad things to a kernel: you can make it stuck in an endless loop, crash it, corrupt it, etc. And you probably will do all of these things :). 

So sometimes you might have to interrupt your kernel or restart it. Use the "Kernel" menu to restart the kernel, re-run your notebook, etc.

Also, before submitting a homework or a project, make sure to `Restart and Run All`. This will create a clean run of your project, without any side effects that you might encounter during development. We want you to submit the homeworks **with output**, and by doing that you will make sure that we actually can also execute your code properly.

## Storing Output

Notebooks contain both, the input to a computation and the outputs. If you run a notebook, all the outputs generated by the code cells are also stored in the notebook. That way, you can look at notebooks also in non-interactive environments, like your first homework on [GitHub](https://github.com/datascience-course/2023-datascience-homework/blob/main/HW1/HW1.ipynb). 

The Notebook itself is stored in a rather ugly format containing the text, code, and the output. As discussed, this can sometimes be challenging when working with version control.

# Python Basics

## Functions

In math, functions transfrom an input to an output as defined by the property of the function, like this: 

$f(x) = x^2 + 3$

In programming, functions can do exactly this, but are also used to execute “subroutines”, i.e., to execute pieces of code in various order and under various conditions. Functions in programming are very important for structuring and modularizing code. 

In computer science, functions are also called “procedures” and “methods” (there are subtle distinctions, but nothing we need to worry about at this time). 

The following Python function, for example, provides the output of the above defined function for every valid input: 

In [6]:
def f(x):
    result = x ** 2 + 3 
    return result

We can now run this function with multiple input values: 

In [7]:
print(f(2))
print(f(3))
f(5)

7
12


28

Let's take a look at this function. The first line
```python
def f(x):
```
defines the function of name `f` using the `def` keyword. The name we use (`f` here) is largely arbitrary, but following good software engineering practices it should be something meaningful. So instead of `f`, **`square_plus_three` would be a better function name in this case**.  

After the function name follows a list of parameters, in parantheses. In this case we define that the function takes only one parameter, `x`, but we could also define multiple parameters like this:
```python 
def f(x, y, z):
```

The parameters are then available as local variables within the function.

The second line does the actual computation and assigns it to a **local variable** called `result`. 

The third line uses the `return` keyword to return the result variable. Functions can have a return value that we can assign to a variable. For example, here we could write: 

```python
my_result = f(10)
``` 

Which would assign the return value of the function to the variable `my_result`.

Note that the lines of code that belong to a function are **indented by four spaces** (you can hit tab to indent, but it will be converted to four spaces). Python defines the scope of a function using indentation. Many other programming languages use curly brackets `{}` to do this. 

A function is ended by a new line.

For example, the same function wouldn't work like this:

In [8]:
def f(x):
    result = x ** 2 + 3
# Throws a SyntaxError because return is used outside a function
return result

SyntaxError: 'return' outside function (2591632246.py, line 4)

Equally, we can't indent by too much:

In [9]:
def f(x):
    result = x ** 2 + 3
    # Throws an IndentationError
        return result

IndentationError: unexpected indent (3765520253.py, line 4)

### Try it!

1. Create a Python cell below.
2. Define a new function that takes two variables, `x` and `y` and prints the one divided by the other.
3. Test your function with multiple input values, printing the answer.
4. What happens when you try to divide by zero?

In [18]:
def f(x,y):
    result = x/y
    return result

In [19]:
print(f(1,2))
print(f(16,2))
print(f(1,0))

0.5
8.0


ZeroDivisionError: division by zero

## Scope

Another critical concept when working with functions is to understand the scope of a variable. Scope defines under which circumstances a variable is accessible. For example, in the following code snippet we cannot access the variable defined inside a function:

In [10]:
def scope_test():
    function_scope = "only readable in here"
    # Within the function, we can use the variable we have defined
    print("Within function: " + function_scope)

# calling the function, which will print     
scope_test()

Within function: only readable in here


If we try to use the `function_scope` variable outside of the function, we will find that it is not defined. 

This will throw a `NameError`, because Python doesn't know about that variable here.

In [11]:
print("Outside function: " + function_scope)

NameError: name 'function_scope' is not defined

You might wonder “Why is that? Wouldn't it make sense to have access to variables wherever I need access?”. The reason for scoping is that it's simply much easier to **build reliable software when we modularize code**. When we use a function, we shouldn't have to worry about its internals. 

Another practical reason is that this way we can **re-use variable names** that were used in other places. This is really important when we work with other peoples' code (e.g., libraries). If that weren't possible, we might get nasty side-effects just because the library uses a variable with the same name somewhere. 

You can, however, use variables defined in the larger scope in the sub-scope:

In [21]:
name = "Science Cat"

def print_name_with_dr():
    print("Dr.", name)
    
print_name_with_dr()

Dr. Science Cat


This is generally **not considered good practice** – functions should rely only on their input parameters. Otherwise it can easily lead to side effects. This would be the better approach: 

In [23]:
# note that we're re-using the parameter name defined in the previous cell.
def print_name_with_dr(name):
    print("Dr.", name)
    
print_name_with_dr('Foo')
print(name)

Dr. Foo
Science Cat


Finally, there is a way to define a variable within a function for use outside its scope by using the global keyword. There are reasons to do this, but it is generally discouraged.

In [14]:
def scope_test():
    # Think long and hard before you do this - generally you shouldn't. I have never.
    global global_scope
    global_scope = "defined in the function, global scope"
    # Within the function, we can use the variable we have defined
    print("Within function: " + global_scope)

scope_test()
# Since this is defined as global we can also print the variable here
print("Outside function: " + global_scope)

Within function: defined in the function, global scope
Outside function: defined in the function, global scope


### Try it!

1. Create a Python cell below.
2. In the cell, define a variable called `x` and set its value to `2`.
3. Create three functions, all of which calculate `x + 7`:
    * The first function should use `x` without defining it.
    * The second function should have a parameter named `x`. 
    * The third function should redefine `x` inside of it to be `3`.
3. When you try each function, what is the result? What is the value of the `x` outside the function?

## Looking Ahead: Conditions, Loops, Advanced Data Types

We've learned how to execute operations and call and define functions. In the next lecture, we'll learn how we can control the flow of execution in a program using conditions (if statements) and loops. We'll also introduce more advanced data types such as lists and dictionaries. 