# Aug 30

## Making Programs

So far, we have been working in this nice programmatic interface called the Jupyter Notebook. These are really nice for doing interactive programming. What we'll do today is look at making actual computer programs - self contained pieces of code that can run in a variety of environments.

### Anatomy of a Python Program

First, there are import statements. What libraries will we need to run our code?

Second, there is content. These are usually the functions that we want to execute.

Third, there is a statement of which of our functions to execute. 

## Hands On

First, we are going to open a plain text file. Plain text files, unlike Word or Google Docs, save files as simply the characters in them. Word, on the other hand, creates a bunch of binary code around the text that saves the formatting and extras in our document. Plain text files, instead, rely on the software interpreter to make files look pretty.

Next, we paste in our function.

Now, we save it. Call it `my_first_software.py`.

Now, we will open a terminal, and we will run our function like so:

In [None]:
python my_first_software.py

What happened? Did you get output? 

What did we forget? Let's add those import statements. Save the file and run it again. Did it do anything? No - look at the third step above. We need to tell Python which function to execute. We need to add a main statement. 

In [None]:
if __name__ == '__main__':
    surveys_df = pd.read_csv("../data/surveys.csv")
    yearly_data_arg_test(surveys_df)

## Digression: What is this interface we're using?

The dollar sign is a **prompt**, which shows us that the shell is waiting for input;
your shell may use a different character as a prompt and may add information before
the prompt. When typing commands, either from these lessons or from other sources,
do not type the prompt, only the commands that follow it.

~~~
$
~~~

Let's find out where we are by running a command called `pwd`
(which stands for "print working directory").
At any moment, our **current working directory**
is our current default directory,
i.e.,
the directory that the computer assumes we want to run commands in
unless we explicitly specify something else.
Here, the computer's response is `/home/yourusername`,
which is the top level directory within our system:

~~~
$ pwd
~~~

Let's look at how our file system is organized.  

At the top is your directory, which holds all the 
subdirectories and files.

Inside that directory are some other directories:

We'll be working with these subdirectories throughout this workshop.  

The command to change locations in our file system is `cd` followed by a
directory name to change our working directory.
`cd` stands for "change directory".

Let's say we want to navigate to the `data` directory we saw above.  We can
use the following command to get there:

~~~
$ cd data
~~~

We can see files and subdirectories are in this directory by running `ls`,
which stands for "listing":

~~~
$ ls
~~~

`ls` prints the names of the files and directories in the current directory in
alphabetical order,
arranged neatly into columns.
We can make its output more comprehensible by using the **flag** `-F`,
which tells `ls` to add a trailing `/` to the names of directories:

~~~
$ ls -F
~~~

Anything with a "/" after it is a directory. Things with a "*" after them are programs. If
there are no decorations, it's a file.

`ls` has lots of other options. To find out what they are, we can type:

~~~
$ man ls
~~~

Some manual files are very long. You can scroll through the file using
your keyboard's down arrow or use the <kbd>Space</kbd> key to go forward one page
and the <kbd>b</kbd> key to go backwards one page. When you are done reading, hit <kbd>q</kbd>
to quit.

> ## Challenge
> Use the `-l` option for the `ls` command to display more information for each item 
> in the directory. What is one piece of additional information this long format
> gives you that you don't see with the bare `ls` command?
>
> > ## Solution
> > ~~~
> > $ ls -l
> > ~~~
> > 
> > ~~~
> > 
> > The additional information given includes the name of the owner of the file,
> > when the file was last modified, and whether the current user has permission
> > to read and write to the file.
> > 


No one can possibly learn all of these arguments, that's why the manual page
is for. You can (and should) refer to the manual page or other help files
as needed.

Let's go into the `notebooks` directory and see what is in there.

~~~
$ cd notebooks
$ ls -F
~~~


This directory contains files with `.ipynb` extensions. ipynb is a format
for storing our notebooks

### Shortcut: Tab Completion

Typing out file or directory names can waste a
lot of time and it's easy to make typing mistakes. Instead we can use tab complete 
as a shortcut. When you start typing out the name of a directory or file, then
hit the <kbd>Tab</kbd> key, the shell will try to fill in the rest of the
directory or file name.

Return to your home directory:

~~~
$ cd
~~~

then enter:

~~~
$ cd CompBio2018/d
~~~

The shell will fill in the rest of the directory name for
`data`.


~~~
$ pw<tab><tab>
~~~

~~~
pwd         pwd_mkdb    pwhich      pwhich5.16  pwhich5.18  pwpolicy
~~~

Displays the name of every program that starts with `pw`. 

## Summary

We now know how to move around our file system using the command line.
This gives us an advantage over interacting with the file system through
a GUI as it allows us to work on a remote server, carry out the same set of operations 
on a large number of files quickly, and opens up many opportunities for using 
bioinformatics software that is only available in command line versions. 

In the next few episodes, we'll be expanding on these skills and seeing how 
using the command line shell enables us to make our workflow more efficient and reproducible.

Paste this in to the script. Save it, and run it.

Congratulations, you're all hackers now!

How can we improve this script?

Right now, everything is what we call _hard-coded_. If we sent this script to a collaborator, they would have to go into the script and change the file name to use it with other data. This is dangerous - any time you have someone else hacking on a script, there's the potential that they make a mistake. We want to minimize the amount of modification our colleagues have to do to use our scripts.

Instead, what we want to do is pass in the file as a command line argument. Command line arguments is information we specify outside the program that is accepted as input by our functions.

We're going to import a module called `sys`, which stands for "system". This allows us to pass input from the command line to our program. Let's start by figuring out how to pass the name of the file in.

Where is the file name defined in the Python script? Let's replace the file name with the statement `sys.argv[1]`. Save the file. Now, we will run it differently:


In [None]:
python my_first_software.py ../data/surveys.csv

Next, we will try to add arguments for our other two inputs - mix and max. Add these as `sys.argv[2]` and `sys.argv[3]`. Save and run your script. This is cool - you can give this to someone who doesn't know Python basically at all, and have them run this script.

How can we make our program a little more user-friendly? Right now, we need to know precisely what arguments the program takes, and in what order. How confusing! This is where we will pick up Tuesday.

I've placed several scripts in the scripts directory. One is my_first_script.py. This one has the variables hard-coded into it. The second script is the one we wrote Thursday, with the sys module used to take command line input. Run it, and make sure it gives you the correct output.

The final is called script-argparse.py. Open that one.

### tl;dr

Good: having the code in a script
### Better: Having the code in a script that does not need to be edited to run.
## Best: Having the code in a script that does not need to be edited to run _and_ gives the user (you?) some help about what it required to run.

One concluding thought: have we put the output_check() function in a sensible place? Where else _could_ we put it, and why might we?