# 6) Git and IDES for good programming practices

Related material:

- Main basis (with many thanks) for this ipynb: https://swcarpentry.github.io/git-novice/
- Related nice reference: http://swcarpentry.github.io/git-novice/reference
- Additional Git reference: https://git-scm.com/docs
- https://git-scm.com/book/en/v2
- https://gist.github.com/trey/2722934 

## Version control with git

[Git](https://git-scm.com/) is an extremely useful and broadly adopted version control system. Every save copies of files with different names as you work on them to enable going back to older versions? 

![Piled Higher and Deeper by Jorge Cham, http://www.phdcomics.com/comics/archive_print.php?comicid=1531](images/lect06_phd_versions.png)
"Piled Higher and Deeper" by Jorge Cham, http://www.phdcomics.com

Git (and other version control software) avoids saving many almost-identical versions of files and then having to sort through them. It makes it very easy to store incremental changes and then compare them. When writing programs, this is especially useful as you might temporarily have to break some functionality while extending it; wouldn't it be nice to make a separate "branch" to work on that, and merge it back when the new functionality is complete?

This is such a common problem that multiple version control tools have been created. Git is the most common. It becomes especially helpful when there are multiple people that want to work on the same project/file.

Git is not to be confused with GitHub; [GitHub](https://github.com/) is an online host (website) for projects which interfaces with git. It is extremely helpful for sharing code, managing projects, and team project development. We will be using both.

## Some visuals of what git can do for you:

Save changes sequentially:
![Changes Are Saved Sequentially](images/lect06_play-changes.svg)

Make independent changes to the same file:
![Different Versions Can be Saved](images/lect06_versions.svg)

Merge changes: If there are conflicts, you will have a chance to review them. 
![Multiple Versions Can be Merged](images/lect06_merge.svg)

The entire history of saved states (**commits**) and the metadata about them make up a particular git **repository**. These repositories are saved on individual machines, but can easily be can be kept in sync across different computers, facilitating collaboration among different people. Repositories do not need a central server to host the **repo** (the common shorthand for repository). Thus, git repos are described as distributed.

## When first using Git on a machine

Below are a few examples of configurations we will set as we get started with Git:

- our name and email address,
- what our preferred text editor is,
- and that we want to use these settings globally (i.e. for every project).

On a command line, Git commands are written as `git verb options`,
where `verb` is what we actually want to do and `options` is additional optional information which may be needed for the `verb`. So here is how Dracula sets up his new laptop:

~~~
$ git config --global user.name "Vlad Dracula"
$ git config --global user.email "vlad@tran.sylvan.ia"
~~~

The user name and email you set will be associated with your subsequent Git activity,
which means that any changes pushed to [GitHub](https://github.com/),
[BitBucket](https://bitbucket.org/), [GitLab](https://gitlab.com/) or another Git host server.

### Line Endings

As with other keys, when you hit <kbd>Return</kbd> on your keyboard,
your computer encodes this input as a character.
Different operating systems use different character(s) to represent the end of a line.
(You may also hear these referred to as newlines or line breaks.)
Because Git uses these characters to compare files,
it may cause unexpected issues when editing a file on different machines. 

Although it is beyond the scope of this lesson, you can read more about this issue on
[on this GitHub page](https://help.github.com/articles/dealing-with-line-endings/).

You can change the way Git recognizes and encodes line endings
using the `core.autocrlf` command to `git config`. Thus, the following settings are recommended:

On macOS and Linux:

~~~
$ git config --global core.autocrlf input
~~~

And on Windows:
~~~
$ git config --global core.autocrlf true
~~~

We will be interacting with [GitHub](https://github.com/) and so the email address used should be the same as the one used when setting up your GitHub account. If you are concerned about privacy, please review [GitHub's instructions for keeping your email address private](https://help.github.com/articles/keeping-your-email-address-private/). 

If you elect to use a private email address with GitHub, then use that same email address for the `user.email` value, e.g. `username@users.noreply.github.com` replacing `username` with your GitHub one. You can change the email address later on by using the `git config` command again.

Dracula also has to set his favorite text editor, following this table:

| Editor             | Configuration command                            |
|:-------------------|:-------------------------------------------------|
| Atom | `$ git config --global core.editor "atom --wait"`|
| nano               | `$ git config --global core.editor "nano -w"`    |
| BBEdit (Mac, with command line tools) | `$ git config --global core.editor "bbedit -w"`    |
| Sublime Text (Mac) | `$ git config --global core.editor "/Applications/Sublime\ Text.app/Contents/SharedSupport/bin/subl -n -w"` |
| Sublime Text (Win, 32-bit install) | `$ git config --global core.editor "'c:/program files (x86)/sublime text 3/sublime_text.exe' -w"` |
| Sublime Text (Win, 64-bit install) | `$ git config --global core.editor "'c:/program files/sublime text 3/sublime_text.exe' -w"` |
| Notepad++ (Win, 32-bit install)    | `$ git config --global core.editor "'c:/program files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"`|
| Notepad++ (Win, 64-bit install)    | `$ git config --global core.editor "'c:/program files/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"`|
| Kate (Linux)       | `$ git config --global core.editor "kate"`       |
| Gedit (Linux)      | `$ git config --global core.editor "gedit --wait --new-window"`   |
| Scratch (Linux)       | `$ git config --global core.editor "scratch-text-editor"`  |
| Emacs              | `$ git config --global core.editor "emacs"`   |
| Vim                | `$ git config --global core.editor "vim"`   |

It is possible to reconfigure the text editor for Git whenever you want to change it.

The four commands we just ran above only need to be run once: the flag `--global` tells Git
to use the settings for every project, in your user account, on this computer.

You can check your settings at any time:

~~~
$ git config --list
~~~

You can change your configuration as many times as you want: just use the
same commands to choose another editor or update your email address.

### Git Help and Manual

If you forget a `git` command, you can access the list of commands by using `-h` and access the Git manual by using `--help` :

~~~
$ git config -h
$ git config --help
~~~

## Let's make a Git repo!

We will make a repo for a project Wolfman and Dracula are working on, investigating if it is possible to send a planetary lander to Mars.

![motivatingexample](images/lect06_motivatingexample.png)

First, let's create a directory in `Desktop` folder for our work and then move into that directory:

~~~
$ cd ~/Desktop
$ mkdir planets
$ cd planets
~~~

Then we tell Git to make `planets` a repository (where Git can store versions of our files):

~~~
$ git init
~~~

Note: that `git init` will create a repository that
includes subdirectories and their files---there is no need to create
separate repositories nested within the `planets` repository, whether
subdirectories are present from the beginning or added later. Also, note
that the creation of the `planets` directory and its initialization as a
repository are completely separate processes.

If we use `ls` to show the directory's contents, it appears that nothing has changed:

~~~
$ ls
~~~

But if we add the `-a` flag to show everything, we can see that Git has created a hidden directory within `planets` called `.git`:

~~~
$ ls -a
~~~

~~~
.	..	.git
~~~

Git uses this special sub-directory to store all the information about the project, 
including all files and sub-directories located within the project's directory.
If we ever delete the `.git` sub-directory,
we will lose the project's history.

We can check that everything is set up correctly
by asking Git to tell us the status of our project:

~~~
$ git status
~~~

~~~
On branch master

Initial commit

nothing to commit (create/copy files and use "git add" to track)
~~~

If you are using a different version of `git`, the exact wording of the output might be slightly different.

### Correcting `git init` Mistakes
#### USE WITH CAUTION!

To undo accidental creation of a git repo (e.g. did init when in the Desktop directory, instead of after changing directory to planets), you can just remove the `.git` within a current directory using the following command:

~~~
$ rm -rf moons/.git
~~~

But be careful! Running this command in the wrong directory, will remove the entire Git history of a project you might want to keep. Therefore, always check your current directory using the command `pwd`.

## Tracking Changes

Let's create a file called `mars.txt` within the folder `planets` that contains some notes
about the Red Planet's suitability as a base.
We'll use `vim` to edit the file; this editor does not have to be the `core.editor` you set globally earlier. 

~~~
$ vim mars.txt
~~~

Type the text below into the `mars.txt` file:

~~~
Cold and dry, but everything is my favorite color
~~~

`mars.txt` now contains a single line, which we can see by running:

~~~
$ ls
~~~

~~~
mars.txt
~~~

~~~
$ cat mars.txt
~~~

~~~
Cold and dry, but everything is my favorite color
~~~

If we check the status of our project again, Git tells us that it's noticed the new file:

~~~
$ git status
~~~

~~~
On branch master

Initial commit

Untracked files:
   (use "git add <file>..." to include in what will be committed)

	mars.txt
nothing added to commit but untracked files present (use "git add" to track)
~~~

The "untracked files" message means that there's a file in the directory
that Git isn't keeping track of. We can tell Git to track a file using `git add`:

~~~
$ git add mars.txt
~~~

and then check that the right thing happened:

~~~
$ git status
~~~

~~~
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   mars.txt

~~~

*Note*: You can add individual files by adding them by name, or any changes listed after typing `status` (new/modified/deleted files) by typing `git add .`

Git now knows that it's supposed to keep track of `mars.txt`,
but it hasn't recorded these changes as a commit yet.
To get it to do that, we need to run one more command:

~~~
$ git commit -m "Start notes on Mars as a base"
~~~

~~~
[master (root-commit) f22b25e] Start notes on Mars as a base
 1 file changed, 1 insertion(+)
 create mode 100644 mars.txt
~~~

When we run `git commit`,
Git takes everything we have told it to save by using `git add`
and stores a copy permanently inside the special `.git` directory.
This permanent copy is called a [commit](https://git-scm.com/docs/git-commit)
(or [revision](https://git-scm.com/docs/gitrevisions)) and its short identifier is `f22b25e`.
Your commit may have another identifier.

We use the `-m` flag (for "message")
to record a short, descriptive, and specific comment that will help us remember later on what we did and why.
If we just run `git commit` without the `-m` option,
Git will launch `vim` (or whatever other editor we configured as `core.editor`)
so that we can write a longer message.

[Good commit messages](https://chris.beams.io/posts/git-commit/) start with a brief (< 50 characters) statement about the changes made in the commit. Generally, the message should complete the sentence "If applied, this commit will ...".

If you want to go into more detail, add a blank line between the summary line and your additional notes. Use this additional space to explain why you made changes and/or what their impact will be.

If we run `git status` now:

~~~
$ git status
~~~

~~~
On branch master
nothing to commit, working directory clean
~~~

it tells us everything is up to date.
If we want to know what we've done recently,
we can ask Git to show us the project's history using `git log`:

~~~
$ git log
~~~

~~~
commit f22b25e3233b4645dabd0d81e651fe074bd8e73b
Author: Vlad Dracula <vlad@tran.sylvan.ia>
Date:   Thu Aug 22 09:51:46 2013 -0400

    Start notes on Mars as a base
~~~

`git log` lists all commits  made to a repository in reverse chronological order.
The listing for each commit includes
the commit's full identifier
(which starts with the same characters as
the short identifier printed by the `git commit` command earlier),
the commit's author,
when it was created,
and the log message Git was given when the commit was created.

### Where Are My Changes?

If we run `ls` at this point, we will still see just one file called `mars.txt`.
That's because Git saves information about files' history
in the special `.git` directory mentioned earlier
so that our filesystem doesn't become cluttered
(and so that we can't accidentally edit or delete an old version).

Now suppose Dracula adds more information to the file.

~~~
$ vim mars.txt
~~~

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
~~~

When we run `git status` now,
it tells us that a file it already knows about has been modified:

~~~
$ git status
~~~

~~~
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   mars.txt

no changes added to commit (use "git add" and/or "git commit -a")
~~~

The last line is the key phrase: "no changes added to commit".

We have changed this file, but we haven't told Git we will want to save those changes
(which we do with `git add`) nor have we saved them (which we do with `git commit`).

If you want to review your changes before saving them. We do this using `git diff`.
This shows us the differences between the current state
of the file and the most recently saved version:

~~~
$ git diff
~~~

~~~
diff --git a/mars.txt b/mars.txt
index df0654a..315bf3a 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1 +1,2 @@
 Cold and dry, but everything is my favorite color
+The two moons may be a problem for Wolfman
~~~

The output is a series of commands for tools like editors and `patch`
telling them how to reconstruct one file given the other:

1.  The first line tells us that Git is producing output similar to the Unix `diff` command
    comparing the old and new versions of the file.
2.  The second line tells exactly which versions of the file
    Git is comparing;
    `df0654a` and `315bf3a` are unique computer-generated labels for those versions.
3.  The third and fourth lines once again show the name of the file being changed.
4.  The remaining lines are the most interesting, they show us the actual differences
    and the lines on which they occur.
    The `+` marker in the first column shows where we added a line.

Now to commit:

~~~
$ git commit -m "Add concerns about effects of Mars' moons on Wolfman"
$ git status
~~~

~~~
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   mars.txt

no changes added to commit (use "git add" and/or "git commit -a")
~~~

Why did we get that note?

~~~
$ git add mars.txt
$ git commit -m "Add concerns about effects of Mars' moons on Wolfman"
~~~

~~~
[master 34961b1] Add concerns about effects of Mars' moons on Wolfman
 1 file changed, 1 insertion(+)
~~~

Git insists that we add files to the set we want to commit before actually committing anything. This allows us to commit our changes in stages and capture changes in logical portions rather than only large batches.

For example,suppose we're adding a few citations to relevant research to our thesis.
We might want to commit those additions,
and the corresponding bibliography entries,
but *not* commit some of our work drafting the conclusion
(which we haven't finished yet).

To allow for this, Git has a special *staging area* where it keeps track of things that have been added to
the current [changeset](http://swcarpentry.github.io/git-novice/reference#changeset)
but not yet committed.

### Staging Area

If you think of Git as taking snapshots of changes over the life of a project,
`git add` specifies *what* will go in a snapshot
(putting things in the staging area),
and `git commit` then *actually takes* the snapshot, and
makes a permanent record of it (as a commit).
If you don't have anything staged when you type `git commit`,
Git will prompt you to use `git commit -a` or `git commit --all`. 

Only do this if you are certain you know what will go into the commit, at least by checking `git status` first!

![The Git Staging Area](images/lect06_git-staging-area.svg)

Let's watch as our changes to a file move from our editor
to the staging area and into long-term storage.
First, we'll add another line to the file:

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
~~~

~~~
$ git diff
~~~

~~~
diff --git a/mars.txt b/mars.txt
index 315bf3a..b36abfd 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1,2 +1,3 @@
 Cold and dry, but everything is my favorite color
 The two moons may be a problem for Wolfman
+But the Mummy will appreciate the lack of humidity
~~~

Now let's put that change in the staging area
and see what `git diff` reports:

~~~
$ git add mars.txt
$ git diff
~~~

There is no output: as far as Git can tell,
there's no difference between what it's been asked to save permanently
and what's currently in the directory.
However, if we do this:

~~~
$ git diff --staged
~~~

~~~
diff --git a/mars.txt b/mars.txt
index 315bf3a..b36abfd 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1,2 +1,3 @@
 Cold and dry, but everything is my favorite color
 The two moons may be a problem for Wolfman
+But the Mummy will appreciate the lack of humidity
~~~

it shows us the difference between the last committed change
and what's in the staging area.

Let's save our changes:

~~~
$ git commit -m "Discuss concerns about Mars' climate for Mummy"
~~~

~~~
[master 005937f] Discuss concerns about Mars' climate for Mummy
 1 file changed, 1 insertion(+)
~~~

check our status:

~~~
$ git status
~~~

~~~
On branch master
nothing to commit, working directory clean
~~~

and look at the history of what we've done so far:

~~~
$ git log
~~~

~~~
commit 005937fbe2a98fb83f0ade869025dc2636b4dad5
Author: Vlad Dracula <vlad@tran.sylvan.ia>
Date:   Thu Aug 22 10:14:07 2013 -0400

    Discuss concerns about Mars' climate for Mummy

commit 34961b159c27df3b475cfe4415d94a6d1fcd064d
Author: Vlad Dracula <vlad@tran.sylvan.ia>
Date:   Thu Aug 22 10:07:21 2013 -0400

    Add concerns about effects of Mars' moons on Wolfman

commit f22b25e3233b4645dabd0d81e651fe074bd8e73b
Author: Vlad Dracula <vlad@tran.sylvan.ia>
Date:   Thu Aug 22 09:51:46 2013 -0400

    Start notes on Mars as a base
~~~

### Paging the Log

When the output of `git log` is too long to fit in your screen,
`git` uses a program to split it into pages of the size of your screen.
When this "pager" is called, you will notice that the last line in your
screen is a `:`, instead of your usual prompt.

-   To get out of the pager, press <kbd>Q</kbd>.
-   To move to the next page, press <kbd>Spacebar</kbd>.
-   To search for `some_word` in all pages,
    press <kbd>/</kbd>
    and type `some_word`.
    Navigate through matches pressing <kbd>N</kbd>.

### Limit Log Size

To avoid having `git log` cover your entire terminal screen, you can limit the
number of commits that Git lists by using `-N`, where `N` is the number of
commits that you want to view. For example, if you only want information from
the last commit you can use:

~~~
$ git log -1
~~~

~~~
commit 005937fbe2a98fb83f0ade869025dc2636b4dad5
Author: Vlad Dracula <vlad@tran.sylvan.ia>
Date:   Thu Aug 22 10:14:07 2013 -0400

    Discuss concerns about Mars' climate for Mummy
~~~

You can also reduce the quantity of information using the
`--oneline` option:

~~~
$ git log --oneline
~~~

~~~
- 005937f Discuss concerns about Mars' climate for Mummy
- 34961b1 Add concerns about effects of Mars' moons on Wolfman
- f22b25e Start notes on Mars as a base
~~~

### Directories

Two important facts you should know about directories in Git.

1) Git does not track directories on their own, only files within them.
   Try it for yourself:

~~~
$ mkdir directory
$ git status
$ git add directory
$ git status
~~~

   Note, our newly created empty directory `directory` does not appear in
   the list of untracked files even if we explicitly add it (_via_ `git add`) to our
   repository. This is the reason why you will sometimes see `.gitkeep` files
   in otherwise empty directories. Unlike `.gitignore`, these files are not special
   and their sole purpose is to populate a directory so that Git adds it to
   the repository. In fact, you can name such files anything you like.

2) If you create a directory in your Git repository and populate it with files,
   you can add all files in the directory at once by:

   ~~~
   git add <directory-with-files>
   ~~~

To recap, when we want to add changes to our repository,
we first need to add the changed files to the staging area
(`git add`) and then commit the staged changes to the
repository (`git commit`):

![The Git Commit Workflow](images/lect06_git-committing.svg)

### Author and Committer

For each of the commits you have done, Git stored your name twice.
You are named as the author and as the committer. You can observe
that by telling Git to show you more information about your last
commits:

~~~
$ git log --format=full
~~~

When committing you can name someone else as the author:

~~~
$ git commit --author="Vlad Dracula <vlad@tran.sylvan.ia>"
~~~

## Exploring History

As we previously saw, we can refer to commits by their
identifiers.  You can refer to the _most recent commit_ of the working
directory by using the identifier `HEAD`.

We've been adding one line at a time to `mars.txt`, so it's easy to track our
progress by looking, so let's do that using our `HEAD`s.  Before we start,
let's make a change to `mars.txt`:

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
An ill-considered change
~~~

Now, let's see what we get.

~~~
$ git diff HEAD mars.txt
~~~

~~~
diff --git a/mars.txt b/mars.txt
index b36abfd..0848c8d 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1,3 +1,4 @@
 Cold and dry, but everything is my favorite color
 The two moons may be a problem for Wolfman
 But the Mummy will appreciate the lack of humidity
+An ill-considered change.
~~~

which is the same as what you would get if you leave out `HEAD` (try it).  The
real goodness in all this is when you can refer to previous commits.  We do
that by adding `~1` 
(where "~" is "tilde", pronounced [**til**-d*uh*]) 
to refer to the commit one before `HEAD`.

~~~
$ git diff HEAD~1 mars.txt
~~~

If we want to see the differences between older commits we can use `git diff`
again, but with the notation `HEAD~1`, `HEAD~2`, and so on, to refer to them:

~~~
$ git diff HEAD~2 mars.txt
~~~

~~~
diff --git a/mars.txt b/mars.txt
index df0654a..b36abfd 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1 +1,4 @@
 Cold and dry, but everything is my favorite color
+The two moons may be a problem for Wolfman
+But the Mummy will appreciate the lack of humidity
+An ill-considered change
~~~

We could also use `git show` which shows us what changes we made at an older commit as well as the commit message, rather than the _differences_ between a commit and our working directory that we see by using `git diff`.

~~~
$ git show HEAD~2 mars.txt
~~~

~~~
commit 34961b159c27df3b475cfe4415d94a6d1fcd064d
Author: Vlad Dracula <vlad@tran.sylvan.ia>
Date:   Thu Aug 22 10:07:21 2013 -0400

    Start notes on Mars as a base

diff --git a/mars.txt b/mars.txt
new file mode 100644
index 0000000..df0654a
--- /dev/null
+++ b/mars.txt
@@ -0,0 +1 @@
+Cold and dry, but everything is my favorite color
~~~

In this way, we can build up a chain of commits.
The most recent end of the chain is referred to as `HEAD`;
we can refer to previous commits using the `~` notation,
so `HEAD~1` means "the previous commit",
while `HEAD~123` goes back 123 commits from where we are now.

We can also refer to commits using those long strings of digits and letters
that `git log` displays. These are unique IDs for the changes,
and "unique" really does mean unique: every change to any set of files on any computer
has a unique 40-character identifier.
Our first commit was given the ID `f22b25e3233b4645dabd0d81e651fe074bd8e73b`,
so let's try this:

~~~
$ git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b mars.txt
~~~

~~~
diff --git a/mars.txt b/mars.txt
index df0654a..93a3e13 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1 +1,4 @@
 Cold and dry, but everything is my favorite color
+The two moons may be a problem for Wolfman
+But the Mummy will appreciate the lack of humidity
+An ill-considered change
~~~

That's the right answer, but typing out random 40-character strings is annoying,
so Git lets us use just the first few characters:

~~~
$ git diff f22b25e mars.txt
~~~

~~~
diff --git a/mars.txt b/mars.txt
index df0654a..93a3e13 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1 +1,4 @@
 Cold and dry, but everything is my favorite color
+The two moons may be a problem for Wolfman
+But the Mummy will appreciate the lack of humidity
+An ill-considered change
~~~

All right! So we can save changes to files and see what we've changed—now how
can we restore older versions of things?
Let's suppose we accidentally overwrite our file:

~~~
We will need to manufacture our own oxygen
~~~

`git status` now tells us that the file has been changed,
but those changes haven't been staged:

~~~
$ git status
~~~

~~~
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   mars.txt

no changes added to commit (use "git add" and/or "git commit -a")
~~~

We can put things back the way they were
by using `git checkout`:

~~~
$ git checkout HEAD mars.txt
$ cat mars.txt
~~~

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
~~~

As you might guess from its name, `git checkout` checks out (i.e., restores) an old version of a file.
In this case, we're telling Git that we want to recover the version of the file recorded in `HEAD`,
which is the last saved commit.
If we want to go back even further, we can use a commit identifier instead:

~~~
$ git checkout f22b25e mars.txt
~~~

~~~
$ cat mars.txt
~~~

~~~
Cold and dry, but everything is my favorite color
~~~

~~~
$ git status
~~~

~~~
On branch master
Changes to be committed:
   (use "git reset HEAD <file>..." to unstage)
Changes not staged for commit:
   (use "git add <file>..." to update what will be committed)
   (use "git checkout -- <file>..." to discard changes in working directory)

   modified:   mars.txt

no changes added to commit (use "git add" and/or "git commit -a")
~~~

Notice that the changes are on the staged area.
Again, we can put things back the way they were
by using `git checkout`:

~~~
$ git checkout HEAD mars.txt
~~~

### Don't Lose Your HEAD

Above we used

~~~
$ git checkout f22b25e mars.txt
~~~

to revert `mars.txt` to its state after the commit `f22b25e`. But be careful! 
The command `checkout` has other important functionalities and Git will misunderstand
your intentions if you are not accurate with the typing. For example, 
if you forget `mars.txt` in the previous command.

~~~
$ git checkout f22b25e
~~~
~~~
Note: checking out 'f22b25e'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

 git checkout -b <new-branch-name>

HEAD is now at f22b25e Start notes on Mars as a base
~~~

The "detached HEAD" is like "look, but don't touch" here,
so you shouldn't make any changes in this state.
After investigating your repo's past state, reattach your `HEAD` with `git checkout master`.

It's important to remember that
we must use the commit number that identifies the state of the repository
*before* the change we're trying to undo.
A common mistake is to use the number of
the commit in which we made the change we're trying to get rid of.
In the example below, we want to retrieve the state from before the most
recent commit (`HEAD~1`), which is commit `f22b25e`:

![Git Checkout](images/lect06_git-checkout.svg)

So, to put it all together,
here's how Git works in cartoon form:

![https://figshare.com/articles/How_Git_works_a_cartoon/1328266](images/lect06_git_staging.svg)

### Simplifying the Common Case
If you read the output of `git status` carefully,
you'll see that it includes this hint:
~~~
(use "git checkout -- <file>..." to discard changes in working directory)
~~~
As it says, `git checkout` without a version identifier restores files to the state saved in `HEAD`.
The double dash `--` (optional) separates the names of the files being recovered
from the command itself. Without it, Git might try to use the name of the file as the commit identifier (but I haven't had this problem).

The fact that files can be reverted one by one
tends to change the way people organize their work.
If everything is in one large document,
it's hard (but not impossible) to undo changes to the introduction
without also undoing changes made later to the conclusion.
If the introduction and conclusion are stored in separate files,
on the other hand,
moving backward and forward in time becomes much easier.

### Explore and Summarize Histories

Exploring history is an important part of git, often it is a challenge to find
the right commit ID, especially if the commit is from several months ago.

Imagine the `planets` project has more than 50 files.
You would like to find a commit with specific text in `mars.txt` is modified.
When you type `git log`, a very long list appeared,
How can you narrow down the search?

Recall that the `git diff` command allow us to explore one specific file,
e.g. `git diff mars.txt`. We can apply a similar idea here.

~~~
$ git log mars.txt
~~~

Unfortunately some of these commit messages are very ambiguous e.g. `update files`.
How can you search through these files?

Both `git diff` and `git log` are very useful and they summarize a different part of the history for you.
Is it possible to combine both? Let's try the following:

~~~
$ git log --oneline --patch mars.txt
~~~

You should get a long list of output, and you should be able to see both commit messages and the difference between each commit.

### Tagging

If there is a particularly important commit, e.g. a new version of a program, you can add an identifier to your commit (that is, just after committing). There are two types of tags. A "lightweight" tag is as simple as:

~~~
$ git tag v0.1-lw
~~~

This tag simply points to a particular commit and then can be used for checking out a branch instead of the alphanumeric string given to the branch.

More useful is the annotated tag:

~~~
$ git tag -a v1.0 -m "my version 1.0"
~~~

Annotated tags are stored as full objects in the Git database. They’re checksummed; contain the tagger name, email, and date; have a tagging message; and can be signed and verified with GNU Privacy Guard (GPG) (cryptographically signed). It’s generally recommended that you create annotated tags so you can have all this information, and then it is a full bookmark of the state.

If desired, read more about tagging [here](https://git-scm.com/book/en/v2/Git-Basics-Tagging).

## Ignoring Things


What if we have files that we do not want Git to track for us,
like backup files created by our editor or intermediate files created during data analysis?
Let's create a few dummy files:

~~~
$ mkdir results
$ touch a.dat b.dat c.dat results/a.out results/b.out
~~~

and see what Git says:

~~~
$ git status
~~~

~~~
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	a.dat
	b.dat
	c.dat
	results/
nothing added to commit but untracked files present (use "git add" to track)
~~~

Putting these files under version control would be a waste of disk space.
What's worse, having them all listed could distract us from changes that actually matter,
so let's tell Git to ignore them.

We do this by creating a file in the root directory of our project called `.gitignore`:

and adding 
~~~
*.dat
results/
~~~

These patterns tell Git to ignore any file whose name ends in `.dat`
and everything in the `results` directory.
(If any of these files were already being tracked,
Git would continue to track them.)

Once we have created this file,
the output of `git status` is much cleaner:

~~~
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	.gitignore
nothing added to commit but untracked files present (use "git add" to track)
~~~

The only thing Git notices now is the newly-created `.gitignore` file.
You might think we wouldn't want to track it,
but everyone we're sharing our repository with will probably want to ignore
the same things that we're ignoring.
Let's add and commit `.gitignore`:

~~~
$ git add .gitignore
$ git commit -m "Ignore data files and the results folder."
$ git status
~~~

~~~
# On branch master
nothing to commit, working directory clean
~~~

As a bonus, using `.gitignore` helps us avoid accidentally adding to the repository files that we don't want to track:

~~~
$ git add a.dat
~~~

~~~
The following paths are ignored by one of your .gitignore files:
a.dat
Use -f if you really want to add them.
~~~

If we really want to override our ignore settings,
we can use `git add -f` to force Git to add something. For example,
`git add -f a.dat`.

We can also always see the status of ignored files if we want:

~~~
$ git status --ignored
~~~

~~~
On branch master
Ignored files:
 (use "git add -f <file>..." to include in what will be committed)

        a.dat
        b.dat
        c.dat
        results/

nothing to commit, working directory clean
~~~

## Branches!

When you first start your project you will by default be on a branch called `master`. Check for yourself with the following command, which will show you all your branches.

~~~
$ git branch -a
~~~

If I now want to start adding a new feature, best practice is to start a new branch.

~~~
$ git checkout -b mercury
~~~
~~~
Switched to a new branch 'mercury'
~~~

The `-b` flag created the new branch and switched to it. At any time, we can confirm that there are now two branches, `master` and `mercury`, with `git branch -a`, which also highlights which branch you are on.

Now let's make changes to the branch. 

~~~
$ echo "Mercury is the closest planet to our sun" > mercury.txt
$ git status
$ git add .
$ git commit -m 'Added first note re Mercury'
~~~

We can create multiple branches to work on different features, so let's do that:

~~~
$ git checkout -b saturn
$ echo "Saturn is my favorite planet because of its beautiful rings" > saturn.txt
$ git status
$ git add .
$ git commit -m 'Added first note re Saturn'
~~~

FYI: by default, branches will be made based on master. You can base a new branch off an existing branch with the command `git checkout -b <new-branch> <existing-branch>`.

When I'm done with my changes on my local branch, I'm going to merge my master into my local branch. In this case, it is not strictly necessary because I know that I didn't change the master. 

~~~
$ git merge master
$ git checkout master
$ git branch -a
$ git merge saturn
$ git branch -d saturn
$ git branch -a
~~~

It is a good idea to clean up branches when you have finished making and then merging in the new feature.

## Remotes in GitHub

Version control really comes into its own when we begin to collaborate with
other people.  We already have most of the machinery we need to do this; the
only thing missing is to copy changes from one repository to another.

Systems like Git allow us to move work between any two repositories.  In
practice, though, it's easiest to use one copy as a central hub, and to keep it
on the web rather than on someone's laptop.  Most programmers use hosting
services like [GitHub](https://github.com), [BitBucket](https://bitbucket.org) or
[GitLab](https://gitlab.com/) to hold those master copies; we'll explore the pros
and cons of this in the final section of this lesson.

Let's start by sharing the changes we've made to our current project with the
world.  Log in to GitHub, then click on the icon in the top right corner to
create a new repository called `planets`:

![Creating a Repository on GitHub (Step 1)](images/lect06_github-create-repo-01.png)

Name your repository "planets" and then click "Create Repository":

![Creating a Repository on GitHub (Step 2)](images/lect06_github-create-repo-02.png)

As soon as the repository is created, GitHub displays a page with a URL and some
information on how to configure your local repository:

![Creating a Repository on GitHub (Step 3)](images/lect06_github-create-repo-03.png)

This effectively does the following on GitHub's servers:

~~~
$ mkdir planets
$ cd planets
$ git init
~~~

If you remember back when we added and commited our earlier work on `mars.txt`, we had a diagram of the local repository
which looked like this:

![The Local Repository with Git Staging Area](images/lect06_git-staging-area.svg)

Now that we have two repositories, we need a diagram like this:

![Freshly-Made GitHub Repository](images/lect06_git-freshly-made-github-repo.svg)

Note that our local repository still contains our earlier work on `mars.txt`, but the
remote repository on GitHub appears empty as it doesn't contain any files yet.

The next step is to connect the two repositories.  We do this by making the
GitHub repository a [remote](http://swcarpentry.github.io/git-novice/reference#remote) for the local repository.
The home page of the repository on GitHub includes the string we need to
identify it:

![Where to Find Repository URL on GitHub](images/lect06_github-find-repo-string.png)

Click on the 'HTTPS' link to change the [protocol](http://swcarpentry.github.io/git-novice}/reference#protocol) from
SSH to HTTPS.

### FYI: HTTPS vs. SSH

We use HTTPS here because it does not require additional configuration. Later
you may want to set up SSH access, which is a bit more secure, by
following one of the great tutorials from
[GitHub](https://help.github.com/articles/generating-ssh-keys),
[Atlassian/BitBucket](https://confluence.atlassian.com/display/BITBUCKET/Set+up+SSH+for+Git)
and [GitLab](https://about.gitlab.com/2014/03/04/add-ssh-key-screencast/)
(this one has a screencast).

### Back to the remote

Copy the the URL above from the browser, go into the local `planets` repository, and run
this command:

~~~
$ git remote add origin https://github.com/vlad/planets.git
~~~

Make sure to use the URL for your repository rather than Vlad's: the only
difference should be your username instead of `vlad`.

We can check that the command has worked by running `git remote -v`:

~~~
origin   https://github.com/vlad/planets.git (push)
origin   https://github.com/vlad/planets.git (fetch)
~~~

The name `origin` is a local nickname for your remote repository. We could use
something else if we wanted to, but `origin` is by far the most common choice.

Once the nickname `origin` is set up, this command will push the changes from
our local repository to the repository on GitHub:

~~~
$ git push origin master
~~~

~~~
Counting objects: 9, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (9/9), 821 bytes, done.
Total 9 (delta 2), reused 0 (delta 0)
To https://github.com/vlad/planets
 * [new branch]      master -> master
Branch master set up to track remote branch master from origin.
~~~

The `git branch -a` command from earlier will display both your local and remote branches.

### FYI: Network Proxies

If the network you are connected to uses a proxy, there is a chance that your
last command failed with "Could not resolve hostname" as the error message. To
solve this issue, you need to tell Git about the proxy:

~~~
$ git config --global http.proxy http://user:password@proxy.url
$ git config --global https.proxy http://user:password@proxy.url
~~~

When you connect to another network that doesn't use a proxy, you will need to
tell Git to disable the proxy using:

~~~
$ git config --global --unset http.proxy
$ git config --global --unset https.proxy
~~~

### FYI on Password Managers

If your operating system has a password manager configured, `git push` will
try to use it when it needs your username and password.  For example, this
is the default behavior for Git Bash on Windows. If you want to type your
username and password at the terminal instead of using a password manager,
type:

~~~
$ unset SSH_ASKPASS
~~~

in the terminal, before you run `git push`.  Despite the name, [git uses
`SSH_ASKPASS` for all credential
entry](https://git-scm.com/docs/gitcredentials#_requesting_credentials), so
you may want to unset `SSH_ASKPASS` whether you are using git via SSH or
https.

You may also want to add `unset SSH_ASKPASS` at the end of your `~/.bashrc`
to make git default to using the terminal for usernames and passwords.

### Back to our local and remote repositories

They are now in this state:

![GitHub Repository After First Push](images/lect06_github-repo-after-first-push.svg)

We can pull changes from the remote repository to the local one as well:

~~~
$ git pull origin master
~~~

~~~
From https://github.com/vlad/planets
 * branch            master     -> FETCH_HEAD
Already up-to-date.
~~~

Pulling has no effect in this case because the two repositories are already
synchronized.  If someone else had pushed some changes to the repository on
GitHub, though, this command would download them to our local repository.

## Collaborating

For the next step, get into pairs.  One person will be the "Owner" and the other
will be the "Collaborator". The goal is that the Collaborator add changes into
the Owner's repository. We will switch roles at the end, so both persons will
play Owner and Collaborator.

If you're working through this lesson on your own, you can carry on by opening
a second terminal window. This window will represent your partner, working on another computer. You
won't need to give anyone access on GitHub, because both 'partners' are you.

If you do have a partner, the Owner needs to give the Collaborator access.
On GitHub, click the settings button on the right,
then select Collaborators, and enter your partner's username.

![Adding Collaborators on GitHub](images/lect06_github-add-collaborators.png)

*Note*: you can still copy any public repo even if you are not listed as a collaborator. Case in point, this class! We'll come back to this point.

*FYI*: If you want to add to a project on which you are not a collaborator, you can instead [`fork`](https://help.github.com/articles/fork-a-repo/) the repo, and eventually even ask the original project owner if they want to incorporate your changes.

To accept access to the Owner's repo, the Collaborator
needs to go to [https://github.com/notifications](https://github.com/notifications).
Once there she can accept access to the Owner's repo.

Next, the Collaborator needs to download a copy of the Owner's repository to her
 machine. This is called "cloning a repo". To clone the Owner's repo into
her `Desktop` folder, the Collaborator enters:

~~~
$ git clone https://github.com/vlad/planets.git ~/Desktop/vlad-planets
~~~

Replace 'vlad' with the Owner's username.

![After Creating Clone of Repository](images/lect06_github-collaboration.svg)

The Collaborator can now make a change in her clone of the Owner's repository,
exactly the same way as we've been doing before. It is best practice to only make changes to a new branch:

~~~
$ cd ~/Desktop/vlad-planets
$ git checkout -b pluto
$ echo 'It is so a planet!' > pluto.txt
$ git add pluto.txt
$ git commit -m "Add notes about Pluto"
~~~

~~~
 1 file changed, 1 insertion(+)
 create mode 100644 pluto.txt
~~~

Then push the change to the *Owner's repository* on GitHub, but as a new branch:

~~~
$ git push --set-upstream origin pluto
~~~

Take a look to the Owner's repository on its GitHub website now (maybe you need
to refresh your browser). You should be able to see that there is a new branch available. If you wish, you request that the branch be merged to master. 

![Request pull and merge](images/lect06_github-compare-pull.png)

Now, you can describe the features that you added (optional) or any other note for the owner to see before reviewing the request. You can also see what has changed.
![Add description](images/lect06_github-add-descript.png)

If there are no conflicts (e.g. someone else already made a different file with the same name), you can review and accept.
![No conflicts](images/lect06_github-no-conflicts.png)

![Success](images/lect06_github-pull-success.png)

We can delete the remote branch now, and the Collaborator can delete the local branch `pluto` as well, if desired.

On the Owner's local computer, to download the Collaborator's changes from GitHub, the Owner now enters:

~~~
$ git pull origin master
~~~

Now the three repositories (Owner's local, Collaborator's local, and Owner's on
GitHub) are back in sync.

### A Basic Collaborative Workflow

In practice, it is good to be sure that you have an updated version of the
repository you are collaborating on, so you should `git pull` before making
our changes. The basic collaborative workflow would be:

- update your local repo with `git pull origin master`
- create a branch on which to make changes `git checkout -b <new-feature>`
- make your changes and stage them with `git add`
- commit your changes with `git commit -m`
- check if there will be any conflicts with the master with `git merge master` and resolve if necessary (see below)
- upload the changes to GitHub with `git push --set-upstream origin <new-feature>`
- request that the new branch it push origin master
- delete remote and local branches

It is better to make many commits with smaller changes rather than
of one commit with massive changes: small commits are easier to
read and review.

### Comment Changes in GitHub

The Collaborator has some questions about one line change made by the Owner and
has some suggestions to propose.

With GitHub, it is possible to comment the diff of a commit. Over the line of
code to comment, a blue comment icon appears to open a comment window.

The Collaborator posts its comments and suggestions using GitHub interface.

## Conflicts

As soon as people can work in parallel, they'll likely step on each other's
toes.  This will even happen with a single person: if we are working on
a piece of software on both our laptop and a server in the lab, we could make
different changes to each copy.  Version control helps us manage these
[conflicts]({{ page.root }}/reference#conflicts) by giving us tools to
[resolve]({{ page.root }}/reference#resolve) overlapping changes.

To see how we can resolve conflicts, we must first create one.  The file
`mars.txt` currently looks like this in both partners' copies of our `planets`
repository:

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
~~~

Let's add a line to one partner's copy only:

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
This line added to Wolfman's copy
~~~

and then push the change to GitHub:

~~~
$ git add mars.txt
$ git commit -m "Add a line in our home copy"
~~~

~~~
[master 5ae9631] Add a line in our home copy
 1 file changed, 1 insertion(+)
~~~

~~~
$ git push origin master
~~~

~~~
Counting objects: 5, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 352 bytes, done.
Total 3 (delta 1), reused 0 (delta 0)
To https://github.com/vlad/planets
   29aba7c..dabb4c8  master -> master
~~~

Now let's have the other partner
make a different change to their copy
*without* updating from GitHub:

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
We added a different line in the other copy
~~~

We can commit the change locally:

~~~
$ git add mars.txt
$ git commit -m "Add a line in my copy"
~~~

~~~
[master 07ebc69] Add a line in my copy
 1 file changed, 1 insertion(+)
~~~

but Git won't let us push it to GitHub:

~~~
$ git push origin master
~~~

~~~
To https://github.com/vlad/planets.git
 ! [rejected]        master -> master (non-fast-forward)
error: failed to push some refs to 'https://github.com/vlad/planets.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Merge the remote changes (e.g. 'git pull')
hint: before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
~~~

![The Conflicting Changes](images/lect06_conflict.svg)

Git rejects the push because it detects that the remote repository has new updates that have not been
incorporated into the local branch.
What we have to do is pull the changes from GitHub,
[merge]({{ page.root }}/reference#merge) them into the copy we're currently working in,
and then push that.
Let's start by pulling:

~~~
$ git pull origin master
~~~

~~~
remote: Counting objects: 5, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 3 (delta 1)
Unpacking objects: 100% (3/3), done.
From https://github.com/vlad/planets
 * branch            master     -> FETCH_HEAD
Auto-merging mars.txt
CONFLICT (content): Merge conflict in mars.txt
Automatic merge failed; fix conflicts and then commit the result.
~~~

The `git pull` command updates the local repository to include those
changes already included in the remote repository.
After the changes from remote branch have been fetched, Git detects that changes made to the local copy 
overlap with those made to the remote repository, and therefore refuses to merge the two versions to
stop us from trampling on our previous work. The conflict is marked in
in the affected file:

~~~
$ cat mars.txt
~~~

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
<<<<<<< HEAD
We added a different line in the other copy
=======
This line added to Wolfman's copy
>>>>>>> dabb4c8c450e8475aee9b14b4383acc99f42af1d
~~~

Our change is preceded by `<<<<<<< HEAD`.
Git has then inserted `=======` as a separator between the conflicting changes
and marked the end of the content downloaded from GitHub with `>>>>>>>`.
(The string of letters and digits after that marker
identifies the commit we've just downloaded.)

It is now up to us to edit this file to remove these markers
and reconcile the changes.
We can do anything we want: keep the change made in the local repository, keep
the change made in the remote repository, write something new to replace both,
or get rid of the change entirely.
Let's replace both so that the file looks like this:

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
We removed the conflict on this line
~~~

To finish merging,
we add `mars.txt` to the changes being made by the merge
and then commit:

~~~
$ git add mars.txt
$ git status
~~~

~~~
On branch master
All conflicts fixed but you are still merging.
  (use "git commit" to conclude merge)

Changes to be committed:

	modified:   mars.txt

~~~

~~~
$ git commit -m "Merge changes from GitHub"
~~~

~~~
[master 2abf2b1] Merge changes from GitHub
~~~

Now we can push our changes to GitHub:

~~~
$ git push origin master
~~~

~~~
Counting objects: 10, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 697 bytes, done.
Total 6 (delta 2), reused 0 (delta 0)
To https://github.com/vlad/planets.git
   dabb4c8..2abf2b1  master -> master
~~~

Git keeps track of what we've merged with what,
so we don't have to fix things by hand again
when the collaborator who made the first change pulls again:

~~~
$ git pull origin master
~~~

~~~
remote: Counting objects: 10, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 6 (delta 2), reused 6 (delta 2)
Unpacking objects: 100% (6/6), done.
From https://github.com/vlad/planets
 * branch            master     -> FETCH_HEAD
Updating dabb4c8..2abf2b1
Fast-forward
 mars.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
~~~

We get the merged file:

~~~
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
We removed the conflict on this line
~~~

We don't need to merge again because Git knows someone has already done that.

Git's ability to resolve conflicts is very useful, but conflict resolution
costs time and effort, and can introduce errors if conflicts are not resolved
correctly. If you find yourself resolving a lot of conflicts in a project,
consider these technical approaches to reducing them:

- Pull from upstream more frequently, especially before starting new work
- Use topic branches to segregate work, merging to master when complete
- Make smaller more atomic commits
- Where logically appropriate, break large files into smaller ones so that it is
  less likely that two authors will alter the same file simultaneously

Conflicts can also be minimized with project management strategies:

- Clarify who is responsible for what areas with your collaborators
- Discuss what order tasks should be carried out in with your collaborators so
  that tasks expected to change the same lines won't be worked on simultaneously
- If the conflicts are stylistic churn (e.g. tabs vs. spaces), establish a
  project convention that is governing and use code style tools (e.g.
  `htmltidy`, `perltidy`, `rubocop`, etc.) to enforce, if necessary

## Open Science

> The opposite of "open" isn't "closed".
> The opposite of "open" is "broken".
>
> --- John Wilbanks


Free sharing of information might be the ideal in science,
but the reality is often more complicated.
Normal practice today looks something like this:

-   A scientist collects some data and stores it on a machine
    that is occasionally backed up by her department.
-   She then writes or modifies a few small programs
    (which also reside on her machine)
    to analyze that data.
-   Once she has some results,
    she writes them up and submits her paper.
    She might include her data—a growing number of journals require this—but
    she probably doesn't include her code.
-   Time passes.
-   The journal sends her reviews written anonymously by a handful of other people in her field.
    She revises her paper to satisfy them,
    during which time she might also modify the scripts she wrote earlier,
    and resubmits.
-   More time passes.
-   The paper is eventually published.
    It might include a link to an online copy of her data,
    but the paper itself will be behind a paywall:
    only people who have personal or institutional access
    will be able to read it.

For a growing number of scientists,
though, the process looks like this:

-   The data that the scientist collects is stored in an open access repository
    like [figshare](https://figshare.com/) or
    [Zenodo](https://zenodo.org), possibly as soon as it's collected,
    and given its own
    [Digital Object Identifier](https://en.wikipedia.org/wiki/Digital_object_identifier) (DOI).
    Or the data was already published and is stored in
    [Dryad](https://datadryad.org/).
-   The scientist creates a new repository on GitHub to hold her work.
-   As she does her analysis,
    she pushes changes to her scripts
    (and possibly some output files)
    to that repository.
    She also uses the repository for her paper;
    that repository is then the hub for collaboration with her colleagues.
-   When she's happy with the state of her paper,
    she posts a version to [arXiv](https://arxiv.org/)
    or some other preprint server
    to invite feedback from peers.
-   Based on that feedback,
    she may post several revisions
    before finally submitting her paper to a journal.
-   The published paper includes links to her preprint
    and to her code and data repositories,
    which  makes it much easier for other scientists
    to use her work as starting point for their own research.

This open model accelerates discovery:
the more open work is,
[the more widely it is cited and re-used](https://doi.org/10.1371/journal.pone.0000308).

However, people who want to work this way need to make some decisions
about what exactly "open" means and how to do it. You can find more on the different aspects of Open Science in [this book](https://link.springer.com/book/10.1007/978-3-319-00026-8).

This is one of the (many) reasons we teach version control.
When used diligently, it answers the "how" question
by acting as a shareable electronic lab notebook for computational work:

-   The conceptual stages of your work are documented, including who did
    what and when. Every step is stamped with an identifier (the commit ID)
    that is for most intents and purposes unique.
-   You can tie documentation of rationale, ideas, and other
    intellectual work directly to the changes that spring from them.
-   You can refer to what you used in your research to obtain your
    computational results in a way that is unique and recoverable.
-   With a version control system such as Git, 
    the entire history of the repository is easy to archive for perpetuity.

## Licensing

When a repository with source code, a manuscript or other creative
works becomes public, it should include a file `LICENSE` or
`LICENSE.txt` in the base directory of the repository that clearly
states under which license the content is being made available. This
is because creative works are automatically eligible for intellectual
property (and thus copyright) protection. Reusing creative works
without a license is dangerous, because the copyright holders could
sue you for copyright infringement.

A license solves this problem by granting rights to others (the
licensees) that they would otherwise not have. What rights are being
granted under which conditions differs, often only slightly, from one
license to another. In practice, a few licenses are by far the most
popular, and [choosealicense.com](https://choosealicense.com/) will
help you find a common license that suits your needs.  Important
considerations include:

- Whether you want to address patent rights.
- Whether you require people distributing derivative works to also
  distribute their source code.
- Whether the content you are licensing is source code.
- Whether you want to license the code at all.

Choosing a license that is in common use makes life easier for
contributors and users, because they are more likely to already be
familiar with the license and don't have to wade through a bunch of
jargon to decide if they're ok with it.  The [Open Source
Initiative](https://opensource.org/licenses) and [Free Software
Foundation](https://www.gnu.org/licenses/license-list.html) both
maintain lists of licenses which are good choices.

[This article][software-licensing] provides an excellent overview of
licensing and licensing options from the perspective of scientists who
also write code.

At the end of the day what matters is that there is a clear statement
as to what the license is. Also, the license is best chosen from the
get-go, even if for a repository that is not public. Pushing off the
decision only makes it more complicated later, because each time a new
collaborator starts contributing, they, too, hold copyright and will
thus need to be asked for approval once a license is chosen.

## Citation

You may want to include a file called `CITATION` or `CITATION.txt`
that describes how to reference your project;
the [one for Software
Carpentry](https://github.com/swcarpentry/website/blob/gh-pages/CITATION)
states:

~~~
To reference Software Carpentry in publications, please cite both of the following:

Greg Wilson: "Software Carpentry: Getting Scientists to Write Better
Code by Making Them More Productive".  Computing in Science &
Engineering, Nov-Dec 2006.

Greg Wilson: "Software Carpentry: Lessons Learned". arXiv:1307.5448,
July 2013.

@article{wilson-software-carpentry-2006,
    author =  {Greg Wilson},
    title =   {Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive},
    journal = {Computing in Science \& Engineering},
    month =   {November--December},
    year =    {2006},
}

@online{wilson-software-carpentry-2013,
  author      = {Greg Wilson},
  title       = {Software Carpentry: Lessons Learned},
  version     = {1},
  date        = {2013-07-20},
  eprinttype  = {arxiv},
  eprint      = {1307.5448}
}
~~~

More detailed advice, and other ways to make your code citable can be found
[at the Software Sustainability Institute blog](https://www.software.ac.uk/how-cite-and-describe-software) and in:

>  Smith AM, Katz DS, Niemeyer KE, FORCE11 Software Citation Working Group. 
>  (2016) Software citation principles. [PeerJ Computer Science 2:e86](https://peerj.com/articles/cs-86/) https://doi.org/10.7717/peerj-cs.86
 
There is also an [`@software{…`](https://www.google.de/search?q=git+citation+%22%40software%7B%22) 
[BibTeX](https://www.ctan.org/pkg/bibtex) entry type in case
no "umbrella" citation like a paper or book exists for the project you want to
make citable.

## Hosting

The second big question for groups that want to open up their work is where to
host their code and data.  One option is for the lab, the department, or the
university to provide a server, manage accounts and backups, and so on.  The
main benefit of this is that it clarifies who owns what, which is particularly
important if any of the material is sensitive (i.e., relates to experiments
involving human subjects or may be used in a patent application).  The main
drawbacks are the cost of providing the service and its longevity: a scientist
who has spent ten years collecting data would like to be sure that data will
still be available ten years from now, but that's well beyond the lifespan of
most of the grants that fund academic infrastructure.

Another option is to purchase a domain and pay an Internet service provider
(ISP) to host it.  This gives the individual or group more control, and
sidesteps problems that can arise when moving from one institution to another,
but requires more time and effort to set up than either the option above or the
option below.

The third option is to use a public hosting service like
[GitHub](https://github.com), [GitLab](https://gitlab.com),or
[BitBucket](https://bitbucket.org).
Each of these services provides a web interface that enables people to create,
view, and edit their code repositories.  These services also provide
communication and project management tools including issue tracking, wiki pages,
email notifications, and code reviews.  These services benefit from economies of
scale and network effects: it's easier to run one large service well than to run
many smaller services to the same standard.  It's also easier for people to
collaborate.  Using a popular service can help connect your project with
communities already using the same service.

As an example, Software Carpentry [is on
GitHub]({{ swc_github }}) where you can find the [source for this
page]({{page.root}}/_episodes/13-hosting.md).
Anyone with a GitHub account can suggest changes to this text.

GitHub repositories can also be assigned DOIs, [by connecting its releases to
Zenodo](https://guides.github.com/activities/citable-code/). For example,
[`10.5281/zenodo.57467`](https://zenodo.org/record/57467) is the DOI that has
been "minted" for the Software Carpentry introduction to Git.

Using large, well-established services can also help you quickly take advantage
of powerful tools.  One such tool, continuous integration (CI), can
automatically run software builds and tests whenever code is committed or pull
requests are submitted.  Direct integration of CI with an online hosting service
means this information is present in any pull request, and helps maintain code
integrity and quality standards.  While CI is still available in self-hosted
situations, there is much less setup and maintenance involved with using an
online service.  Furthermore, such tools are often provided free of charge to
open source projects, and are also available for private repositories for a fee.

### Institutional Barriers

Sharing is the ideal for science,
but many institutions place restrictions on sharing,
for example to protect potentially patentable intellectual property.
If you encounter such restrictions,
it can be productive to inquire about the underlying motivations and
either to request an exception for a specific project or domain,
or to push more broadly for institutional reform to support more open science.

## Summary of key points from this notebook

- Version control is like an unlimited ‘undo’.
- Version control also allows many people to work in parallel.
- `git init` initializes a repository.
- Git stores all of its repository data in the .git directory. Be *really* careful before deleting such a directory!
- `git status` shows the status of a repository.
- `git add` puts files in the staging area.
- `git commit` saves the staged content as a new commit in the local repository
- `git diff` displays differences between commits.
- `git checkout` recovers old versions of files.
- `git log --oneline` displays one line describing each commit in the repo.
- The `.gitignore` file tells Git what files to ignore.
- `git clone` copies a remote repository to create a local repository with a remote called `origin` automatically set up.
- Conflicts occur when two or more people change the same file(s) at the same time.
- The version control system does not allow people to overwrite each other’s changes blindly, but highlights conflicts so that they can be resolved.
- Open scientific work is more useful and more highly cited than closed.
- People who are not lawyers should not try to write licenses from scratch.
- Add a CITATION file to a repository to explain how you want your work cited.
- Projects can be hosted on university servers, on personal domains, or on public forges.