# Version control with Git

This lesson is focused on understanding and implementing version control with Git to keep track of changes to a code-based project. We'll talk about the utility of version control systems for tracking changes to local projects, and how it can be used to enable remote collaboration and crediting of multiple authors to a project hosted in a remote repository like GitHub. This material draws from the [Version Control with Git](https://swcarpentry.github.io/git-novice/) lesson from the Software Carpentries. 

# Setup requirements

For Windows users, Git should be installed already on your computer as part of your Bash install from Day 1. 

Mac users need to install Git for Mac by downloading and running the most recent "mavericks" installer from [this list on Sourceforge.net](https://sourceforge.net/projects/git-osx-installer/files/). Because this installer is not signed by the developer, you may have to right click (control click) on the .pkg file, click Open, and click Open on the pop up window. After installing Git, there will not be anything in your /Applications folder, as Git is a command line program. For older versions of OS X (10.5-10.8) use the most recent available installer labelled "snow-leopard" available here. 

More detailed instructions and videos are available here: https://carpentries.github.io/workshop-template/#git

## Data and script for Day 3
We will continue to use the data from Day 2 on cities etc. _____

# Background

Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.

A version control system is a tool that keeps track of these changes for us, effectively creating different versions of our files. It allows us to decide which changes will be made to the next version (each record of these changes is called a commit), and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository. Repositories can be kept in sync across different computers, facilitating collaboration among different people. Version control systems start with a base version of the document and then record changes you make each step of the way. You can think of it as a recording of your progress: you can rewind to start at the base document and play back each change you made, eventually arriving at your more recent version.

## Remote collaboration
It seems ridiculous to have multiple nearly-identical versions of the same document that we keep passing back and forth to create something whole. 

Collaborative writing with traditional word processors is cumbersome. Either every collaborator has to work on a document sequentially (slowing down the process of writing), or you have to send out a version to all collaborators and manually merge their comments into your document. The ‘track changes’ or ‘record changes’ option can highlight changes for you and simplifies merging, but as soon as you accept changes you will lose their history. You will then no longer know who suggested that change, why it was suggested, or when it was merged into the rest of the document. Some word processors let us deal with this a little better, such as Microsoft Word’s Track Changes, Google Docs’ version history, but they lack a streamlined way to customise messages about changes made and store just one latest version of the file for everyone working on it. This ends up in a lot of files with names like Final_paper_EDIT01.doc or Final_paper_EDITCOMMITTEE03.doc and so on...it can get messy trying to merge suggestions and changes made by multiple people in multiple documents, sent over email...I'm sure we all know the headache!

Using Git with a remote platform like GitUnless multiple users make changes to the same section of the document - a conflict - you can incorporate two sets of changes into the same base document.

## Individual work
Teams are not the only ones to benefit from version control: lone researchers can benefit immensely. Keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on (e.g., a year later, when memory has faded).




# Setting up Git

When we use Git on a new computer for the first time, we need to configure a few things. The commands we will run only need to be run once: the flag `--global` tells Git to use the settings for every project, in your user account, on this computer. Below are configurations we will set as we get started with Git:

* our name and email address,
* what our preferred text editor is,
* and that we want to use these settings globally (i.e. for every project).

Open up your terminal or GitBash command prompt and type:

`$ git config --global user.name "<insert your name here>"`

`$ git config --global user.email "<insert your email here>"`

For these lessons, we will be interacting with GitHub and so the email address used should be the same as the one used when setting up your GitHub account.

We will also set up nano to be the preferred text editor for Git (you can always reconfigure this if you want to change to a different text editor in the future):

`$ git config --global core.editor "nano -w"`

You can check your settings at any time by running:

`$ git config --list`

We will be working with some of the most common Git commands today. If you ever forget a command, or need help, you can use `git config -h` for a list of commands and `git config --help` to access the detailed Git manual.


# Make the `world_cities_workshop/` folder into a local Git repository

The first step in implementing Git for version control on any project is initializing a new Git repository in a folder in the main folder of that project. This is why keeping everything organized in subdirectories in one place is especially useful! For this workshop, our project folder is the `world_cities_workshop/` folder that should be stored on your Desktop, so we will initialize a new Git repository there. When we initialise a git repository in our main folder, all the contents of every subfolder therein will be automatically trackable with git...so we only need to do this step once in the main `world_cities_workshop/` folder:

To do so, in your terminal you should `cd` into the `world_cities_workshop/` folder:

`$ cd world_cities_workshop`

And then we will run our first git command to initialize a git repository in this folder:

`$ git init`

And just like that, we've started a local repository that is ready to track changes for us. Pretty neat! Of course, we're still in the driver's seat - Git requires our input about what to track, what to ignore, and when to save or "commit" changes that we've made to our files and folders. For now, we can check that the git repository has indeed been initialised by running the `ls` command with a flag, `-a` that shows all files and folders, including hidden ones that git has created:

`$ ls -a`

You should see in the output a hidden folder called `.git/`. This is where git will store all the information from commits you make based on changes in your projects, and also where you can revert to previous versions and specific commits if you ever need to. This folder needs to stay put so that the project remains tracked with git and we don't lose information about changes we've made - the choice to hide it makes it a little bit harder to delete (which we don't want to do)! :) 

Let's do one last check to make sure our git repository was set up successfully in our `world_cities_workshop/` directory. We'll ask git to report on its current status using:

`$ git status`

You should see a message indicating that you are on a branch called "master" and that there is nothing to commit yet - this makes sense because of course we haven't made any changes yet. That's coming up!



# Making a change to a file

## _____ adding something to the map.py script

## Moving changes to the staging area with `git add`

If you think of Git as taking snapshots of changes over the life of a project, `git add` specifies what will go in a snapshot (putting things in the staging area), and `git commit` then actually takes the snapshot, and makes a permanent record of it (as a commit). 

If you don’t have anything staged when you type git commit, Git will prompt you to use git commit -a or git commit --all, which is kind of like gathering everyone to take a group photo! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. The staging area can hold changes from any number of files that you want to commit as a single snapshot.

First, let's check the status of changes that git knows about using `git status`:

`$ git status`

You should see a message that you are on the branch "master", there are no commits yet, and there is one untracked file - the `map.py` that we just made changes to. We need to add these changes to the staging area to let git know we want to commit the latest version. We do this by running:

`$ git add <name_of_file>`

so, in this case:

`$ git add map.py`

Notice that git gives pretty helpful messages when you interact with it on the command line - in this case, it tells us we should use git add to track this file and get the changes ready for committing.

## Committing changes and adding a message with `git commit`

When we run `git commit`, Git takes everything we have told it to save by using `git add` and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and each one has a short identifer code that can be used to point to the version of your file if you ever need to revert back to it or investigate changes made. 

We use the `-m` flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run `git commit` without the `-m` option, Git will launch nano so that we can write a longer message.

Good commit messages start with a brief (<50 characters) statement about the changes made in the commit. Generally, the message should complete the sentence “If applied, this commit will...” . 

Let's try it with our `map.py` script. This is the only file in the staging area, so the message we include in our commit can pertain only to it. If we had made changes to multiple files, for example changed parts of another script, added a new data file, or added lines to the data documentation text file, and we had added those to the staging area as well, we would be writing a commit that summarises the changes to ALL of those files. You can add all the changed files to the staging area at once by using `git add .`, but since we only changed one file we called it by name. Now, we're ready to make the commit by running:

`$ git commit -m "<my message here>"`

Choose a good commit message here - maybe something like ______. Remember, keep it short and sweet, but also memorable.

When we hit enter, we have successfully stored our first commit to our local git repository! To be sure, let's run:

'$ git status'

We should see a message that we are on branch "master", there is nothing to commit, and the working tree is clean. Nice!

Another way to verify that our commit was saved is by using the command `git log`, which shows us the project's history of commits. `git log` lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier)the commit’s author, when it was created, and the log message Git was given when the commit was created.

If we run `ls` in our project folder at this point, we will still see ____. That’s because Git saves information about files’ history in the special .git directory mentioned earlier so that our filesystem doesn’t become cluttered (and so that we can’t accidentally edit or delete an old version).

## Recap

To recap, the workflow for adding and committing changes to files using Git is:

1. `$ git status` to see the files that have been changed and are awaiting commit
2. `$ git add <file_name>` to add one file at a time to the staging area or `$ git add .` to add all files that have been changed since the last commit
3. `$ git commit -m "<descriptive message>"` to commit the latest version of file(s) and describe the nature of the changes made



# Using `.gitignore`

What if we have files that we do not want Git to track for us, like backup files created by our editor or intermediate files created during data analysis? Putting these files under version control would be a waste of disk space. What’s worse, having them all listed could distract us from changes that actually matter, so let’s tell Git to ignore them.

We do this by creating a file in the root directory (in this case, the `world_cities_workshop/` folder) of our project called .gitignore. We can use two shell commands, either `touch` or `nano` to create the file. If we use `nano`, the file will be created and opened with the nano editor in one step, but since it's empty and we still need to decide what to list in there, let's use `touch` to simply create the file. Make sure you are in the `world_cities_workshop/` folder (not any subfolder):

`$ cd world_cities_workshop`

And then run the command:

`$ touch .gitignore`

If you run `ls -a` now, you should see a new hidden file called `.gitignore` in your `world_cities_workshop/` folder. 

We can edit this file with nano and add the names of files or directories that we specifically want git to ignore and NOT track changes to. We don't need git to track changes to the `netherlands-cities.csv` file in the `world_cities_workshop/data/` folder, so let's tell git to ignore it. Open up the .gitignore file by running:

`$ nano .gitignore`

You should see a blank text document in your terminal window. We are going to add to it the name of the file we want to ignore, and we also need to specify the folder that it is in. In nano, type:

`data/netherlands-cities.csv`

Then hit shift + control + O to write out, enter to confirm, and shift + control + X to close nano and return to your shell. 

If you now run `cat .gitignore` you should see the line you just wrote as the contents of this file. You can also use the .gitignore file to ignore whole folders - for example if you are working with large amounts of data, image files, or other files that are either too large, unchanging, or otherwise not useful to be tracked with Git. You can also ignore all files of a certain type following the same pattern-matching strategy we used to select separate files that ended with `*-nl.csv`, or say, all `.jpg` files.

Of course, since we created the .gitignore file after our last commit, we'll need to add it and commit it so that git knows about its existence and can start ignoring things according to the rules we set. We can do this the same way we added and committed changes to any other file, by:

`$ git add .gitignore`

`$ git commit -m "added .gitignore file and ignored netherlands-cities.csv file"`

If you run `git status` now, you should see the same message that you are on branch "master", there is nothing to commit, and the working tree is clean.



# Remote collaboration on GitHub

## Relationship between local and remote repositories

Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.

Systems like Git allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, Bitbucket or GitLab to hold those master copies, which are sometimes referred to as a "single source of truth".

Let’s start by sharing the changes we’ve made to our current project with the world. Log in to GitHub, then click on the plus icon in the top right corner to create a new repository. Name your repository "map-<yourcountryname>" and then click “Create Repository".

*Note:* Since this repository will be connected to a local repository, it needs to be empty. Leave “Initialize this repository with a README” unchecked, and keep “None” as options for both “Add .gitignore” and “Add a license.” 

## Pushing your local repository to the remote on GitHub

As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository. The remote repository appears empty for now because it doesn't have any files yet - we have to add those by connecting the remote repository to the local one - remember, this is the `world_cities_workshop/` directory on your computer. 

To do that, GitHub already suggests some helpful commands under the heading **"Push an existing repository from the command line"**. You should see some commands that are unique to the remote repository you just created on GitHub, which allow you to do this connection process to your local folder. In your terminal, make sure you are in your `world_cities_workshop/` directory and then run the following command, making sure to replace <url of your project> with (you guessed it) the actual url of your project shown on GitHub*:

`git remote add origin <url of your project>.git`

This first command, `git remote add origin` does the connecting step between your local repository and the remote on GitHub. Remember, we set up the global configuration options so Git already knows that the person requesting this connection be made is you (from your computer). 

The third step uses the git command `push` to move the files tracked by git in your local repository to the remote on GitHub. We have only one branch in this project ("master"), and of course it's associated with the origin (the remote repository). 

`git push -u origin master`

This establishes the connection and makes it easy to `push` future local changes made in your local repository to the remote version on GitHub, and also allows you to `pull` changes that are made by you or someone else on GitHub to your local machine. The `-u` option used with `git push` associates the current branch with a remote branch so that the commands `git push` and `git pull` can be used without any arguments to move changes between the local and remote repositories. Think of it as a two-way street. :)

*Note: If you have navigated away from the page that displayed immediately after creating the remote repository on GitHub, you can always get back to the url of your project to set up this step by clicking on the green "Code" button near the top right of your screen on the main repository page. When you click there, choose "HTTPS" and copy the contents of the box below. You can use that directly in your commands with git, including when you want to clone someone else's repository (more on that later).

## The difference between `push` and `commit`

When we push changes, we’re interacting with a remote repository to update it with the changes we’ve made locally (often this corresponds to sharing the changes we’ve made with others). Commit only updates your local repository.


# Working with others on GitHub

## Cloning a repository from GitHub

We set up a repository they can clone, then I make a change. They run git pull and update their local copy. Don't need to be members to clone. Important to set privacy settings if you don't want the world to contribute :) 

## Creating a pull request
Demonstrate but don't actually do it.


# A basic collaborative workflow

In practice, it is good to be sure that you have an updated version of the repository you are collaborating on, so you should git pull before making our changes. The basic collaborative workflow would be:

1. update your local repo with `git pull origin master`
2. make your changes and stage them with `git add`
3. commit your changes with `git commit -m`, and
4. upload the changes to GitHub with `git push origin master`

It is better to make many commits with smaller changes rather than of one commit with massive changes: small commits are easier to read and review.

# Open Science

Free sharing of information might be the ideal in science, but the reality is often more complicated. Normal practice today looks something like this:

* A scientist collects some data and stores it on a machine that is occasionally backed up by their department.
* They then writes or modify a few small programs (which also reside on their machine) to analyze that data.
* Once they have some results, they write them up and submit their paper. They might include their data – a growing number of journals require this – but they probably don’t include their code.
* Time passes.
* The journal sends them reviews written anonymously by a handful of other people in their field. They revise their paper to satisfy them, during which time they might also modify the scripts they wrote earlier, and resubmit.
* More time passes.
* The paper is eventually published. It might include a link to an online copy of their data, but the paper itself will be behind a paywall: only people who have personal or institutional access will be able to read it.

For a growing number of scientists, though, the process looks like this:

* The data that the scientist collects is stored in an open access repository like 4TU.ResearchData or Zenodo, possibly as soon as it’s collected, and given its own Digital Object Identifier (DOI). 
* The scientist creates a new repository on GitHub to hold their work.
* As they do their analysis, they push changes to their scripts (and possibly some output files) to that repository. They also use the repository for their paper; that repository is then the hub for collaboration with their colleagues.
* When they're happy with the state of their paper, they post a version to arXiv or some other preprint server to invite feedback from peers.
* Based on that feedback, they may post several revisions before finally submitting their paper to a journal.
* The published paper includes links to their preprint and to their code and data repositories, which makes it much easier for other scientists to use their work as starting point for their own research.

This open model accelerates discovery: the more open work is, the more widely it is cited and re-used. However, people who want to work this way need to make some decisions about what exactly “open” means and how to do it. This is one of the (many) reasons we teach version control. When used diligently, it answers the “how” question by acting as a shareable electronic lab notebook for computational work:

* The conceptual stages of your work are documented, including who did what and when. Every step is stamped with an identifier (the commit ID) that is for most intents and purposes unique.
* You can tie documentation of rationale, ideas, and other intellectual work directly to the changes that spring from them.
* You can refer to what you used in your research to obtain your computational results in a way that is unique and recoverable.
* With a version control system such as Git, the entire history of the repository is easy to archive for perpetuity.

At TU Delft, researchers are able to store 1TB/year of data for free on the 4TU.ResearchData data repository. You can organise your work into projects, and even connect your GitHub repository to link processing scripts to data stored in 4TU.ResearchData! For more information, contact ____.

Interested in learning more about working reproducibly with code and data? Join the Open Science Community Delft! ____ info.

