<center><img src=img/MScAI_brand.png width=70%></center>

# Version control, `git`, and GitHub


Has this ever happened to you?

```
$ ls
assignment.py
assignment2.py
assignment_from_Tom.py
assignment2_from_Tom.py
assignment2_from_Tom2.py
assignment_final.py
assignment_FINAL_with_Toms_changes_merged_except_spelling.py
assignment_final_5pm_missing_some_changes.py
```

A *version control* system is a system for controlling multiple versions of a file. There can be multiple collaborators working remotely, and multiple versions, with automatic merging of changes. 

Version control also serves as a *history*, a *backup*, and (if you want) a *public repository*.

### Diff and patch

Version control usually works on plain text files. E.g. a `.txt` file, or a `.py` or `.java` file is plain text. Word `.docx` files are not, but people working on documents (as opposed to program code) might write them in Markdown or LaTeX, which are plain-text formats that convert into `pdf`s or `html`.

A fundamental concept is the *diff*. A diff is a line-by-line list of differences between two plain text files.

**`file_a.py`**

```
x = 3
a = "elephant"
print("hello")
```

**`file_b.py`**

```
x = 3
a = "lion"
print("hello")
```

The *diff* between the two files would look like this:

```
2c2
< a = "elephant"
---
> a = "lion"
```

There is a program `diff` which produces the *diff* as seen above. If you're on Linux or OSX, you should be able to run the above command. Inside a Jupyter Notebook, shell commands can then be run using a `!` prefix as above. On Windows you might have to use `gitbash` which is part of the main `git` download from https://git-scm.com/download/win.

There is also a program `patch` which can take `file_a.py` and *apply* the diff, to produce `file_b.py`.

This is useful, because it allows two people to start on a common file, work *independently*, and then merge their changes whenever they want. A version control system uses `diff` and `patch` internally (that's why we didn't show how to do it manually, above).

In fact, `diff` and `patch` together give rise to an *algebra* of versions -- the theory that underlies version control.

<center><img src="img/diff-merge.png" width=60%></center>

### Common version control systems

There are several version control systems in common use, including
* Subversion
* Git
* Mercurial
* Bitkeeper (\$)
* Perforce (\$)

There are several places online where you can use free online version control services. E.g. for Subversion there is https://riouxsvn.com/. For git there is http://www.github.com/. We will concentrate on git and Github. Remember, git is the name of the version control system, and Github is a company which provides a nice website with free (and pay-for) git hosting. A lot of people working in software and analytics use their Github account as a CV.

Sadly, git and Github are really complex. Happily, we can avoid most of the complexity. We just need to know a few simple things:

* How to create a new repository on Github
* How to clone the repository from Github to our disk
* How to add a new file to our local copy
* How to push from our local copy to Github
* A little about branches, merges, and merge conflicts.



Next, we'll carry out these basic tasks in a live Github repo. Don't worry, you can delete the repo after. 

Before proceeding, make an account on Github, and log in to it.

### Creating a new repository


<center><img src=img/github-new-repo.png width=60%></center>

* Click "New", then enter a name (no spaces or weird punctuation), e.g. "test", and a description, e.g. "My first test repository"
* Then choose "public" and tick "Initialize this repository with a README".

Now the repository will be created, and a `README.md` file will be created inside it. You can look at the list of files in the repository and get the clone URL of the repository (needed for the next step). You can also download the entire thing as a zip, but we won't normally proceed that way.

<center><img src="img/github-clone.png" width=50%></center>

### Cloning

To get the new repository onto your local disk, you *clone*. Type the following at your command prompt. Obviously, put in the appropriate clone URL for your new repository in place of the one I have mentioned.

```bash
git clone https://github.com/jmmcd/ML-snippets.git
```

You'll get a new directory, in your current directory, containing one file: `README.md`. Open it up in a text editor and have a look. 

### Adding a new file


Now, let's write a new Python program, say `test.py`, and save it in the same directory as `README.md`. We have to tell Git that it exists, and *commit* it.

```bash
git add test.py
git commit -m "Wrote a simple test program."
```

### Please tell me who you are

If we see the message `Please tell me who you are`, it's because git needs to associate every commit with the person who made it. It helpfully tells us what to do:

```
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
```

### What does *commit* mean?


When you commit, you're saying the current version of the code is in a consistent state (i.e. no half-finished changes). It's not necessarily *complete* or perfect. Usually, you commit with messages like these:

* `git commit -m "Fixed a bug in calculation of y."`
* `git commit -m "Added a new function to print stats."`
* `git commit -m "Expanded documentation."`

In order to commit, you always `add` first, to tell `git` which files you want to commit.

### Pushing


So far, we've added a new file and committed it, but that only affects our local (on-disk) repository. Next we have to *push* to Github. 

```bash
git push
```

(You will be asked for your Github password.)

After this finishes, you can reload the web page to see your `test.py` has appeared on Github.

### Changing and committing

Previously, we added a new file and then committed. Even if we *edit* an existing file, we still have to run add (it really means "add the file to an upcoming commit", rather than "add to the repository") and then commit. 

So, try adding some text to the README.md, then add, commit with an appropriate message, and push.

### Pulling

Suppose your colleague is working in the same repository. To check whether they've committed and pushed any changes, you can run:

In [None]:
git pull

It gets any changes from github and applies them to your local repository. If necessary it uses *merge* so that your colleague's changes and your own are merged together. 

### Typical (simple) workflow on a single repo

```bash
git pull # get any changes by others
# edit test.py in text editor
git add test.py # tell git that test.py will be committed
git commit -m "Change to tigers" # commit
git push # push changes to GitHub
```

### Walk-through 1

1. On Github, make a new repository by clicking "New".
2. Initialise it with a README.
3. Clone to our hard disk.
4. Make a new file `test.py`.
5. Add, commit, push.
6. See the changes on Github.
7. Make a change directly on Github.
8. Pull that to hard disk.

### Merge conflicts

If we are working with colleagues in a single repository, we may see *merge conflicts*. A merge conflict arises when two people push incompatible changes (e.g. I changed lion to elephant, and `commit`ted, and at the same time you changed lion to antelope, and `commit`ted, and then we both `push`ed). 

In a merge conflict, we'll see a message like this from `git`:
```bash
git pull
[...]
CONFLICT (content): Merge conflict in test.py
```

`git` will also put some special markers in the conflicted file. They show the chunks of text that is coming from the remote repo (Github), and the text in our working copy (on disk):

```bash
<<<<<<< 
a = "tiger"
=======
a = "elephant"
>>>>>>> 
```

To solve this, we have to edit the file to decide which version is better, remove all the special markers `<<<` `===` and `>>>`, and then save it, `add` and `commit` with an appropriate message.

### Walk-through 1 (part 2)

9. Make a change directly on Github.
10. Make a conflicting change on hard disk.
11. Try to pull and observe problem.
12. Try to commit and pull, still observe problem.
13. Resolve merge conflict by editing the markers.
14. Add, commit, push and see result in Github.

### Branching

<center><img src="img/atlassian-branch.svg" width=35%> <font size=1>From atlassian.com</font></center>

A more common workflow uses *branching*. When we are about to start a new work item such as fixing a bug, we tell Git to create a branch for that work item. That branch starts as a copy of the current master branch. We commit items to our new branch and push. When we're ready with the work item, we can *merge* it back to the master branch and delete it.

```bash
git checkout -b big_feature # create branch
# work on test.py in text editor
git add test.py
git commit -m "Created fab new function in test.py"
# maybe keep working for a few hours, days or weeks and then
git checkout master # switch to master branch
git merge big_feature # merge the branch onto master
git branch -d big_feature # delete branch 
git push # push to github
```

### Pull requests

A further complication is that in many collaborative situations there is one person or organisation which controls the repository, but others want to make edits. The way to do this is: we *fork* the person's repository in Github, so that there is a new repo of the same name under our Github account:

<center><img src="img/github-fork.png" width=30%></center>

Then clone that to our local computer, create a branch, and work until our work is done on that branch, and push it back to our copy of the repo on Github. 

Then we create a *pull request*. The easiest way to do this is by going to the organisation's repository on Github and clicking "New pull request". There are then some buttons to click, basically telling the organisation which of our branches we want them to merge changes from. 

<center><img src="img/github_pr.png" width=30%></center>

The organisation may have a review policy, they may have their own code/commenting style, etc., which we should obey.

### Git complexity

There is a lot more going on in Git. What we have covered is enough to get started. There are many guides which you could consider, e.g.:

* https://www.atlassian.com/git/tutorials
* https://rogerdudler.github.io/git-guide/
* https://guides.github.com/introduction/flow/

If you see something out of the ordinary, Stackoverflow is a good place to look it up.

### Walk-through 2

We'll walk through these steps using `networkx`:

1. Go to the repo page on Github, e.g. `networkx/networkx`.
2. On Github, fork it to your account e.g. `jmmcd/networkx`.
3. Clone that to your local machine using `git clone`.
4. Go into the directory and make a branch using `git checkout -b`.
5. Make some changes on that branch, add, commit.
6. Merge that branch to our master and delete the branch.
7. Push to our Github.
8. Make a pull request. # DON'T DO THIS