# Code management and collaboration

In this demo, we will learn about using **version control** to collaborate on programming projects.

```{image} images/version_control.png
:width: 400px
:align: center
```

We've all been in this situation before: it seems unnecessary to have multiple nearly-identical versions of the same document. But we do this because we're afraid we will delete something that may turn out to be useful. 

It becomes even more complex when writing collaboratively with traditional word processors (e.g. MS word). One option would be to require every collaborator to work on the document **sequentially** (slowing down the process of writing). 

Alternatively, we could receive comments from all collaborators at the same time (i.e. parallel) and **manually** merge their edits into our document. But this is time-consuming, prone to errors, and makes it challenging to resolve **conflicting edits**. 

The **track changes** feature in modern word processors can be useful for highlighting changes and simplifying the merging process. But, once we have accepted their changes, we no longer know who suggested the change, why it was suggested, or when it was merged into the rest of the document.

## Version control

* **Version control systems** (VCS) start with a **base version** of the document and then **keeps track of changes** you make each step of the way


* VCS does not care about file names, instead it records **who, what, when, and why** changes were made to files


* VCS allows us to undo changes at any time by **reverting** to a previous version of the files


* A central location where we manage our files using VCS is called a **repository**


```{image} images/local-vcs.png
:width: 600px
:align: center
```

## Git

* One of the most popular VCS tools in use today is called `git` developed in 2005 by Linus Torvalds


* It is a command-line interface (CLI) tool that is installed **locally** 


* It is free and open-source software

```{image} images/git.svg
:width: 600px
:align: center
```

## GitHub

* **GitHub** is a web-based hosting service for `git` owned by Microsoft

```{image} images/github.png
:width: 200px
:align: center
```

* Provides remote backup for our **repository** (i.e. in the cloud), a graphical user interface (GUI), and many other features that encourage sharing and collaboration


* There are other web-based hosting services (e.g. **GitLab** and **Bitbucket**)


* We can use git *and* GitHub together as a form of **distributed version control (DVCS)**, where a full mirror of our repository (including its history) is stored in multiple places and our collaborators all have access to the central repository.

```{image} images/dvcs.png
:width: 600px
:align: center
```

## Single user workflow

Install git using the following [instructions](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)

Open terminal and set user name and email address.

```
git config --global user.name "Johnny Ryan"
git config --global user.email jonathan.ryan@duke.edu
```

Set default branch name to `main`.

```
git config --global init.defaultBranch main
```


### Initialize a repository

Make a new folder with a text file.

```
mkdir my_project
cd my_project
echo "This is the content of my file." > file.txt
```

Initialize `git` in repository.

```
git init
```

```{image} images/git-dot.png
:width: 400px
:align: center
```

At this point our repository contains all the necessary files but no version control is happening yet. We can confirm this by checking the **status** of our files. 

```
git status
```

```{image} images/git-status.png
:width: 400px
:align: center
```

We use a two-step process to start tracking changes to our file. We first define the files we want to track using `git add`.

```
git add .
```

```{note}
The `.` tells `git` to stage **all changes to all files** in the repository. 
```

```{image} images/git-status2.png
:width: 400px
:align: center
```

We then use `git commit` to begin tracking our files. 

```
git commit -m "initial commit"
```

```{note}
The `-m` is a message that we use to briefly describe what the changes were.  
```

```{image} images/git-commit.png
:width: 400px
:align: center
```

```{note}
`nothing to commit, working tree clean` means that none of our tracked files are **modified**. 
```

### Modify repository

Let's add another file to our project to see what happens.

```
echo "# My Project" > README.md
```

We will also make some changes to `file.txt` so that it now reads `This is the content of my file **with edits.**`. Now when we run `git status` we see there are two files, one file that is **modified** and one file that is **untracked**. 

```{image} images/git-modify.png
:width: 400px
:align: center
```

**Both** these files can be staged using the `git add` command which stages **all addition, modification, or deletion** of files. We can therefore use the same workflow every time. 

```{image} images/git-modify2.png
:width: 400px
:align: center
```

We can then use `git commit` to track them. 

```{image} images/git-commit.png
:width: 400px
:align: center
```

We can inspect the history of our repository using `git log`.

```{image} images/git-log.png
:width: 400px
:align: center
```

This command lists the commits in reverse chronological order (newest commits first) and includes the author's name, email, date written, and message. 

The commit is named using a type of **checksum** called a SHA-1 hash which is a 40-character string composed of hexadecimal characters (0–9 and a–f). 

```{tip}
Adding the `-a` option to the git commit command makes Git automatically stage every file that is already tracked before doing the commit, letting you skip the git add part e.g. `git commit -a -m 'some more changes'`
```

### Knowledge check

Now that we have seen how git works, let's take a second to understand what it is doing "under-the-hood". It is important to note that git is not just storing a list of file-based changes (i.e. **delta-based version control**). Instead, it is recording a **series of snapshots** of our repository. Every time we **commit**, we are taking a picture of the current state of files.

```{image} images/snapshots.png
:width: 700px
:align: center
```

Files in a git repository have **three states**:

* **Modified** means that we have changed the file but have not staged or committed it to the repository.

* **Staged** means that we have marked a modified file in its current version to be committed into the next snapshot.

* **Committed** means that our file is safely recorded and stored.

### Going back in time

If we want to go back to a previous commit (just to look around) we can use `git checkout`.

```{image} images/checkout.png
:width: 400px
:align: center
```

From here we can save the old file somewhere else before going back to the `main` branch using `git checkout main`.

If we made changes to a file that we later decide we don't want, we can **undo** them using `git restore` (provided that we have not yet committed them).

```{image} images/git-restore.png
:width: 400px
:align: center
```

```{image} images/git-restore2.png
:width: 400px
:align: center
```

If we want to reset our repository to a previous commit, we can use `git reset`.

```{image} images/git-reset.png
:width: 400px
:align: center
```

```{warning}
The `git reset` command is very dangerous. It completely rewinds our repository to that commit and all changes after that commit are permanently deleted! Be careful with this command.
```

### Working with remotes

Remote repositories are versions of our project that are hosted somewhere else (usually on GitHub). To set up a remote repository, first go to [GitHub.com](https://github.com/) and click **New**. 

Choose a name for the new repository, **leave everything else blank**, and click **Create Repository**. Then back in Terminal, run:

```
git remote add origin git@github.com:JohnnyRyan1/my_project.git
```

This links our local repository with our remote repository. We can then **upload** our local repository to the remote using `git push`.

```
git push -u origin main
```

```{note}
The first time we use this command we have to include `-u origin main` to define which branch we want to upload to (i.e. the `main` branch of our remote repository called `origin`.
```

```{image} images/empty-repo.png
:width: 700px
:align: center
```

Our remote repository now contains `file.txt`. But since we used `reset` earlier, we lost our `README.md` file. We can make one on GitHub.

```{image} images/remote-commit.png
:width: 700px
:align: center
```

We can run `git fetch` now to compare our local repository with our remote repository.

```{image} images/git-fetch.png
:width: 400px
:align: center
```

After running `git status`, we find that our local repository is 1 commit behind the remote repository. We can use `git pull` to update our local repository.

```{image} images/git-pull.png
:width: 400px
:align: center
```

## Centralized workflow for teams

* One team member creates a remote repository. All team members **clone** this central repository to their local machine.

```{image} images/colab.svg
:width: 600px
:align: center
```

* One team member makes changes (e.g. add, modify, delete) to files on their local machine, periodically **committing** these changes (i.e. take a snapshot)


* When they are finished working, they can **push** their changes back to the central repository 


```{image} images/john_push.svg
:width: 600px
:align: center
```

* But now when **another team member** (who has also been working on the project) tries to **push** their changes, Git will **refuse the request** because the their local history has **diverged** from the central repository


```{image} images/mary_push.svg
:width: 600px
:align: center
```

* The team member must first **pull** the most recent changes in the central reposistory into their local repository

```{image} images/mary_pull.svg
:width: 600px
:align: center
```

* The team member then resolves any conflicts between their local version and the central repository.


* Once finished, then team member can then **commit** and **push** their changes to the central repo  

```{image} images/mary_successful_push.svg
:width: 600px
:align: center
```


### Advantages

* Simplest workflow


* Works well for small teams


### Disadvantages

* If someone breaks the central repo, it breaks for everyone


* Potential for a lot of conflicts


* One solution is to avoid working on the same files


* But this does not scale well as teams increase in size



### Feature branch workflow

* The logical extension of the centralized workflow is to use **branches**


* In this workflow, all feature development takes place in a dedicated branch instead of the main branch


* This means that main branch never contains broken code - a huge advantage for continuous integration environments

* All team members **clone** a **single, central repository** to their local machine

```{image} images/colab.svg
:width: 600px
:align: center
```

* Team members immediately create a new branch to make their changes

```{image} images/big_branch.svg
:width: 600px
:align: center
```

* When team members finish their changes, they **push** their branch to the central repository. The central repository will now contain multiple branches.  


* Therefore, unlike the centralized workflow, this **push** will never cause conflicts


```{image} images/mary_successful_push.svg
:width: 600px
:align: center
```

* Team members then submit a **pull request** on GitHub.com asking to **merge** their new feature (or branch) into the main codebase, all team members will be notified automatically


```{image} images/git-pull-request.png
:width: 300px
:align: center
```


* Team leader **reviews** pull request, discusses any changes with team members


* Once everything looks good, team leader merges new feature into main codebase


* Team member can then delete their branch

```{image} images/merge.svg
:width: 600px
:align: center
```


```{image} images/pull-request.png
:width: 800px
:align: center
```


```{image} images/pr-changes.png
:width: 1000px
:align: center
```

```{image} images/create-pull-request.png
:width: 1000px
:align: center
```

```{image} images/github-diff-file.png
:width: 800px
:align: center
```


### Advantages of feature branch workflow

* Promotes collaboration with team members through **pull requests** and **merge reviews**


* Teams can work in parallel on same files so good approach for larger teams


* Main branch never contains broken code 


* Guiding framework for other, more complex worflows

* Instead of using a single, central repository, forking workflows give every team member their **own central repository**



* Team members can tinker with their forked repository as they wish without disturbing anyone else



* When ready they can **push** to their private central repository and file **pull requests** if they think their changes are ready to be integrated to main codebase


```{image} images/multiple_repos.svg
:width: 600px
:align: center
```


* Provides a little more **power** to the team leader because they are the only person that can push to the official repository



* Allows the team leader to **accept/reject commits** from any developer without giving them write access to the main codebase
 


* Often used for large open-source projects

## Good practices

### Agree on a workflow


* It is important that teams establish shared patterns of collaboration


* If a team doesn't agree on a shared workflow it can lead to inefficient communication when it comes time to merge branches


### Commit often


* Commits are **easy to make** and provide opportunities to **revert** or **undo** work


* They should be made **frequently** to capture updates to a code base


### Ensure you're working from latest version


* VCS enables rapid updates from multiple developers


* It's easy to have a local copy of the codebase fall behind the global copy


* Make sure to `git pull` or `fetch` the latest code before you start working on project


### Make detailed notes

* It is important to leave descriptive explanatory commit log messages. These commit log messages should explain the "why" and "what" that encompass the commits content. 


* These log messages become the canonical history of the project's development and leave a trail for future contributors to review.


### Use branches


* Branches enable multiple developers to work in parallel on **separate lines** of development


* Branches should be used **frequently** as they are quick and inexpensive. 


* When development on a branch is complete it should be **merged** into the main line of development and then **deleted**

There are two ways to use `git`, the command-line and **GitHub Desktop**. Most students prefer to use the desktop version to begin with but we'd be happy to provide guidance on the command-line version during labs.  


## GitHub Desktop

We will talk a bit more about `git` later in the lecture but, to continue setting up our project, go ahead and install [**GitHub Desktop**](https://desktop.github.com/). 

* Open **GitHub Desktop** &rarr; **Add an Existing Repository from your Hard Drive...**
* Select **Choose...** and navigate to your project folder
* Respond to this warning by clicking **create a repository**
* Leave everything as is and **Create Repository**

Now we can **Publish repository** on Github.com by clicking the big blue button. If you signed up for an educational GitHub account we should able to tick the box to **Keep this code private** and click **Publish Repository** again.


## GitHub.com

If we navigate to `github.com` on a web browser, sign in, and navigate to our profile, there should be a new **repository** that contains our files (just an `requirements.txt` and `.venv` folder for now).


## Basic usage

In line with our "learn by doing" mantra, we will demonstrate the basics of version control with GitHub using a demo. 

### Add a new file

* Make a new text file using **Notepad** on Windows or **TextEdit** on MacOS called `README.md`, add some random text, and save.
* In **GitHub Desktop** we will see **1 changed file**. 
* Type in **"Added README"** in the **Summary** box in the lower left &rarr; **Commit to main**.

If we navigate to `github.com`, we will see this new file. 

### Make some changes

Now make some changes to the text in the `README.md` file, save, **and close**. Again, we will see **1 changed file** in GitHub Desktop. 

### Undo the changes

* Type in **"Changed README"** in the **Summary** box in the lower left and **Commit to main** again.
* Click the **History** tab (next to **Changes**)
* Right-click the **Changed README** commit and click **Revert Changes in Commit...**

If we navigate to the `README.md` file, we will find that the changes we made have been deleted. We have successfully used `git` in a practical way!

## Why do we use version control systems?

### Security


* VCS acts like an unlimited **'undo'** thereby **protecting source code** from yourself **and** others 
    
    
* e.g. catastrophe, human error, and unintended consequences

```{image} images/hero.svg
:width: 600px
:align: center
```

### Collaboration

* VCS enables **many people** to work on the same project at the same time


* Teams working in parallel accelerates project development


```{image} images/colab.svg
:width: 500px
:align: center
```

### Community

* Impossible for junior developer to mess up a big project 


* Since it is so robust this encourages open-source **experimentation** and **development**


* `GitHub` has really emerged as the industry standard


```{image} images/pull.svg
:width: 500px
:align: center
```

## Challenges of version control

* Difficult to learn

```{image} images/meme.jpeg
:width: 600px
:align: center
```

## Acknowledgments

[Pro Git](https://git-scm.com/book/en/v2)

[Earth Data Analytics Online Certificate: Git and GitHub](https://earthdatascience.org/courses/intro-to-earth-data-science/git-github/)