# Version control and remote collaboration with Git and GitHub

This lesson is focused on understanding and implementing version control with Git to keep track of changes to a code-based project. We'll talk about the utility of version control systems for tracking changes to local projects, and how it can be used to enable remote collaboration and crediting of multiple authors to a project hosted in a remote repository like GitHub. This material draws from the [Version Control with Git](https://swcarpentry.github.io/git-novice/) lesson from the Software Carpentries. It uses the [Gizmo repository](https://github.com/wmvanvliet/gizmo) from wmvanvliet on GitHub as a basis for Python challenges and to illustrate a track changes workflow.

## Setup requirements

For Windows users, Git should be installed already on your computer as part of your Bash install from Day 1. 

Mac users need to install Git for Mac by downloading and running the most recent "mavericks" installer from [this list on Sourceforge.net](https://sourceforge.net/projects/git-osx-installer/files/). Because this installer is not signed by the developer, you may have to right click (control click) on the .pkg file, click Open, and click Open on the pop up window. After installing Git, there will not be anything in your /Applications folder, as Git is a command line program. *Participants should install Git before the start of the lesson*

More detailed instructions and videos are available here: https://carpentries.github.io/workshop-template/#git

## Data and code for this lesson

The material for this lesson is found in the Gizmo Python Challenges repository by wmvanvliet on GitHub. Participants do not need to do anything ahead of the lesson as the first part walks them through forking and cloning a repository on GitHub. The link is here: https://github.com/wmvanvliet/gizmo

## Background

Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software code: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.

A version control system is a tool that keeps track of these changes for us, effectively creating different versions of our files. It allows us to decide which changes will be made to the next version (each record of these changes is called a commit), and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository. Repositories can be kept in sync across different computers, facilitating collaboration among different people. Version control systems start with a base version of the document and then record changes you make each step of the way. You can think of it as a recording of your progress: you can rewind to start at the base document and play back each change you made, eventually arriving at your more recent version.

## Remote collaboration

Collaborative writing or scripting with traditional word processors and text editors is cumbersome. Either every collaborator has to work on a document sequentially (slowing down the process of writing), or you have to send out a version to all collaborators and manually merge their comments into your document. The ‘track changes’ or ‘record changes’ option can highlight changes for you and simplifies merging, but as soon as you accept changes you will lose their history. You will then no longer know who suggested that change, why it was suggested, or when it was merged into the rest of the document. 

Some word processors let us deal with this a little better, such as Microsoft Word’s Track Changes, Google Docs’ version history, but they lack a streamlined way to customise messages about changes made and store just one latest version of the file for everyone working on it. It seems ridiculous to have multiple nearly-identical versions of the same document that we keep passing back and forth to create something whole. The result is a lot of files with names like Final_paper_EDIT01.doc or Final_paper_EDITCOMMITTEE03.doc and so on...it can get messy trying to merge suggestions and changes made by multiple people in multiple documents, sent over email...I'm sure we all know the headache!

When using a remote collaboration like Git Hub, unless multiple users make changes to the same section of the document - a conflict - you can incorporate two sets of changes into the same base document.

From [GitHub Guides](https://guides.github.com/activities/hello-world/): "GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere." GitHub hosts code and all files for each project in a **repository**, sometimes shortened to "repo". A repository on GitHub is free to create, you can add contributing members and collaborators with different permission settings as you please, and you can access it with a unique URL.

## Individual work
Teams are not the only ones to benefit from version control: lone researchers can benefit immensely. Keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on (e.g., a year later, when memory has faded).




## Set up Git configuration settings

When we use Git on a new computer for the first time, we need to configure a few things. The commands we will run only need to be run once: the flag `--global` tells Git to use the settings for every project, in your user account, on this computer. Below are configurations we will set as we get started with Git:

* our name and email address,
* what our preferred text editor is,
* and that we want to use these settings globally (i.e. for every project).

Open up your terminal or GitBash command prompt and type:

`$ git config --global user.name "<insert your name here>"`

`$ git config --global user.email "<insert your email here>"`

For these lessons, we will be interacting with GitHub and so the email address used should be the same as the one used when setting up your GitHub account.

We will also set up nano to be the preferred text editor for Git (you can always reconfigure this if you want to change to a different text editor in the future):

`$ git config --global core.editor "nano -w"`

You can check your settings at any time by running:

`$ git config --list`

We will be working with some of the most common Git commands today. If you ever forget a command, or need help, you can use `git config -h` for a list of commands and `git config --help` to access the detailed Git manual.


## Fork and clone a repository from GitHub

### Why and when do you fork a repository on GitHub?
From [GitHub Guides](https://guides.github.com/activities/forking/#:~:text=After%20using%20GitHub%20by%20yourself,contribute%20to%20someone%20else's%20project.&text=Creating%20a%20%E2%80%9Cfork%E2%80%9D%20is%20producing,repository%20and%20your%20personal%20copy.): "After using GitHub by yourself for a while, you may find yourself wanting to contribute to someone else’s project. Or maybe you’d like to use someone’s project as the starting point for your own. This process is known as forking.

Creating a “fork” is producing a personal copy of someone else’s project. Forks act as a sort of bridge between the original repository and your personal copy. You can submit Pull Requests to help make other people’s projects better by offering your changes up to the original project. Forking is at the core of social coding at GitHub."

### About the Gizmo Python challenges GitHub Repository (https://github.com/aecryan/gizmo)

"This is a Python challenge. Create pull requests (PR's) to this repository to solve it. Upon PR submission, the GitHub action robots will check your code and report back how well you did. You can then add more commits to your PR until all tests come back green, which means you win! (*note:* you can also do all exercises locally and run the test to check if you pass or fail each exercise, without needing to do a pull request. This is how we will do it in this lesson.)

The exercises are meant to test your knowledge of some important features of the Python programming language and the NumPy and Pandas libraries. When it's not immediately obvious to you how to solve an exercise using only a few lines of code, it is likely you can learn a new Python trick by checking the links below the exercise."

### Fork the Gizmo Python challenges repository on GitHub

Make sure all participants are signed into their GitHub accounts, and then send the link to the Gizmo Python challenges repository: https://github.com/aecryan/gizmo. 

On the top right of the repository page, there is a small small gray button that says "Fork". Each participant should click this button to create fork - a copy of the main repository - in their own account. Changes made to the forked version of the repository do not affect the original author's work, but if you intend to contribute to their repository you can later create what is called a pull request to integrate your changes/suggestions in their repository. We will not be doing that today but rather running all tests locally. There is more information about how to create a pull request in the Gizmo repository documentation if you are interested.

### Clone the forked copy of the Gizmo Python challenges repository to your local computer

In the Gizmo repository on GitHub, locate the green button near the top of the repository page that says "Code". When you click it, you will see a menu that includes a link to clone the repository over HTTPS (and several other options like SSH). We will use HTTPS because it is secure and requires no additional configuration. 

The steps here are:
1. Copy the link under HTTPS heading 
2. In the terminal/command prompt, change directory to the Desktop: `cd Desktop`.
3. Clone the repository by running the following command in the terminal/command prompt: `git clone https://github.com/aecryan/gizmo.git`


You will see a new folder appear on your desktop called gizmo. This contains all the same contents as the remote repository - it is now a local version unique to you. You can make changes here that will not be reflected in the remote version until you tell git to incorporate them to GitHub by "pushing".

### Explore the .git folder and local git repo

The . git folder contains all the information that is necessary for your project in version control and all the information about commits, remote repository address, etc. All of them are present in this folder. It also contains a log that stores your commit history so that you can roll back to history.

If you are in the Desktop, move into the newly created repository folder: `cd gizmo`. The .git folder is hidden by default, so in order to see it we will need to run the `ls` command with the flag `-a`. Run this in your terminal/command prompt: `ls -a`.

Remember the credentials we set up earlier with Git? We can see how Git stores those to associated us with each project we create by entering the `config` file that is stored in the .git folder. 

`cd .git` 

`cat config`

You can also check the remote URL of the repository that this local copy is associated with by running the command:

`git remote -v`

This is the location that all changes will be pushed to (more about this later).

### Relationship between local and remote repositories

Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.

Systems like Git allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, Bitbucket or GitLab to hold those master copies, which are sometimes referred to as a "single source of truth".

If you now change directory back to `gizmo` (move one up by `cd ..`) and run the `ls` command, you will see there are two files, README.md and titanic.csv. These are the same files as are on the original repository on GitHub as we have done nothing with them yet. 


### What is the .gitignore file?
What if we have files that we do not want Git to track for us, like backup files created by our editor or intermediate files created during data analysis? Putting these files under version control would be a waste of disk space. What’s worse, having them all listed could distract us from changes that actually matter, so we tell Git to ignore them via this .gitignore file in the root (main folder) of our repository.

We can edit this file with nano and add the names of files or directories that we specifically want git to ignore and NOT track changes to. Anything that is listed there will not be tracked by Git. 

Use `cat` to view what git is ignoring about this repository:

`cat .gitignore` 

See that it is ignoring anything with __pycache__, tags or .pytest_cache in the name.


## Using branches with Git and GitHub

From [Backlog.com](https://backlog.com/git-tutorial/using-branches/):"In a collaborative environment, it is common for several developers to share and work on the same source code. While some developers will be fixing bugs, others will be implementing new features, etc. With so much going on, there needs to be a system in place for managing different versions of the same code base.

Branching allows each developer to branch out from the original code base and isolate their work from others. It also helps Git to easily merge versions later on.

A Git branch is essentially an independent line of development. You can take advantage of branching when working on new features or bug fixes because it isolates your work from that of other team members. Different branches can be merged into any one branch as long as they belong to the same repository."

### Make a new branch to work on in the Gizmo repository

Even though we are probably the only person who is going to work on the fork of the Gizmo Python challenges repository we made, we are going to make a new branch and make all changes on that. There are several advantages to working this way. Mainly, the advantage is that if we make all the changes we see fit on one branch we label with our name, when and if we made a pull request to the original author in the future, that person would be able to easily identify all changes we made in one convenient location. We can simply make a pull request to merge all the features from our new branch into the master branch of the original repository. The owner would then have the ability to look through the changes we propose and decide which they are willing to incorporate. It's a clean way of keeping track of contributions per person.

To make a new branch and automatically switch to it, use:

`git checkout -b <yourname>`

Replace '<yourname>' with your name. You should see a message from git in your terminal saying it has switched to a new branch called 'yourname'. That's it!

If you run the command `git branch` - you will see two branches, master and the one you just named, which should be highlighted and have a little asterisk next to it. That is good - it verifies that you are indeed on the new branch with your name. We can make changes in this branch that will not be reflected in the master branch of our forked copy (or anywhere else). If you run `ls` now that you are in your new branch you will see the same files README.md and titanic.csv. This is because we created a branch while we were on the master branch, so it kept the same files.

## Make a new file called gizmo.py (a python script) and try a python challenge 2

Great news - we have already done challenge 1 on the Python challenges repository by forking and cloning the repo! Now, let's move on to challenge 2, which is to define a new function called "hello" in a new module called "gizmo". 

First, making sure that you are on the new branch, create a new file called gizmo.py:

`nano gizmo.py` 

This handy command will create the file and open the nano editor in one step. 

Next, read the challenge in Exercise 2. It asks us to define a function that takes two arguments, name and country, and prints the message: "Hello <name>, how are things going in <country>?" The country should default to Finland. Try this on your own - how would you write a function that can do this?

- answer: 
def hello(name, country='Finland'):
  print("Hello "+ name + ","+" how are things in "+ country + "?"))

Next, open the Python interpreter you've been using (type 'python' in Mac terminal). First, we need to tell Python to import the new module called gizmo that you just created by running:

`import gizmo`

This will allow you to run any function defined in the gizmo.py script. Right now, we only have one function called "hello" - let's use it to see if we've accomplished what the challenge asked us to:

`gizmo.hello('Your Name', 'Your Country')`

Remember that Python takes commands in a very specific syntax. In this case, we are running essentially, `module.function('argument1','argument2')` which, because we have specified a print command with some other words in the function itself, should do what we want it to. If you see a friendly message asking how things are with you in the country you're in, you've done well!

## Use Git to add and commit the changes you've made to the local git repository


### Step 1: Move changes to the staging area with `git add`

If you think of Git as taking snapshots of changes over the life of a project, `git add` specifies what will go in a snapshot (putting things in the staging area), and `git commit` then actually takes the snapshot, and makes a permanent record of it (as a commit). 

If you don’t have anything staged when you type git commit, Git will prompt you to use `git commit -a` or `git commit --all`, which is kind of like gathering everyone to take a group photo! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. The staging area can hold changes from any number of files that you want to commit as a single snapshot.

First, let's check the status of changes that git knows about using `git status`:

`$ git status`

You should see a message that you are on the branch named 'yourname', there are no commits yet, and there is one untracked file - `gizmo.py` that we just created and edited. We need to add these changes to the staging area to let git know we want to commit this version. We do this by running:

`$ git add <name_of_file>`

so, in this case:

`$ git add gizmo.py`



### Step 2: Commit changes and adding a message with `git commit`

When we run `git commit`, Git takes everything we have told it to save by using `git add` and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and each one has a short identifer code that can be used to point to the version of your file if you ever need to revert back to it or investigate changes made. 

We use the `-m` flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run `git commit` without the `-m` option, Git will launch nano so that we can write a longer message.

Good commit messages start with a brief (<50 characters) statement about the changes made in the commit. Generally, the message should complete the sentence “If applied, this commit will...” . 

Let's try it with our `gizmo.py` file. This is the only file in the staging area, so the message we include in our commit can pertain only to it. If we had made changes to multiple files, for example changed parts of another script, added a new data file, or added lines to a data documentation text file, and we had added those to the staging area as well, we would be writing a commit that summarises the changes to ALL of those files. You can add all the changed files to the staging area at once by using `git add .`, but since we only changed one file we called it by name. Now, we're ready to make the commit by running:

`$ git commit -m "<my message here>"`

Choose a good commit message here - maybe something like "define hello function in gizmo.py". Remember, keep it short and sweet, but also memorable.

When we hit enter, we have successfully stored our first commit to our local git repository! To be sure, let's run:

`$ git status`

We should see a message that we are on the branch named 'yourname', there is nothing to commit, and the working tree is clean. Nice!

Another way to verify that our commit was saved is by using the command `git log`, which shows us the project's history of commits. `git log` lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier) the commit’s author, when it was created, and the log message Git was given when the commit was created.

If we run `ls` in our project folder at this point, we will still see the same files as before. That’s because Git saves information about files’ history in the special .git directory mentioned earlier so that our filesystem doesn’t become cluttered (and so that we can’t accidentally edit or delete an old version).

## Push changes to remote repository in GitHub

When we add and commit changes with git to files in our local computer, we are tracking changes to our **local** repository. In order to have those changes reflected in a **remote** repository (like on GitHub), we need to **push** them to either a brand new or an existing remote repository. In this case, we are working with a repository that already existed on GitHub before it existed on our local computers (we forked and cloned it from GitHub), and it already contains a remote URL in its configuration settings because that information came along with the files when we did the cloning step. We checked this earlier when looking at the git configuration settings - it is updated with each new repository we clone or create.

*In short:* When we push changes, we’re interacting with a remote repository to update it with the changes we’ve made locally (often this corresponds to sharing the changes we’ve made with others). 

Let's try to push our local changes to the remote repository using a simple command:

`git push`

We get an error message which makes sense - we made the new 'yourname' branch on our local computer so there is no branch in the remote repository called 'yourname'! Git helpfully suggests we create a branched called 'yourname' using the set-upstream command. Run the following to do so:

`git push --set-upstream origin <yourname>`

You should see a message: 'Branch 'yourname' set up to track remote branch <yourname> from origin'

If you run `git status` at this point, you should see that your branch is up to date with 'origin/yourname' remote branch and your working tree is clean. This is good!

From now on, any new changes on the 'yourname' branch can be added (`git add .`), committed (`git commit -m "message"`) and pushed to GitHub with `git push`, without the need to run the set-upstream command again. 

We can also check on GitHub to see the changes were reflected in the remote location. There should be a message at the top of your forked repository location that indicates changes were pushed recently. Click on the branches drop down menu near the top left of the page and select show all. Select the branch 'yourname' - here you should see a new file called gizmo.py that you created which contains the "hello" function you defined.



## Recap: A basic collaborative workflow

In practice, it is good to be sure that you have an updated version of the repository you are collaborating on, so you should git pull before making our changes. The basic collaborative workflow would be:

1. update your local repo with `git pull`
2. make your changes and stage them with `git add`
3. commit your changes with `git commit -m`, and
4. upload the changes to GitHub with `git push`

It is better to make many commits with smaller changes rather than of one commit with massive changes: small commits are easier to read and review.

## Test your answers to Python challenges in the Gizmo repository!

Now that you're familiar with version control using git AND the basics of working with Python, it's time to try your hand at the remaining challenges in the Python challenges Gizmo repository! You can work completely on your local computer, add and commit the changes you make periodically (whenever you decide, don't forget the accompanying commit message) and test your answers to each challenge using the test_gizmo.py script. 

For now, let's test the function we wrote to solve exercise 2 by running the test_gizmo.py script. This script tests whether the code you wrote for each challenge accomplished the task - it automatically runs on all fourteen challenges so expect to see a lot of "failures" if you've only done two at this point - good motivation! :) 

In your terminal, make sure you are in the main folder of the gizmo repository and run: 
`python -m pytest .tests/test_gizmo.py` 

In response, you should see a list of tests two to fourteen and a pass/fail marker for each one. At the top there is a list of fourteen characters, if you pass a test it will show a green dot, if you fail it there should be a red F.   

Try the rest of the exercises on your own! Good luck! 