# Version control and remote collaboration with Git and GitHub

This lesson is focused on understanding and implementing version control with Git to keep track of changes to a code-based project. We'll talk about the utility of version control systems for tracking changes to local projects, and how it can be used to enable remote collaboration and crediting of multiple authors to a project hosted in a remote repository like GitHub. This material draws from the [Version Control with Git](https://swcarpentry.github.io/git-novice/) lesson from the Software Carpentries. It uses the [Gizmo repository](https://github.com/wmvanvliet/gizmo) from wmvanvliet on GitHub as a basis for Python challenges and to illustrate a track changes workflow.

## Setup requirements

For Windows users, Git should be installed already on your computer as part of your Bash install from Day 1. 

Mac users need to install Git for Mac by downloading and running the most recent "mavericks" installer from [this list on Sourceforge.net](https://sourceforge.net/projects/git-osx-installer/files/). Because this installer is not signed by the developer, you may have to right click (control click) on the .pkg file, click Open, and click Open on the pop up window. After installing Git, there will not be anything in your /Applications folder, as Git is a command line program. *Participants should install Git before the start of the lesson*

More detailed instructions and videos are available here: https://carpentries.github.io/workshop-template/#git

## Data and code for this lesson

The material for this lesson is found in the Gizmo Python Challenges repository by wmvanvliet on GitHub. Participants do not need to do anything ahead of the lesson as the first part walks them through forking and cloning a repository on GitHub. The link is here: https://github.com/wmvanvliet/gizmo

## Background

Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software code: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.

A version control system is a tool that keeps track of these changes for us, effectively creating different versions of our files. It allows us to decide which changes will be made to the next version (each record of these changes is called a commit), and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository. Repositories can be kept in sync across different computers, facilitating collaboration among different people. Version control systems start with a base version of the document and then record changes you make each step of the way. You can think of it as a recording of your progress: you can rewind to start at the base document and play back each change you made, eventually arriving at your more recent version.

## Remote collaboration

Collaborative writing or scripting with traditional word processors and text editors is cumbersome. Either every collaborator has to work on a document sequentially (slowing down the process of writing), or you have to send out a version to all collaborators and manually merge their comments into your document. The ‘track changes’ or ‘record changes’ option can highlight changes for you and simplifies merging, but as soon as you accept changes you will lose their history. You will then no longer know who suggested that change, why it was suggested, or when it was merged into the rest of the document. 

Some word processors let us deal with this a little better, such as Microsoft Word’s Track Changes, Google Docs’ version history, but they lack a streamlined way to customise messages about changes made and store just one latest version of the file for everyone working on it. It seems ridiculous to have multiple nearly-identical versions of the same document that we keep passing back and forth to create something whole. The result is a lot of files with names like Final_paper_EDIT01.doc or Final_paper_EDITCOMMITTEE03.doc and so on...it can get messy trying to merge suggestions and changes made by multiple people in multiple documents, sent over email...I'm sure we all know the headache!

When using a remote collaboration like Git Hub, unless multiple users make changes to the same section of the document - a conflict - you can incorporate two sets of changes into the same base document.

From [GitHub Guides](https://guides.github.com/activities/hello-world/): "GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere." GitHub hosts code and all files for each project in a **repository**, sometimes shortened to "repo". A repository on GitHub is free to create, you can add contributing members and collaborators with different permission settings as you please, and you can access it with a unique URL.

## Individual work
Teams are not the only ones to benefit from version control: lone researchers can benefit immensely. Keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on (e.g., a year later, when memory has faded).




## Set up Git configuration settings

When we use Git on a new computer for the first time, we need to configure a few things. The commands we will run only need to be run once: the flag `--global` tells Git to use the settings for every project, in your user account, on this computer. Below are configurations we will set as we get started with Git:

* our name and email address,
* what our preferred text editor is,
* and that we want to use these settings globally (i.e. for every project).

Open up your terminal or GitBash command prompt and type:

`$ git config --global user.name "<insert your name here>"`

`$ git config --global user.email "<insert your email here>"`

For these lessons, we will be interacting with GitHub and so the email address used should be the same as the one used when setting up your GitHub account.

We will also set up nano to be the preferred text editor for Git (you can always reconfigure this if you want to change to a different text editor in the future):

`$ git config --global core.editor "nano -w"`

You can check your settings at any time by running:

`$ git config --list`

We will be working with some of the most common Git commands today. If you ever forget a command, or need help, you can use `git config -h` for a list of commands and `git config --help` to access the detailed Git manual.


## Fork and clone a repository from GitHub
From [GitHub Guides](https://guides.github.com/activities/forking/#:~:text=After%20using%20GitHub%20by%20yourself,contribute%20to%20someone%20else's%20project.&text=Creating%20a%20%E2%80%9Cfork%E2%80%9D%20is%20producing,repository%20and%20your%20personal%20copy.): "After using GitHub by yourself for a while, you may find yourself wanting to contribute to someone else’s project. Or maybe you’d like to use someone’s project as the starting point for your own. This process is known as forking.

Creating a “fork” is producing a personal copy of someone else’s project. Forks act as a sort of bridge between the original repository and your personal copy. You can submit Pull Requests to help make other people’s projects better by offering your changes up to the original project. Forking is at the core of social coding at GitHub."

### Fork the [Gizmo Python challenges repository on GitHub](https://github.com/aecryan/gizmo )

About the Gizmo "Can you speak Python?" GitHub Repository: "This is a Python challenge. Create pull requests (PR's) to this repository to solve it. Upon PR submission, the GitHub action robots will check your code and report back how well you did. You can then add more commits to your PR until all tests come back green, which means you win! (*note:* you can also do all exercises locally and run the test to check if you pass or fail each exercise, without needing to do a pull request. This is how we will do it in this lesson.)

The exercises are meant to test your knowledge of some important features of the Python programming language and the NumPy and Pandas libraries. When it's not immediately obvious to you how to solve an exercise using only a few lines of code, it is likely you can learn a new Python trick by checking the links below the exercise."

- On the top right of GitHub website, find the small gray button that says "Fork". 
- Click this button to create a copy of the main repository in your own account. You can work here and make changes to the original author's work, it will not affect their work but you can later create what is called a pull request to integrate your changes/suggestions in their repository. We will not be doing that today but rather running all tests locally. There is more information in the links within the Gizmo repository.

### Clone the forked copy of the Gizmo Python challenges repository to your local computer

- green button near the top of the screen that says "Code". We will use HTTPS today but you can configure SSH later which stores your keys for a private connection. Both are secure, this one is just ready. 
- copy the link that is created under HTTPS heading 
- in your terminal, `cd Desktop`.
- `git clone https://github.com/aecryan/gizmo.git`
- You will see a new folder appear on your desktop called gizmo. This contains all the same contents as the remote repository - it is now a local version unique to you. You can make changes here that will not be reflected in the remote version until you tell git to incorporate them to GitHub by "pushing"

### Explore .git folder and local git repo
- `cd gizmo`
- what is a .git folder and how to see it `ls -a`
The . git folder contains all the information that is necessary for your project in version control and all the information about commits, remote repository address, etc. All of them are present in this folder. It also contains a log that stores your commit history so that you can roll back to history.
- `cd .git` , `cat config`. Show your credentials that were set up earlier when we stored name and email address associated with git. 

- `cd gizmo` , `ls` - what is the relationship between local and remote repositories? can see there are two files, README.md and titanic.csv. These are the same files as are on GitHub. There is also a .gitignore file which can be found by showing hidden files, you can `cat .gitignore` and see that it is ignoring anything with __pycache__, tags or .pytest_cache in the name.


## Make a new branch 
- why do we use branches with Git and GitHub?
- `git checkout -b <yourname>`
- Get a message: Switched to a new branch 'yourname'
- `git branch` - you will see two branches, master and the one you named, which should be highlighted and have a little asterisk next to it. We can make changes in this branch that will not be a part of the master branch. If you do `ls` you will see the same files README.md and titanic.csv.

## Make a new file called gizmo.py (a python script) and try a python challenge 2
- `nano gizmo.py` this will create and enter the editor in one step.
- define a function that takes two arguments, name and country, and prints the message, Hello <name>, how are things going in <country>? The country should default to Finland (Exercise 2 from Gizmo)
- answer: 
def hello(name, country='Finland'):
  print("Hello "+ name + ","+" how are things in "+ country + "?"))
- get to the python ide by typing `python` in Mac terminal
- run `import gizmo` to import the new gizmo module you just created and let python access the new function
- run `gizmo.hello('Your Name', 'Your Country')` this should print the right message

## Add and commit changes to local git repository
- run `git status` - this should show that you have one new untracked file, gizmo.py. See the message that you can use `git add <filename>` to add this file to the staging area. If you are working with more than one file at a time, you can also use `git add .` to add them all to the staging area at the same time.
- `git add gizmo.py`
- what is the staging area and how does it relate to tracking changes?

## Moving changes to the staging area with `git add`

If you think of Git as taking snapshots of changes over the life of a project, `git add` specifies what will go in a snapshot (putting things in the staging area), and `git commit` then actually takes the snapshot, and makes a permanent record of it (as a commit). 

If you don’t have anything staged when you type git commit, Git will prompt you to use git commit -a or git commit --all, which is kind of like gathering everyone to take a group photo! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. The staging area can hold changes from any number of files that you want to commit as a single snapshot.

First, let's check the status of changes that git knows about using `git status`:

`$ git status`

You should see a message that you are on the branch "master", there are no commits yet, and there is one untracked file - the `map.py` that we just made changes to. We need to add these changes to the staging area to let git know we want to commit the latest version. We do this by running:

`$ git add <name_of_file>`

so, in this case:

`$ git add map.py`

Notice that git gives pretty helpful messages when you interact with it on the command line - in this case, it tells us we should use git add to track this file and get the changes ready for committing.

## Committing changes and adding a message with `git commit`

When we run `git commit`, Git takes everything we have told it to save by using `git add` and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and each one has a short identifer code that can be used to point to the version of your file if you ever need to revert back to it or investigate changes made. 

We use the `-m` flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run `git commit` without the `-m` option, Git will launch nano so that we can write a longer message.

Good commit messages start with a brief (<50 characters) statement about the changes made in the commit. Generally, the message should complete the sentence “If applied, this commit will...” . 

Let's try it with our `map.py` script. This is the only file in the staging area, so the message we include in our commit can pertain only to it. If we had made changes to multiple files, for example changed parts of another script, added a new data file, or added lines to the data documentation text file, and we had added those to the staging area as well, we would be writing a commit that summarises the changes to ALL of those files. You can add all the changed files to the staging area at once by using `git add .`, but since we only changed one file we called it by name. Now, we're ready to make the commit by running:

`$ git commit -m "<my message here>"`

Choose a good commit message here - maybe something like ______. Remember, keep it short and sweet, but also memorable.

When we hit enter, we have successfully stored our first commit to our local git repository! To be sure, let's run:

`$ git status`

We should see a message that we are on branch "master", there is nothing to commit, and the working tree is clean. Nice!

Another way to verify that our commit was saved is by using the command `git log`, which shows us the project's history of commits. `git log` lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier)the commit’s author, when it was created, and the log message Git was given when the commit was created.

If we run `ls` in our project folder at this point, we will still see ____. That’s because Git saves information about files’ history in the special .git directory mentioned earlier so that our filesystem doesn’t become cluttered (and so that we can’t accidentally edit or delete an old version).

## Recap

To recap, the workflow for adding and committing changes to files using Git is:

1. `$ git status` to see the files that have been changed and are awaiting commit
2. `$ git add <file_name>` to add one file at a time to the staging area or `$ git add .` to add all files that have been changed since the last commit
3. `$ git commit -m "<descriptive message>"` to commit the latest version of file(s) and describe the nature of the changes made

- `git status` - shows you which branch you are on, and new message indicating changes will be committed for new file: gizmo.py
- `git commit -m "defined hello function in gizmo.py"`
- what is a commit message and how to write a good one?
- `git status` - which branch you are on, nothing to commit, working tree clean.

# Using `.gitignore`

What if we have files that we do not want Git to track for us, like backup files created by our editor or intermediate files created during data analysis? Putting these files under version control would be a waste of disk space. What’s worse, having them all listed could distract us from changes that actually matter, so let’s tell Git to ignore them.

We do this by creating a file in the root directory (in this case, the `world_cities_workshop/` folder) of our project called .gitignore. We can use two shell commands, either `touch` or `nano` to create the file. If we use `nano`, the file will be created and opened with the nano editor in one step, but since it's empty and we still need to decide what to list in there, let's use `touch` to simply create the file. Make sure you are in the `world_cities_workshop/` folder (not any subfolder):

`$ cd world_cities_workshop`

And then run the command:

`$ touch .gitignore`

If you run `ls -a` now, you should see a new hidden file called `.gitignore` in your `world_cities_workshop/` folder. 

We can edit this file with nano and add the names of files or directories that we specifically want git to ignore and NOT track changes to. We don't need git to track changes to the `netherlands-cities.csv` file in the `world_cities_workshop/data/` folder, so let's tell git to ignore it. Open up the .gitignore file by running:

`$ nano .gitignore`

You should see a blank text document in your terminal window. We are going to add to it the name of the file we want to ignore, and we also need to specify the folder that it is in. In nano, type:

`data/netherlands-cities.csv`

Then hit shift + control + O to write out, enter to confirm, and shift + control + X to close nano and return to your shell. 

If you now run `cat .gitignore` you should see the line you just wrote as the contents of this file. You can also use the .gitignore file to ignore whole folders - for example if you are working with large amounts of data, image files, or other files that are either too large, unchanging, or otherwise not useful to be tracked with Git. You can also ignore all files of a certain type following the same pattern-matching strategy we used to select separate files that ended with `*-nl.csv`, or say, all `.jpg` files.

Of course, since we created the .gitignore file after our last commit, we'll need to add it and commit it so that git knows about its existence and can start ignoring things according to the rules we set. We can do this the same way we added and committed changes to any other file, by:

`$ git add .gitignore`

`$ git commit -m "added .gitignore file and ignored netherlands-cities.csv file"`

If you run `git status` now, you should see the same message that you are on branch "master", there is nothing to commit, and the working tree is clean.



# Remote collaboration on GitHub

## Relationship between local and remote repositories

Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.

Systems like Git allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, Bitbucket or GitLab to hold those master copies, which are sometimes referred to as a "single source of truth".

Let’s start by sharing the changes we’ve made to our current project with the world. Log in to GitHub, then click on the plus icon in the top right corner to create a new repository. Name your repository "map-<yourcountryname>" and then click “Create Repository".

*Note:* Since this repository will be connected to a local repository, it needs to be empty. Leave “Initialize this repository with a README” unchecked, and keep “None” as options for both “Add .gitignore” and “Add a license.” 

## Pushing your local repository to the remote on GitHub

As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository. The remote repository appears empty for now because it doesn't have any files yet - we have to add those by connecting the remote repository to the local one - remember, this is the `world_cities_workshop/` directory on your computer. 

To do that, GitHub already suggests some helpful commands under the heading **"Push an existing repository from the command line"**. You should see some commands that are unique to the remote repository you just created on GitHub, which allow you to do this connection process to your local folder. In your terminal, make sure you are in your `world_cities_workshop/` directory and then run the following command, making sure to replace <url of your project> with (you guessed it) the actual url of your project shown on GitHub*:

`git remote add origin <url of your project>.git`

This first command, `git remote add origin` does the connecting step between your local repository and the remote on GitHub. Remember, we set up the global configuration options so Git already knows that the person requesting this connection be made is you (from your computer). 

The third step uses the git command `push` to move the files tracked by git in your local repository to the remote on GitHub. We have only one branch in this project ("master"), and of course it's associated with the origin (the remote repository). 

`git push -u origin master`

This establishes the connection and makes it easy to `push` future local changes made in your local repository to the remote version on GitHub, and also allows you to `pull` changes that are made by you or someone else on GitHub to your local machine. The `-u` option used with `git push` associates the current branch with a remote branch so that the commands `git push` and `git pull` can be used without any arguments to move changes between the local and remote repositories. Think of it as a two-way street. :)

*Note: If you have navigated away from the page that displayed immediately after creating the remote repository on GitHub, you can always get back to the url of your project to set up this step by clicking on the green "Code" button near the top right of your screen on the main repository page. When you click there, choose "HTTPS" and copy the contents of the box below. You can use that directly in your commands with git, including when you want to clone someone else's repository (more on that later).

## The difference between `push` and `commit`

When we push changes, we’re interacting with a remote repository to update it with the changes we’ve made locally (often this corresponds to sharing the changes we’ve made with others). Commit only updates your local repository.


## Push changes to remote repository in GitHub
- why do we have to push
- try `git push` - get an error message because there is no branch in the remote repository called 'yourname' so it helpfully suggests we create one by set-upstream command
- `git push --set-upstream origin <yourname>` - you should see a message that a 'Branch 'yourname' set up to track remote branch <yourname> from origin'
- `git status` - your branch is up to date with 'origin/yourname' nothing to commit, working tree clean
- from now on, any new changes can be added (`git add .`), committed (`git commit -m "message"`) and pushed to GitHub with `git push`
- check on GitHub - you should see a message saying that changes were pushed less than a minute ago. 
- click on branches and show all, select 'yourname', here you should see a new file called gizmo.py that you created and if you click into it, you should see the function you defined.

## Test answer against Gizmo repository 
- `python -m pytest .tests/test_gizmo.py` Should show a list of tests two to fourteen and pass/fail for each one. At the top there is a list of fourteen characters, if you pass a test it will show a green dot, if you fail it there should be a red F.   
- can do the rest of the exercise (14 in total) on their own. 

# A basic collaborative workflow

In practice, it is good to be sure that you have an updated version of the repository you are collaborating on, so you should git pull before making our changes. The basic collaborative workflow would be:

1. update your local repo with `git pull`
2. make your changes and stage them with `git add`
3. commit your changes with `git commit -m`, and
4. upload the changes to GitHub with `git push`

It is better to make many commits with smaller changes rather than of one commit with massive changes: small commits are easier to read and review.