## UBC Intro to Machine Learning

###  Git and Github
Instructor: Socorro Dominguez  
February 19, 2022

Add a line

**Agenda:**

* What is Version Control? 
* What is Git and Github?
* Creating a Github account
* Setting up git in the computer
* What is a repository?
* Creating a repository (directly on Github)
* Cloning a repository

# What is Version Control?

- Practice of tracking and managing changes to software code. 
- Version Control systems are software tools that help software teams manage changes to source code over time. 
- VC systems help software teams work faster and smarter. They help reduce development time and increase successful deployments.
- VC keeps track of every modification to the code in a special kind of database. 
    - If a mistake is made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while minimizing disruption to all team members.

# Version Control Motivation

![](https://github.com/UBC-MDS/DSCI_521_platforms-dsci/raw/275d8fd548028d6d37c8284a672a535b86ce5a2e/lectures/01_lecture-intro-MDS-tools/phd101212s.png)

*source: “Piled Higher and Deeper” by Jorge Cham, http://www.phdcomics.com*

**Additional reasons to learn VC with Git**
- Github (website) can act as a back-up for files housed there
- Github can be used to host websites/blogs
- Github has a fantastic search functionality
- And Git is the language we have to speak to communicate with Github.

# What is GitHub?

- Web-based platform for the dissemination of free- and open-source software.

- You are already using Github as that is where all the course's material is! Even your assignments! 

- GitHub provides the following:
    - Version control for free- and open-source software and other digital assets
    - Project discussion forums
    - DevOps to facilitate building and testing software
    - Bug reporting, patching, and tracking
    - Documentation hosting
    - An environment that fosters collaboration

- You can use GitHub for any digital asset. However, the most common use case is for individuals or organizations to house repositories of free- and open-source software.


# Register a GitHub account

Register an account with GitHub. It’s free!

[GitHub](https://github.com)

### Username advice

- Choose wisely.

- A few tips, you can take into consideration:

    - Incorporate your actual name! People like to know who they’re dealing with. 
    - Reuse your username from other contexts, e.g., Twitter or Slack. 
    - Pick a username you will be comfortable revealing to your future boss.
    - Shorter is better than longer.
    - Don’t highlight your current university, employer, or place of residence.
    - Lower case ideally.

# Installing Git

Follow the instructions:
- For [Windows](https://ubc-mds.github.io/resources_pages/install_ds_stack_windows/#git-bash-and-windows-terminal) users.

- For Mac users:
    1. [Bash Shell](https://ubc-mds.github.io/resources_pages/install_ds_stack_mac/#bash-shell)
    2. [Git](https://ubc-mds.github.io/resources_pages/install_ds_stack_mac/#git)


# Setting Up Git

Once you got your `shell` working, do the following:

~~~shell

git config --global user.name 'Jane Doe'
git config --global user.email 'jane@example.com'
git config --global --list

~~~

# Setting Up Git 2
* What user name should you give to Git? 
    - It does not have to be your GitHub user name, although it can be. 
    - Another good option is your actual first name and last name. 
    - Your commits will be labelled with this user name, so make it informative to potential collaborators and future you.

* What email should you give to Git? 
    - This must be the email associated with your GitHub account.

- These commands return nothing. You can check that Git understood by typing

~~~shell
git config --global --list
~~~


# Connecting from Git to GitHub

Now we need to make sure that the connections between tools on you rcomputer and between your computer and GitHub work.

- Unfortunately, we have to front-load a rather fiddly task, which is to decide whether to communicate with GitHub via HTTPS or SSH and setup some credentials accordingly. 

- You have 2 choices: getting a personal token or set up SSH keys.

- Once we have our credentials sorted out, in Connect to GitHub, we use Git in the shell to make sure you can clone a repo from GitHub and establish two-way communications, i.e. pull and push. 

Third verb: clone

## HTTPS

- When we interact with a remote Git server, such as GitHub, we have to include credentials in the request. 
- This proves we are a specific GitHub user.

- It would make sense to use your GitHub password. But you can't. You need to create a Token.

1.  Go to https://github.com/settings/tokens and click `“Generate token”`.
2. Look over the scopes; select “repo”, “user”, and “workflow”.
3. Copy the generated PAT to your clipboard/notes/somewhere.
4. Provide this PAT next time a Git operation asks for your password.

## Generate a personal access token (PAT)

On github.com, assuming you’re signed in, you can manage your PATs from https://github.com/settings/tokens

![](https://happygitwithr.com/img/new-personal-access-token-screenshot.png)


**Treat this PAT like a password! Do not ever hard-wire your PAT into your code!**

## Setting the PAT in your local computer

The credential helpers used by Git take advantage of official OS-provided credential stores, where possible, such as macOS Keychain and Windows Credential Manager.

Here’s a command to reveal the current credential helper:

macOS

~~~shell
git config --show-origin --get credential.helper
file:/Users/jenny/.gitconfig    osxkeychain
~~~

Windows

~~~shell
 git config --show-origin --get credential.helper
 file:C:/Program Files/Git/mingw64/etc/gitconfig manager

~~~

Store your PAT in your credential manager.

## Setting the PAT in your local computer 2 Opt A

- Using a token on the command line
    - Once you have a token, enter it instead of your password when performing Git operations over HTTPS.

For example, on the command line you would enter the following:

~~~output
$ git clone https://github.com/username/repo.git
Username: your_username
Password: your_token

~~~

## Setting the PAT in your local computer 2 Opt B
1. Click on the Spotlight icon (magnifying glass) on the right side of the menu bar. Type Keychain access then press the Enter key to launch the app.

![](https://docs.github.com/assets/cb-273689/images/help/setup/keychain-access.png)

2. In Keychain Access, search for github.com.
3. Find the "internet password" entry for github.com.
4. Edit or delete the entry accordingly.

## What is a Repo

> A repository is usually used to organize a single project. Repositories can contain folders and files, images, videos, spreadsheets, and data sets – anything your project needs. Often, repositories include a README file, a file with information about your project. GitHub makes it easy to add one at the same time you create your new repository. It also offers other common options such as a license file.

[Template of Repo](https://github.com/throughput-ec/Template)

Each GitHub repository:
- Has an owner, which could be an individual or an organization. 
- Can be set to public or private, determining who can see and interact with it. 

- While a repository can simply store files, GitHub is designed with collaboration in mind. Three key collaborative tools in GitHub are:
    - Issues: report a bug, plan improvements, or provide feedback to others working on the repository.
    - Discussions: post ideas or other conversations that are not as specific or actionable as an Issue.
    - Pull requests: Allows a user to propose a change to any of the files within a repository.

# Connect to GitHub

1. Go to https://github.com and make sure you are logged in.

2. Near “Repositories”, click the big green “New” button. Or, if you are on your own profile page, click on “Repositories”, then click the big green “New” button.

3. How to fill this in:
    - Repository template: No template.
    - Repository name: `myrepo` (or whatever you wish).
    - Description: “Repository for testing my Git/GitHub setup” or similar. It will appear in the README.
    - Public.
    - Initialize this repository with: Add a README file.
    - Click the big green button that says “Create repository”.
4. You will be taken to your repository

## Clone your Repo

- Click the big green button that says “<> Code”.
- Copy a clone URL to your clipboard. Copy the HTTPS URL.

- Navigate to a desired directory. `pwd` displays the working directory. `cd` is the command to change directory. 

- Clone `myrepo` from GitHub to your computer. Use the URL we just copied from GitHub. 
    - This URL has your GitHub username and the name of your practice repo. 
    - If your shell cooperates, you should be able to paste the whole https://.... bit that we copied above. 
    - Some shells are not clipboard aware. In that sad case, you must type it. Accurately.

~~~ shell
git clone https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.git
~~~

This should look something like this:


~~~shell
~/tmp % git clone https://github.com/jennybc/myrepo.git
Cloning into 'myrepo'...
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (3/3), done.
~~~

## Modify your Repo

- Make a local change, commit, and push. 
    - e.g., open the README. Add a new line and save it.

- Verify that Git notices the change using your terminal:

~~~shell
git status
~~~


- This should look something like this:

~~~shell
~/tmp/myrepo % echo "A line I wrote on my local computer" >> README.md

~/tmp/myrepo % git status
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   README.md

no changes added to commit (use "git add" and/or "git commit -a")
~~~

# Stage, commit and push to your remote repo

- Stage (“add”) and commit this change and then push to your remote repo on GitHub.

- If you set your credentials earlier, you might not be prompted to give them again.

- **IMPORTANT** Since you are a new GitHub user and using HTTPS, you might be challenged for your username and password. 
    - Many general Git tools still frame the authentication task with this vocabulary. **By all means**, provide your GitHub username and the **PAT** you created earlier as the password. AGAIN: Do not enter your web password. Enter your PAT.

~~~shell

git add README.md
git commit -m "A commit from my local computer"
git push

~~~

# Stage, commit and push to your remote repo 2

This should produce the following output
~~~shell
~/tmp/myrepo % git add README.md

~/tmp/myrepo % git commit -m "A commit from my local computer"
[main e92528c] A commit from my local computer
 1 file changed, 1 insertion(+)
 
~/tmp/myrepo % git push
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 12 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 327 bytes | 327.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/jennybc/myrepo.git
   31dcaef..e92528c  main -> main
~~~

## Confirm the local change propagated to the GitHub remote

1. Go back to the browser where you have your new GitHub repo.

2. Refresh.

3. You should see the new “A line I wrote on my local computer” in the README.

4. If you click on “commits,” you should see one with the message “A commit from my local computer.”

5. If you have made it this far, you are ready to start using Git and GitHub.

## Exploring GitHub for Code / Data Reusability

Some academics like using GitHub for storing and working with numerical data. It has the advantage of being stored in a repository alongside the code that is used for analysis, making a research project into a single, neatly packaged reproducible object.

Some drawbacks to using GitHub to store your data:
- It doesn’t specialize in one kind of data, making it difficult to find data to reuse. 
- Doesn’t work well for large data stores (often described as big data)

Now that we discussed APIs and Github, you can set up a way to run into GitHub's API and look only certain keywords and even exploring if those repos have csv files.

# Efforts from others

[The Throughput Cookbook](https://throughputdb.com/)

## Other Public Databases / Datasets / Places to Find Data

https://www.kaggle.com/competitions


https://www.ebsco.com/academic-libraries/research-databases-archives

https://www150.statcan.gc.ca/n1/en/type/data

https://github.com/awesomedata/awesome-public-datasets