# Repositories
This chapter digs a little deeper into how Git stores information and how you can explore a repository's history.

## Table of Contents
[I. How does Git store information?](#one)  
[II. What is a hash?](#two)  
[III. How can I view a specific commit?](#three)  
[IV. What is Git's equivalent of a relative path?](#four)  
[V. How can I see who changed what in a file?](#five)  
[VI. How can I see what changed between two commits?](#six)  
[VII. How do I add new files?](#seven)  
[VIII. How do I tell Git to ignore certain files?](#eight)  
[IX. How can I remove unwanted files?](#nine)  
[X. How can I see how Git is configured?](#ten)  
[XI. How can I change my Git configuration?](#eleven)

## How does Git store information? <a name="one"></a>
Git uses a three-level structure for this:
1. A **commit** contains metadata such as author, commit message, and time of commit.
2. Each commit has a **tree**, which tracks the names and locations in the repo when that commit happened.
3. For each file listed in the tree, there is a **blob** which contains a compressed snapshot of the contents of the file when the commit happened.

Fun fact: *blob* is short for *binary large object*, which is a SQL database term form "may contain data of any kind"
    
    
![Commit Tree Blob](imgs/commit_tree_blob2.svg)

Note: In the middle commit, `report.md` and `draft.md` were changed, so the blobs are shown next to that commit. `data/northern.csv` didn't change in that commit, so the tree links to the blob from the previous commit. Reusing blobs between commits help make common operations fast and minimizes storage space.

## What is a hash?<a name="two"></a>
- Every commit to a repo has a unique identfier called a **hash**
    - The hash is normally written as a 40-character hexadecimal string like `7c35a3ce607a14953f070f0f83b5d74c2296ef93`
    - Most of the time, you only have to give Git the first 6 or 8 characters in order to identify a commit
- Hashes enable Git to share data efficiently between repos
- If two files are the same. their hashes are guaranteed to be the same
    - Similarly, if two commits contain the same files and have same ancestors, their hashes are the same
    
## How can I view a specific commit?<a name="three"></a>
- Use the command `git show` with the first few characters of the commit's hash

## What is Git's equivalent of a relative path?<a name="four"></a>
- A hash is like an absolute path; it identifies a specific commit
- Another way to identify a commit is to use the equivalent of a relative path
- `git show HEAD` always refers to the most recent commit
- `git show HEAD~1` then refers to the commit before it
- `git show HEAD~2` refers to the commit before that, and so on

## How can I see who changed what in a file?<a name="five"></a>
`git annotate file` shows who made the last change to each line of a file and when. Each line of the output contains five elements, with element 2 to 4 enclosed in parentheses:
1. The first eight digits of the hash
2. The author
3. The time of the commit
4. The line number
5. The contents of the line

## How can I see what changed between two commits?<a name="six"></a>
- `git show` with a commit ID shows the changes made *in* a particular commit
- To see changes *between* two commits, you can use `git show ID1..ID2`, where `ID1` and `ID2` identify the hashes of two commits
- Something like `git show HEAD~1..HEAD~3` can be used to show the differences between the state of the repo one commit in the past and its state three commits in the past

## How do I add new files?<a name="seven"></a>
- Git doesn't track files by default
- Instead, it waits until you have used `git add` at least once
- Recall, `git status` will always tell you about files that are in your repo but aren't (yet) being tracked

## How do I tell Git to ignore certain files?<a name="eight"></a>
- Data analysis often produces temporary or immediate files that you don't want to save
- You can tell it to stop paying attention to certain files by creating a file in the root directory of the repo called `.gitignore`
    - In this file you can store a list of **wildcard** patterns (e.g. *.csv) or filenames to specify the files you want Git to ignore
    
## How can I remove unwanted files?<a name="nine"></a>
- `git clean -n` will show you a list of files that are in the repo, but whose history Git is not currently tracking
- `git clean -f` will then delete those files

*Note: Use this command carefully*. `git clean` only works on untracked files, so by definition, their history has not been saved. If you delete them with `git clean -f`, they're gone for good.

## How can I see how Git is configured?<a name="ten"></a>
To see what the settings are, use `git config --list` with one of three additional options:
1. `--system`: settings for every user on this computer
2. `--global`: settings for every one of your projects
3. `--local`: settings for one specific project

Every level overrides the one above it.

## How can I change my Git configuration?<a name="eleven"></a>
Most of Git's setting should be left as is. However, there are two you should set on every computer you use: your name and email address. These are recorded in the log every time you commit a change.

To change a configuration value for all of your projects on a particular computer user,

        $ git config --global setting value