# Getting to know Git
Tools used:
- `tree`
- `wc`
- `hexdump`

## The `.git` directory

In [1]:
git init first-repo

Initialized empty Git repository in /Users/trueskawka/Documents/Alicja/Projects/programming/building_git/first-repo/.git/


In [2]:
cd first-repo/

In [3]:
# list the directory structure
tree .git/

.git/
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

8 directories, 15 files


In [4]:
# configuration for this repository
cat .git/config

[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
	ignorecase = true
	precomposeunicode = true


- `repositoryformatversion = 0` - using version 0 of the repository format
- `filemode = true` - store each file's mode (e.g. if it's executable)
- `bare = false` - user will be editing working copy of files and creating commits
- `logallrefupdates = true` - reflog is enabled
- `ignorecase = true` and `precomposeunicode = true` - handle text encoding in OS X

In [9]:
# name of the repository
cat .git/description

Unnamed repository; edit this file 'description' to name the repository.


In [10]:
# reference to the current commit
cat .git/HEAD

ref: refs/heads/master


In [12]:
# .git/info contains metadata about the repository that don't fit into other main directories
# .git/info/exclude files that should not be checked in (not part of the source tree)
# as opposed to .gitignore which is tracked in Git
cat .git/info/exclude

# git ls-files --others --exclude-from=.git/info/exclude
# Lines that start with '#' are comments.
# For a project mostly in C, the following would be a good set of
# exclude patterns (uncomment them if you want to use them):
# *.[oa]
# *~


In [14]:
# scripts that Git will execute as part of core commands
# e.g. before creating a commit .git/hooks/pre-commit
# on init a lot of .sample files are created
cat .git/hooks/commit-msg.sample

#!/bin/sh
#
# An example hook script to check the commit log message.
# Called by "git commit" with one argument, the name of the file
# that has the commit message.  The hook should exit with non-zero
# status after issuing an appropriate message if it wants to stop the
# commit.  The hook is allowed to edit the commit message file.
#
# To enable this hook, rename this file to "commit-msg".

# Uncomment the below to add a Signed-off-by line to the message.
# Doing this in a hook is a bad idea in general, but the prepare-commit-msg
# hook is more suited to it.
#
# SOB=$(git var GIT_AUTHOR_IDENT | sed -n 's/^\(.*>\).*$/Signed-off-by: \1/p')
# grep -qs "^$SOB" "$1" || echo "$SOB" >> "$1"

# This example catches duplicate Signed-off-by lines.

test "" = "$(grep '^Signed-off-by: ' "$1" |
	 sort | uniq -c | sed -e '/^[ 	]*1[ 	]/d')" || {
	echo >&2 Duplicate Signed-off-by lines.
	exit 1
}


In [15]:
# .git/objects is where all the content is stored
# .git/objects/pack - storing in an optimized format (rolling up loose files in a pack)
# .git/objects/info - store metadata

In [16]:
# .git/refs stores pointers to .git/objects, e.g.:
# .git/refs/heads - latest commit on each local branch
# .git/refs/remotes - latest commit on each remote branch
# .git/refs/tags - tags
# .git/refs/stash - stashed objects reference

## A simple commit

In [5]:
echo "hello" > hello.txt
echo "world" > world.txt
git add .
git commit --message "First commit."

[master (root-commit) d9a710f] First commit.
 2 files changed, 2 insertions(+)
 create mode 100644 hello.txt
 create mode 100644 world.txt


Printing out some useful stuff on commit:
- on `master` branch
- made a `root-commit` (has no parent)
- abbreviated commit hash `badbad5`
- commit message `First commit.`
- two new files with mode `100644`

In [18]:
tree .git

.git
├── COMMIT_EDITMSG
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   └── update.sample
├── index
├── info
│   └── exclude
├── logs
│   ├── HEAD
│   └── refs
│       └── heads
│           └── master
├── objects
│   ├── 88
│   │   └── e38705fdbd3608cddbe904b67c731f3234c45b
│   ├── ba
│   │   └── dbad5bb37eae9c37336ccb7b0a5529fbe95acb
│   ├── cc
│   │   └── 628ccd10742baea8241c5924df992b5c019f71
│   ├── ce
│   │   └── 013625030ba8dba906f756967f9e9ca394464a
│   ├── info
│   └── pack
└── refs
    ├── heads
    │   └── master
    └── tags

15 directories, 24 files


In [19]:
# message given for the commit
# if no --message is passed, opens this file in a text editor and saves
# --message saves it in this file
cat .git/COMMIT_EDITMSG

First commit.


In [20]:
# stores information about each file in current commit
cat .git/index

DIRC      ];��:�S�];��:�S�  ,?�  ��  �      �6%�۩�V�����FJ 	hello.txt ];��ͻ�];��ͻ�  ,?�  ��  �      �b��t+��$Y$ߙ+\�q 	world.txt TREE    2 0
��������|s24�[ �FB���A*A��=�4�Ub

In [22]:
# log of every time a ref to a commit changes its value
# used by reflog
cat .git/logs/HEAD # symref to refs/heads/master
cat .git/logs/refs/heads/master

0000000000000000000000000000000000000000 badbad5bb37eae9c37336ccb7b0a5529fbe95acb Alicja Raszkowska <malavarena@gmail.com> 1564186848 -0700	commit (initial): First commit.
0000000000000000000000000000000000000000 badbad5bb37eae9c37336ccb7b0a5529fbe95acb Alicja Raszkowska <malavarena@gmail.com> 1564186848 -0700	commit (initial): First commit.


Recorded log event: `0000000000000000000000000000000000000000` (pointing at nothing) changed to commit hash `badbad5bb37eae9c37336ccb7b0a5529fbe95acb`.

In [23]:
# records which commit is at the tip of master
cat .git/refs/heads/master

badbad5bb37eae9c37336ccb7b0a5529fbe95acb


## Storing objects

In [25]:
# print information about the HEAD commit
git show

[33mcommit badbad5bb37eae9c37336ccb7b0a5529fbe95acb[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m
Author: Alicja Raszkowska <malavarena@gmail.com>
Date:   Fri Jul 26 17:20:48 2019 -0700

    First commit.

[1mdiff --git a/hello.txt b/hello.txt[m
[1mnew file mode 100644[m
[1mindex 0000000..ce01362[m
[1m--- /dev/null[m
[1m+++ b/hello.txt[m
[36m@@ -0,0 +1 @@[m
[32m+[m[32mhello[m
[1mdiff --git a/world.txt b/world.txt[m
[1mnew file mode 100644[m
[1mindex 0000000..cc628cc[m
[1m--- /dev/null[m
[1m+++ b/world.txt[m
[36m@@ -0,0 +1 @@[m
[32m+[m[32mworld[m


One of the files in .git/objects is `ba/dbad5bb37eae9c37336ccb7b0a5529fbe95acb` - ID of the commit with the first two chars being the directory name.

In [26]:
# print commit file from the database with -p pretty print
git cat-file -p badbad5bb37eae9c37336ccb7b0a5529fbe95acb

tree 88e38705fdbd3608cddbe904b67c731f3234c45b
author Alicja Raszkowska <malavarena@gmail.com> 1564186848 -0700
committer Alicja Raszkowska <malavarena@gmail.com> 1564186848 -0700

First commit.


Commit representation:
- first line `tree` followed by tree ID (commits point to trees)
- author and commiter lines, with timestamps
- blank line
- commit message

The tree ID is not dependent on commiter or timestamps, just the commit content, so it should be the same for identical file contents.

In [28]:
# let's look at the tree
git cat-file -p 88e38705fdbd3608cddbe904b67c731f3234c45b

100644 blob ce013625030ba8dba906f756967f9e9ca394464a	hello.txt
100644 blob cc628ccd10742baea8241c5924df992b5c019f71	world.txt


A tree file has:
- one tree for every directory in the project (including root)
- that tree lists the contents - either trees (subdirectory) or blobs (files)

Each entry lists:
- mode - numeric representation of the type and permissions
- type - blob or tree
- ID
- filename

In [30]:
# let's look at a blob
# storing just the content
git cat-file -p ce013625030ba8dba906f756967f9e9ca394464a

hello


In [31]:
git cat-file -p cc628ccd10742baea8241c5924df992b5c019f71

world


### Blobs on disk

In [33]:
# this file is compressed with DEFLATE algorithm, used in zlib
cat .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a

xK��OR0c�H���� �

In [36]:
# decompress and print data
# -r zlib - require zlib
# -e - inline script
alias inflate='ruby -r zlib -e "STDOUT.write Zlib::Inflate.inflate(STDIN.read)"'

In [37]:
cat .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a | inflate

blob 6 hello


In [38]:
# "blob 6" is probably a length header
# "blob 6" - 6 bytes
# "hello\n" - 6 bytes
# where is one extra byte coming from?
cat .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a | inflate | wc -c

      13


In [39]:
# print hexadecimal representation of the bytes in a file alongside textual
cat .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a | inflate | hexdump -C

00000000  62 6c 6f 62 20 36 00 68  65 6c 6c 6f 0a           |blob 6.hello.|
0000000d


All files are just a list of bytes, `hexdump` shows numeric values of those bytes in hexadecimal, in rows of 16 values each. 

Leftmost column shows the total offset into the file each row begins at (in hexadecimal). 

Value at the bottom shows total size, e.g. `0000000d` is 13 bytes.

Te column on the right between | characters sows the ASCII character for each column. If it's not printable in ASCII, then hexdump prints . in it's place (to make it easier to count by eye).

In this file:
- `62 6c 6f 62` - `blob`
- `20` - space
- `36`- digit `6`
- `00` - null byte
- `68  65 6c 6c 6f 0a` - `hello\n`

Git stores blobs by prepending them with `blob <length><null byte><content>` and compressed with `zlib`.

```
Decimal    Hex    Binary
      0      0      0000
      1      1      0001
      2      2      0010
      3      3      0011
      4      4      0100
      5      5      0101
      6      6      0110
      7      7      0111
      8      8      1000
      9      9      1001
     10      a      1010
     11      b      1011
     12      c      1100
     13      d      1101
     14      e      1110
     15      f      1111
```

### Trees on disk

In [40]:
cat .git/objects/88/e38705fdbd3608cddbe904b67c731f3234c45b | inflate

tree 74 100644 hello.txt �6%�۩�V�����FJ100644 world.txt �b��t+��$Y$ߙ+\�q

In [41]:
cat .git/objects/88/e38705fdbd3608cddbe904b67c731f3234c45b | inflate | hexdump -C

00000000  74 72 65 65 20 37 34 00  31 30 30 36 34 34 20 68  |tree 74.100644 h|
00000010  65 6c 6c 6f 2e 74 78 74  00 ce 01 36 25 03 0b a8  |ello.txt...6%...|
00000020  db a9 06 f7 56 96 7f 9e  9c a3 94 46 4a 31 30 30  |....V......FJ100|
00000030  36 34 34 20 77 6f 72 6c  64 2e 74 78 74 00 cc 62  |644 world.txt..b|
00000040  8c cd 10 74 2b ae a8 24  1c 59 24 df 99 2b 5c 01  |...t+..$.Y$..+\.|
00000050  9f 71                                             |.q|
00000052


Structure:
- `74 72 65 65` - `tree`, type of the object
- `20` - space
- `37 34` - size (`74`)
- `00` - null byte (end of length)
-  `31 30 30 36 34 34` - `100644`, file mode
- `20` - space
- `68 65 6c 6c 6f 2e 74 78 74` - `hello.txt`, file name
- `00` - null byte (end of file name)
- `ce 01 36 25 03 0b a8 db a9 06 f7 56 96 7f 9e 9c a3 94 46 4a` - object ID
- `31 30 30 36 34 34` - `100644`, file mode
- `20` - space
- `77 6f 72 6c  64 2e 74 78 74` - `world.txt`, file name
- `00` - null byte (end of file name)
- `cc 62 8c cd 10 74 2b ae a8 24 1c 59 24 df 99 2b 5c 01 9f 71` - object ID

Why are object IDs 40 chars?

- object IDs are hexadecimal representations of numbers
- each digit represents four bits in binary
- in a 40-digit object ID each digit stands for four bits of a 160-bit number
- it's being stored in binary as twenty blocks of eight bits (20 bytes)

### Commits on disk

In [44]:
cat .git/objects/ba/dbad5bb37eae9c37336ccb7b0a5529fbe95acb | inflate

commit 194 tree 88e38705fdbd3608cddbe904b67c731f3234c45b
author Alicja Raszkowska <malavarena@gmail.com> 1564186848 -0700
committer Alicja Raszkowska <malavarena@gmail.com> 1564186848 -0700

First commit.


In [45]:
cat .git/objects/ba/dbad5bb37eae9c37336ccb7b0a5529fbe95acb | inflate | hexdump -C

00000000  63 6f 6d 6d 69 74 20 31  39 34 00 74 72 65 65 20  |commit 194.tree |
00000010  38 38 65 33 38 37 30 35  66 64 62 64 33 36 30 38  |88e38705fdbd3608|
00000020  63 64 64 62 65 39 30 34  62 36 37 63 37 33 31 66  |cddbe904b67c731f|
00000030  33 32 33 34 63 34 35 62  0a 61 75 74 68 6f 72 20  |3234c45b.author |
00000040  41 6c 69 63 6a 61 20 52  61 73 7a 6b 6f 77 73 6b  |Alicja Raszkowsk|
00000050  61 20 3c 6d 61 6c 61 76  61 72 65 6e 61 40 67 6d  |a <malavarena@gm|
00000060  61 69 6c 2e 63 6f 6d 3e  20 31 35 36 34 31 38 36  |ail.com> 1564186|
00000070  38 34 38 20 2d 30 37 30  30 0a 63 6f 6d 6d 69 74  |848 -0700.commit|
00000080  74 65 72 20 41 6c 69 63  6a 61 20 52 61 73 7a 6b  |ter Alicja Raszk|
00000090  6f 77 73 6b 61 20 3c 6d  61 6c 61 76 61 72 65 6e  |owska <malavaren|
000000a0  61 40 67 6d 61 69 6c 2e  63 6f 6d 3e 20 31 35 36  |a@gmail.com> 156|
000000b0  34 31 38 36 38 34 38 20  2d 30 37 30 30 0a 0a 46  |4186848 -0700..F|
000000c0  69 72 73 74 20 63 6f 6d  6d 69 74 2e 0a   

Structure:
- `commit <commit-length>`
- `00` - null byte after lenght
- `tree` - all commits refer to a single tree representing the state of files at that point in history (commits are pointers to complete snapshot of state of the project, not diffs)
- `author`
- `committer`
- blank line
- commit message

### Computing object IDs

In [60]:
echo '
require "digest/sha1"
require "zlib"

string = "hello\n"
puts "raw:"
puts Digest::SHA1.hexdigest(string)

blob = "blob #{ string.bytesize }\0#{ string }" 
puts "blob:"
puts Digest::SHA1.hexdigest(blob)

zipped = Zlib::Deflate.deflate(blob)
puts "zipped:"
puts Digest::SHA1.hexdigest(zipped)
' >> sha.rb

In [61]:
ruby sha.rb

raw:
f572d396fae9206628714fb2ce00f72e94f2258f
blob:
ce013625030ba8dba906f756967f9e9ca394464a
zipped:
3a3cca74450ee8a0245e7c564ac9e68f8233b1e8


The ID is made from:
- file content
- prepended by bytesize lenght
- prepended by `blob`
- computed SHA-1

This helps speed up comparisons - if two objects have te same ID, then Git treats them as having the same content (within the limits of SHA-1). It reduces the amount of work required when comparing commits on large projects.

If an object ID appears in one repository, then it and everything it refers to does not need to be re-downloaded. You then only need to download missing objects.

This also explains why objects are being hashed before compression - compression can have different settings resulting in different hashes. This would negate being able to decide if two objects have the same content based on the hash itself.

### Bear necessities

Minimum viable Git repository:

```
.git
 ├── HEAD
 ├── objects
 │   ├── 88
 │   │   └── e38705fdbd3608cddbe904b67c731f3234c45b
 │   ├── ba
 │   │   └── dbad5bb37eae9c37336ccb7b0a5529fbe95acb
 │   ├── cc
 │   │   └── 628ccd10742baea8241c5924df992b5c019f71
 │   ├── ce
 │   │   └── 013625030ba8dba906f756967f9e9ca394464a
 └── refs
     └── heads
         └── master
```

If we can write some code that:
- stores itself as a commit, a tree and some blobs
- writes the ID of that commit to HEAD

then we’re up and running.

## The `parent` field

In [6]:
echo "second" > hello.txt
git add .
git commit --m "Second commit."

[master f826d50] Second commit.
 1 file changed, 1 insertion(+), 1 deletion(-)


In [7]:
cat .git/refs/heads/master

f826d50b3070fbd98c329cfeab412f0c8fd9b60a


In [8]:
git cat-file -p f826d50b3070fbd98c329cfeab412f0c8fd9b60a

tree 040c6f3e807f0d433870584bc91e06b6046b955d
parent d9a710f4b3b5d45de8e8b73bc614effda7706bb6
author Alicja Raszkowska <malavarena@gmail.com> 1564251489 -0700
committer Alicja Raszkowska <malavarena@gmail.com> 1564251489 -0700

Second commit.


In [9]:
git cat-file -p d9a710f4b3b5d45de8e8b73bc614effda7706bb6

tree 88e38705fdbd3608cddbe904b67c731f3234c45b
author Alicja Raszkowska <malavarena@gmail.com> 1564251438 -0700
committer Alicja Raszkowska <malavarena@gmail.com> 1564251438 -0700

First commit.


In [10]:
git cat-file -p 88e38705fdbd3608cddbe904b67c731f3234c45b

100644 blob ce013625030ba8dba906f756967f9e9ca394464a	hello.txt
100644 blob cc628ccd10742baea8241c5924df992b5c019f71	world.txt


In [12]:
# still references the same blob for the unchanged file
git cat-file -p 040c6f3e807f0d433870584bc91e06b6046b955d

100644 blob e019be006cf33489e2d0177a3837a2384eddebc5	hello.txt
100644 blob cc628ccd10742baea8241c5924df992b5c019f71	world.txt


In [13]:
git cat-file -p e019be006cf33489e2d0177a3837a2384eddebc5

second


Git is storing new los when files have their content changed, rather than storing a complete copy of the project on every commit. Files with the same content will point to the same blob. Most of the contents of the tree can be reused.