# Git log analysis - Working with basic Python (No Numpy, Pandas)

---

# 0/ Import các thư viện cần thiết

In [1]:
import datetime

# 1/ Get data

We will get the log data of the [nbdime container](https://github.com/KienTrann/nbdime). Nbdime is a tool that helps Git to diff and merge Notebook files appropriately (instead of just treating it as a regular text file). The link provided is not a link to the original container of nbdime but to the container I _"fork"_ from the original container. 

In [None]:
!git clone https://github.com/KienTrann/nbdime.git

## 1.1/ Check repo size

In [4]:
!du -sh --apparent-size ./nbdime

9.8M	./nbdime


## 1.2/ Get log and save to __log.tsv__

Get the below data

|Column|Meaning|
|-|-|
|__Id__|abbreviated commit hash|
|__ParentIds__|parent hashes|
|__AuthorName__|author name|
|__AuthorDate__|author date|
|__Subject__|msg title|
|__ChangedFiles__|files are changed, separated by comma|

In [6]:
!cd nbdime && git log --pretty=format:"%h    %p    %an    %ae    %aD    %s    " --name-only > ../log.tsv

In [7]:
lines = []
with open("log.tsv", 'r', encoding='utf-8') as f:
    lines = f.read().splitlines()

    for i in range(len(lines)):
        lines[i] = lines[i].replace("    ", "\t")

    for i in range(len(lines) - 1):
        if len(lines[i].split("\t")) > 1:
            if len(lines[i+1].split("\t")) == 1:
                continue
            else:
                lines[i] = lines[i] + '\n'
                continue

        if lines[i] == "":
            continue
        
        if lines[i+1] != "":
            lines[i] = lines[i] + ','
        else:
            lines[i] = lines[i] + '\n'

# Overwrite current file
with open("log.tsv", 'w', encoding='utf-8') as f:
    f.write("Id	ParentIds	AuthorName	AuthorEmail	AuthorDate	Subject	ChangedFiles\n")
    for i in lines:
        f.write(i)
    f.write('\n')

## 1.3/ Check the data is correct or not.
If not, maybe the git repo had changed!

In [5]:
!wget https://raw.githubusercontent.com/the0nlyWyvern/git-log-analysis/main/correct_log.tsv

--2023-01-30 14:04:31--  https://raw.githubusercontent.com/the0nlyWyvern/git-log-analysis/main/correct_log.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 370723 (362K) [text/plain]
Saving to: ‘correct_log.tsv’


2023-01-30 14:04:32 (45.6 MB/s) - ‘correct_log.tsv’ saved [370723/370723]



In [8]:
file = open('log.tsv', 'r')
log = file.read()
file.close()
file = open('correct_log.tsv', 'r')
correct_log = file.read()
file.close()
assert log == correct_log

---

# 2/ Explore the data

## 2.1/ Number of rows/columns

In [9]:
cols = {}

with open('correct_log.tsv', 'r', encoding='utf-8') as f:
    cols_name = f.readline().split("\t")
    cols_name[-1] = cols_name[-1].replace('\n','')

    for _, name in enumerate(cols_name):
        cols[name] = []
    
    for line in f:
        l = line.split("\t")
        # l[-1] = l[-1][:-2]
        l[-1] = l[-1].replace('\n','')
        for n, i in zip(list(cols.keys()), l):
            cols[n].append(i)


num_cols = len(cols.keys())
num_rows = len(cols['Id'])

In [11]:
print(num_cols)
print(num_rows)

7
1928


## 2.2/ What does each line mean? Does it matter the lines have different meanings?

Each line corresponds to a commit. There don't seem to be any lines that are "out of line" (i.e. no matter what lines have different meanings).

## 2.3/ Does the data have duplicate lines?

In [12]:
duplicated_id = False
if len(cols['Id']) != len(set(cols['Id'])):
    duplicated_id = True

duplicated_id

False

## 2.4/ What does each column mean?

View __1.2/__

---

# 3/ Cleaning the data

## 3.1/ Change __"AuthorDate"__ from __string__ to __datetime__.

[How to use `strptime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) 

In [13]:
new_col = []

for i in cols['AuthorDate']:
    d = datetime.datetime.strptime(i, "%a, %d %b %Y %X %z")
    new_col.append(d)

cols['AuthorDate'] = new_col

## 3.2/ For each column with numeric and datetime data types, how are the values distributed?

There is no missing value in __"AuthorDate" because if there is a null value `''` then the program will give an error `ValueError: time data '' does not match format ...`


Tiếp theo, bạn tính min và max của cột "AuthorDate", rồi lần lượt lưu vào hai biến `author_date_min` và `author_date_max`.

In [15]:
author_date_min = min(cols['AuthorDate'])
author_date_max = max(cols['AuthorDate'])

print(f"From {author_date_min} to {author_date_max}")

From 2015-11-09 13:13:28+01:00 to 2021-04-20 12:27:09+01:00


## 3.3/ For each column with a categorical data type, how are the values distributed?

In [17]:
cate_col_profiles = {}
columns = list(cols.keys())
columns.remove('AuthorDate')

for c in columns:
    value = []

    count = 0
    for i in cols[c]:
        if i == '':
            count +=1
    value.append(count*100/num_rows)# % missing values
    
    s = set(cols[c])
    try:
        s.remove('')
    except KeyError:
        pass
  
    value.append(len(s)) # number of unique values
    value.append(s) # set of unique values

    cate_col_profiles[c] = value

In [21]:
print(f"{'ColName':12} {'Miss(%)':7} {'NumDifVals':10} {'SomeVals'}")
for col_name, col_profile in cate_col_profiles.items():
    print(f'{col_name:12} {col_profile[0]:<7.3f} {col_profile[1]:<10} {col_profile[2].__repr__()[:34]+"...":<}')

ColName      Miss(%) NumDifVals SomeVals
Id           0.000   1928       {'a031d32', '655fe8d', '2d8b4b2', ...
ParentIds    0.052   1820       {'a031d32', '655fe8d', '31b716e 8e...
AuthorName   0.000   34         {'Charlotte Godley', 'Chris Sewell...
AuthorEmail  0.000   34         {'lev@columbia.edu', 'Engler.Will@...
Subject      0.000   1860       {'nbdime@4.0.0', 'Update to typesc...
ChangedFiles 16.079  842        {'packages/labextension/package.js...


- other than __Id__ is not missing and there are no duplicate values, __ParentIds__ are missing or duplicate
- Prediction: git doesn't force __Parent Ids__ to always have or be different from each other.

In [22]:
for i in range(num_rows):
    if cols['ParentIds'][i] != "":
        continue

    print(f"""
    ParentIds: {cols['ParentIds'][i]}
    Author: {cols['AuthorName'][i]} - {cols['AuthorEmail'][i]}
    Date: {cols['AuthorDate'][i]}
    Subject: {cols['Subject'][i]}
    Changed files: {cols['ChangedFiles'][i]}
    """)


    ParentIds: 
    Author: Martin Sandve Alnæs - martinal@simula.no
    Date: 2015-11-09 13:13:28+01:00
    Subject: Initial commit, framework files following nbformat setup.
    Changed files: CONTRIBUTING.md,COPYING.md,MANIFEST.in,README.md,docs/README.md,nbmerge/__init__.py,nbmerge/_version.py,scripts/nbdiff,scripts/nbmerge,setup.cfg,setup.py
    


- Note that the Author Date above is *author_date_min*.
- This is the first commit, so there is no parent commit.

---

# 4/ Ask meaningful questions to get data insight

## 4.1/ Question 1: In 2021, who has the most commits, who has the second most commits, ...?

In [24]:
wanted_people = {}

filter_year21 = []
for i in range(num_rows):
    if cols['AuthorDate'][i].year == 2021:
        filter_year21.append(i)


for i in filter_year21:
    name = cols['AuthorName'][i]
    if name not in set(wanted_people.keys()):
        wanted_people[name] = 1
    else:
        wanted_people[name] += 1


wanted_people = dict(sorted(wanted_people.items(), key=lambda item: item[1], reverse=True))

print(wanted_people)

{'Vidar Tonaas Fauske': 26, 'Alex Bozarth': 9, 'Frederic COLLONVAL': 7, 'Frédéric Collonval': 6, 'krassowski': 2, 'Michał Krassowski': 2}


---

## 4.2/ Question 2: Commit is the last commit that changed the file "nbdime/webapp/templates/difftool.html"?

We agree: do not consider "merge pull request" commits because these commits have no information about "ChangedFiles".

In [26]:
filter_changedFiles = []
for i in range(num_rows):
    s = cols['ChangedFiles'][i].find("nbdime/webapp/templates/difftool.html")
    if s != -1:
        filter_changedFiles.append(i)


latest = filter_changedFiles[0]
for i in filter_changedFiles:
    if cols['AuthorDate'][i] > cols['AuthorDate'][latest]:
        latest = i


wanted_commit = dict()
for i in list(cols.keys()):
    wanted_commit[i] = cols[i][latest]


print(wanted_commit)

{'Id': 'be0b5ee',
 'ParentIds': '4226179',
 'AuthorName': 'Vidar Tonaas Fauske',
 'AuthorEmail': 'vidartf@gmail.com',
 'AuthorDate': datetime.datetime(2017, 7, 4, 22, 53, 9, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))),
 'Subject': 'jinja2 templates + avoid name collisions with nb',
 'ChangedFiles': 'nbdime/webapp/nbdimeserver.py,nbdime/webapp/templates/compare.html,nbdime/webapp/templates/diff.html,nbdime/webapp/templates/difftool.html,nbdime/webapp/templates/merge.html,nbdime/webapp/templates/mergetool.html,nbdime/webapp/templates/nbdimepage.html,setup.py'}