Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a git blame parser #68

Closed
carlosparadis opened this issue Jul 11, 2020 · 24 comments
Closed

Create a git blame parser #68

carlosparadis opened this issue Jul 11, 2020 · 24 comments
Assignees

Comments

@carlosparadis
Copy link
Member

Part of the effort towards #2

The idea is that we have a git_blame() parser function added to R/git.R

@massihonda
Copy link
Collaborator

Hi Carlos,

I created a function that stores the blame commit of each line of all files of a checkout. I used the conf file that I am attaching, where I had to insert the absolute path to the git directory in order to run. I was not able to do it with path.expand unfortunately. I added the functions in the git.R file and created a vignette show_case to run the blame function. The show_case uses the drill repository as an example. These files are attached too, but you would have to clone the drill repository into the git_repo folder, since it was too large to upload. If you think there are some things to change I can do it.

The function has two optional parameters: one is all, the other is since and until (to use together). If all is set to TRUE, instead of retrieving only the last commit that touched the lines, it retrieves all the commits the touched the lines in all the history. If a date is specified in since and until, it retrieves all the commits that changed the lines within these two dates. When all is set to TRUE the time to compute the data frame increases, reasonably.

Kind regards,

Massimo

@carlosparadis
Copy link
Member Author

carlosparadis commented Jul 17, 2020

Hi Massimo,

I added some overall feedback concerning the code on #71 , however, I noticed one assumption you have on your code that may be going in a different direction than what you wanted to do for your thesis, which to my understanding is a collaboration network at function level, so I am double checking again.

In your git_blame() function, your parameters are:

parse_git_blame <- function(git_repo_path, all=FALSE, since, until){...}

However, the call to git blame via terminal is, for example on Kaiaulu:

git blame -p -C -w -M 9b2137579609144e98ca27e858edb76e05655eb2 git.R

A portion of the output, in turn, looks like this:

9da2897fe4dc320072fb8fd98b13c0c24430b3e6 1 1 4
author Carlos Paradis
author-mail <carlosviansi@gmail.com>
author-time 1590633783
author-tz -1000
committer Carlos Paradis
committer-mail <carlosviansi@gmail.com>
committer-time 1590633783
committer-tz -1000
summary i #12 Add MPL 2.0 License
previous 05dc1768f3e7d4c53881448222cc8cc06d2a52da R/parsers.R
filename R/parsers.R
        # This Source Code Form is subject to the terms of the Mozilla Public
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 2 2
        # License, v. 2.0. If a copy of the MPL was not distributed with this
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 3 3
        # file, You can obtain one at https://mozilla.org/MPL/2.0/.
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 4 4
        
9b2137579609144e98ca27e858edb76e05655eb2 5 5 39
author Carlos Paradis
author-mail <carlosviansi@gmail.com>
author-time 1593782688
author-tz -1000
committer Carlos Paradis
committer-mail <carlosviansi@gmail.com>
committer-time 1593782688
committer-tz -1000
summary i #62 add git interface functions
filename R/git.R
        #' Performs a git checkout on specified repo
9b2137579609144e98ca27e858edb76e05655eb2 6 6
        #'

The above is the top portion of the blame shown here. In order for me to finish the git log at function level I understood you wanted, you need to create a function that takes the following parameters:

git_blame(git_repo_path,commit_hash,file_name)

And returns that to memory. In the case of git_blame, you should not need to save to /tmp since it is a small output (a single file). git_log saves to /tmp/ because it can range from 400MB to GBs. Either way, your main challenge is on parse_git_blame(git_repo_path,commit_hash,file_name). This function will be in charge of actually parsing the output above into a tabular format.

If you can parse it with those 2 functions, I can finish integrating to the rest of code already in Kaiaulu to get your git log at function level.

Also, you can still reuse the existing branch and PR. You would just need to make another commit editing so git.R contains the original code plus this function, and R/parsers.R contains the parser function. We can squash all commits later to clean up.

Let me know if you have questions, or if I misunderstood your interest was not on the git log functions part.

@massihonda
Copy link
Collaborator

massihonda commented Jul 17, 2020 via email

@carlosparadis
Copy link
Member Author

carlosparadis commented Jul 18, 2020

At the beginning I also put the commit in the parameter, then I took it
out, because I thought I could use the git_checkout function before to go
to a specific commit. I stored the output into /tmp because when I called
the terminal command it did not return anything I could store in a normal
variable, so I used it as a workaround (I understand it is not the proper
way though).

Your assumption of using git checkout is not incorrect, and it is indeed to the best of my knowledge the only way I have seen to date to compute static metrics in all implementations such as architectural flaws. However, it is my understanding also passing the commit hash will save you an immense amount of computational time, if not the only way viable to do this per commit. To be more specific:

The reason why when computing architectural flaws we have to use git checkout given a commit happens, is because even though we may know a given commit only changed, say, 4 files, we do not know by just looking these 4 files the new dependencies it created to the rest of the codebase. Specifically, without using an abstract syntax tree, which in itself adds also another overhead, we have no means to know all types of dependencies (eg. inheritances) generated to any of the other files in the codebase, even if only those 4 files changed. Hence, we must perform git_checkout, and then analyze all the files. Tools, like Understand and Depends, will require a folder path and will not leverage the git commit because of that. A side effect is that most publications, if not all, will compute these metrics per time window, or per release rather than per commit as the cost is prohibitive.

However, in your specific case, you do not need to do so: If a commit change 4 files, we can, with the implementation Kaiaulu already have of ctags and I also believe codeface does this way, only look at those 4 files instead of the entire folder tree. This means that if you are looking at a project like Chromium of thousands of files and over 900k commits, instead of ending up with 900k*5000 git blames, you may end up, on average, with something like 900k*5, as the average commit and human shouldn't be changing so many files at every commit, unless the code is really bad. Hence your execution time T should be something like O(n_commits)*O(5 ish), instead of O(n_commits)*O(n_files_in_folder) on the average case if my math serves me right.

You can also already do a git checkout iteration by using:

kaiaulu/R/git.R

Lines 5 to 26 in 9b21375

#' Performs a git checkout on specified repo
#'
#' @param commit_hash The commit hash the repo should be checkout
#' @param git_repo_path The git repo path
#' @return Any error message generated by git
#' @export
git_checkout <- function(commit_hash,git_repo_path){
# Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop")
git_repo_path <- path.expand(git_repo_path)
# Remove ".git"
folder_path <- stri_replace_last(git_repo_path,replacement="",regex=".git")
error <- system2('git',
args = c('--git-dir',
git_repo_path,
'--work-tree',
folder_path,
'checkout',
commit_hash),
stdout = TRUE,
stderr = FALSE)
return(error)
}

In the following manner on a vignette (and it is indeed what I did to compute "Data for Bob" for architectural flaws:

project_gitlog <- parse_gitlog(perceval_path,git_repo_path)
project_gitlog <- project_gitlog[order(data.AuthorDate)]
metrics <- list()
for(commit in 1:nrow(project_gitlog)){
  commit_hash <- project_gitlog$data.commit
  git_checkout(commit_hash,git_repo_path)
  metrics[[commit_hash]] <- parse_line_metrics(scc_path,git_repo_path)
}
# Reset the detached HEAD of the repo to the most current commit as it is by default
git_checkout("master",git_repo_path)

But back to answer your question:

I understand you want to put this loop outside the function
parse_git_blame, so that parse_git_blame would return only the blame of
only one file at the time, is it correct? Is it fine that the function
returns a dataframe with the format I used (file_name, path, line, commit)
, but containing the information of only one file? What should I do with
the function list_files_in_repo?

You should not need to define a function that scans all file names in a folder, as we will already know what their filepath is and commit hash while looping through the gitlog. Your git_blame() function should only do one task generate the log and pass directly on memory to avoid I/O, and parse_git_blame(), which uses git_blame() should also only have one responsibility, which is to encode the logic to turn that output I've shown into a well-formatted data.table.

I will then put these together in a vignette with the other necessary functions, and finish your gitlog network at function level, and when I think it is good enough, turn into another function (which at that point I need to think which module it would go, if parse_, or something else, but certainly not git_ as git.R is solely to define functions to interact with the git interface).

As for the file filter or time windows, don't worry about that for now. In the vignette, this information is obtained from the config file. I can just file filter in a vignette the rows of project_git and then pass to your function:

This is done in gitlog_showcase.Rmd:

project_git <- project_git  %>%
  filter_by_file_extension(file_extensions,"file")  %>% 
  filter_by_filepath_substring(substring_filepath,"file"

Also:

project_git_slice <- project_git[data.AuthorDate >= as.POSIXct("2019-01-01", format = "%Y-%m-%d",tz = "UTC") & data.AuthorDate < as.POSIXct("2019-12-31", format = "%Y-%m-%d",tz = "UTC")]

Hence in a similar manner when we call:

parse_git_blame(git_repo_path,commit_hash,file_name)

One column will contain the file_path, as shown on the output before there is one line that says filename R/parsers.R. We could use the same method above to filter, in the vignette, with the parameters passed on the config file, as we need. In the future, I may change the git() API to leverage more of its various parameters, but it's best to keep it simple for now and go on as needed basis.

On a side note, I noticed my output above, despite specifying a single file, contains information of more than one. My understanding of git blame output may be flawed. Maybe we should start by making sure when we pass a commit and a file name, it does indeed report just the lines and commit of that file, and not of all files the commit touches.

Let me know if the above makes sense or if you have questions.

@carlosparadis
Copy link
Member Author

At the beginning I also put the commit in the parameter, then I took it
out, because I thought I could use the git_checkout function before to go
to a specific commit. I stored the output into /tmp because when I called
the terminal command it did not return anything I could store in a normal
variable, so I used it as a workaround (I understand it is not the proper
way though).

Take a look at this older version of parse_gitlog to see how you can do it without saving to /tmp/:

kaiaulu/R/parsers.R

Lines 18 to 24 in 6c62c97

# Use percerval to parse .git --json line is required to be parsed by jsonlite::fromJSON.
perceval_output <- system2(perceval_path,
args = c('git',git_repo_path,'--json-line'),
stdout = TRUE,
stderr = FALSE)
# Parsed JSON output.
perceval_parsed <- data.table(jsonlite::stream_in(textConnection(perceval_output),verbose=FALSE))

The reason you can't get the output is because of the > in the list of parameters you pass:

kaiaulu/R/parsers.R

Lines 66 to 67 in 9b21375

'>',
gitlog_path),

On a bash terminal, if you time a command like git log > somefile.txt what would be otherwise output to stdout is saved to somefile.txt instead. You can reason system2 as a function that copy and paste a command line to the terminal. With stdout = TRUE it will capture the output of the terminal. It may be counter intuitive to have stdout = TRUE when the output is saved to /tmp but that is required for a file to be saved as we use >. I didn't find a flag to explicitly have git to save the output. Most stackoverflow questions just suggest using >.
Another example:

kaiaulu/R/parsers.R

Lines 448 to 467 in 9b21375

parse_line_metrics <- function(scc_path,git_repo_path){
# Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop")
scc_path <- path.expand(scc_path)
git_repo_path <- path.expand(git_repo_path)
# Remove ".git"
folder_path <- stri_replace_last(git_repo_path,replacement="",regex=".git")
# Use Depends to parse the code folder.
stdout <- system2(
scc_path,
args = c(folder_path, '--by-file','--format','csv'),
stdout = TRUE,
stderr = FALSE
)
line_metrics <- fread(stri_c(stdout,collapse = "\n"))
# /Users/user/git_repos/APR/xml/apr_xml_xmllite.c => "xml/apr_xml_xmllite.c"
line_metrics$Location <- stri_replace_first(line_metrics$Location,
replacement="",
regex=folder_path)
return(line_metrics)
}

You may need to fiddle with the exact function you read the input directly from the terminal. In the first I use jsonlite because Perceval gives a json output. In the second I use fread because the output to stdout can be a .csv. Gitlog may be neither, so you may need to use readlines instead, but it should be doable.

In the future, I will make it so all git_ functions can do both after I am certain there is no other way. For now, you can use the way the old parse_git used to obtain the output on memory.

@massihonda
Copy link
Collaborator

massihonda commented Jul 18, 2020 via email

@carlosparadis
Copy link
Member Author

Thanks! Feel free to post questions here. I am still intrigued on why the git blame is giving information mentioning other file in Kaiaulu. I hope my understanding of it is not incorrect!

@massihonda
Copy link
Collaborator

massihonda commented Jul 20, 2020 via email

@carlosparadis
Copy link
Member Author

carlosparadis commented Jul 22, 2020

You may want to look this response on GitHub so the embedded links appear and is easier to read:

I think that the blame message, when the parameter -p is specified, returns
in the metadata the hash of the previous commit and the file it touched.

If you mean by previous the ancestor commit to the one you pass as a parameter, then that should not be the case. For example,

Try: git blame -p -C -w -M 8459acac8d5d42ce2dc326269bd4ea65b1a5f0da R/filter.R

And the output should be a single file, filter.R, and the commit hash will be the same as the one provided as a parameter.

Take a look at the options man page and the formatting man page. -C and -M seem to be the reason we saw multiple files being shown despite providing only one. Because of that, I am afraid we can't just parse commit hashes but instead need to parse everything. But this shouldn't be too hard, considering the doc above explains the layout of the file.

Also, in regards to my comment 5 days ago and the associated command, try removing -p, so it is easier to read:

git blame -C -w -M 9b2137579609144e98ca27e858edb76e05655eb2 R/git.R

and you will get:

9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000  1) # This Source Code Form is subject to the terms of the Mozilla Public
9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000  2) # License, v. 2.0. If a copy of the MPL was not distributed with this
9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000  3) # file, You can obtain one at https://mozilla.org/MPL/2.0/.
9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000  4) 
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  5) #' Performs a git checkout on specified repo
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  6) #'
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  7) #' @param commit_hash The commit hash the repo should be checkout
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  8) #' @param git_repo_path The git repo path
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  9) #' @return Any error message generated by git
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000 10) #' @export
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000 11) git_checkout <- function(commit_hash,git_repo_path){
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000 12)   # Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop")
...

As you can see, it is just the license header that adds another file instead of R/git.R. Why? If you see the commits:

You will notice the license header commit and the file were where I first added the license header, and the commit was when I first explicitly added the license to the R package. This is likely the behavior of -C and -M.

Try experimenting removing the flags -C and -M, and you will see when you remove -C, the file now shows the git blame as we originally expected:

git blame -w -M 9b2137579609144e98ca27e858edb76e05655eb2 R/git.R outputs:

9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  1) # This Source Code Form is subject to the terms of the Mozilla Public
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  2) # License, v. 2.0. If a copy of the MPL was not distributed with this
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  3) # file, You can obtain one at https://mozilla.org/MPL/2.0/.
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  4) 
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  5) #' Performs a git checkout on specified repo
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  6) #'
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  7) #' @param commit_hash The commit hash the repo should be checkout
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  8) #' @param git_repo_path The git repo path
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  9) #' @return Any error message generated by git
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 10) #' @export
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 11) git_checkout <- function(commit_hash,git_repo_path){
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 12)   # Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop")
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 13)   git_repo_path <- path.expand(git_repo_path)
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 14)   # Remove ".git"
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 15)   folder_path <- stri_replace_last(git_repo_path,replacement="",regex=".git")
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 16)   error <- system2('git',
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 17)                   args = c('--git-dir',
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 18)                            git_repo_path,
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 19)                            '--work-tree',
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 20)                            folder_path,
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 21)                            'checkout',
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 22)                            commit_hash),
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 23)                   stdout = TRUE,
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 24)                   stderr = FALSE)
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 25)   return(error)
...

So the bottom line is, take a look at what is the specified format of -p and encode in your parse_git_blame() function, and reason -C and -M to pass the explanation onto the function @description. This is where, over time, I think, it pays off to build these functions to wrap around these interfaces: They never make immediate sense, and once we wrap and document them properly, we don't need to go through this again. Stackoverflow has several questions on what git blame does as well, so I am led to believe it is confusing for its name.

Let me know if you have any other questions.

@massihonda
Copy link
Collaborator

massihonda commented Jul 22, 2020 via email

carlosparadis added a commit that referenced this issue Jul 23, 2020
Adds a wrapper to the command git blame

Signed-off-by: Carlos Paradis <carlosviansi@gmail.com>
@carlosparadis
Copy link
Member Author

Hi Massimo,

I went ahead and pushed to the repo a git_blame function. That way you don't need to worry about the issues of setwd, getwd, and how it is established in the config file, and can fully focus on parse_git_blame() by calling it.

As you said, git_blame differs from the other git_functions in that it also requires a file and a commit_hash, and I can see how this can lead to confusion if it is not clear what exactly is providing the inputs to it so you can decide where your assumption of the current working directory should be. Add that to the fact git requires an extra flag to be executed from a different folder and this just becomes plain confusing.

You can use the git_blame() I just pushed as follows:

git_blame(git_repo_path = "rawdata/git_repo/OpenSSL/.git",
          flags = c('-p','-C','-w','-M'),
          commit_hash = '1940c092a52afd8bc919b8faa5f3d51004503f3a',
          file_path = 'apps/genpkey.c'
          )

This function will return a character vector, where each element of the character is a line of the commithash and the file you specified (plus what I explained earlier concerning -C and -M).

This was executed from the working directory "~/Desktop/kaiaulu". I know, the config files use ../rawdata/git_repo/OpenSSL/.git and that's also correct. This is because when you are in RStudio, if you are running the commands from an R Notebook after opening kaiaulu.Rproj, your working directory will be: "~/Desktop/kaiaulu". However, when you Knit the vignette into html, it will actually be "~/Desktop/kaiaulu/vignettes". Hence it is the later that the config it is accounting for, not the former when it states the data lies at "../rawdata/" instead of simply "rawdata". As it should be: When you knit a vignette, the start of each vignette will be used to load the config file variables and the associated paths, i.e. the later case. Whereas when you are running the code block by block on R console, you have full control of what goes in each function call, the former case, and my example can simply be pasted.

The truth is, there is no silver bullet way to code this so the config works from any possible path. This is also why the API makes no assumption whatsoever of the config files when you make function calls. It is the responsibility of the user to say where things are.

However, and not versioned yet, there is also a component of this repo that is batch mode (will eventually appear under a exec folder). That component will assume a config file as input since it is a CLI, so calling the CLI won't look something like this. But you don't need to worry about it on this PR.

Take note how I call git_blame(), because this is exactly how parse_git_log will be expected to be called. The only difference being that parse_git_log will be handling the parsing of the file provided by git blame. I should have provided you this from the beginning, but my understanding of git blame itself was crude. Hence, I think this will facilitate your life :)

I hope this helps!

@carlosparadis
Copy link
Member Author

Just to clarify, the git_repo_path parameter will be still supplied by the user, which already is available in the config file. The others will not. That will be obtained via parse_gitlog. You can use my example call above as an example call for your parse_git_blame function to ignore these details too.

For now, don't worry about making additional files for .Rmd showcase, or additional functions. Just commit parse_git_blame(git_repo_path,flags,commit_hash,file_path) in the same branch you already have a Pull Request. Inside of the function, call git_blame with the parameters, obtain the blame data, and write your code for the parser. Make a commit with just this function.

Also, don't worry about how to update your branch with the commit I just did so git_blame appears on it (you can just copy and paste on the R Console the function definition to use it for now). We can go over that afterward to avoid mixing the code effort with the GitHub workflow nuances.

Let me know if you have questions!

@massihonda
Copy link
Collaborator

massihonda commented Jul 23, 2020 via email

@carlosparadis
Copy link
Member Author

carlosparadis commented Jul 23, 2020

I don't know. Codeface chose -p over --line-porcelain and git blame documentation says -p is the one to be used for machine consumption, not --line-porcelain. I also do not know the output, can you paste it here as I did in my responses?

The main problem, I believe, in putting everything in 1 line is that you run the risk the actual content of the code, which should be parsed, ends up being affected by whatever you use to separate the various columns. Meanwhile -p already does the work for you, so there is no risk of the content of the code being affected.

You should parse all data made available by git blame. There is no reason not to since they can all be potentially relevant to other work.

@massihonda
Copy link
Collaborator

massihonda commented Jul 23, 2020 via email

@carlosparadis
Copy link
Member Author

carlosparadis commented Jul 23, 2020

I see what you're saying. If that is the case, then --line-porcelain just considerably inflates the number of lines of a file repeating the same information. I don't know if this will be a good price to pay on a git log like Chromium which has 20GB. Try to go with -p and if you get stuck I can try to help.

Also, try to avoid for loops if you can and use vectorized operations instead. Loops in R are terrible, and parse_git_blame will be used in a very expensive computation. A for loop here may push the process to several hours.

Looking at the file example from:

git_blame_output <- git_blame(git_repo_path = "rawdata/git_repo/OpenSSL/.git",
          flags = c('-p','-C','-w','-M'),
          commit_hash = '1940c092a52afd8bc919b8faa5f3d51004503f3a',
          file_path = 'apps/genpkey.c'
          )

For example on the above if I use stri_match_all_regex(git_blame_output,"^([a-f0-9]{40}) (\\d+) (\\d+)"):

I get (a slice of the output):

.
.
.

[[768]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[769]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[770]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[771]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[772]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[773]]
     [,1]                                               [,2]                                      
[1,] "b3c6a331853c125985cdf0d3ae60e1894974c3eb 383 272" "b3c6a331853c125985cdf0d3ae60e1894974c3eb"
     [,3]  [,4] 
[1,] "383" "272"

[[774]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[775]]
     [,1]                                               [,2]                                      
[1,] "0f113f3ee4d629ef9a4a30911b22b224772085e5 349 273" "0f113f3ee4d629ef9a4a30911b22b224772085e5"
     [,3]  [,4] 
[1,] "349" "273"

[[776]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[777]]
     [,1]                                               [,2]                                      
[1,] "f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 327 274" "f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab"
     [,3]  [,4] 
[1,] "327" "274"

.
.
.

With this function, you should be able to parse out the content and line numbers of commits. See the holes of NAs from 768 to 772? Maybe you can capitalize on the fact -p doesn't keep repeating things to get these holes properly parsed. Otherwise, you just need to parse the index + 1 of the content "as is", even if the entire line is just blank, so the code can be recreated.

The other regex symbols you can find here: https://rdrr.io/cran/stringi/man/stringi-search-regex.html

@massihonda
Copy link
Collaborator

massihonda commented Jul 24, 2020 via email

@carlosparadis
Copy link
Member Author

carlosparadis commented Jul 25, 2020

Hi Massimo,

Could you format your post on GitHub and enclose the code portions in code blocks and the table? It is really hard to read and for some reason, my editing will not work. Also for the table, maybe it is best you send me via e-mail an attachment to it. R way to break columns instead of providing a horizontal bar was always very confusing to me. Thanks for the work :)

@carlosparadis
Copy link
Member Author

carlosparadis commented Jul 26, 2020

parse_git_blame<-function(git_repo_path,flags,commit_hash,file_path){

  blame_file <- git_blame(git_repo_path,
          flags,
          commit_hash,
          file_path
)

cond<-'(?<=.)(?=[a-f0-9]{40} \\d+)'
pasted <- paste(blame_file,collapse = ' ')
splitted <- stri_split_regex(pasted,cond)

commit_line <- stri_match_all_regex(unlist(splitted),"^([a-f0-9]{40})
(\\d+) (\\d+)")

commits <- unlist(lapply(commit_line, `[[`, 2))
line_number_of_the_line_in_the_original_file <- unlist(lapply(commit_line,
`[[`, 3))
line_number_of_the_line_in_the_final_file <- unlist(lapply(commit_line,
`[[`, 4))
authors <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author).*(?=author-mail)"))
author_mails <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author-mail).*(?=author-time)"))
author_time <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author-time).*(?=author-tz)"))
#author_tz <- stri_match_all_regex(unlist(splitted),
"(?<=author-tz).*(?=committer)")
committers <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer).*(?=committer-mail)"))
committer_mails <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-mail).*(?=committer-time)"))
committer_time <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-time).*(?=committer-tz)"))
committer_tz <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-tz).*(?=summary)"))
summaries <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=summary).*(?=filename)"))
file_names <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=filename).*(?=\t)"))
lines <- unlist(stri_match_all_regex(unlist(splitted), "(?<=\t).*$"))

df <-
data.table(commits,line_number_of_the_line_in_the_original_file,line_number_of_the_line_in_the_final_file,

 authors,author_mails,author_time,committers,committer_mails,committer_time,committer_tz,summaries,
                 file_names,lines)
return(df)
}

Actually, scratch that. Can you go ahead and commit the code on the branch you already have a PR? We are on the same page on the function, I will make any final comments there :) Thanks.

@massihonda
Copy link
Collaborator

massihonda commented Jul 27, 2020 via email

@carlosparadis
Copy link
Member Author

carlosparadis commented Jul 29, 2020

@massihonda thank you for the changes, I have been working on this ever since you sent me the new commit. There are some changes I had to do locally because of how utags works.

The good news is that as of yesterday night I have a vignette that will successfully blame a file and use it's blamed content column to parse using utags, generating the additional information of whether a line is a function or not: I.e. we are able now to get all the information of git blame and add an additional column if said line is a function or not. So we're almost there with the gitlog at function level.

There is one thing I overlooked however: When we use git blame, we obtain information of every single line. I originally thought utags would do the same: but it does not.

The current flags I am using of utags do not output what every line of code is. Rather it output a few lines, some of which functions are one of them. There is a way to output the end line of functions but the output format is really weird. Technically if we are able to consistently parse where a function starts and ends given a file though, we should be able to tell if a person changed a function and which it is.

What is strange is that codeface does not use said approach, it uses more flags and doesn't include e:

  # f = functions, s = structs, c = classes, n = namespace
  # p = function prototype, g = enum, d = macro, t= typedef, u = union
  # e = end line
["u","f", "s", "c", "n", "p", "g", "d", "t"]

As you see, they do more than functons too. This is very important to understand as it affects the dependencies we are really looking at when we say "at funciton level".

So that's basically where I am at. Remember Codeface uses Exuberant Ctags according to the paper, but we use Universal Ctags (uctags). Exuberant is now dead code as far as I understand, and Universal Ctags is the sucessor. If you want to take a look at this to speed up us getting to the gitlog functions, check the README.md on how to install uctags, and try running a few examples to see if you can make sense of all these flags.

Let me know if anything doesn't make sense!

@carlosparadis
Copy link
Member Author

I have summarized my findings on #2 since it is not git blame related. I think I have everything I need now. Give me a few days and we will have the gitlog at the function level.

carlosparadis added a commit that referenced this issue Aug 2, 2020
Adds a git blame parser. Note this is a git
command, and the filepath needs only to exist
in the git log to be output. Said filepath and
commit hash can be obtained, for example, using
parse_gitlog(). A vignette will be added in
a later commit showing how to combine both.

Signed-off-by: Carlos Paradis <carlosviansi@gmail.com>
@carlosparadis
Copy link
Member Author

@massihonda when you have a chance, take a look at this function. I added you as a ctb on DESCRIPTION as part of this commit, and when I accept your PR for the configs git will add you as a contributor to the page. I will be pushing a vignette in a moment. A few comments:

  • The actual git blame output needed was 1 line of the original source code file annotated with the metadata. Otherwise, we can't easily use the line_content column to pass it to uctags.
  • When I suggested using a regex to parse commit hashes and not using loops, what I was trying to say is that you could observe how the format of the file changed, and then apply the regex to the lines. For example, author- information is followed by committer information, etc. Using this approach, I was able to narrow down the possible git blame files to be of one of the 3 cases (see sub-function parse_line_content). In your PR, you scan the entire git blame file every time you call stri_match_all_regex() over 10 times, even if you already know the line was associated with something else from a prior call of stri_match_all_regex(). The pushed function only applies the regex matching for commit hash in all lines in the pushed function. For the remaining ones, it leverages the possible 3 ways the blame file format appears to parse exactly the lines for the regex once.

I went ahead and extended the API files to have regex. This will allow us to unit test them as well. If you can think of counter cases, please let me know.

I will send the remaining commits in a moment for the use of uctags to a separate issue.

@massihonda
Copy link
Collaborator

massihonda commented Aug 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants