-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a git blame parser #68
Comments
Hi Carlos, I created a function that stores the blame commit of each line of all files of a checkout. I used the conf file that I am attaching, where I had to insert the absolute path to the git directory in order to run. I was not able to do it with path.expand unfortunately. I added the functions in the git.R file and created a vignette show_case to run the blame function. The show_case uses the drill repository as an example. These files are attached too, but you would have to clone the drill repository into the git_repo folder, since it was too large to upload. If you think there are some things to change I can do it. The function has two optional parameters: one is all, the other is since and until (to use together). If all is set to TRUE, instead of retrieving only the last commit that touched the lines, it retrieves all the commits the touched the lines in all the history. If a date is specified in since and until, it retrieves all the commits that changed the lines within these two dates. When all is set to TRUE the time to compute the data frame increases, reasonably. Kind regards, Massimo |
Hi Massimo, I added some overall feedback concerning the code on #71 , however, I noticed one assumption you have on your code that may be going in a different direction than what you wanted to do for your thesis, which to my understanding is a collaboration network at function level, so I am double checking again. In your git_blame() function, your parameters are:
However, the call to git blame via terminal is, for example on Kaiaulu:
A portion of the output, in turn, looks like this:
The above is the top portion of the blame shown here. In order for me to finish the git log at function level I understood you wanted, you need to create a function that takes the following parameters:
And returns that to memory. In the case of git_blame, you should not need to save to If you can parse it with those 2 functions, I can finish integrating to the rest of code already in Kaiaulu to get your git log at function level. Also, you can still reuse the existing branch and PR. You would just need to make another commit editing so git.R contains the original code plus this function, and R/parsers.R contains the parser function. We can squash all commits later to clean up. Let me know if you have questions, or if I misunderstood your interest was not on the git log functions part. |
I see, my purpose is indeed also to arrive to create a network at function
level. I am going to adjust it as you wish, I would just like to explain to
you the reason for this implementation to better understand how to evolve.
At the beginning I also put the commit in the parameter, then I took it
out, because I thought I could use the git_checkout function before to go
to a specific commit. I stored the output into /tmp because when I called
the terminal command it did not return anything I could store in a normal
variable, so I used it as a workaround (I understand it is not the proper
way though).
Instead of inserting the filename as a parameter, I created a function
list_files_in_repo, that lists all the files in a repository at a specific
commit. I called this function just before the first while loop, so that it
stores all the files into a variable. Then the while loop iteratively calls
the git_blame function on each file with an extension in the form of a
programming language (in order to avoid parsing an image for example). The
function returns a dataframe with the information about each line of all
files in a specific commit. I thought it would have been possible to filter
the data frame when you have to create the network at function level, i.e.
selecting only those rows belonging to one file each time to see who
committed each line, and joining the data frame of parse_gitlog to see the
author of the commit and the data frame from the function that parse ctags
to see which function is defined in that line.
I understand you want to put this loop outside the function
parse_git_blame, so that parse_git_blame would return only the blame of
only one file at the time, is it correct? Is it fine that the function
returns a dataframe with the format I used (file_name, path, line, commit)
, but containing the information of only one file? What should I do with
the function list_files_in_repo?
…On Fri, Jul 17, 2020 at 5:49 PM Carlos Paradis <SRS0=Id6m=A4=github.com= ***@***.***> wrote:
Hi Massimo,
I added some overall feedback concerning the code on #71
<#71> , however, I noticed one
assumption you have on your code that may be going in a different direction
than what you wanted to do for your thesis, which to my understanding is a
collaboration network at function level, so I am double checking again.
In your git_blame() function, your parameters are:
parse_git_blame <- function(git_repo_path, all=FALSE, since, until){...}
However, the call to git blame via terminal is, for example on Kaiaulu:
git blame -p -C -w -M 9b21375 git.R
A portion of the output, in turn, looks like this:
9da2897 1 1 4
author Carlos Paradis
author-mail ***@***.***>
author-time 1590633783
author-tz -1000
committer Carlos Paradis
committer-mail ***@***.***>
committer-time 1590633783
committer-tz -1000
summary i #12 Add MPL 2.0 License
previous 05dc176 R/parsers.R
filename R/parsers.R
# This Source Code Form is subject to the terms of the Mozilla Public
9da2897 2 2
# License, v. 2.0. If a copy of the MPL was not distributed with this
9da2897 3 3
# file, You can obtain one at https://mozilla.org/MPL/2.0/.
9da2897 4 4
9b21375 5 5 39
author Carlos Paradis
author-mail ***@***.***>
author-time 1593782688
author-tz -1000
committer Carlos Paradis
committer-mail ***@***.***>
committer-time 1593782688
committer-tz -1000
summary i #62 add git interface functions
filename R/git.R
#' Performs a git checkout on specified repo
9b21375 6 6
#'
The above is the top portion of the blame shown here
<https://github.com/sailuh/kaiaulu/blame/master/R/git.R>. In order for me
to finish the git log at function level I understood you wanted, you need
to create a function that takes the following parameters:
git_blame(git_repo_path,commit_hash,file_name)
And returns that to memory. In the case of git_blame, you should not need
to save to /tmp since it is a small output (a single file). git_log saves
to /tmp/ because it can range from 400MB to GBs. Either way, your main
challenge is on parse_git_blame(git_repo_path,commit_hash,file_name). This
function will be in charge of actually parsing the output above into a
tabular format.
If you can parse it with those 2 functions, I can finish integrating to
the rest of code already in Kaiaulu to get your git log at function level.
Also, you can still reuse the existing branch and PR. You would just need
to make another commit editing so git.R contains the original code plus
this function, and R/parsers.R contains the parser function. We can squash
all commits later to clean up.
Let me know if you have questions, or if I misunderstood your interest was
not on the git log functions part.
and return a parsed table of the output above
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL42SQE3IYD5Y6FQXWGXMHDR4BXHNANCNFSM4OXDOL6Q>
.
|
Your assumption of using git checkout is not incorrect, and it is indeed to the best of my knowledge the only way I have seen to date to compute static metrics in all implementations such as architectural flaws. However, it is my understanding also passing the commit hash will save you an immense amount of computational time, if not the only way viable to do this per commit. To be more specific: The reason why when computing architectural flaws we have to use However, in your specific case, you do not need to do so: If a commit change 4 files, we can, with the implementation Kaiaulu already have of ctags and I also believe codeface does this way, only look at those 4 files instead of the entire folder tree. This means that if you are looking at a project like Chromium of thousands of files and over 900k commits, instead of ending up with 900k*5000 git blames, you may end up, on average, with something like 900k*5, as the average commit and human shouldn't be changing so many files at every commit, unless the code is really bad. Hence your execution time T should be something like O(n_commits)*O(5 ish), instead of O(n_commits)*O(n_files_in_folder) on the average case if my math serves me right. You can also already do a git checkout iteration by using: Lines 5 to 26 in 9b21375
In the following manner on a vignette (and it is indeed what I did to compute "Data for Bob" for architectural flaws:
But back to answer your question:
You should not need to define a function that scans all file names in a folder, as we will already know what their filepath is and commit hash while looping through the gitlog. Your I will then put these together in a vignette with the other necessary functions, and finish your gitlog network at function level, and when I think it is good enough, turn into another function (which at that point I need to think which module it would go, if parse_, or something else, but certainly not git_ as git.R is solely to define functions to interact with the git interface). As for the file filter or time windows, don't worry about that for now. In the vignette, this information is obtained from the config file. I can just file filter in a vignette the rows of project_git and then pass to your function: This is done in
Also:
Hence in a similar manner when we call:
One column will contain the file_path, as shown on the output before there is one line that says On a side note, I noticed my output above, despite specifying a single file, contains information of more than one. My understanding of git blame output may be flawed. Maybe we should start by making sure when we pass a commit and a file name, it does indeed report just the lines and commit of that file, and not of all files the commit touches. Let me know if the above makes sense or if you have questions. |
Take a look at this older version of Lines 18 to 24 in 6c62c97
The reason you can't get the output is because of the Lines 66 to 67 in 9b21375
On a bash terminal, if you time a command like Lines 448 to 467 in 9b21375
You may need to fiddle with the exact function you read the input directly from the terminal. In the first I use jsonlite because Perceval gives a json output. In the second I use fread because the output to stdout can be a .csv. Gitlog may be neither, so you may need to use readlines instead, but it should be doable. In the future, I will make it so all |
I understand, that's indeed a better way, I didn't think about it and I
made it just intuitively. So I am going to refactor it in this way and let
you know.
Regards,
Massimo
…On Sat, Jul 18, 2020 at 5:29 AM Carlos Paradis <SRS0=ud+N=A5=github.com= ***@***.***> wrote:
At the beginning I also put the commit in the parameter, then I took it
out, because I thought I could use the git_checkout function before to go
to a specific commit. I stored the output into /tmp because when I called
the terminal command it did not return anything I could store in a normal
variable, so I used it as a workaround (I understand it is not the proper
way though).
Take a look at this older version of parse_gitlog to see how you can do
it without saving to /tmp/:
https://github.com/sailuh/kaiaulu/blob/6c62c97eeb5668515f14495019117662184ebd15/R/parsers.R#L18-L24
The reason you can't get the output is because of the > in the list of
parameters you pass:
https://github.com/sailuh/kaiaulu/blob/9b2137579609144e98ca27e858edb76e05655eb2/R/parsers.R#L66-L67
On a bash terminal, if you time a command like git log > somefile.txt
what would be otherwise output to stdout is saved to somefile.txt instead.
You can reason system2 as a function that copy and paste a command line
to the terminal. With stdout = TRUE it will capture the output of the
terminal. It may be counter intuitive to have stdout = TRUE when the output
is saved to /tmp but that is required for a file to be saved as we use >.
I didn't find a flag to explicitly have git to save the output. Most
stackoverflow questions just suggest using >.
Another example:
https://github.com/sailuh/kaiaulu/blob/9b2137579609144e98ca27e858edb76e05655eb2/R/parsers.R#L448-L467
You may need to fiddle with the exact function you read the input directly
from the terminal. In the first I use jsonlite because Perceval gives a
json output. In the second I use fread because the output to stdout can be
a .csv. Gitlog may be neither, so you may need to use readlines instead,
but it should be doable.
In the future, I will make it so all git_ functions can do both after I
am certain there is no other way. For now, you can use the way the old
parse_git used to obtain the output on memory.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL42SQBD6SBIDSISGZDXIMTR4EJJ5ANCNFSM4OXDOL6Q>
.
|
Thanks! Feel free to post questions here. I am still intrigued on why the git blame is giving information mentioning other file in Kaiaulu. I hope my understanding of it is not incorrect! |
HI Carlos,
I think that the blame message, when the parameter -p is specified, returns
in the metadata the hash of the previous commit and the file it touched.
However, it is only in this context and it should not harm the parsing I
think. All the lines that contain an integer beside the hash commit belong
to the blame of the file specified in the terminal command.
I would like to ask you a few questions. I could easily extract from the
blame file you showed at the first mail the hash that modified the lines
for each line in the file and put them in a data.table. For example:
line 1 '9da2897fe4dc320072fb8fd98b13c0c24430b3e6',
line 2 '9da2897fe4dc320072fb8fd98b13c0c24430b3e6',
...
line 41 '9da2897fe4dc320072fb8fd98b13c0c24430b3e6'
Do you think this information would be enough, considering that afterwards
you could join this table with the table of git log on the commit and
having more detailed info such as author, date, etc?
Secondly, should it return the hash for each line in the file, or only the
lines that the commit specified in the parameter has touched?
Thanks,
Massimo
…On Sat, Jul 18, 2020 at 11:05 AM Carlos Paradis <SRS0=ud+N=A5=github.com= ***@***.***> wrote:
Thanks! Feel free to post questions here. I am still intrigued on why the
git blame is giving information mentioning other file in Kaiaulu. I hope my
understanding of it is not incorrect!
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL42SQDOA264GUW4PC3DPWTR4FQW7ANCNFSM4OXDOL6Q>
.
|
You may want to look this response on GitHub so the embedded links appear and is easier to read:
If you mean by Try: And the output should be a single file, filter.R, and the commit hash will be the same as the one provided as a parameter. Take a look at the options man page and the formatting man page. -C and -M seem to be the reason we saw multiple files being shown despite providing only one. Because of that, I am afraid we can't just parse commit hashes but instead need to parse everything. But this shouldn't be too hard, considering the doc above explains the layout of the file. Also, in regards to my comment 5 days ago and the associated command, try removing
and you will get:
As you can see, it is just the license header that adds another file instead of R/git.R. Why? If you see the commits: You will notice the license header commit and the file were where I first added the license header, and the commit was when I first explicitly added the license to the R package. This is likely the behavior of -C and -M. Try experimenting removing the flags
So the bottom line is, take a look at what is the specified format of Let me know if you have any other questions. |
Hi Carlos,
Thanks for your reply. I made progress on the implementation, but I have a
doubt on the git_blame function. You want to build the function with the
parameter (git_repo_path,commit_hash,file_name), where the git_repo_path is
the path of the .git folder.
I noticed that the command
git blame -p -C -w -M 9b21375 git.R
works if you are in the directory that contains git.R, so using
setwd() in R or using the parameter of git -C, which run the command
as if git was started in *<path>* instead of
the current working directory.
git -C ../kaiaulu/R/ blame -p -C -w -M
9b21375 git.R
Thus, the function that I wrote is:
git_blame<- function(*git_repo_path*,*commit_hash*,*file_name*){
blame_text <- system2("git",
args = c('-C',
*git_repo_path*,
'blame',
'--line-porcelain',
'-C',
'-w',
'-M',
*commit_hash*,
*file_name*),
stdout = TRUE,
stderr = FALSE)
return(blame_text)
}
Where *git_repo_path* is the path of the folder that contains the file.
I noticed in git_log the git_repo path is used in the parameter
--git-dir, but I do not understand how I should use it in this case,
since the scenario seems very different.
It seems to me that I need to specify the path of the folder
containing the file in the parameter of the function, or it is not the
case?
…On Wed, Jul 22, 2020 at 6:31 AM Carlos Paradis <SRS0=NBP2=BB=github.com= ***@***.***> wrote:
You may want to look this response on GitHub so the embedded links appear
and is easier to read:
I think that the blame message, when the parameter -p is specified, returns
in the metadata the hash of the previous commit and the file it touched.
If you mean by previous the ancestor commit to the one you pass as a
parameter, then that should not be the case. For example,
Try: git blame -p -C -w -M 8459aca
R/filter.R
And the output should be a single file, filter.R, and the commit hash will
be the same as the one provided as a parameter.
Take a look at the options man page
<https://www.git-scm.com/docs/git-blame#_options> and the formatting man
page <https://www.git-scm.com/docs/git-blame#_the_porcelain_format>. -C
and -M seem to be the reason we saw multiple files being shown despite
providing only one. Because of that, I am afraid we can't just parse commit
hashes but instead parse everything. But this shouldn't be too hard,
considering the doc above explains the layout of the file.
Also, in regards to my comment 5 days ago
<#68 (comment)> and
the associated command, try removing -p, so it is easier to read:
git blame -C -w -M 9b21375 R/git.R
and you will get:
9da2897 R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000 1) # This Source Code Form is subject to the terms of the Mozilla Public
9da2897 R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000 2) # License, v. 2.0. If a copy of the MPL was not distributed with this
9da2897 R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000 3) # file, You can obtain one at https://mozilla.org/MPL/2.0/.
9da2897 <https://mozilla.org/MPL/2.0/.9da2897f> R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000 4)
9b21375 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 5) #' Performs a git checkout on specified repo
9b21375 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 6) #'
9b21375 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 7) #' @param commit_hash The commit hash the repo should be checkout
9b21375 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 8) #' @param git_repo_path The git repo path
9b21375 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 9) #' @return Any error message generated by git
9b21375 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 10) #' @export
9b21375 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 11) git_checkout <- function(commit_hash,git_repo_path){
9b21375 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 12) # Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop")
...
As you can see, it is just the license header that adds another file
instead of R/git.R. Why? If you see the commits:
- 9da2897
<9da2897>
- 9b21375
<9b21375>
You will notice the license header commit and the file was where I first
added the license header, and the commit was when I first explicitly added
the license to the R package. This is likely the behavior of -C and -M.
Try experimenting removing the flags -C and -M, and you will see when you
remove -C, the file now shows the git blame as we originally expected:
git blame -w -M 9b21375 R/git.R outputs:
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 1) # This Source Code Form is subject to the terms of the Mozilla Public
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 2) # License, v. 2.0. If a copy of the MPL was not distributed with this
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 3) # file, You can obtain one at https://mozilla.org/MPL/2.0/.
9b21375 <https://mozilla.org/MPL/2.0/.9b213757> (Carlos Paradis 2020-07-03 03:24:48 -1000 4)
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 5) #' Performs a git checkout on specified repo
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 6) #'
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 7) #' @param commit_hash The commit hash the repo should be checkout
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 8) #' @param git_repo_path The git repo path
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 9) #' @return Any error message generated by git
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 10) #' @export
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 11) git_checkout <- function(commit_hash,git_repo_path){
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 12) # Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop")
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 13) git_repo_path <- path.expand(git_repo_path)
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 14) # Remove ".git"
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 15) folder_path <- stri_replace_last(git_repo_path,replacement="",regex=".git")
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 16) error <- system2('git',
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 17) args = c('--git-dir',
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 18) git_repo_path,
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 19) '--work-tree',
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 20) folder_path,
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 21) 'checkout',
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 22) commit_hash),
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 23) stdout = TRUE,
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 24) stderr = FALSE)
9b21375 (Carlos Paradis 2020-07-03 03:24:48 -1000 25) return(error)
...
So the bottom line is, take a look at what is the specified format of -p
and encode in your parse_git_blame() function, and reason -C and -M to
pass the explanation onto the function @description. This is where, over
time, I think, it pays off to build these functions to wrap around these
interfaces: They never make immediate sense, and once we wrap and document
them properly, we don't need to go through this again. Stackoverflow has
several questions on what git blame does as well, so I am led to believe
it is confusing for its name.
Let me know if you have any other questions.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL42SQDV2TTUO6VXBY6YBATR4ZTSLANCNFSM4OXDOL6Q>
.
|
Adds a wrapper to the command git blame Signed-off-by: Carlos Paradis <carlosviansi@gmail.com>
Hi Massimo, I went ahead and pushed to the repo a As you said, git_blame differs from the other git_functions in that it also requires a file and a commit_hash, and I can see how this can lead to confusion if it is not clear what exactly is providing the inputs to it so you can decide where your assumption of the current working directory should be. Add that to the fact git requires an extra flag to be executed from a different folder and this just becomes plain confusing. You can use the
This function will return a character vector, where each element of the character is a line of the commithash and the file you specified (plus what I explained earlier concerning -C and -M). This was executed from the working directory The truth is, there is no silver bullet way to code this so the config works from any possible path. This is also why the API makes no assumption whatsoever of the config files when you make function calls. It is the responsibility of the user to say where things are. However, and not versioned yet, there is also a component of this repo that is batch mode (will eventually appear under a Take note how I call I hope this helps! |
Just to clarify, the For now, don't worry about making additional files for .Rmd showcase, or additional functions. Just commit Also, don't worry about how to update your branch with the commit I just did so Let me know if you have questions! |
Okay. I would like to ask one thing, because in the function I wrote so far
of parse_git_blame I used the flag --line-porcelain instead of -p, which
shows the porcelain format, but output commit information for each line,
not just the first time a commit is referenced. Since I am storing all the
information for each line in a tabular format such as author name,committer
name, email, etc, this made the parsing easier, otherwise I could work to
re-do with -p, but it would become more complex I think. Is it right, or
there are reasons to use -p instead of --line-porcelain? Is the information
other than the commit hash and line relevant or it should be more
minimalistic?
Thanks,
Massimo
…On Thu, Jul 23, 2020 at 2:48 AM Carlos Paradis <SRS0=fPwx=BC=github.com= ***@***.***> wrote:
Just to clarify, the git_repo_path parameter will be still supplied by
the user, which already is available in the config file. The others *will
not*. That will be obtained via parse_gitlog. You can use my example call
above as an example call for your parse_git_blame function to ignore
these details too.
For now, don't worry about making additional files for .Rmd showcase, or
additional functions. Just commit
parse_git_blame(git_repo_path,flags,commit_hash,file_path) in the same
branch you already have a Pull Request. Inside of the function, call
git_blame with the parameters, obtain the blame data, and write your code
for the parser. Make a commit with just this function.
Also, don't worry about how to update your branch with the commit I just
did so git_blame appears on it (you can just copy and paste on the R
Console the function definition to use it for now). We can go over that
afterward to avoid mixing the code effort with the GitHub workflow nuances.
Let me know if you have questions!
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL42SQF7GY56J4LKFEQEZT3R46CEBANCNFSM4OXDOL6Q>
.
|
I don't know. Codeface chose The main problem, I believe, in putting everything in 1 line is that you run the risk the actual content of the code, which should be parsed, ends up being affected by whatever you use to separate the various columns. Meanwhile You should parse all data made available by git blame. There is no reason not to since they can all be potentially relevant to other work. |
the beginning of the output is :
```
9da2897 1 1 4
author Carlos Paradis
author-mail <carlosviansi@gmail.com>
author-time 1590633783
author-tz -1000
committer Carlos Paradis
committer-mail <carlosviansi@gmail.com>
committer-time 1590633783
committer-tz -1000
summary i #12 Add MPL 2.0 License
previous 05dc176 R/parsers.R
filename R/parsers.R
# This Source Code Form is subject to the terms of the Mozilla
Public
9da2897 2 2
author Carlos Paradis
author-mail <carlosviansi@gmail.com>
author-time 1590633783
author-tz -1000
committer Carlos Paradis
committer-mail <carlosviansi@gmail.com>
committer-time 1590633783
committer-tz -1000
summary i #12 Add MPL 2.0 License
```
It is the same as with -p, but with -p, the information of a commit such as
author name,mail etc are elencated only the first time the commit appear,
then, when the same commit reappear in another line, it only specifies the
hash commit, the lines, and the content of the line like this:
9da2897 2 2
# License, v. 2.0. If a copy of the MPL was not distributed with this
On the contrary, with --line-porcelain it would be like this, because it
repeats the info every time:
9da2897 2 2
author Carlos Paradis
author-mail <carlosviansi@gmail.com>
author-time 1590633783
author-tz -1000
committer Carlos Paradis
committer-mail <carlosviansi@gmail.com>
committer-time 1590633783
committer-tz -1000
summary i #12 Add MPL 2.0 License
I wonder if the info like author, author-email etc should be included in
the final table the function parse_git_blame gives. If that's the case,
--line-porcelain would make the work easier, whereas if they are not needed
-p would be a better option.
…On Thu, Jul 23, 2020 at 8:54 AM Carlos Paradis <SRS0=fPwx=BC=github.com= ***@***.***> wrote:
I don't know. Codeface chose -p over --line-porcelain and git blame
documentation says -p is the one to be used for machine consumption, not
--line-porcelain. I also do not know the output, can you paste it here as
I did in my responses?
The main problem, I believe, in putting everything in 1 line is that you
run the risk the actual content of the code, which should be parsed, ends
up being affected by whatever you use to separate the various columns.
Meanwhile -p already does the work for you, so there is no risk of the
content of the code being affected.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL42SQBJVNGQEO6UUQRDCPDR47ND7ANCNFSM4OXDOL6Q>
.
|
I see what you're saying. If that is the case, then Also, try to avoid for loops if you can and use vectorized operations instead. Loops in R are terrible, and Looking at the file example from:
For example on the above if I use I get (a slice of the output):
With this function, you should be able to parse out the content and line numbers of commits. See the holes of NAs from 768 to 772? Maybe you can capitalize on the fact The other regex symbols you can find here: https://rdrr.io/cran/stringi/man/stringi-search-regex.html |
Hi Carlos,
I tried to follow your advice as much as I could and I created a function
that converts the log message into a table. The function is:
parse_git_blame<-function(git_repo_path,flags,commit_hash,file_path){
blame_file <- git_blame(git_repo_path,
flags,
commit_hash,
file_path
)
cond<-'(?<=.)(?=[a-f0-9]{40} \\d+)'
pasted <- paste(blame_file,collapse = ' ')
splitted <- stri_split_regex(pasted,cond)
commit_line <- stri_match_all_regex(unlist(splitted),"^([a-f0-9]{40})
(\\d+) (\\d+)")
commits <- unlist(lapply(commit_line, `[[`, 2))
line_number_of_the_line_in_the_original_file <- unlist(lapply(commit_line,
`[[`, 3))
line_number_of_the_line_in_the_final_file <- unlist(lapply(commit_line,
`[[`, 4))
authors <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author).*(?=author-mail)"))
author_mails <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author-mail).*(?=author-time)"))
author_time <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author-time).*(?=author-tz)"))
#author_tz <- stri_match_all_regex(unlist(splitted),
"(?<=author-tz).*(?=committer)")
committers <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer).*(?=committer-mail)"))
committer_mails <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-mail).*(?=committer-time)"))
committer_time <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-time).*(?=committer-tz)"))
committer_tz <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-tz).*(?=summary)"))
summaries <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=summary).*(?=filename)"))
file_names <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=filename).*(?=\t)"))
lines <- unlist(stri_match_all_regex(unlist(splitted), "(?<=\t).*$"))
df <-
data.table(commits,line_number_of_the_line_in_the_original_file,line_number_of_the_line_in_the_final_file,
authors,author_mails,author_time,committers,committer_mails,committer_time,committer_tz,summaries,
file_names,lines)
return(df)
}
In *pasted <- paste(blame_file,collapse = ' ') *it joins all the lines of
the blame file generated with your function git_blame(), so that all the
text is in one line. Then, in *splitted* <- *stri_split_regex(pasted,cond)*
it splits this block on the hash commit, so that each string contains the
metadata of one line of the file to parse, for example:
"f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 375 324 author Dr. Stephen Henson
author-mail <steve@openssl.org> author-time 1144762132 author-tz +0000
committer Dr. Stephen Henson committer-mail <steve@openssl.org>
committer-time 1144762132 committer-tz +0000 summary Initial keygen
support. filename apps/genpkey.c \t}"
After, the regex functions are applied for retrieving each one of this
metadata, and then they are converted into a table. The output is like:
parse_git_blame(git_repo_path,flags,commit_hash,file_path)
commits
line_number_of_the_line_in_the_original_file
1: 0f113f3ee4d629ef9a4a30911b22b224772085e5
2
2: 6738bf1417289a14758590fca5a26b62c9b2c0be
2
3: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab
56
4: 846e33c729311169d9c988ceba29484b3783f244
4
5: 846e33c729311169d9c988ceba29484b3783f244
5
---
320: 0f113f3ee4d629ef9a4a30911b22b224772085e5
398
321: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab
369
322: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab
370
323: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab
374
324: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab
375
line_number_of_the_line_in_the_final_file authors
author_mails author_time
1: 1 Matt Caswell <
matt@openssl.org> 1421898055
2: 2 Matt Caswell <
matt@openssl.org> 1518526289
3: 3 Dr. Stephen Henson <
steve@openssl.org> 1144762132
4: 4 Rich Salz <
rsalz@openssl.org> 1463509110
5: 5 Rich Salz <
rsalz@openssl.org> 1463509110
---
320: 320 Matt Caswell <
matt@openssl.org> 1421898055
321: 321 Dr. Stephen Henson <
steve@openssl.org> 1144762132
322: 322 Dr. Stephen Henson <
steve@openssl.org> 1144762132
323: 323 Dr. Stephen Henson <
steve@openssl.org> 1144762132
324: 324 Dr. Stephen Henson <
steve@openssl.org> 1144762132
committers committer_mails committer_time committer_tz
1: Matt Caswell <matt@openssl.org> 1421918409 +0000
2: Matt Caswell <matt@openssl.org> 1518530365 +0000
3: Dr. Stephen Henson <steve@openssl.org> 1144762132 +0000
4: Rich Salz <rsalz@openssl.org> 1463509159 -0400
5: Rich Salz <rsalz@openssl.org> 1463509159 -0400
---
320: Matt Caswell <matt@openssl.org> 1421918409 +0000
321: Dr. Stephen Henson <steve@openssl.org> 1144762132 +0000
322: Dr. Stephen Henson <steve@openssl.org> 1144762132 +0000
323: Dr. Stephen Henson <steve@openssl.org> 1144762132 +0000
324: Dr. Stephen Henson <steve@openssl.org> 1144762132 +0000
summaries
1: Run util/openssl-format-source -v -c . previous
22b52164aaed31d6e93dbd2d397ace041360e6aa apps/genpkey.c
2: Update copyright year previous
abfee348c613ea8037561ee850cac94218f73f64 apps/genpkey.c
3:
Initial keygen support.
4: Copyright consolidation 01/10 previous
be9c8deb7de92feb5e5300f2e46d3516bcc43c00 apps/genpkey.c
5: Copyright consolidation 01/10 previous
be9c8deb7de92feb5e5300f2e46d3516bcc43c00 apps/genpkey.c
---
320: Run util/openssl-format-source -v -c . previous
22b52164aaed31d6e93dbd2d397ace041360e6aa apps/genpkey.c
321:
Initial keygen support.
322:
Initial keygen support.
323:
Initial keygen support.
324:
Initial keygen support.
file_names
lines
1: apps/genpkey.c
/*
2: apps/genpkey.c * Copyright 2006-2018 The OpenSSL Project Authors.
All Rights Reserved.
3: apps/genpkey.c
*
4: apps/genpkey.c * Licensed under the OpenSSL license (the
"License"). You may not use
5: apps/genpkey.c * this file except in compliance with the License.
You can obtain a copy
---
320: apps/genpkey.c
c = '\\n';
321: apps/genpkey.c
BIO_write(b, &c, 1);
322: apps/genpkey.c
(void)BIO_flush(b);
323: apps/genpkey.c
return 1;
324: apps/genpkey.c
}
The content of the original file is on the column named lines, that is the
last one. The other columns, like author, without using --line-porcelain
would be mostly NA, because they are defined only once, but the columns
lines, commits,
line_number_of_the_line_in_the_original_file,line_number_of_the_line_in_the_final_file
would be full anyway.
Please let me know what you think about it.
kind regards,
Massimo
…On Thu, Jul 23, 2020 at 10:31 AM Carlos Paradis <SRS0=fPwx=BC=github.com= ***@***.***> wrote:
I see what you're saying. If that is the case, then --line-porcelain just
considerably inflates the number of lines of a file repeating the same
information. I don't know if this will be a good price to pay on a git log
like Chromium which has 20GB. Try to go with -p and if you get stuck I
can try to help.
Also, try to avoid for loops if you can and use vectorized operations
instead. Loops in R are terrible, and parse_git_blame will be used in a
very expensive computation. A for loop here may push the process to several
hours.
Looking at the file example from:
git_blame_output <- git_blame(git_repo_path = "rawdata/git_repo/OpenSSL/.git",
flags = c('-p','-C','-w','-M'),
commit_hash = '1940c092a52afd8bc919b8faa5f3d51004503f3a',
file_path = 'apps/genpkey.c'
)
For example on the above if I use stri_match_all_regex(a,"^([a-f0-9]{40})
(\\d+) (\\d+)"):
I get (a slice of the output):
.
..
.
[[768]]
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[[769]]
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[[770]]
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[[771]]
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[[772]]
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[[773]]
[,1] [,2]
[1,] "b3c6a331853c125985cdf0d3ae60e1894974c3eb 383 272" "b3c6a331853c125985cdf0d3ae60e1894974c3eb"
[,3] [,4]
[1,] "383" "272"
[[774]]
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[[775]]
[,1] [,2]
[1,] "0f113f3ee4d629ef9a4a30911b22b224772085e5 349 273" "0f113f3ee4d629ef9a4a30911b22b224772085e5"
[,3] [,4]
[1,] "349" "273"
[[776]]
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[[777]]
[,1] [,2]
[1,] "f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 327 274" "f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab"
[,3] [,4]
[1,] "327" "274"
..
.
..
With this function, you should be able to parse out the content and line
numbers of commits. See the holes of NAs from 768 to 772? Maybe you can
capitalize on the fact -p doesn't keep repeating things to get these
holes properly parsed. Otherwise, you just need to parse the index + 1 of
the content "as is", even if the entire line is just blank, so the code can
be recreated.
The other regex symbols you can find here:
https://rdrr.io/cran/stringi/man/stringi-search-regex.html
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL42SQAYIPKZAM7WCXHF4GDR47YNNANCNFSM4OXDOL6Q>
.
|
Hi Massimo, Could you format your post on GitHub and enclose the code portions in code blocks and the table? It is really hard to read and for some reason, my editing will not work. Also for the table, maybe it is best you send me via e-mail an attachment to it. R way to break columns instead of providing a horizontal bar was always very confusing to me. Thanks for the work :) |
Actually, scratch that. Can you go ahead and commit the code on the branch you already have a PR? We are on the same page on the function, I will make any final comments there :) Thanks. |
Ok, I committed. I'll wait for your review.
Thank you,
Massimo
…On Sun, Jul 26, 2020 at 1:00 PM Carlos Paradis <SRS0=DiD/=BF=github.com= ***@***.***> wrote:
parse_git_blame<-function(git_repo_path,flags,commit_hash,file_path){
blame_file <- git_blame(git_repo_path,
flags,
commit_hash,
file_path
)
cond<-'(?<=.)(?=[a-f0-9]{40} \\d+)'
pasted <- paste(blame_file,collapse = ' ')
splitted <- stri_split_regex(pasted,cond)
commit_line <- stri_match_all_regex(unlist(splitted),"^([a-f0-9]{40})
(\\d+) (\\d+)")
commits <- unlist(lapply(commit_line, `[[`, 2))
line_number_of_the_line_in_the_original_file <- unlist(lapply(commit_line,
`[[`, 3))
line_number_of_the_line_in_the_final_file <- unlist(lapply(commit_line,
`[[`, 4))
authors <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author).*(?=author-mail)"))
author_mails <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author-mail).*(?=author-time)"))
author_time <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author-time).*(?=author-tz)"))
#author_tz <- stri_match_all_regex(unlist(splitted),
"(?<=author-tz).*(?=committer)")
committers <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer).*(?=committer-mail)"))
committer_mails <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-mail).*(?=committer-time)"))
committer_time <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-time).*(?=committer-tz)"))
committer_tz <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-tz).*(?=summary)"))
summaries <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=summary).*(?=filename)"))
file_names <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=filename).*(?=\t)"))
lines <- unlist(stri_match_all_regex(unlist(splitted), "(?<=\t).*$"))
df <-
data.table(commits,line_number_of_the_line_in_the_original_file,line_number_of_the_line_in_the_final_file,
authors,author_mails,author_time,committers,committer_mails,committer_time,committer_tz,summaries,
file_names,lines)
return(df)
}
Actualy, scratch that. Can you go ahead and commit the code on the branch
you already have a PR? We are on the same on the function, I will make any
final comments there :) Thanks.
Regards,
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL42SQDMVEQJELNNTWJUZTDR5QEC3ANCNFSM4OXDOL6Q>
.
|
@massihonda thank you for the changes, I have been working on this ever since you sent me the new commit. There are some changes I had to do locally because of how utags works. The good news is that as of yesterday night I have a vignette that will successfully blame a file and use it's blamed content column to parse using utags, generating the additional information of whether a line is a function or not: I.e. we are able now to get all the information of git blame and add an additional column if said line is a function or not. So we're almost there with the gitlog at function level. There is one thing I overlooked however: When we use git blame, we obtain information of every single line. I originally thought utags would do the same: but it does not. The current flags I am using of utags do not output what every line of code is. Rather it output a few lines, some of which What is strange is that codeface does not use said approach, it uses more flags and doesn't include
As you see, they do more than functons too. This is very important to understand as it affects the dependencies we are really looking at when we say "at funciton level". So that's basically where I am at. Remember Codeface uses Exuberant Ctags according to the paper, but we use Universal Ctags (uctags). Exuberant is now dead code as far as I understand, and Universal Ctags is the sucessor. If you want to take a look at this to speed up us getting to the gitlog functions, check the README.md on how to install uctags, and try running a few examples to see if you can make sense of all these flags. Let me know if anything doesn't make sense! |
I have summarized my findings on #2 since it is not git blame related. I think I have everything I need now. Give me a few days and we will have the gitlog at the function level. |
Adds a git blame parser. Note this is a git command, and the filepath needs only to exist in the git log to be output. Said filepath and commit hash can be obtained, for example, using parse_gitlog(). A vignette will be added in a later commit showing how to combine both. Signed-off-by: Carlos Paradis <carlosviansi@gmail.com>
@massihonda when you have a chance, take a look at this function. I added you as a ctb on
I went ahead and extended the API files to have regex. This will allow us to unit test them as well. If you can think of counter cases, please let me know. I will send the remaining commits in a moment for the use of uctags to a separate issue. |
Ok Carlos, I will pay more attention next time if I shall work on another
part!
…On Sun, Aug 2, 2020 at 1:06 PM Carlos Paradis <SRS0=+lLG=BM=github.com= ***@***.***> wrote:
@massihonda <https://github.com/massihonda> when you have a chance, take
a look at this function. I added you as a ctb on DESCRIPTION as part of
this commit, and when I accept your PR for the configs git will add you as
a contributor to the page. I will be pushing a vignette in a moment. A few
comments:
- The actual git blame output needed was 1 line of the original source
code file annotated with the metadata. Otherwise, we can't easily use the
line_content column to pass it to uctags.
- When I suggested using a regex to parse commit hashes and not using
loops, what I was trying to say is that you could observe how the format of
the file changed, and then apply the regex to the lines. For example,
author- information is followed by committer information, etc. Using
this approach, I was able to narrow down the possible git blame files to be
of one of the 3 cases (see sub-function parse_line_content). In your PR,
you scan the entire git blame file every time you call
stri_match_all_regex() over 10 times, even if you already know the
line was associated with something else from a prior call of
stri_match_all_regex(). The pushed function only applies the regex
matching for commit hash in all lines in the pushed function. For the
remaining ones, it leverages the possible 3 ways the blame file format
appears to parse exactly the lines for the regex once.
I went ahead and extended the API files to have regex. This will allow us
to unit test them as well. If you can think of counter cases, please let me
know.
I will send the remaining commits in a moment for the use of uctags to a
separate issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL42SQEVJPRQDB4RNQPOBELR6VCEPANCNFSM4OXDOL6Q>
.
|
Part of the effort towards #2
The idea is that we have a git_blame() parser function added to R/git.R
The text was updated successfully, but these errors were encountered: