Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to remove forks from the analysis #109

Closed
warenlg opened this issue Jun 27, 2019 · 15 comments · Fixed by #196

Comments

@warenlg
Copy link
Member

commented Jun 27, 2019

When analyzing sourced's codebase with the dashboard, huge repos that have been forked under src-d have made some visuals become irrelavant and even confuse the overview of the codebase/company's activity. For example,

languages_srcd

commits_srcd

Here, we include the src-d/or-tools huge C repo that somebody forked one day, in the analysis, and I think by default we should not. Also, another consequence is that, all the contributors of those forsks are going to be analyzed in the dashboard and thus be presented as the analyzed company's contributors whereas they never did one contribution to any the company's repos.

I try to filter them with gitbase one day in vain. So, to have a relevant analysis of a company, I fetch the data myself excluding forks with the GitHub API

#!/bin/bash

output_directory="/home/waren/tmp/"
repos=$(curl -s "https://api.github.com/orgs/src-d/repos?page=1&per_page=100" | jq '.[].url' | tr -d '"')
for repo in $repos; do
    is_fork=$(curl -s $repo | jq '.fork')
    if [ $is_fork == "false" ]; then
        html_url=$(curl -s $repo | jq '.html_url' | tr -d '"')
        git clone $html_url $output_directory
    fi
done
@dpordomingo

This comment has been minimized.

Copy link
Member

commented Jun 27, 2019

I'd make it optional; with something like --no-forks, as proposed by #105 (comment)

@warenlg

This comment has been minimized.

Copy link
Member Author

commented Jun 27, 2019

Agree, but IMO if so, the flag should be heavily highlighted or put by default, otherwise the first impression of people using the dashboard are going to be: "Well... this is not my company"

@smacker

This comment has been minimized.

Copy link
Member

commented Jun 27, 2019

Heavy +++++ for this by default.

@dpordomingo

This comment has been minimized.

Copy link
Member

commented Jun 27, 2019

yes, whatever the default behavior is, if it's properly documented and being possible to alter, LGTM

@smacker

This comment has been minimized.

Copy link
Member

commented Jun 27, 2019

@marnovo wdyt? let's include it in the next release?

@warenlg warenlg changed the title Remove forks from the analysis Allow to remove forks from the analysis Jun 27, 2019

@marnovo

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Agreed about this being:

  1. The default behavior
  2. Made explicit to users
  3. Able to switch it on/off
  4. Include in the stable release

The question is: what is the best way to achieve 2 and 3?

@dpordomingo

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Should we consider the import something to be tweaked from sourced init?

If so, it could be something like:

$ sourced init org --help

    Will import all repos owned by the passed org, but the ones that are forks

Arguments:
    [org...]     list of organizations to import

Options:
    --with-forks=[fork|parent|both]
                 fork    will include the forks owned by the passed org
                 parent  will include parent repos of the forks owned by the passed org
                 both    will include both, forks owned by the passed org and its parents
    --exclude=file-with-ignored

That way:

$ sourced init org bblfsh

Will import regular repos of bblfsh; and to also import its forks it should be requested explicitly; and it could be also excluded some repos if passed a file with the repos to exclude.

@carlosms

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

To me it's also important to discuss how to visualize it after init.
Should we display somehow in sourced workdirs that your src-d org was initialized with or without forks?

And about being able to switch it on/off: if a user inits without forks, and later wants to init the same org with forks, should we...?

  • create 2 different environments, visible in sourced workdirs
  • override the previous one. Get rid of the previous gitcollector & ghsync data, and replace with new data to download
@dpordomingo

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

I think that sourced should behave the same than when the user runs init again over the same org, and that org added new repos: the environment is just updated with the new content instead of creating a new environment.
Today, the init ties the download and the environment, so it should apply here too. Creating two envs because the user runs init two times (with and without forks), should be through something like init --org=bblfsh --env=myenv1 and init --org=bblfsh --env=myenv2.

@smacker

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

Some relevant info.

Forks in gitbase:

Maxim: is it possible to recognize that a repository is a fork using gitbase? Before in jgit-spark-connecter there was a field for it. But now I don’t see it.
Antonio: nup

Forks in gitcollector:

Maxim: part of the solution can be to add --exclude-forks flags to gitcollector.
Manu: I think adding a flag to make the discovery not enqueuing forked repositories shouldn't be complicated

@marnovo

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

@dpordomingo's approach seems reasonable.

@carlosms maybe workdirs command can be turned into a more general purpose "status" of the currently active environment? Or expand the same on status command?

@smacker that would make sense.

@carlosms

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2019

We could have something like:

  • source workdirs list to show the list of workdirs (what the current sourced workdirs does now)
  • source workdirs show [orgName] to show the config used for the currently active workdir (or the given organization name/file path).

workdirs show could be just the output of ~/.sourced/workdirs/dir/.env, it contains information about the organizations cloned (or the local file path), and if we add a new env var to filter out forks, it will also be visible here.

	COMPOSE_PROJECT_NAME=srcd-TUxvbkNvZGU=
	GITBASE_VOLUME_TYPE=volume
	GITBASE_VOLUME_SOURCE=gitbase_repositories
	GITBASE_SIVA=true
	GITHUB_ORGANIZATIONS=MLonCode
	GITHUB_TOKEN=xxx

As an improvement on that we could have a more user friendly .txt file that gets created at the same time as .env. It can contain the same, but in a more readable way. And output that file with workdirs show

.

Alternatively we can do something similar to the progress tables in the welcome dashboard in the UI. We can have a table where ghsync puts true/false to show if it the download includes forks or not.

@marnovo

This comment has been minimized.

Copy link
Member

commented Jul 12, 2019

@src-d/applications

  1. The two actions make sense. Just one concerned regarding using the workdirs naming for more things because of UX (see more below*).
  2. The txt seems unnecessary, we could just output comments over the variable/value pairs with plain text explanation.
  3. The table is a nice touch, but where would that be displayed? Also, not super necessary at this point.

*I'll pose this question here for context (should probably live on a issue of this own), but would it make sense bundling it all under a status command that gives you all that is currently "active" or initialized/loaded? E.g.:

A. sourced status workdirs -> List all working directories and indicate the active, as we have today with sourced workdirs
B. sourced status components -> show the status of all components, as we have today w/ sourced status
C. sourced status config -> to show the config used for the currently active workdir (or the given organization name/file path
D. sourced status -> all of the above

@carlosms

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

That new sourced status proposal makes sense to me.

@marnovo

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

That new sourced status proposal makes sense to me.

Ok, created an issue for it: src-d/backlog#1458

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.