Skip to content

a utility to crawl all your repositories and find the information you need - now also working for Gitlab !

License

Notifications You must be signed in to change notification settings

societe-generale/github-crawler

Repository files navigation

GitHub crawler

Build Status Maven Central

Why can it be useful ?

With the current move to microservices, it's not rare that a team who previously had a couple of repositories, now has several dozens. Keeping a minimum of consistency between the repositories becomes a challenge which may cause risks :

  • have we updated all our repositories so that they use the latest Docker image ?
  • have we set up the proper security config in all our repositories ?
  • which versions of the library X are we using across ?
  • are we using a library that we are not supposed to use anymore ?
  • do we use hardcoded string in unexpected places in our code ?
  • which team is owner of a repository ?

These are all simple questions that sometimes take hours to answer, with always the risk of missing one repository in the analysis, making the answer inaccurate.

Github crawler aims at automating the information gathering, by crawling an organization's repositories through GitHub API. Even if your organization has hundreds of repositories, Github crawler will be able to report very useful information in few seconds !

Getting started

If you want to provide your own configuration without any code customisation, then you can simply :

  • download the latest github-crawler-starter-exec jar from Maven
  • place your config file (say application.yml) next to the jar - see below
  • run from command line :
java -jar github-crawler-exec.jar --spring.config.location=./

(more examples are available in sections below, ie how to run from IDE and how to extend github crawler, and in this repository)

How does it work ?

Github crawler is a Spring Boot command line application, written in Kotlin.

Following a simple configuration, it will use Github API starting from a given organization level, then for each repository, will look for patterns in specified files or perform other actions.

You can easily exclude repositories from the analysis, configure the files and patterns you're interested in. If you have several types of repositories (front-end, back-end, config repositories for instance), you can have separate configuration files so that the information retrieved is relevant to each scope of analysis.

Several output types are available in this package :

  • console is the default and will be used if no output is configured
  • a simple "raw" file output
  • HTTP output, which enables you to POST the results to an endpoint like ElasticSearch, for easy analysis in Kibana
  • some specific "CI-droid oriented" outputs, to easily "pipe" the crawler output to CI-droid

Configuration on crawler side

Below configuration shows how outputs, indicators and actions are configured under the github-crawler prefix.

crawler:
    
    source-control:    
        # since v2.0.0, we need to mention the source control type. defaults to GitHub, but other options are possible
        # as defined in https://github.com/societe-generale/github-crawler/blob/master/github-crawler-core/src/main/kotlin/com/societegenerale/githubcrawler/GithubConfig.kt
        type: "GITHUB"
        # the base GitHub URL for your Github enterprise or on prem GitLab instance to crawl. 
        # defaults to public URL for GitHub and Azure Devops if not provided in config. 
        url: https://my.githubEnterprise/api/v3
        apiToken: "YOUR_TOKEN"
        # the name of the GitHub organization (or equivalent for other systems) to crawl. To fetch the repositories, the crawler will hit 
        # https://${gitHub.url}/api/v3/orgs/${organizationName}/repos
        organizationName: MyOrganization
        # default is false - API URL is slightly different depending on whether you're crawling an organization (most common case) or a user's repositories
        crawlUsersRepoInsteadOfOrgasRepos: false
     
    #repositories matching one of the configured regexp will be excluded
    repositoriesToExclude:
      # exclude the ones that start with "financing-platform-" and end with "-run"
      - "^financing-platform-.*-run$"
      # exclude the ones that DON'T start with "financing-platform-" 
      - "^(?!financing-platform-.*$).*"

    # repositoriesToExclude and  repositoriesToInclude ARE EXCLUSIVE. you will get an error at startup if both are configured
    repositoriesToInclude:
      # include the ones that start with "financing-platform-" and end with "-service" (and exclude the ones that don't match)
      - "^financing-platform-.*-service$"

    
    # do you want the excluded repositories to be written in output ? (default is false)
    # even if they won't have any indicators attached, it can be useful to output excluded repositories, 
    # especially at beginning, to make sure you're not missing any
    publishExcludedRepositories: true
    
    # by default, we'll crawl only the repositories' default branch. But in some cases, you may want to crawl all branches
    crawlAllBranches: true
    
    #by default, we'll crawl repositories in parallel. However, especially when facing an issue, crawling sequentially can help identifying the issue faster.
    #therefore, providing the option to switch between parallel and sequential processing
    crawl-in-parallel: true
    
    # default output is console - it will be configured automatically if no output is defined
    # the crawler takes a list of output, so you can configure several
    outputs:
      file:
      # we'll output one repository branch per line, in a file named ${filenamePrefix}_yyyyMMdd_hhmmss.txt
       filenamePrefix: "orgaCheckupOutput"
      http:
        # we'll POST one repository branch individually to ${targetUrl}
        targetUrl: "http://someElasticSearchServer:9201/technologymap/MyOrganization"
     
    # list the files to crawl for, and the patterns to look for in each file         
    indicatorsToFetchByFile:
    # use syntax with "[....]" to escape the dot in the file name (configuration can't be parsed otherwise, as "." is a meaningful character in yaml files)
      "[pom.xml]":
        # name of the indicator that will be reported for that repository in the output
        - name: spring_boot_starter_parent_version
          # name of the method to find the value in the file, pointing to one of the implementation classes of FileContentParser
          type: findDependencyVersionInXml
          # the parameters to the method, specific to each method type
          params:
            # findDependencyVersionInXml needs an artifactId as a parameter : it will find the version for that Maven artifact by doing a SAX parsing, even if the version is a ${variable} defined in <properties> section
            artifactId: spring-boot-starter-parent
        - name: spring_boot_dependencies_version
          type: findDependencyVersionInXml
          params:
            artifactId: spring-boot-dependencies
      #another file to parse..
      Dockerfile:
        - name: docker_image_used
            # findFirstValueWithRegexpCapture needs a pattern as a parameter. The pattern needs to contain a group capture (see https://regexone.com/lesson/capturing_groups) 
            # the first match will be returned as the value for this indicator             
          type: findFirstValueWithRegexpCapture
          params:
            pattern: ".*\\/(.*)\\s?"
    
      "[src/main/resources/application.yml]":
          - name: spring_application_name
            type: findPropertyValueInYamlFile
            params:
              propertyName: "spring.application.name"
              
# We can also define a list of miscellaneous actions to perform : this includes things like various searches, ownership computation

    misc-repository-tasks:
       - name: "nbOfMetricsInPomXml"
       #will return the number of hits returned by a search using queryString, for each repo
         type: "countHitsOnRepoSearch"
         params:
           queryString: "q=metrics+extension:xml"
       - name: "pathsWhere_ConsulCatalogWatch_IsFound"
       #will return the paths for each hit on th search using queryString, for each repo
         type: "pathsForHitsOnRepoSearch"
         params:
           queryString: "q=ConsulCatalogWatch"          

Configuration on repository side

While the global configuration is defined along with github crawler, we have the possibility to override it at the repository level. Repository level config is stored in a .githubCrawler file, at the root of the repository in the default branch

  • Exclusion

if a repository should be excluded, we can define it in the repository itself. if .githubCrawler contains :

    excluded: true  

Then the crawler will consider the repository as excluded, even if it doesn't match any of the exclusion pattern in the crawler config

  • Redirecting to a specific file to parse

Sometimes, the file we're interested in parsing is not in a standard location like the root of the repository.

What we can do in this case is define the file in the crawler config, and override the path in the repository config, with the redirectTo attribute, here for a DockerFile :

    filesToParse: 
      - 
        name: Dockerfile
        redirectTo: routing/Dockerfile 

With above config, when the crawler tries to fetch Dockerfile at the root of the repository, it will actually try to parse routing/Dockerfile

  • Tagging a repo

You may want to "tag" some repos, to be able to filter easily on them when browsing the results. GitHub provides "topics" that are very easy to edit, which are actually similar to "tags". GithubCrawler crawls through repository and attaches tags information with all the repositories for which topics have been configured.

Gitlab support

Basic support for gitLab is available ! It all boils down to implementing a GitLab specific version of RemoteSourceControl interface.

Your config would look like :

    crawler:
      source-control:
        type: "GITLAB"
        url: https://gitlab.com/api/v4/

        # your Gitlab personal access token
        apiToken: "5yL4_Y9hyC_YX9urZN_G"

        # your Gitlab "group"
        organizationName: myJavaProjects

Not all methods defined in RemoteSourceControl interface may have been implemented for Gitlab : NotImplementedError would be thrown in that case. If you need them, you can implement them in RemoteGitLabImpl (and contribute them back through a pull request ?).

Similarly, we may have added methods in the interface for some of our Gitlab specific use-cases : in that case, these methods may not have been implemented in the Github version of the interface

overriding config at repository level for Gitlab

the same rules apply that for GitHub, but in a file named .gitlabCrawler

Azure Devops support

Just like for GitLab, there's basic support for Azure Devops !

    crawler:
      source-control:
        type: "AZURE_DEVOPS"
        apiToken: "abcedfr6rwqwzslqhvfmdpuo5amfyv25a"
        # no need to define the URL, since it can only be a hosted service 

        # in Azure devops, repositories are in a project, within an organization. We mention both of them, separated by a '#' :
        # the crawler will pick the repositories from this project
        organization-name: "myOrg#myProject"

BitBucket support

since v2.2.0, there's also support for BitBucket !

    crawler:
      source-control:
        type: "BITBUCKET"
        url: YOUR_BITBUCKET_URL
        organizationName: myProject
        apiToken: "abcedfr6rwqwzslqhvfmdpuo5amfyv25a"

File content parsers

Some parsers are provided here. As of v1.1.1, available parser types out of the box are :

see javadoc in each class for details

Miscellaneous tasks to perform

We sometimes need to get information on repositories, that is not found in the files it contains : we need to perform a "task" on each repository. As of v1.1.0, these are the task types available out of the box :

see javadoc in each class for details

Outputs

Available default outputs are available in this package.

Each of them can be enabled at startup time through configuration. Have a look at GitHubCrawlerOutputConfig to see which property activates which output : we use Spring @ConditionalOnProperty to decide which output to instantiate, depending on what we've configured under github-crawler.outputs

As of v1.1.0, there are 2 "general purpose" outputs available :

there are 3 "specific purpose" outputs available (see javadoc for more infos):

default output is ConsoleOutput

example using HTTP output, pointing to ElasticSearch with Kibana on top

when running the crawler with HTTP output to push indicators in ElasticSearch, this is the kind of data you'll get

  • different values for the same indicator, fetched with findFirstValueWithRegexpCapture parser:

  • different values for the same indicator, fetched with findDependencyVersionInXml parser :

(when there's no value, it means the file was not found. when the value is "not found", it means the file exists, but the value was not found in it)

  • when using crawlAllBranches: true property , branch name is shown :

Once you have this data, you can quickly do any dashboard you want, like here, with the split of spring-boot-starter-parent version across our services :

Packaging

At build time, we produce several jars :

  • a starter-exec jar, bigger because self-contained. If you don't need to extend it, just take this jar and run it from command line with your config
  • much smaller regular jars (following Spring Boot recommendations, that contains just the compiled code : this is the jar you need to declare as a dependency if you want to extend Github crawler on your side.

Running the crawler from your IDE

We leverage on Spring Boot profiles to manage several configurations. Since we consider that each profile will represent a logical grouping of repositories, the Spring profile(s) will be copied on a "groups" attribute for each repository in output.

Assuming you have a property file as defined above, all you need to do in your IDE is :

  1. check out this repository
  2. create your own property file in src/main/resources, and name it application-myOwn.yml : myOwn is the Spring Boot profile you'll use
  3. run GitHubCrawlerApplication, passing myOwn as profile

Extending the crawler (and contributing to it ?)

A starter project is available, allowing you to create your own GitHub crawler application, leveraging on everything that exists in the library. This is the perfect way to test your own output or parser class on your side.. before maybe contributing it back to the project ? ;-)

A simple example is available here : https://github.com/vincent-fuchs/my-custom-github-crawler/

  • import the gitHubCrawler starter as a dependency in your project
  • create a Spring Boot starter class, and inject the GitHubCrawler instantiated by the starter's autoconfig :
@SpringBootApplication
public class PersonalGitHubCrawlerApplication implements CommandLineRunner {

    @Autowired
    private GitHubCrawler crawler;

    public static void main(String[] args) {

        SpringApplication.run(PersonalGitHubCrawlerApplication.class, args);
    }

    @Override
    public void run(String... strings) throws Exception {
        crawler.crawl();
    }
}
  • add your own config or classes, the Spring Boot way : if you add your own, implementing the recognized interfaces for output or parsing, then Spring Boot will use them ! see here or here for examples

  • see the javadoc in FileContentParser , RepoTaskToPerform, GitHubCrawlerOutput which are the main extension points.

About

a utility to crawl all your repositories and find the information you need - now also working for Gitlab !

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •