Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add github_repository_content table #207

Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions docs/tables/github_repository_content.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Table: github_repository_content

Gets the contents of a file or directory in a repository.

Specify the file path or directory in `repository_content_path`.
If you omit `repository_content_path`, you will receive the contents of the repository's root directory.
See the description below regarding what the response includes for directories.

The `github_repository_content` table can be used to query information about **ANY** repository, and **you must specify which repository** in the where or join clause (`where repository_full_name=`, `join github_repository_content on repository_full_name=`).

## Examples

### List a repository

```sql
select
repository_full_name,
path,
content,
type,
size,
sha,
html_url
from
github_repository_content
where
repository_full_name = 'github/docs';
```

### List a directory in a repository

```sql
select
repository_full_name,
path,
content,
type,
size,
sha,
html_url
from
github_repository_content
where
repository_full_name = 'github/docs'
and repository_content_path = 'docs';
```

### Get a file in a repository

```sql
select
repository_full_name,
path,
type,
size,
sha,
content,
html_url
from
github_repository_content
where
repository_full_name = 'github/docs'
and repository_content_path = '.vscode/settings.json';
```
1 change: 1 addition & 0 deletions github/plugin.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ func Plugin(ctx context.Context) *plugin.Plugin {
"github_rate_limit": tableGitHubRateLimit(ctx),
"github_release": tableGitHubRelease(ctx),
"github_repository": tableGitHubRepository(),
"github_repository_content": tableGitHubRepositoryContent(),
"github_search_code": tableGitHubSearchCode(ctx),
"github_search_commit": tableGitHubSearchCommit(ctx),
"github_search_issue": tableGitHubSearchIssue(ctx),
Expand Down
150 changes: 150 additions & 0 deletions github/table_github_repository_content.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
package github

import (
"context"
"github.com/google/go-github/v48/github"
"github.com/turbot/steampipe-plugin-sdk/v4/grpc/proto"
"github.com/turbot/steampipe-plugin-sdk/v4/plugin"
"github.com/turbot/steampipe-plugin-sdk/v4/plugin/transform"
)

//// TABLE DEFINITION

func tableGitHubRepositoryContent() *plugin.Table {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aminvielledebatAtBedrock I tried to use this table to list contents of a directory that contained a .png file, and then tried a query to get the actual .png file:

> select * from github_repository_content where repository_full_name = 'turbot/steampipe-mod-aws-insights' and repository_content_path = 'docs/images/aws-insights-console-graphic.png'

Error: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 (SQLSTATE HV000)

+----------------------+------+------+-------------------------+------+------+---------+--------+-----+-----+---------+----------+--------------+------+
| repository_full_name | type | name | repository_content_path | path | size | content | target | sha | url | git_url | html_url | download_url | _ctx |
+----------------------+------+------+-------------------------+------+------+---------+--------+-----+-----+---------+----------+--------------+------+
+----------------------+------+------+-------------------------+------+------+---------+--------+-----+-----+---------+----------+--------------+------+
> select * from github_repository_content where repository_full_name = 'turbot/steampipe-mod-aws-insights' and repository_content_path = 'docs/images'

Error: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 (SQLSTATE HV000)

+----------------------+------+------+-------------------------+------+------+---------+--------+-----+-----+---------+----------+--------------+------+
| repository_full_name | type | name | repository_content_path | path | size | content | target | sha | url | git_url | html_url | download_url | _ctx |
+----------------------+------+------+-------------------------+------+------+---------+--------+-----+-----+---------+----------+--------------+------+
+----------------------+------+------+-------------------------+------+------+---------+--------+-----+-----+---------+----------+--------------+------+

I'm not sure what the repository content API returns, but have you tested using this table when getting content for non-text files, or listing directories that contain them? For instance, does this table also work with GIF, JPEG, SVG, Microsoft Office (Word, PPT, Excel), PDF, etc., files? If so, what's in content for them?

Also, I'm not sure if you're on a different version, but I'm on github.com/google/go-github/v48 v48.0.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried against an .svg file and it seemed to return the content OK

return &plugin.Table{
Name: "github_repository_content",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the table name (originally brought up in #207 (comment)), I think I still like the name github_repository_content, but am open to github_repository_file.

For github_repository_content, it's the name that the GitHub API uses and because each row can return each file's contents, the name still seems to fit.

On the other hand, github_repository_file would also be intuitive, as each row contains a file (giving meta information and contents).

Between the two, I don't have any strong preferences.

@e-gineer @johnsmyth @aminvielledebatAtBedrock - Curious to hear your thoughts as well, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbruno10 I also lean toward github_repository_content if the table also returns submodules and directories in addition to files.

Description: "List the content in a repository (list directory, or get file content",
List: &plugin.ListConfig{
Hydrate: tableGitHubRepositoryContentList,
ShouldIgnoreError: isNotFoundError([]string{"404"}),
KeyColumns: []*plugin.KeyColumn{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For key columns, are we also able to pass in the ref, e.g., commit, branch, that the API lists?

{Name: "repository_full_name", Require: plugin.Required},
{Name: "repository_content_path", Require: plugin.Optional, CacheMatch: "exact"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate column for this? Could we use path instead as the optional key column?

},
},
Columns: []*plugin.Column{
{Name: "repository_full_name", Description: "The full name of the repository (login/repo-name).", Type: proto.ColumnType_STRING, Transform: transform.FromQual("repository_full_name")},
{Name: "type", Description: "The file type (directory or file).", Type: proto.ColumnType_STRING},
{Name: "name", Description: "The file name.", Type: proto.ColumnType_STRING},
{Name: "repository_content_path", Description: "The requested path in repository search.", Type: proto.ColumnType_STRING, Transform: transform.FromQual("repository_content_path")},
{Name: "path", Description: "The path of the file.", Type: proto.ColumnType_STRING},
{Name: "size", Description: "The size of the file (in MB).", Type: proto.ColumnType_INT},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aminvielledebatAtBedrock I saw that the GitHub API lists some caveats and restrictions about file size in https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#size-limits.

What are the query results/error if the file size is over 100 MB? Also, for files less than 1 MB and for those between 1 - 100 MB, are there any differences in the column values, or do the differences not affect the table?

{Name: "content", Description: "The decoded file content (if the element is a file).", Type: proto.ColumnType_STRING, Transform: transform.From(transformFileContent), Hydrate: tableGitHubRepositoryContentGet},
{Name: "target", Description: "Target is only set if the type is \"symlink\" and the target is not a normal file. If Target is set, Path will be the symlink path.", Type: proto.ColumnType_STRING},
{Name: "sha", Description: "The sha of the file.", Type: proto.ColumnType_STRING, Transform: transform.FromField("SHA")},
{Name: "url", Description: "URL of file's metadata.", Type: proto.ColumnType_STRING},
{Name: "git_url", Description: "Git URL (with SHA) of the file.", Type: proto.ColumnType_STRING},
{Name: "html_url", Description: "Raw file URL in GitHub.", Type: proto.ColumnType_STRING},
{Name: "download_url", Description: "Download URL : it expires and can be be used just once.", Type: proto.ColumnType_STRING},
},
}
}

//// LIST FUNCTION

func tableGitHubRepositoryContentList(ctx context.Context, d *plugin.QueryData, h *plugin.HydrateData) (interface{}, error) {
owner, repo := parseRepoFullName(d.KeyColumnQuals["repository_full_name"].GetStringValue())
var filterPath string
if d.KeyColumnQuals["repository_content_path"] != nil {
filterPath = d.KeyColumnQuals["repository_content_path"].GetStringValue()
}
plugin.Logger(ctx).Trace("tableGitHubRepositoryContentList", "owner", owner, "repo", repo, "path", filterPath)

type ListPageResponse struct {
repositoryContent []*github.RepositoryContent
resp *github.Response
}
client := connect(ctx, d)
opt := &github.RepositoryContentGetOptions{}
listPage := func(ctx context.Context, d *plugin.QueryData, h *plugin.HydrateData) (interface{}, error) {
fileContent, directoryContent, resp, err := client.Repositories.GetContents(ctx, owner, repo, filterPath, opt)

if err != nil {
plugin.Logger(ctx).Error("tableGitHubRepositoryContentList", "api_error", err, "path", filterPath)
return nil, err
}

if fileContent != nil {
directoryContent = []*github.RepositoryContent{fileContent}
}

return ListPageResponse{
repositoryContent: directoryContent,
resp: resp,
}, err
}

for {
listPageResponse, err := retryHydrate(ctx, d, h, listPage)
if err != nil {
plugin.Logger(ctx).Error("tableGitHubRepositoryContentList", "retry_hydrate_error", err)
return nil, err
}

for _, i := range listPageResponse.(ListPageResponse).repositoryContent {
if i != nil {
d.StreamListItem(ctx, i)
}

// Context can be cancelled due to manual cancellation or the limit has been hit
if d.QueryStatus.RowsRemaining(ctx) == 0 {
return nil, nil
}
}

if listPageResponse.(ListPageResponse).resp.NextPage == 0 {
break
}
}
return nil, nil
}

//// GET FUNCTION

func tableGitHubRepositoryContentGet(ctx context.Context, d *plugin.QueryData, h *plugin.HydrateData) (interface{}, error) {
owner, repo := parseRepoFullName(d.KeyColumnQuals["repository_full_name"].GetStringValue())
filterPath := *h.Item.(*github.RepositoryContent).Path

plugin.Logger(ctx).Trace("tableGitHubRepositoryContentGet", "owner", owner, "repo", repo, "path", filterPath)

type GetResponse struct {
repositoryContent *github.RepositoryContent
resp *github.Response
}

client := connect(ctx, d)
getFileContent := func(ctx context.Context, d *plugin.QueryData, h *plugin.HydrateData) (interface{}, error) {
fileContent, _, resp, err := client.Repositories.GetContents(ctx, owner, repo, filterPath, &github.RepositoryContentGetOptions{})

if err != nil {
plugin.Logger(ctx).Error("tableGitHubRepositoryContentGet", "api_error", err, "path", filterPath)
return nil, err
}

return GetResponse{
repositoryContent: fileContent,
resp: resp,
}, err
}

getResponse, err := retryHydrate(ctx, d, h, getFileContent)
if err != nil {
return nil, err
}

return getResponse.(GetResponse).repositoryContent, nil
}

func transformFileContent(_ context.Context, d *transform.TransformData) (interface{}, error) {
content := d.HydrateItem.(*github.RepositoryContent)
// directory use case. By definition, a directory doesn't have a raw content
if content.Content == nil {
return nil, nil
}
// empty file with "none" encoding,
// or too big file (greater than 100MB, the RepositoryContent endpoint is not supported)
if *content.Content == "" {
return "", nil
}
return content.GetContent()
}