Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

report gist filenames in table_github_gist #57

Closed
judell opened this issue Sep 15, 2021 · 4 comments · Fixed by #66
Closed

report gist filenames in table_github_gist #57

judell opened this issue Sep 15, 2021 · 4 comments · Fixed by #66
Assignees
Labels
enhancement New feature or request

Comments

@judell
Copy link
Contributor

judell commented Sep 15, 2021

It's hard to make sense of the table_github_gist listing without filenames. I found here the element Files.

I tried adding this to the table_github_gist.go:

{Name: "files", Type: pb.ColumnType_JSON, Description: "The filename."},

Now I can do this:

> select
  files->jsonb_object_keys(files)->'filename' as name,
  files->jsonb_object_keys(files)->'language' as language,
  description
from github_my_gist limit 5
+------------------------------+--------------+----------------------------------------------------+
| name                         | language     | description                                        |
+------------------------------+--------------+----------------------------------------------------+
| "ucdavis-april-2019.md"      | "Markdown"   |                                                    |
| "plpython.md"                | "Markdown"   |                                                    |
| "internal_and_public_id"     | <null>       | convert between hypothesis internal and public ids |
| "addMembers.js"              | "JavaScript" |                                                    |
| "async-postgres-listener.py" | "Python"     |                                                    |
+------------------------------+--------------+----------------------------------------------------+

The downside is that this requires some Postgres mojo that makes me stop and think, despite my having used Postgres JSONB a lot. So while this works for me I'm not sure I'd recommend for others. Presumably the plugin could use a derived type that hoists these fields to the top level? Anyway, it was an easy and instructive workaround.

@judell judell added the enhancement New feature or request label Sep 15, 2021
@e-gineer
Copy link
Contributor

Definitely makes sense to add the files column. JSON is a reasonable starting point for it too ... if a piece of data from deep in the JSON is used a lot then sometimes we elevate it up to a column on it's own, but that is usually when completely obvious or widely requested (easy to add columns, very hard to deprecate/remove them).

A slight change to the Postgres makes it easier to use I believe. For example:

select
  f ->> 'filename' as name,
  f ->> 'language' as language
from
  github_my_gist as g,
  jsonb_array_elements(g.files) as f

Add a good example like that to the docs with it and it would be good to go / easy to use.

@judell
Copy link
Contributor Author

judell commented Sep 15, 2021

Hmm. I think it needs to be jsonb_object_keys because the JSON is like so:

 {
  "files": {
   "gistfile1.txt": {
    "filename": "gistfile1.txt",
    "language": "Text",
    "raw_url": "https://gist.githubusercontent.com/judell/9744381/raw/cd2f695f7e776e82ef0c6dc6678a6322a514f5f9/gistfile1.txt",
    "size": 6341,
    "type": "text/plain"
   }
  }
 }

So maybe like this, to clean up the referencing?

> with filenames as (
 select
   id,
   jsonb_object_keys(files) as name
 from github_my_gist
)
select
 f.name,
 g.files -> f.name ->> 'language' as language,
 g.description
from filenames f
join github_my_gist g using (id) limit 5
+---------------------------------+----------+----------------------------------------------------+
| name                            | language | description                                        |
+---------------------------------+----------+----------------------------------------------------+
| internal_and_public_id          | <null>   | convert between hypothesis internal and public ids |
| async-postgres-listener.py      | Python   |                                                    |
| find-unused-security-groups.sql | SQL      | PATTERN: UNION similar things to combine them      |
| plpython.md                     | Markdown |                                                    |
| inventory                       | <null>   | h data inventory                                   |
+---------------------------------+----------+----------------------------------------------------+

@judell
Copy link
Contributor Author

judell commented Sep 26, 2021

As per https://steampipe.io/blog/adding-a-column-to-a-table, we decided to flatten the object to a simple array of objects. The query solution we gave was:

select 
  f ->> 'language' as language,
  count(*)
from 
  github_my_gist g
cross join
  jsonb_array_elements(g.files) f
group by
  language
order by 
  count desc

On reflection I'm still puzzled by the cross join. The literature suggests that the "Cartesian product" it produces is useful for things like this.

create table colors (color text);
insert into colors(color) values ('red');
insert into colors(color) values ('green');

create table sizes (size text);
insert into sizes(size) values ('small');
insert into sizes(size) values ('medium');
insert into sizes(size) values ('large');

select * from colors cross join sizes;

  size  | color 
--------+-------
 small  | red
 small  | green
 medium | red
 medium | green
 large  | red
 large  | green

Here is a simplified example of the case discussed in the blog post.

create table my_gist(id text, description text, files jsonb);
insert into my_gist(id, description, files) 
values ('b89721e4f71c3f647e8d686887de3008', 'gist-with-two-files', '[{"filename": "file1.md"}, {"filename": "file2.md"}]')

select * from my_gist

                id                |     description     |                        files                         
----------------------------------+---------------------+------------------------------------------------------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md", }, {"filename": "file2.md"}]


select jsonb_array_elements(g.files) as files from my_gist

   files
-----------------------
{"filename": "file1.md"}
{"filename": "file2.md"}

Cross joining against the files object.

select 
  g.id,
  g.description,
  files
from 
  my_gist g
cross join
  jsonb_array_elements(g.files) files;


                id                |     description     |                                    files                                     
----------------------------------+---------------------+------------------------------------------------------------------------------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md"}, {"filename": "file2.md"}]
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md"}, {"filename": "file2.md"}]

That seems like a weird Cartesian product! Cross joining with indexing into the files object gets us where we want to go.

select 
  id,
  description,
  f ->> 'filename'
from 
  my_gist g
cross join
  jsonb_array_elements(g.files) f;


                id                |     description     | filename 
----------------------------------+---------------------+----------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | file1.md
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | file2.md

But how to explain it? I'm just not seeing how it works.

I've always gone with this approach.

select 
  id,
  description,
  jsonb_array_elements(files) as files
from 
  my_gist;

                id                |     description     |         jsonb_array_elements         
----------------------------------+---------------------+--------------------------------------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | { "filename": "file1.md"}
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | { "filename": "file2.md"}

I'm used to how unnesting an array produces array-length rows, and I think of jsonb_array_elements as the jsonb version of that.

And

```sql
select 
  id,
  description,
  jsonb_array_elements(files) ->> 'filename' as filename
from 
  my_gist;
  
                 id                |     description     | filename 
-----------------------------------+---------------------+----------
 b89721e4f71c3f647e8d686887de3008  | gist-with-two-files | file1.md
 b89721e4f71c3f647e8d686887de3008  | gist-with-two-files | file2.md

The unnest <-> jsonb_array_elements analogy enables me to grok this. But the cross join escapes me. How do you get a Cartesian product with two identical rows like this?

 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md"}, {"filename": "file2.md"}]
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md"}, {"filename": "file2.md"}]

Indexing into it with f ->> 'filename' gets the "right" answer but I just don't grok the mechanism.

                id                |     description     | filename 
----------------------------------+---------------------+----------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | file1.md
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | file2.md

@judell
Copy link
Contributor Author

judell commented Sep 30, 2021

Hey @LalitTurbot, thanks for following up on that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants