report gist filenames in table_github_gist #57

judell · 2021-09-15T18:33:38Z

It's hard to make sense of the table_github_gist listing without filenames. I found here the element Files.

I tried adding this to the table_github_gist.go:

{Name: "files", Type: pb.ColumnType_JSON, Description: "The filename."},

Now I can do this:

> select
  files->jsonb_object_keys(files)->'filename' as name,
  files->jsonb_object_keys(files)->'language' as language,
  description
from github_my_gist limit 5
+------------------------------+--------------+----------------------------------------------------+
| name                         | language     | description                                        |
+------------------------------+--------------+----------------------------------------------------+
| "ucdavis-april-2019.md"      | "Markdown"   |                                                    |
| "plpython.md"                | "Markdown"   |                                                    |
| "internal_and_public_id"     | <null>       | convert between hypothesis internal and public ids |
| "addMembers.js"              | "JavaScript" |                                                    |
| "async-postgres-listener.py" | "Python"     |                                                    |
+------------------------------+--------------+----------------------------------------------------+

The downside is that this requires some Postgres mojo that makes me stop and think, despite my having used Postgres JSONB a lot. So while this works for me I'm not sure I'd recommend for others. Presumably the plugin could use a derived type that hoists these fields to the top level? Anyway, it was an easy and instructive workaround.

The text was updated successfully, but these errors were encountered:

e-gineer · 2021-09-15T19:33:39Z

Definitely makes sense to add the files column. JSON is a reasonable starting point for it too ... if a piece of data from deep in the JSON is used a lot then sometimes we elevate it up to a column on it's own, but that is usually when completely obvious or widely requested (easy to add columns, very hard to deprecate/remove them).

A slight change to the Postgres makes it easier to use I believe. For example:

select
  f ->> 'filename' as name,
  f ->> 'language' as language
from
  github_my_gist as g,
  jsonb_array_elements(g.files) as f

Add a good example like that to the docs with it and it would be good to go / easy to use.

judell · 2021-09-15T20:21:44Z

Hmm. I think it needs to be jsonb_object_keys because the JSON is like so:

 {
  "files": {
   "gistfile1.txt": {
    "filename": "gistfile1.txt",
    "language": "Text",
    "raw_url": "https://gist.githubusercontent.com/judell/9744381/raw/cd2f695f7e776e82ef0c6dc6678a6322a514f5f9/gistfile1.txt",
    "size": 6341,
    "type": "text/plain"
   }
  }
 }

So maybe like this, to clean up the referencing?

> with filenames as (
 select
   id,
   jsonb_object_keys(files) as name
 from github_my_gist
)
select
 f.name,
 g.files -> f.name ->> 'language' as language,
 g.description
from filenames f
join github_my_gist g using (id) limit 5
+---------------------------------+----------+----------------------------------------------------+
| name                            | language | description                                        |
+---------------------------------+----------+----------------------------------------------------+
| internal_and_public_id          | <null>   | convert between hypothesis internal and public ids |
| async-postgres-listener.py      | Python   |                                                    |
| find-unused-security-groups.sql | SQL      | PATTERN: UNION similar things to combine them      |
| plpython.md                     | Markdown |                                                    |
| inventory                       | <null>   | h data inventory                                   |
+---------------------------------+----------+----------------------------------------------------+

judell · 2021-09-26T20:18:12Z

As per https://steampipe.io/blog/adding-a-column-to-a-table, we decided to flatten the object to a simple array of objects. The query solution we gave was:

select 
  f ->> 'language' as language,
  count(*)
from 
  github_my_gist g
cross join
  jsonb_array_elements(g.files) f
group by
  language
order by 
  count desc

On reflection I'm still puzzled by the cross join. The literature suggests that the "Cartesian product" it produces is useful for things like this.

create table colors (color text);
insert into colors(color) values ('red');
insert into colors(color) values ('green');

create table sizes (size text);
insert into sizes(size) values ('small');
insert into sizes(size) values ('medium');
insert into sizes(size) values ('large');

select * from colors cross join sizes;

  size  | color 
--------+-------
 small  | red
 small  | green
 medium | red
 medium | green
 large  | red
 large  | green

Here is a simplified example of the case discussed in the blog post.

create table my_gist(id text, description text, files jsonb);
insert into my_gist(id, description, files) 
values ('b89721e4f71c3f647e8d686887de3008', 'gist-with-two-files', '[{"filename": "file1.md"}, {"filename": "file2.md"}]')

select * from my_gist

                id                |     description     |                        files                         
----------------------------------+---------------------+------------------------------------------------------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md", }, {"filename": "file2.md"}]


select jsonb_array_elements(g.files) as files from my_gist

   files
-----------------------
{"filename": "file1.md"}
{"filename": "file2.md"}

Cross joining against the files object.

select 
  g.id,
  g.description,
  files
from 
  my_gist g
cross join
  jsonb_array_elements(g.files) files;


                id                |     description     |                                    files                                     
----------------------------------+---------------------+------------------------------------------------------------------------------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md"}, {"filename": "file2.md"}]
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md"}, {"filename": "file2.md"}]

That seems like a weird Cartesian product! Cross joining with indexing into the files object gets us where we want to go.

select 
  id,
  description,
  f ->> 'filename'
from 
  my_gist g
cross join
  jsonb_array_elements(g.files) f;


                id                |     description     | filename 
----------------------------------+---------------------+----------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | file1.md
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | file2.md

But how to explain it? I'm just not seeing how it works.

I've always gone with this approach.

select 
  id,
  description,
  jsonb_array_elements(files) as files
from 
  my_gist;

                id                |     description     |         jsonb_array_elements         
----------------------------------+---------------------+--------------------------------------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | { "filename": "file1.md"}
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | { "filename": "file2.md"}

I'm used to how unnesting an array produces array-length rows, and I think of jsonb_array_elements as the jsonb version of that.

And

```sql
select 
  id,
  description,
  jsonb_array_elements(files) ->> 'filename' as filename
from 
  my_gist;
  
                 id                |     description     | filename 
-----------------------------------+---------------------+----------
 b89721e4f71c3f647e8d686887de3008  | gist-with-two-files | file1.md
 b89721e4f71c3f647e8d686887de3008  | gist-with-two-files | file2.md

The unnest <-> jsonb_array_elements analogy enables me to grok this. But the cross join escapes me. How do you get a Cartesian product with two identical rows like this?

 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md"}, {"filename": "file2.md"}]
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | [{"filename": "file1.md"}, {"filename": "file2.md"}]

Indexing into it with f ->> 'filename' gets the "right" answer but I just don't grok the mechanism.

                id                |     description     | filename 
----------------------------------+---------------------+----------
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | file1.md
 b89721e4f71c3f647e8d686887de3008 | gist-with-two-files | file2.md

…loses #57

judell · 2021-09-30T16:06:44Z

Hey @LalitTurbot, thanks for following up on that!

… (#66)

judell added the enhancement New feature or request label Sep 15, 2021

LalitLab self-assigned this Sep 30, 2021

LalitLab pushed a commit that referenced this issue Sep 30, 2021

Add "files" column to the "github_my_gist" and "github_gist" tables. C…

cbdd5a5

…loses #57

LalitLab linked a pull request Sep 30, 2021 that will close this issue

Add "files" column to the "github_my_gist" and "github_gist" tables. Closes #57 #66

Merged

cbruno10 closed this as completed in #66 Oct 4, 2021

cbruno10 pushed a commit that referenced this issue Oct 4, 2021

Add column files to github_my_gist and github_gist tables. Closes #57…

56bb26e

… (#66)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report gist filenames in table_github_gist #57

report gist filenames in table_github_gist #57

judell commented Sep 15, 2021 •

edited

e-gineer commented Sep 15, 2021

judell commented Sep 15, 2021 •

edited

judell commented Sep 26, 2021 •

edited

judell commented Sep 30, 2021

report gist filenames in table_github_gist #57

report gist filenames in table_github_gist #57

Comments

judell commented Sep 15, 2021 • edited

e-gineer commented Sep 15, 2021

judell commented Sep 15, 2021 • edited

judell commented Sep 26, 2021 • edited

judell commented Sep 30, 2021

judell commented Sep 15, 2021 •

edited

judell commented Sep 15, 2021 •

edited

judell commented Sep 26, 2021 •

edited