Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept on the fly sitemap generation built during Blacklight-LD meeting #2351

Closed
wants to merge 1 commit into from

Conversation

@mejackreed
Copy link
Member

mejackreed commented Sep 27, 2019

Paired on by:
@magibney
@mejackreed
@agazzarini
@netsensei

Todos / ?:

  • Test this at scale
  • Extract to a gem?
  • Are there any concerns about the url_for with a hostname in a load balanced scenario? (I don't seem to remember)
  • Validate the show action id param to make sure that the length is at least the access
  • Better variable names
@cdmo

This comment has been minimized.

Copy link

cdmo commented Nov 4, 2019

I know this if for searchworks, not blacklight, but, I gave this a try locally because I was curious and it is not working. The index view renders like below

<sitemapindex xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd">
<sitemap>
<loc>http://localhost:3000/sitemap/0</loc>
</sitemap>
</sitemapindex>

But the show does not. Just an empty <urlset>. I think that the update dedupe action isn't actually adding the signatureField. Is that the expectation? Like, when you run an update it should also go through and generate a new field (hashed_id_ss) for every record?

@magibney

This comment has been minimized.

Copy link

magibney commented Nov 4, 2019

Yeah if you're not getting a hashed_id_ss field generated for every record, then none of the rest of this will work. Did you configure all the relevant aspects of solrconfig.xml and schema.xml?

@agazzarini

This comment has been minimized.

Copy link

agazzarini commented Nov 4, 2019

I see from the following schema snippet

...
<dynamicField name="*_si" type="string" stored="true" indexed="true" omitNorms="true" />
...

that the signature field "hashed_id_si" is indexed and stored, so that means if you run a query in Solr you should see each returned document with that new field.
Could you please you confirm that? I'm checking the solrconfig.xml in the patch and it seems ok

@magibney

This comment has been minimized.

Copy link

magibney commented Nov 4, 2019

Shouldn't you be looking for hashed_id_ssi, not hashed_id_ss? Perhaps that was just a typo, but the dynamic field and the signatureField configured in your schema.xml and solrconfig.xml are _ssi ...

@cdmo

This comment has been minimized.

Copy link

cdmo commented Nov 4, 2019

@magibney sorry - I deleted my comment because I realized the branch was not quite right, will report back soon

@cdmo

This comment has been minimized.

Copy link

cdmo commented Nov 4, 2019

Ok, so yes, I am using hashed_id_ssi because *_si is not stored in my schema.xml, and, happily, I can generate the hashed_id_ssi - it shows up in records like hashed_id_ssi":"81524bef2092a2df",

Here's my branch

Still the same issue though, nothing in the urlset on the show view.

I wasn't sure if I needed a signatureField in my schema.xml. It wasn't on this PR, but, documentation seemed to allude to needing to do this.

I guess I don't see how this works:

    @solr_response = Blacklight.default_index.connection.select({
      params: {
        q: "{!prefix f=hashed_id_ssi v=#{access_params}}", # changed f to my signature field
        fl: 'id,last_updated'
      }
    })

Or how it's supposed to work.

Thanks for the help

@cdmo

This comment has been minimized.

Copy link

cdmo commented Nov 5, 2019

It looks like the hashed_id generated as the signatureKey should be associated with a sitemap id (not to be confused with a record id - I might suggest calling the sitemap id sitemap_id or something). Otherwise, how does Solr pull up a list of documents just by passing in the sitemap id in the show method for the sitemap controller? Again, thanks for any help.

@cdmo

This comment has been minimized.

Copy link

cdmo commented Nov 5, 2019

With generous help from @magibney and @tampakis too - I was able to get this working. The most important missing bit was that the defType had to be lucene (h/t to @magibney!). I now have a working POC for our Blacklight instance at Penn State for a few hundred records, next I'll try a few million locally, and after that we'll test it out on a test server environment. Thank you all for your help.

@mejackreed

This comment has been minimized.

Copy link
Member Author

mejackreed commented Jan 11, 2020

@mejackreed mejackreed closed this Jan 11, 2020
@mejackreed mejackreed deleted the poc-sitemap-generation branch Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.