New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Elasticsearch to version 6.x #4211

Merged
merged 16 commits into from Jun 19, 2018

Conversation

3 participants
@safwanrahman
Member

safwanrahman commented Jun 8, 2018

Currently, only project indexing working. Investigating and trying to index the pages/files in a cleaner way. (Fixed)
Things need to be done

  • Implement a cleaner way to index File
  • Search for project
  • Search for File
  • Add highlighter for result description
  • Implement facet searching in a cleaner way
  • Configure travis to run the tests
  • Fixup the tests in order to pass

This fixes #4183

@safwanrahman safwanrahman added this to Backlog in Search update via automation Jun 8, 2018

@safwanrahman safwanrahman moved this from Backlog to In progress in Search update Jun 8, 2018

@agjohnson agjohnson changed the title from [Fix #4183] Search Proof of Concept to Search Proof of Concept Jun 8, 2018

safwanrahman added some commits Jun 9, 2018

@ericholscher

Tested this locally and it's working well w/ projects & faceting. The next big thing is definitely getting the File search working. 👍

@@ -42,3 +42,4 @@ notifications:
branches:
only:
- master
- search_upgrade

This comment has been minimized.

@ericholscher

ericholscher Jun 12, 2018

Member

Interesting :)

@@ -143,7 +142,7 @@ def bulk_index(self, data, index=None, chunk_size=500, parent=None,
docs.append(doc)
# TODO: This doesn't work with the new ES setup.
bulk_index(self.es, docs, chunk_size=chunk_size)
# bulk_index(self.es, docs, chunk_size=chunk_size)
def index_document(self, data, index=None, parent=None, routing=None):
doc = self.extract_document(data)

This comment has been minimized.

@ericholscher

ericholscher Jun 12, 2018

Member

Guessing this entire file and other related code should be deleted?

This comment has been minimized.

@safwanrahman

safwanrahman Jun 13, 2018

Member

Yes. it need to be deleted. I will delete once I implement the file searching functionality!

@pytest.fixture
def all_projects():
def all_projects(es_index):

This comment has been minimized.

@ericholscher

ericholscher Jun 12, 2018

Member

Where does this get passed in? Is it automatically callign the above fixture based on name?

This comment has been minimized.

@safwanrahman

safwanrahman Jun 13, 2018

Member

Actually, its pytest's dependeny enjection. So if you have a fixture name foo and you accept this fixture in def bar(foo), the foo fixture will be passed to the bar fixture.

language=user_input.language)
response = project_search.execute()
results = response.hits
facets = response.facets

This comment has been minimized.

@ericholscher

ericholscher Jun 12, 2018

Member

Is this used?

This comment has been minimized.

@safwanrahman

safwanrahman Jun 13, 2018

Member

Yes. Its used for showing facet (language) in project search results.

'doc_types': [cls],
'model': cls._doc_type.model,
'query': query
}

This comment has been minimized.

@ericholscher

ericholscher Jun 12, 2018

Member

Is this logic required? It seems a bit heavy/complex.

This comment has been minimized.

@safwanrahman

safwanrahman Jun 13, 2018

Member

I think, to keep alligned with the search method, we can keep this logic. maybe its not needed now, but it will be useful to keep it alligned.

@ericholscher

I think this change is too complicated. We should be able to do this without much of the code changes here with just a filter on the manager.

@@ -902,6 +903,7 @@ class ImportedFile(models.Model):
path = models.CharField(_('Path'), max_length=255)
md5 = models.CharField(_('MD5 checksum'), max_length=255)
commit = models.CharField(_('Commit'), max_length=255)
is_html = models.BooleanField(default=False)

This comment has been minimized.

@ericholscher

ericholscher Jun 13, 2018

Member

I don't think we need this on the model, the queryset manager can just do the filter, no?

This comment has been minimized.

@safwanrahman

safwanrahman Jun 13, 2018

Member

Yeah. it can do the filtering. but I thought it would be much slow to filter in the queryset manager. I am ok to remove it.

if fnmatch.fnmatch(filename, '*.html'):
model_class = HTMLFile
else:
model_class = ImportedFile

This comment has been minimized.

@ericholscher

ericholscher Jun 13, 2018

Member

I don't believe this is needed, since it's all the same model in the database.

This comment has been minimized.

@safwanrahman

safwanrahman Jun 13, 2018

Member

The problem is with actually signal manager. I have opened sabricot/django-elasticsearch-dsl#111 about this.

Untill it has been fixed, we need to have a proxy model for the purpose, I believe.

class HTMLFileManager(models.Manager):
def get_queryset(self):
return super(HTMLFileManager, self).get_queryset().filter(is_html=True)

This comment has been minimized.

@ericholscher

ericholscher Jun 13, 2018

Member

Can't this just do filter(filename__endswith='html') instead of adding additional state to the model?

This comment has been minimized.

@safwanrahman

safwanrahman Jun 13, 2018

Member

Yeah. It can be done. I thought it would be slower, so I added another state.
But I think performance is not a issue here. So I am good to filter by name.

safwanrahman added some commits Jun 14, 2018

File searching basic backend task has been implemented
TODO: integrate it with view and template
@safwanrahman

This comment has been minimized.

Member

safwanrahman commented Jun 14, 2018

@ericholscher I think you can take another look into this!
I will integrate the view and templated later today!

@ericholscher

Good changes. I look forward to testing it locally once it's hooked up in the templates & views :)

class ImportedFileManager(models.Manager):
def get_queryset(self):

This comment has been minimized.

@ericholscher

ericholscher Jun 14, 2018

Member

I don't think we should exclude them here. This will change the logic in other downstream code which is using ImportedFile, which we don't want to do.

version_slug=self.version.slug,
include_file=False)
file_path = find_file(basename=basename, pattern='*.fjson', path=full_path)

This comment has been minimized.

@ericholscher

ericholscher Jun 14, 2018

Member

I don't think we need to do all this I/O. Shouldn't it simply be:

os.path.join(full_path, self.path.replace('.html', '.fjson') or similar? It should be in the same path structure as the existing path.

file_path = find_file(basename=basename, pattern='*.fjson', path=full_path)
return file_path
@cached_property

This comment has been minimized.

@ericholscher

ericholscher Jun 14, 2018

Member

How long is this cached for? Will we really see value in caching it vs. the tradeoff of keeping a bunch of JSON in memory?

This comment has been minimized.

@safwanrahman

safwanrahman Jun 14, 2018

Member

As per documentation cached result will persist as long as the instance does.
As its called multiple time for same instance while indexing, (eg path, content, header), I think the cache helps much in indexing fast.

return queryset
def update(self, thing, **kwargs):

This comment has been minimized.

@ericholscher

ericholscher Jun 14, 2018

Member

When does this update actually get called? I believe it's a signal attached to the saving of the ImportedFile objects?

This comment has been minimized.

@safwanrahman

safwanrahman Jun 14, 2018

Member

Its called everytime the object is created, updated, or deleted.
Also whenever we run the management command.
call from the registry

facets = {
'project': TermsFacet(field='project'),
'version': TermsFacet(field='version')
}

This comment has been minimized.

@ericholscher

ericholscher Jun 14, 2018

Member

Love how simple this is.

This comment has been minimized.

@safwanrahman

safwanrahman Jun 14, 2018

Member

Yes. me too! the thing is made so simple by using OOP concepts!

@safwanrahman

This comment has been minimized.

Member

safwanrahman commented Jun 14, 2018

@ericholscher I have fixed the integration in view and template. But currently the description with highlighting is not showing in the file search result. I will look over into it later. (Fixed)
Can you run it locally and let me know about your feedback?

@ericholscher

Looks good! Once we get the tests working, I think we can merge this into the search_upgrade branch, and continue to work on improvements as additional PR's, so as not to complicate this one.

@safwanrahman safwanrahman changed the title from Search Proof of Concept to Upgrade Elasticsearch to 6.x and rewrite using elasticsearch-dsl Jun 18, 2018

@safwanrahman safwanrahman changed the title from Upgrade Elasticsearch to 6.x and rewrite using elasticsearch-dsl to Upgrade to Elasticsearch 6.x Jun 18, 2018

@safwanrahman

This comment has been minimized.

Member

safwanrahman commented Jun 19, 2018

@ericholscher I have fixed the tests and now it is passing. 🎾
Could not mock the processed_json property, so added another method get_processed_json that will be called from processed_json and mock get_processed_json instead.

Without mocking, the get_processed_json would run in every test for all the files, which will make the test slower. I will add other tests for the model methods in future.
possible to merge?

@safwanrahman safwanrahman changed the title from Upgrade to Elasticsearch 6.x to Upgrade Elasticsearch to version 6.x Jun 19, 2018

@ericholscher

Looks good. I'll get it merged once we rename the migration.

@@ -0,0 +1,54 @@
# -*- coding: utf-8 -*-

This comment has been minimized.

@ericholscher

ericholscher Jun 19, 2018

Member

This migration should have a name, saying what it does.

I also worry that this migration will become out of date with a long-running branch beside the master. I'm not sure the best path to take -- we will likely just need to re-create it before we merge it into master.

This comment has been minimized.

@safwanrahman

safwanrahman Jun 19, 2018

Member

Yes! We can absolutely do this. I will keep this in mind.

@safwanrahman

This comment has been minimized.

Member

safwanrahman commented Jun 19, 2018

@ericholscher Renamed the migration. ready to merge! :shipit:

@ericholscher

This comment has been minimized.

Member

ericholscher commented Jun 19, 2018

Looks good. 👍

@ericholscher ericholscher merged commit fd75aa3 into rtfd:search_upgrade Jun 19, 2018

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

Search update automation moved this from In progress to Done Jun 19, 2018

@safwanrahman safwanrahman deleted the safwanrahman:search branch Jul 7, 2018

safwanrahman added a commit to safwanrahman/readthedocs.org that referenced this pull request Jul 16, 2018

Merge pull request rtfd#4211 from safwanrahman/search
Upgrade Elasticsearch to version 6.x

safwanrahman added a commit to safwanrahman/readthedocs.org that referenced this pull request Jul 16, 2018

Merge pull request rtfd#4211 from safwanrahman/search
Upgrade Elasticsearch to version 6.x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment