Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Elasticsearch to version 6.x #4211

Merged
merged 16 commits into from Jun 19, 2018

Conversation

@safwanrahman
Copy link
Member

@safwanrahman safwanrahman commented Jun 8, 2018

Currently, only project indexing working. Investigating and trying to index the pages/files in a cleaner way. (Fixed)
Things need to be done

  • Implement a cleaner way to index File
  • Search for project
  • Search for File
  • Add highlighter for result description
  • Implement facet searching in a cleaner way
  • Configure travis to run the tests
  • Fixup the tests in order to pass

This fixes #4183

@safwanrahman safwanrahman added this to Backlog in Search update via automation Jun 8, 2018
@safwanrahman safwanrahman moved this from Backlog to In progress in Search update Jun 8, 2018
@agjohnson agjohnson changed the title [Fix #4183] Search Proof of Concept Search Proof of Concept Jun 8, 2018
Copy link
Member

@ericholscher ericholscher left a comment

Tested this locally and it's working well w/ projects & faceting. The next big thing is definitely getting the File search working. 👍

@@ -42,3 +42,4 @@ notifications:
branches:
only:
- master
- search_upgrade
Copy link
Member

@ericholscher ericholscher Jun 12, 2018

Interesting :)

@@ -143,7 +142,7 @@ def bulk_index(self, data, index=None, chunk_size=500, parent=None,
docs.append(doc)

# TODO: This doesn't work with the new ES setup.
bulk_index(self.es, docs, chunk_size=chunk_size)
# bulk_index(self.es, docs, chunk_size=chunk_size)

def index_document(self, data, index=None, parent=None, routing=None):
doc = self.extract_document(data)
Copy link
Member

@ericholscher ericholscher Jun 12, 2018

Guessing this entire file and other related code should be deleted?

Copy link
Member Author

@safwanrahman safwanrahman Jun 13, 2018

Yes. it need to be deleted. I will delete once I implement the file searching functionality!



@pytest.fixture
def all_projects():
def all_projects(es_index):
Copy link
Member

@ericholscher ericholscher Jun 12, 2018

Where does this get passed in? Is it automatically callign the above fixture based on name?

Copy link
Member Author

@safwanrahman safwanrahman Jun 13, 2018

Actually, its pytest's dependeny enjection. So if you have a fixture name foo and you accept this fixture in def bar(foo), the foo fixture will be passed to the bar fixture.

language=user_input.language)
response = project_search.execute()
results = response.hits
facets = response.facets
Copy link
Member

@ericholscher ericholscher Jun 12, 2018

Is this used?

Copy link
Member Author

@safwanrahman safwanrahman Jun 13, 2018

Yes. Its used for showing facet (language) in project search results.

'doc_types': [cls],
'model': cls._doc_type.model,
'query': query
}
Copy link
Member

@ericholscher ericholscher Jun 12, 2018

Is this logic required? It seems a bit heavy/complex.

Copy link
Member Author

@safwanrahman safwanrahman Jun 13, 2018

I think, to keep alligned with the search method, we can keep this logic. maybe its not needed now, but it will be useful to keep it alligned.

Copy link
Member

@ericholscher ericholscher left a comment

I think this change is too complicated. We should be able to do this without much of the code changes here with just a filter on the manager.

@@ -902,6 +903,7 @@ class ImportedFile(models.Model):
path = models.CharField(_('Path'), max_length=255)
md5 = models.CharField(_('MD5 checksum'), max_length=255)
commit = models.CharField(_('Commit'), max_length=255)
is_html = models.BooleanField(default=False)
Copy link
Member

@ericholscher ericholscher Jun 13, 2018

I don't think we need this on the model, the queryset manager can just do the filter, no?

Copy link
Member Author

@safwanrahman safwanrahman Jun 13, 2018

Yeah. it can do the filtering. but I thought it would be much slow to filter in the queryset manager. I am ok to remove it.

if fnmatch.fnmatch(filename, '*.html'):
model_class = HTMLFile
else:
model_class = ImportedFile
Copy link
Member

@ericholscher ericholscher Jun 13, 2018

I don't believe this is needed, since it's all the same model in the database.

Copy link
Member Author

@safwanrahman safwanrahman Jun 13, 2018

The problem is with actually signal manager. I have opened django-es/django-elasticsearch-dsl#111 about this.

Untill it has been fixed, we need to have a proxy model for the purpose, I believe.

class HTMLFileManager(models.Manager):

def get_queryset(self):
return super(HTMLFileManager, self).get_queryset().filter(is_html=True)
Copy link
Member

@ericholscher ericholscher Jun 13, 2018

Can't this just do filter(filename__endswith='html') instead of adding additional state to the model?

Copy link
Member Author

@safwanrahman safwanrahman Jun 13, 2018

Yeah. It can be done. I thought it would be slower, so I added another state.
But I think performance is not a issue here. So I am good to filter by name.

@safwanrahman
Copy link
Member Author

@safwanrahman safwanrahman commented Jun 14, 2018

@ericholscher I think you can take another look into this!
I will integrate the view and templated later today!

Copy link
Member

@ericholscher ericholscher left a comment

Good changes. I look forward to testing it locally once it's hooked up in the templates & views :)


class ImportedFileManager(models.Manager):

def get_queryset(self):
Copy link
Member

@ericholscher ericholscher Jun 14, 2018

I don't think we should exclude them here. This will change the logic in other downstream code which is using ImportedFile, which we don't want to do.

version_slug=self.version.slug,
include_file=False)

file_path = find_file(basename=basename, pattern='*.fjson', path=full_path)
Copy link
Member

@ericholscher ericholscher Jun 14, 2018

I don't think we need to do all this I/O. Shouldn't it simply be:

os.path.join(full_path, self.path.replace('.html', '.fjson') or similar? It should be in the same path structure as the existing path.

file_path = find_file(basename=basename, pattern='*.fjson', path=full_path)
return file_path

@cached_property
Copy link
Member

@ericholscher ericholscher Jun 14, 2018

How long is this cached for? Will we really see value in caching it vs. the tradeoff of keeping a bunch of JSON in memory?

Copy link
Member Author

@safwanrahman safwanrahman Jun 14, 2018

As per documentation cached result will persist as long as the instance does.
As its called multiple time for same instance while indexing, (eg path, content, header), I think the cache helps much in indexing fast.

return queryset

def update(self, thing, **kwargs):
Copy link
Member

@ericholscher ericholscher Jun 14, 2018

When does this update actually get called? I believe it's a signal attached to the saving of the ImportedFile objects?

Copy link
Member Author

@safwanrahman safwanrahman Jun 14, 2018

Its called everytime the object is created, updated, or deleted.
Also whenever we run the management command.
call from the registry

facets = {
'project': TermsFacet(field='project'),
'version': TermsFacet(field='version')
}
Copy link
Member

@ericholscher ericholscher Jun 14, 2018

Love how simple this is.

Copy link
Member Author

@safwanrahman safwanrahman Jun 14, 2018

Yes. me too! the thing is made so simple by using OOP concepts!

@safwanrahman
Copy link
Member Author

@safwanrahman safwanrahman commented Jun 14, 2018

@ericholscher I have fixed the integration in view and template. But currently the description with highlighting is not showing in the file search result. I will look over into it later. (Fixed)
Can you run it locally and let me know about your feedback?

Copy link
Member

@ericholscher ericholscher left a comment

Looks good! Once we get the tests working, I think we can merge this into the search_upgrade branch, and continue to work on improvements as additional PR's, so as not to complicate this one.

@safwanrahman safwanrahman changed the title Search Proof of Concept Upgrade Elasticsearch to 6.x and rewrite using elasticsearch-dsl Jun 18, 2018
@safwanrahman safwanrahman changed the title Upgrade Elasticsearch to 6.x and rewrite using elasticsearch-dsl Upgrade to Elasticsearch 6.x Jun 18, 2018
@safwanrahman
Copy link
Member Author

@safwanrahman safwanrahman commented Jun 19, 2018

@ericholscher I have fixed the tests and now it is passing. 🎾
Could not mock the processed_json property, so added another method get_processed_json that will be called from processed_json and mock get_processed_json instead.

Without mocking, the get_processed_json would run in every test for all the files, which will make the test slower. I will add other tests for the model methods in future.
possible to merge?

@safwanrahman safwanrahman changed the title Upgrade to Elasticsearch 6.x Upgrade Elasticsearch to version 6.x Jun 19, 2018
Copy link
Member

@ericholscher ericholscher left a comment

Looks good. I'll get it merged once we rename the migration.

@@ -0,0 +1,54 @@
# -*- coding: utf-8 -*-
Copy link
Member

@ericholscher ericholscher Jun 19, 2018

This migration should have a name, saying what it does.

I also worry that this migration will become out of date with a long-running branch beside the master. I'm not sure the best path to take -- we will likely just need to re-create it before we merge it into master.

Copy link
Member Author

@safwanrahman safwanrahman Jun 19, 2018

Yes! We can absolutely do this. I will keep this in mind.

@safwanrahman
Copy link
Member Author

@safwanrahman safwanrahman commented Jun 19, 2018

@ericholscher Renamed the migration. ready to merge! :shipit:

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jun 19, 2018

Looks good. 👍

@ericholscher ericholscher merged commit fd75aa3 into readthedocs:search_upgrade Jun 19, 2018
1 check was pending
Search update automation moved this from In progress to Done Jun 19, 2018
@safwanrahman safwanrahman deleted the search branch Jul 7, 2018
safwanrahman added a commit to safwanrahman/readthedocs.org that referenced this issue Jul 16, 2018
safwanrahman added a commit to safwanrahman/readthedocs.org that referenced this issue Jul 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Search update
  
Done
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants