codetest-data-engineer

Discovery Phase

There are texts with empty text.
- For those cases the text is empty and therefore is not possible to classify it by saga and entities, those specific cases could be stored in a different table to keep track of the numbers of readers with empty text. Or even this could be tracked by using Prometheus metrics and then visualize some graph in Grafana.
When I run this command an issue was raised related to classification-service. Issue related to scikit. I had to chance it by scikit-learn
There also duplicate entities in the entities.txt file. I have cleaned up it. If the last name is not provided, just name I have assumed is the same entity
kafka_consumer was picking random files from texst folder. I have just modified it to process one by one.
I am using text.count(entity_name) to count the number of entity occurrences in the text for a specific entity. This approach has some limitations:
- With this approach it could happen that you have 'Sam Gamgee' and 'Sam', so this will detect 'Sam' when it shouldn't because the complete name is 'Sam Gamgee'.
- We could try to use and entity recognition algorithm to extract the names by using LLM or ML model.

How to setup development environment?

$ docker compose up

How to run it locally?

I just reused the listener-service to create the tables in Clickhouse and populate it.

$ docker-compose exec clickhouse bash

$ python3 kafka_consumer.py

In order to access Clickhouse and do analytical queries you can do it by executing:

$ docker-compose exec clickhouse bash

$ clickhouse-client --host=127.0.0.1 --port=9000

$ use analytics_saga

And now you could pick some query from the listener-service/sql/analytics.sql file and copy it in the clickhouse command line that we opened before

Data Model Design:

I decided to use Clickhouse as data store to generate analytical reports. It is a high-performance, column-oriented SQL database management system for OLAP. I also decided to use a star schema from Kimball, that is very popular for analytics purposes. In that way in the fact table I just store ids as foreign keys to the dimensional tables and the measures that in this case is the number of readers. The dimensional tables are used to filter by different dimensions, aka saga and entity.

Fact table: fact_readers

- Measures:
    - readers: how many people read a specific article
    - entity_counter: number of entity occurrences -> number of times a specific entity appears in a specific article related to a specific saga

Dim tables:

- dim_entity: all possible entities with an entity_id generated by the system

- dim_saga: all possible sagas with a saga_id generated by the system

Architecture:

I would use kubernetes to manage the different services where the services would be deployed by using CI/CD. The architecture would look something like this:

A FastAPI service that will classify the texts into sagas by using the ML model.
A streaming ingestion microservice that will be consuming the texts and store it in a Kafka topic 'analytics.article.e'.
A streaming transformation microservice that will consume the articles from 'analytics.article.e' and apply the filters/transformations needed to store the data in Clickhouse in the star schema for subsequent analysis
Visualization tool:
- A custom solution with a FastAPI service that will generate analytics from Clickhouse. This would act as a wrapper
- We could use some Analytics tool as Sisense, Tableau, Apache Superset, etc. This kind of Analytical tools can use ClickHouse databases and tables as a data source.

By managing the different microservices with K8S we could autoscale and manage resources (cpu and memory) by service. This also will provide us the flexibility to autoscale per service by using HPA or similar. So for:

kafka microservices: they could be auto-scaled based on the consumer-lag metric
FastAPI services: they could be auto-scaled based on the number of http requests.

Next steps:

Dependencies: Regard to the project structure I would use poetry to manage the different dependencies instead of installing it directly in the Dockerfile

Tests: I didn't expend time implementing tests, I just focused on the analytics part. But it would be necessary to implement unit and integration tests.

Refactor: I would refactor the code because I saw some things, just not the code that it was shared with me, also some code I have implemented. For example:

- Not using pydantic models for the response of the classification API (response_model)
- If the API needs to be exposed outside of the cluster in the organization then setup some kind of authentication mechanism
- Hardcoded configuration across the code instead of defining it in some config class that takes the configuration from env variables.
- Avoid to use prints. Instead logging should be used. 
- Manage exceptions by calling the classification API and the ingestion in Kafka as well. 
- Try to wrap functionalities into Classes instead of having uncleaned code 
- Configure mypy,flake8,black for forcing some good practices and have a cleaned and formatted code.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
classification-service		classification-service
data-streaming-service		data-streaming-service
listener-service		listener-service
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

codetest-data-engineer

Discovery Phase

How to setup development environment?

How to run it locally?

Data Model Design:

Architecture:

Next steps:

About

Uh oh!

Releases

Packages

Languages

sylvinho81/codetest-data-engineer

Folders and files

Latest commit

History

Repository files navigation

codetest-data-engineer

Discovery Phase

How to setup development environment?

How to run it locally?

Data Model Design:

Architecture:

Next steps:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages