Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Engineering #9

Closed
6 tasks
sayantikabanik opened this issue Dec 30, 2021 · 10 comments
Closed
6 tasks

Data Engineering #9

sayantikabanik opened this issue Dec 30, 2021 · 10 comments
Assignees

Comments

@sayantikabanik
Copy link
Owner

  • start building pipeline using dagster [POC] is good enough
  • Gather all the datasets required
  • Add an intake ingest or manually upload to cloud
  • Add model artifact pipline
  • Collect sample artifact
  • prepare dataset class to fetch each dataset
@sayantikabanik
Copy link
Owner Author

@nitinjethwani7 @sahithi02 @anuraagbhavaraju let me know when the data is uploaded to the repo as discussed

@sayantikabanik
Copy link
Owner Author

Also if anyone is interested in the data engineering part I can give a walkthrough

@nitinjethwani7
Copy link
Collaborator

Yes i am interested ,please guide!

@anuraagbhavaraju
Copy link
Collaborator

Also if anyone is interested in the data engineering part I can give a walkthrough

I am interested too!

@sayantikabanik
Copy link
Owner Author

sayantikabanik commented Jan 17, 2022

@sahithi02
https://docs.dagster.io/getting-started
https://docs.dagster.io/tutorial

the above links should give you a fair idea how to get started

@nitinjethwani7 @anuraagbhavaraju
Could you add the datasets to the folders here (if not already done) also a note regarding how one should process the data

@sahithi02 the aim is to take the raw source of the data and using code get to the processed format
https://github.com/sayantikabanik/FP2/tree/main/forecasting_framework/data

@sayantikabanik
Copy link
Owner Author

sayantikabanik commented Jan 17, 2022

Dagster code example (this is example using old version)

import pandas as pd
from dagster import pipeline, solid, execute_pipeline
import requests
"""useful link
- https://understandingdata.com/list-of-python-assert-statements-for-unit-tests/
"""

@solid
def read_weather_data():
    df_sample = pd.read_csv("/Users/sayantikabanik/Downloads/SA4/ass-p1/sample_ds_sa.csv")
    return df_sample


@solid
def state_count(sample):
    age_greater_than_50 = sample.loc[sample["Age"] > 50, "Age"].count()
    return age_greater_than_50


@solid
def average_cal(sample):
    """
    Info: The method calculates the average experience from the sample
    :param sample: DataFrame
    :return: floating point positive integer
    """
    avg_exp = sample.Experience.mean()
    return avg_exp


@solid
def display_results(context, age_greater_than_50, avg_exp):
    context.log.info(f"Count for age >50: {age_greater_than_50}")
    context.log.info(f"Overall avg experience: {avg_exp}")


@solid
def test_cases(avg_exp, count_age):
    assert avg_exp > 5
    assert count_age > 0
    assert avg_exp, count_age is not object


@pipeline
def data_pipeline():
    sample = read_weather_data()
    count_age = state_count(sample)
    avg_exp = average_cal(sample)
    display_results(count_age, avg_exp)
    test_cases(avg_exp, count_age)


if __name__ == "__main__":
    result = execute_pipeline(simple_pipeline)

@anuraagbhavaraju
Copy link
Collaborator

@sahithi02 https://docs.dagster.io/getting-started https://docs.dagster.io/tutorial

the above links should give you a fair idea how to get started

@nitinjethwani7 @anuraagbhavaraju Could you add the datasets to the folders here (if not already done) also a note regarding how one should process the data

@sahithi02 the aim is to take the raw source of the data and using code get to the processed format https://github.com/sayantikabanik/FP2/tree/main/forecasting_framework/data

latest dataset is now added to the data folder

@sayantikabanik
Copy link
Owner Author

@sayantikabanik
Copy link
Owner Author

https://github.com/sayantikabanik/FP2/tree/main/forecasting_framework/data
@sahithi02 please add the .py file for dagster here

@sayantikabanik
Copy link
Owner Author

Screenshot 2022-01-28 at 10 55 02 AM
@sahithi02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants