### ETL - PinoyFoodBlog

This ETL project for PinoyFoodBlog demonstrates an ETL process with <b>Medallion Architecture</b>, <b>Web Scraping</b>, <b>Data Modeling</b>, <b>Pandas</b>, <b>Python</b>, <b>SQL</b>, and <b>Jupyter Notebook</b>. The project includes three stages: Bronze, Silver, and Gold, each with specific data processing and storage methods.

- Bronze Stage:

    - <b>File:</b> `extract.py`
    - <b>Process:</b> Web scraping is performed asynchronously using `playwright` and `asyncio`, with an object-oriented structure provided by `dataclasses`. The raw data is saved in `datasets/bronze` for initial processing.

- Silver Stage:

    - <b>File:</b> `transform.ipynb`
    - <b>Process:</b> Data is cleaned and normalized in `Jupyter` using `pandas`, with spelling corrections by `thefuzz` and unit standardization by `pint`. The structured data is saved in `datasets/silver` as a cleaned dataset and `pinoyfoodblog.db` file.

- Gold Stage:

    - <b>File:</b> `load.ipynb`
    - <b>Process:</b> Using `sqlalchemy`, the analytics-ready data is loaded from the Silver database. Aggregated insights are saved as a Parquet file in `datasets/gold`, providing analytics on cooking times, nutritional content, and servings.

### Installation Process

You need to install Jupyter Notebook or VS Code jupyter extension to view or run the .ipynb files.

For the pip installation packages used:

In [None]:
pip install pandas thefuzz pint sqlalchemy playwright
pip install playwright

### Final Output

From extraction, It will saved a .json file with a filename of pinoyfoodblog.json. and will be transformed and load using the .ipynb files.

For your references, here are the expected outputs from raw data to analytics:

<div style="text-align: center;">
    Raw data from extraction<br>
    <img src="Images/json-structure.png" alt="json-structure" width="65%"/><br><br>
    Normalized database structure<br>
    <img src="Images/ERD.png" alt="ERD" width="80%"/><br><br>
    Analytics<br>
    <img src="Images/analytics.png" alt="ERD" width="100%"/>
</div>