A Python-based project downloads a CSV file from a provided URL, saves it with a timestamp, cleans the data by removing duplicates and empty rows, logs discarded rows, and produces a JSON stats file and a monthly metrics CSV.
π Assumptions
- Input CSV has columns:
order_date(inYYYY-MM-DDformat)item_price(float)item_promo_discount(float)
- System has enough memory to load a ~1GB CSV into Pandas (16GB+ RAM recommended).
- No extra invalid row checks beyond:
- Empty rows
- Duplicates
- Invalid date formats (discarded)
- Therefore,
"total_invalid_rows_discarded"in stats is always0.
- Install Poetry:
pip install poetry - Install dependencies:
poetry install
##π Project Structure csv-processor/ βββ src/ β βββ main.py # CLI entrypoint β βββ csv_processor/ β βββ clean.py # Data cleaning logic β βββ download.py # Streaming CSV download β βββ metrics.py # Monthly aggregation calculations β βββ stats.py # Processing statistics (JSON output) β βββ tests/ β βββ init.py β βββ test_clean.py # Unit tests for cleaning module β βββ pyproject.toml # Poetry dependencies & build config βββ README.md # Project documentation βββ .gitignore # Ignore large outputs and build artifacts
Run via CLI with the URL argument:
poetry run python src/main.py --url https://storage.googleapis.com/nozzle-csv-exports/testing-data/order_items_data_2_.csv
Outputs: When you run the pipeline, the following files are generated in the working directory:
order_items_.csv β raw downloaded file
discarded_rows.csv β rows removed during cleaning
processing_stats.json β summary of processing (rows kept, discarded, invalid)
monthly_metrics.csv β monthly aggregated metrics
poetry run pytest
Tests cover basic cleaning functionality using a sample in-memory DataFrame.
- Srikar Kunta