Skip to content

SRIKARKUNTA/De_coding_task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CSV Processor

A Python-based project downloads a CSV file from a provided URL, saves it with a timestamp, cleans the data by removing duplicates and empty rows, logs discarded rows, and produces a JSON stats file and a monthly metrics CSV.

πŸ“Œ Assumptions

  • Input CSV has columns:
    • order_date (in YYYY-MM-DD format)
    • item_price (float)
    • item_promo_discount (float)
  • System has enough memory to load a ~1GB CSV into Pandas (16GB+ RAM recommended).
  • No extra invalid row checks beyond:
    • Empty rows
    • Duplicates
    • Invalid date formats (discarded)
  • Therefore, "total_invalid_rows_discarded" in stats is always 0.

Setup

  1. Install Poetry: pip install poetry
  2. Install dependencies: poetry install

##πŸ“‚ Project Structure csv-processor/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ main.py # CLI entrypoint β”‚ └── csv_processor/ β”‚ β”œβ”€β”€ clean.py # Data cleaning logic β”‚ β”œβ”€β”€ download.py # Streaming CSV download β”‚ β”œβ”€β”€ metrics.py # Monthly aggregation calculations β”‚ └── stats.py # Processing statistics (JSON output) β”‚ β”œβ”€β”€ tests/ β”‚ β”œβ”€β”€ init.py β”‚ └── test_clean.py # Unit tests for cleaning module β”‚ β”œβ”€β”€ pyproject.toml # Poetry dependencies & build config β”œβ”€β”€ README.md # Project documentation └── .gitignore # Ignore large outputs and build artifacts

How to Run

Run via CLI with the URL argument:

poetry run python src/main.py --url https://storage.googleapis.com/nozzle-csv-exports/testing-data/order_items_data_2_.csv

Outputs: When you run the pipeline, the following files are generated in the working directory:

order_items_.csv β†’ raw downloaded file

discarded_rows.csv β†’ rows removed during cleaning

processing_stats.json β†’ summary of processing (rows kept, discarded, invalid)

monthly_metrics.csv β†’ monthly aggregated metrics

⚠️Note: These files are listed in .gitignore, so they won’t appear in the GitHub repo, but they will be created locally every time you run the program.

How to Execute Tests

poetry run pytest

Tests cover basic cleaning functionality using a sample in-memory DataFrame.

πŸ‘¨β€πŸ’» Author

  • Srikar Kunta

About

Python project to download, clean, and process order item CSV files. Includes data cleaning, logging, JSON stats, and monthly metrics generation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages