A Python application that processes files based on job definitions. It supports multiple file transformations including:
- Extracting ZIP files
- Converting XML files to CSV format
Note: This project simulates S3 paths locally. Any path starting with
s3://will be automatically converted to a local path by replacings3://withs3_simulation/. For example,s3://alejo-parsers/file.zipbecomess3_simulation/alejo-parsers/file.zip.
Note: The job definition file path is hardcoded as
job_definition.jsoninmain.py. Make sure to place your job definition file in the project root directory.
.
├── src/
│ ├── parsers/
│ │ ├── base_parser.py # Base parser class
│ │ ├── zip_parser.py # ZIP file extraction
│ │ └── xml_parser.py # XML to CSV conversion
│ ├── config.py # Configuration and logging setup
│ └── main.py # Main application entry point
├── logs/ # Log files directory
│ └── file_parser.log # Application logs
├── s3_simulation/ # Local directory for S3 path simulation
└── job_definition.json # Job configuration file
- Python 3.x
- Required packages:
- pandas
- lxml
- Clone the repository:
git clone https://github.com/rubenoliveros/file_parser.git
cd file_parser- Install dependencies:
pip install pandas lxml- Create a job definition file (
job_definition.json) with your transformations:
{
"transformations": [
{
"object": {
"parser": "unzip",
"origin": "s3://alejo-parsers/workspace1/sources/rutafuente1/miarchivo1.zip",
"destiny": "s3://alejo-parsers/workspace1/sources/rutafuente2/",
"classname": "ZipFileParser"
},
"kwargs": {
"scripts_path": "scripts/",
"scripts_bucket": "alejo-scripts"
}
},
{
"object": {
"parser": "xml_to_csv",
"origin": "s3://alejo-parsers/workspace1/sources/rutafuente1/miarchivo2.xml",
"destiny": "s3://alejo-parsers/workspace1/sources/rutafuente2/",
"classname": "XmlToCsvParser"
},
"kwargs": {
"scripts_path": "scripts/",
"scripts_bucket": "alejo-scripts"
}
}
]
}-
Place your input files in the corresponding local directories under
s3_simulation/. For example:s3_simulation/alejo-parsers/workspace1/sources/rutafuente1/miarchivo1.zips3_simulation/alejo-parsers/workspace1/sources/rutafuente1/miarchivo2.xml
-
Run the application:
python3 src/main.py-
ZIP Parser (
unzip)- Extracts contents of a ZIP file to a destination directory
- Example:
"parser": "unzip"
-
XML to CSV Parser (
xml_to_csv)- Converts XML files to CSV format
- CSV headers: name, email, street, city, country
- Example:
"parser": "xml_to_csv"
The application logs all operations to:
- Console output
logs/file_parser.log
Log entries include:
- Timestamp
- Log level (INFO/ERROR)
- Operation details
The application handles various error cases:
- Missing job definition file
- Invalid JSON format
- Unsupported parser types
- File not found errors
- Processing errors
All errors are logged with detailed messages for debugging.
Feel free to submit issues and enhancement requests!