Welcome to my GitHub profile! I'm passionate about data science, machine learning, data engineering, and open-source projects. I specialize in developing data-driven solutions and building automated systems using cutting-edge technologies.
Client: Suryoday Bank
Technologies: PySpark, Machine Learning, Shell Script, Sqoop, Data Warehousing, Oozie, SQL
- Designed and implemented scalable ETL pipelines using Sqoop, PySpark, Hive, and Shell, ensuring reliable and efficient data flow from source systems to the data warehouse.
- Optimized Sqoop ingestion performance by selecting the ideal split-by column and tuning the number of mappers, reducing ingestion time by 25%.
- Developed a pre-approved loan eligibility machine learning model using business-specific datamarts, improving targeting accuracy by 15% and reducing credit risk through better customer profiling.
- Created and maintained 5+ business-critical datamarts, enabling seamless report generation and reducing data retrieval time by 30%.
- Automated and orchestrated ETL workflows using Apache Oozie, cutting manual effort and improving job reliability and success rate.
- Performed exploratory data analysis (EDA) and feature selection on customer transaction data using Python (Pandas, Seaborn, Scikit-learn) to identify key behavioral patterns, laying groundwork for future classification models.
- Automated schema conversion from Greenplum to Hive using custom Linux shell scripting, accelerating migration efforts and reducing manual work.
-
Implementation and Optimization of the Llama 2 Chat Model with Quantized LoRA
Tools & Libraries: Python, Transformers, PEFT, Bitsandbytes, Accelerate, TRL, Hugging Face Datasets
- Implemented and fine-tuned the Llama 2 Chat Model using Quantized LoRA (Low-Rank Adaptation) to reduce memory and computational requirements for resource-constrained environments.
- Loaded and tokenized conversational datasets using Hugging Face libraries, and applied model quantization along with LoRA-based fine-tuning to significantly reduce trainable parameters and memory footprint.
- Trained the model using the Trainer class with optimized hyperparameters, achieving low-latency response generation and 30β50% reduction in resource usage without sacrificing output qualityβmaking the model ideal for edge deployments.
-
Tools: Machine Learning, Python, Scikit-learn, Streamlit, API Integration
- Built and optimized a Random Forest model using real-time API data and preprocessing techniques, achieving 84% accuracy in predicting customer travel insurance purchases.
- Performed hyperparameter tuning with GridSearchCV to enhance model performance and compared multiple algorithms for best results.
- Developed and deployed an interactive Streamlit web app on the community cloud, enabling users to get instant purchase likelihood predictions.
- Deep Dive into Advanced Machine Learning & Deep Learning
Exploring complex models and techniques to push the boundaries of AI applications. - Mastering Transformer Architecture & Natural Language Processing (NLP)
Focusing on state-of-the-art techniques for handling and understanding language in machines.
- LinkedIn: Swaraj_Gupta
- Blog: Medium
I'm always open to collaboration, new ideas, and discussions. Feel free to reach out to me on any of the platforms above.
"Data is the new oil, but we must refine it." β Clive Humby