Skip to content
View stemgene's full-sized avatar

Block or report stemgene

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
stemgene/README.md

Haoyuan Dong's Data Science Portfolio

Hello! I'm Haoyuan Dong, a passionate Data Scientist and Machine Learning Engineer with over 4 years of experience in the field. Welcome to my GitHub portfolio where I share my projects, achievements, and insights from my data science journey.

About Me

I specialize in leveraging Python for computer vision models, machine learning, and data science. With hands-on experience in creating computer vision solutions, I've worked on projects that bridge technology and data to address complex challenges. Additionally, I'm also a proficient software development engineer, having successfully completed full-stack applications that showcase my versatile skills.

Portfolio Highlights

Here are a few notable projects that showcase my skills and achievements:

Large-scale Language Model Applications

  • Image caption and object detection with LLM and Langchain. Implemented the image analysis based on the ChatGPT API and Langchain. Users can import an image and ask questions about this image with natural language. It will return two types of responses.

    • Describe the image
    • Detect the bounding box of objects in this image.
  • Generate-Code-By-Nature-Language-Using-LLMs. Fine tuned the Llama-2-7b and Gemma models on HuggingFace using QLoRA, and trained it on Code Alpaca dataset with 120k instruction-following code data, enabling translation from natural language input to code output.

As an example, if we prompt the model with this instruction:

Instruction: Create an array of length 5 which contains all even numbers between 1 and 10.

We want the model to produce exactly this response:

Response: array = [2, 4, 6, 8, 10]
  • YouTube video content analyzer. This app can audit a YouTube channel to get the summary and main topics of it. All you have to do is to input the video link or to pass a list of links. Once you select a video by clicking its thumbnail, you can view:

    • A summary of the video
    • the topics that are discussed in the video
    • whether there are any sensitive topics discussed in the video

Computer Vision Projects

  • Ovarian Cancer Subtype Classification and Outlier Detection:

  • Breast Cancer Detection with Vision Transformers: Trained a vision transformer model for breast cancer identification in mammograms using YOLOX for ROIs. Employed cross-self-attention to enhance detection accuracy. Ensembled results for a probabilistic F-1 score of 0.56.

  • Brain Tumor Detection with VGG-16 and LRP: Implemented VGG-16 for brain tumor detection from MRI images, achieving 81.25% accuracy. Enhanced model explainability using Layer-Wise Relevance Propagation (LRP) algorithm to highlight tumor contours.

  • Computer Vision Projects: In this repository, I collect my mini computer vision projects including classification and object detection tasks, as well as basic image preprocessing, e.g. image augmentation, image vectorization, data loader, etc.

Machine Learning Projects

  • Customer Loyalty Prediction: Predicted customer loyalty scores from 30 million card transactions. Leveraged NLP techniques (CountVector, TF-IDF) and model stacking (Light GBM, XGBoost) to achieve an impressive RMSE of 3.676.

  • Clinical Entity Classification: Categorized name entities from 4963 clinical documents using advanced word embedding and RNN techniques. Achieved a high f1-score of 0.94, indicating strong model performance.

  • Patient Care Prediction: Consulted for a medical center to predict post-surgery patient care decisions. Enhanced data integrity, used a neural network, and achieved an f1-score of 0.93.

  • Stock Index Prediction with Time-Series and NLP: Developed a model predicting the US stock index based on social media influence using LSTM for time-series and NLP techniques (LDA, VADAR), with 𝑅2= 0.79 accuracy.

Explore the repository for each project to find detailed code, data, and insights!

Work Experience

Software Engineer Consultant (JavaScript, AWS) | PACT

  • Developed a full-stack project optimizing the creation of legal contracts.
  • Designed SQL schema, deployed databases on AWS RDS, and implemented front-end and back-end services using Node.js and RESTful APIs.

Machine Learning Engineer Consultant (Python, AWS) | Prophia

  • Extracted data from PDFs using computer vision deep learning models.
  • Optimized Faster R-CNN with ResNet-50 and CascadeTabNet to detect tables and extract data.
  • Productized the model on AWS Lambda, S3, and DynamoDB.

Deputy Director, Department of Important Enterprise Organizations | China Mobile

  • Negotiated with industrial companies, achieving $8.07 million in annual revenue.
  • Designed a Smart University Platform, generating $150,000 in annual revenue through collaborations.

Product Manager, Department of Data Service | China Mobile Group

  • Increased song downloads by 23% through personalized song recommendations.
  • Led a nationwide project attracting over 100,000 users in 9 months.

Education

  • M.S. in Data Science, University of Rochester
  • M.S. in Electronic Engineering, Harbin Engineering University
  • Data Science Fellowships, Insight Data Science, TripleTen

Future Project Plan

Currently, my core focus is utilizing cutting-edge deep learning models to tackle challenges in image classification, object detection, and more. I have successfully employed transformer-based approaches to complete projects, such as breast cancer tumor detection. With the continuous advancements in Large-scale Language Models (LLM) and their remarkable success in various multimodal applications, my next endeavor will involve fine-tuning LLM to further enhance breakthroughs in the field of computer vision.

Connect with me on LinkedIn and let's collaborate on exciting data science projects!

Popular repositories Loading

  1. Engine-RPM-Profile-Detection Engine-RPM-Profile-Detection Public

    Jupyter Notebook 4

  2. PythonDataScienceHandbook PythonDataScienceHandbook Public

    Forked from jakevdp/PythonDataScienceHandbook

    Python Data Science Handbook: full text in Jupyter Notebooks

    Jupyter Notebook 1

  3. Python-Diary Python-Diary Public

    Jupyter Notebook

  4. free-programming-books-zh_CN free-programming-books-zh_CN Public

    Forked from Alfred-Sun/free-programming-books-zh_CN

    免费的计算机编程类中文书籍,欢迎投稿

  5. stemgene.github.io stemgene.github.io Public

    HTML

  6. StarterLearningPython StarterLearningPython Public

    Forked from qiwsir/StarterLearningPython

    Learning Python: from Beginner to Master. http://www.itdiffer.com

    Python