Skip to content

This repository containing code for preprocessing text data from PDF and DOCX files for use with GPT-3. It includes steps such as tokenization, removal of stop words and punctuation, and formatting for GPT-3 input.

Notifications You must be signed in to change notification settings

shamspias/gpt3-data-preprocessing

Repository files navigation

gpt3-data-preprocessing

This GitHub repository contains code for preprocessing text data from PDF and DOCX files for use with GPT-3. It includes steps such as tokenization, removal of stop words and punctuation, and formatting for GPT-3 input.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

You will need to have Python and pip installed on your machine. You will also need to install the following Python packages:

  • PyPDF2
  • python-docx
  • NLTK
  • Pandas

You can install these packages by running the following command: pip install pypdf2 python-docx nltk pandas

Usage

The code in this repository is meant to be used in a Jupyter notebook or any Python environment.

  • To preprocess data from a PDF file, use the pdf_preprocessing.ipynb notebook or the pdf_preprocessing.py script
  • To preprocess data from a DOCX file, use the docx_preprocessing.ipynb notebook or the docx_preprocessing.py script
  • Both notebooks contain detailed comments on how to use the code and how it works.

Contributing

If you have any suggestions for improvements or find any bugs, please feel free to submit a pull request or open an issue.

Acknowledgments

This code uses the PyPDF2 and python-docx libraries to read and extract text from PDF and DOCX files respectively, and the NLTK library for tokenization and stop word removal.

Note

It's important to note that the code provided here is a sample, you may want to adapt it to your specific use case and add more preprocessing steps as needed. Also, it's important to have a big enough dataset to achieve better results with GPT-3.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This repository containing code for preprocessing text data from PDF and DOCX files for use with GPT-3. It includes steps such as tokenization, removal of stop words and punctuation, and formatting for GPT-3 input.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published