gpt3-data-preprocessing

This GitHub repository contains code for preprocessing text data from PDF and DOCX files for use with GPT-3. It includes steps such as tokenization, removal of stop words and punctuation, and formatting for GPT-3 input.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

You will need to have Python and pip installed on your machine. You will also need to install the following Python packages:

PyPDF2
python-docx
NLTK
Pandas

You can install these packages by running the following command: pip install pypdf2 python-docx nltk pandas

Usage

The code in this repository is meant to be used in a Jupyter notebook or any Python environment.

To preprocess data from a PDF file, use the pdf_preprocessing.ipynb notebook or the pdf_preprocessing.py script
To preprocess data from a DOCX file, use the docx_preprocessing.ipynb notebook or the docx_preprocessing.py script
Both notebooks contain detailed comments on how to use the code and how it works.

Contributing

If you have any suggestions for improvements or find any bugs, please feel free to submit a pull request or open an issue.

Acknowledgments

This code uses the PyPDF2 and python-docx libraries to read and extract text from PDF and DOCX files respectively, and the NLTK library for tokenization and stop word removal.

Note

It's important to note that the code provided here is a sample, you may want to adapt it to your specific use case and add more preprocessing steps as needed. Also, it's important to have a big enough dataset to achieve better results with GPT-3.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
README.md		README.md
csv_preprocessing.py		csv_preprocessing.py
docx_preprocessing.ipynb		docx_preprocessing.ipynb
docx_preprocessing.py		docx_preprocessing.py
excel_preprocessing.py		excel_preprocessing.py
pdf_preprocessing.ipynb		pdf_preprocessing.ipynb
pdf_preprocessing.py		pdf_preprocessing.py
preprocessed_text.py		preprocessed_text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

csv_preprocessing.py

csv_preprocessing.py

docx_preprocessing.ipynb

docx_preprocessing.ipynb

docx_preprocessing.py

docx_preprocessing.py

excel_preprocessing.py

excel_preprocessing.py

pdf_preprocessing.ipynb

pdf_preprocessing.ipynb

pdf_preprocessing.py

pdf_preprocessing.py

preprocessed_text.py

preprocessed_text.py

Repository files navigation

gpt3-data-preprocessing

Getting Started

Prerequisites

Usage

Contributing

Acknowledgments

Note

License

About

Releases

Packages

Languages

shamspias/gpt3-data-preprocessing

Folders and files

Latest commit

History

Repository files navigation

gpt3-data-preprocessing

Getting Started

Prerequisites

Usage

Contributing

Acknowledgments

Note

License

About

Topics

Resources

Stars

Watchers

Forks

Languages