# Generative AI Text Generation Model Documentation
**Team name:** [MECHAMINDS]  
**Date:** February 2025  
**Version:** 3.0  


## 1. Introduction
This project implements a **LSTM(LONG SHORT TERM MEMORY)** for text generation, trained using **Wikibooks, General Question Answering Datasets(kind~GPT), and Story Book Dataset**. The goal is to create human-like text generation that is **Indistinguishable from AI-generated content**.

The model was built for the **[HTS'25 CODEVERSE]**, where the objective was to develop a generative AI solution **without using external APIs**.


## 2. Installation & Setup

- Visual Studio Code Environment Setup.
- Python 3.11.0 language (Suitable version for this project).
- PyTorch 2.15.0 library.
- Pandas library.
- Scikit-Learn library.
- Matplotlib library.



- !git clone [repository_link]
- !cd project_directory
- !pip install -r requirements.txt


- import torch
- print("CUDA Available:", torch.cuda.is_available())

## 3. Model Architecture
This model is based on **LSTM(LONG SHORT TERM MEMORY)**. It was fine-tuned on **Wikibooks, General Question Answering Datasets(kind~GPT), and Story Book Dataset** to enhance natural text generation.

### Features:
✅ Uses **LSTM(LONG SHORT TERM MEMORY)** for deep learning  
✅ Implements **PyTorch** for training  
✅ Supports **English text generation**  


## 4. Training Process
The model was trained using the following configuration:
### 1. Download and Explore Datasets.
- WIKIBOOKS/ E.A. POE'S CORPUS SHORTSTORIES DATASET. (Kaggle)
- ENGLISH DIALOGUES DATASET. (From Web)
- BASIC ENGLISH TEXT DATASET. (Kaggle)

### 2. Preapare the data for training. USing NLP and TorchText.
- Basic Pre-processing using regex (Regular Expression).
- Implementing/Making the tokenizer for making tokens from the text.
- Fitting the tokenizer to the text.
- Implementing padding for the same number of inputs.
- Making the vocabulary (Important Step.)
- Indexing the vocabulary.
- Making the input and output sequences and converting it into PyTorch Tensors.

### 3. TensorDataset and DataLoader.
- Creation of Pytorch Datasets and Pytorch DataLoader for Neural Networks.

### 4. Model Creation (LSTM) using Custom Class.
- Steps:
1. Input DataLoader to the embedding layer (1oo Dimensions) (Embedding layer: Makes vectors of the individual words.)
2. Outputs of embedding layer goes to input of LSTM Layers (2 layers).
3. Outputs of LSTM layers goes to Linear layer for the outputs.
4. Implementing the forward pass.
5. Instantiation of the class with vocabulary size.

### 5. Loss Function and Optimization.
- Loss function : Cross Entropy Loss.
- Optimizer : Adam.
### 6. Training and Evaluation of the model.
1. Train the model epoch-by-epoch.
2. Backpropagation for decreasing loss.
3. Calculating the average loss.

### 7. Generating Text.
- Generate text using user input and generate text function.

## 5. Model Inference & Usage
- Once trained, the model can generate human-like text given a prompt.

### Usuage and Future Scope
- Can be use as a content creation, language translation, and marketing.
- Can be use as a code generation.
- Can be use as a college chatbot on college website for FAQ's and solve Student queries by giving the data.
- Can be use in a college virtual library collection for answering the book's contents.
- Can be use as a summarization of given text input and many more...


## 6. Limitations & Future Improvements
### Limitations:
🚫 The model sometimes generates **inconsistent** text.  
🚫 It requires a **high-end GPU** and large datasets for training which requires high calculations.

### Future Enhancements:
✅ Fine-tuning on **domain-specific datasets**.  
✅ Implementing **reinforcement learning for better text coherence**.  


## 7. Conclusion
This project successfully trains a **LSTM(LONG SHORT TERM MEMORY)**, capable of producing human-like text. Future work includes improving model coherence and fine-tuning on custom datasets.

## 8. References

{
"Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow":"Aurelien Geron.(2022).Powered  by JUPYTER.[https://oreilly.com/catalog/errata.csp?isbn=9781492032649]",

"Kaggle Dataset": "Kaggle. (2025). OpenWebText Dataset. [https://www.kaggle.com/datasets]"

}



# EOF (End-Of-File)