<a href="https://colab.research.google.com/github/sushantchandelog/Projects/blob/main/Philosophy_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers
!pip install datasets



In [2]:
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
    TextDataset
  )
from google.colab import drive
from transformers import pipeline
import os
import time

In [3]:
#google mount drive
drive.mount('/content/drive', force_remount = True)  #it help refreshing the connection from google file and make sure we are wrking on a most current file

#defing the path
folder_path = "/content/drive/MyDrive/cleaned_data"
COMBINED_FILE_PATH = f"{folder_path}/combined_plato.txt"
OUTPUT_DIR = f"{folder_path}/PhilosophyModel"
MODEL_NAME = "gpt2"

print("data folder", folder_path)
print("Combined file Will be", COMBINED_FILE_PATH)
print("model will be saved to", OUTPUT_DIR)

Mounted at /content/drive
data folder /content/drive/MyDrive/cleaned_data
Combined file Will be /content/drive/MyDrive/cleaned_data/combined_plato.txt
model will be saved to /content/drive/MyDrive/cleaned_data/PhilosophyModel


In [7]:
#combining the all nine files
all_files = os.listdir(folder_path)
txt_files =  [f for f in all_files if f.endswith('.txt') and f != "combined_plato.txt"]

print(len(txt_files), "files to combine:", txt_files)

9 files to combine: ['middle_symposium_cleaned.txt', 'late_timaeus_cleaned.txt', 'middle_phaedo_cleaned.txt', 'early_euthyphro_cleaned.txt', 'late_laws_cleaned.txt', 'early_crito_cleaned.txt', 'early_apology_cleaned.txt', 'middle_republic_cleaned.txt', 'cached_lm_GPT2Tokenizer_128_combined_plato.txt']


In [9]:
all_text = ""
for file_name in txt_files:
    file_path = os.path.join(folder_path, file_name)
    with open(file_path, 'r', encoding='latin-1') as f:
        all_text += f.read()

    all_text += "\n\n" # Add separation between books


#write the combin text for the new file
with open(COMBINED_FILE_PATH, 'w', encoding='utf-8') as f:
    f.write(all_text)

print("Succesfully combine all file into ", COMBINED_FILE_PATH)

Succesfully combine all file into  /content/drive/MyDrive/cleaned_data/combined_plato.txt


In [10]:
#loading tokenizer and base model
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME)
model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
print("tokenizer model loaded")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer model loaded


In [11]:
#loding the combined datasets
train_dataset = TextDataset(
    tokenizer = tokenizer,
    file_path = COMBINED_FILE_PATH,
    block_size= 128 #this is the chunk size for the text
)
data_collator = DataCollatorForLanguageModeling(
    tokenizer = tokenizer,
    mlm = False
)
print("dataset is prepared",len(train_dataset),"text blocks")

dataset is prepared 6061 text blocks




In [12]:
#setting up the trainer
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=3,              # 3 passes over the data is a good start
    per_device_train_batch_size=4,   # Batch size for T4 GPU
    save_steps=1000,
    save_total_limit=2,
    prediction_loss_only=True,
    report_to="none"
)
trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_dataset
)

In [13]:
#starting the training
start_time = time.time()
trainer.train()
end_time = time.time()

#saving the final model
trainer.save_model()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,3.5701
1000,3.4293
1500,3.3615
2000,3.1769
2500,3.1419
3000,3.1417
3500,3.0232
4000,3.0142
4500,3.0175


AttributeError: GPT2Tokenizer has no attribute savepretrained

In [14]:
OUTPUT_DIR  = "/content/drive/MyDrive/cleaned_data/PhilosophyModel"

tokenizer.save_pretrained(OUTPUT_DIR)

('/content/drive/MyDrive/cleaned_data/PhilosophyModel/tokenizer_config.json',
 '/content/drive/MyDrive/cleaned_data/PhilosophyModel/special_tokens_map.json',
 '/content/drive/MyDrive/cleaned_data/PhilosophyModel/vocab.json',
 '/content/drive/MyDrive/cleaned_data/PhilosophyModel/merges.txt',
 '/content/drive/MyDrive/cleaned_data/PhilosophyModel/added_tokens.json')

In [19]:
#testing the new model
model_from_drive = GPT2LMHeadModel.from_pretrained(OUTPUT_DIR)
tokenizer_from_drive = GPT2Tokenizer.from_pretrained(OUTPUT_DIR)
plato_generator = pipeline(
    'text-generation',
    model=model_from_drive,
    tokenizer=tokenizer_from_drive
)


Device set to use cuda:0


In [20]:
prompt = input("Enter you prompt for plato")
print(f"Generating text for prompt: '{prompt}'")
generated_text = plato_generator(
    prompt,
    max_length=150,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id
)

print("\n--- MODEL'S OUTPUT ---")
print(generated_text[0]['generated_text'])
print("---------------------------------")

Enter you prompt for platoplato theory of forms


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generating text for prompt: 'plato theory of forms'

--- MODEL'S OUTPUT ---
plato theory of forms and forms of knowledge, is not to be found in the laws, but in the philosophy of plato himself, as far as we can tell. the laws are a kind of pre-meditation about the relations of mind to the world, and are made up of three sections—the idea of good, the idea of justice and morality, and the idea of good as the highest principle of all—of the two which the early greek philosophers sought to determine between the ideas of justice and the other two, and which they would have called the natural and rational. in the first place, there is the idea of good, which is the first principle of all, and is the principle of good when compared with the other two; and in the second place, there is the idea of justice, which is the second principle of all, and is the principle of justice when compared with the other two; and in the third place, there is the idea of good; and in the fourth place, there is 