LIBS

In [2]:
#installing libs
%pip install praw pandas vaderSentiment google-api-python-client flask datasets scikit-learn tensorflow tf-keras transformers==4.46.2 torch==2.3.1+cu121 torchvision==0.18.1+cu121 -f https://download.pytorch.org/whl/cu121/torch_stable.html accelerate google-generativeai pyngrok

Looking in links: https://download.pytorch.org/whl/cu121/torch_stable.html
Note: you may need to restart the kernel to use updated packages.




In [3]:
#importing required libs
import pandas as pd
import praw
from praw.models import MoreComments
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import time
from datetime import datetime, timezone
import googleapiclient.discovery
from googleapiclient.errors import HttpError
from werkzeug.serving import run_simple
import threading
import torch
from datasets import load_dataset, Dataset
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, EvalPrediction, GPT2LMHeadModel, GPT2Tokenizer
from flask import Flask, render_template_string, request, session
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm import tqdm
from torch.cuda.amp import autocast, GradScaler
import google.generativeai as genai
from pyngrok import ngrok




STAGE 1

Scraping Reddit for Dataset to Train on Sentiments

In [4]:
#reddit creds
reddit = praw.Reddit(
    user_agent="Comment Extraction (by /u/lestergreeks)",
    client_id="YiK4KHJXxneFv0IiV8aOhg", 
    client_secret="8GfGLUx8E62B3sBUwJcie43RDBQm7A"
)

Version 7.7.1 of praw is outdated. Version 7.8.1 was released Friday October 25, 2024.


In [5]:
#extracting all comments from top10 posts of a subreddit
subreddit = reddit.subreddit('technews')

all_comments = pd.DataFrame()
    

In [6]:
def extract_comments(submission):
    posts = []
    submission.comments.replace_more(limit=None)
    count = 0
    for top_level_comment in submission.comments.list():
        if isinstance(top_level_comment, MoreComments):
            continue
        if (top_level_comment.author and "bot" not in top_level_comment.author.name.lower() 
            and not top_level_comment.stickied):
            posts.append(top_level_comment.body.encode('utf-8', 'ignore').decode('utf-8'))
    
    return posts


In [7]:
def process_posts():
    global all_comments
    attempt = 0
    for submission in subreddit.rising(limit=15):
        try:
            print(f"Processing post: {submission.title}")
            posts = extract_comments(submission)
            posts_df = pd.DataFrame(posts, columns=["body"])
            indexNames = posts_df[(posts_df.body == '[removed]') | (posts_df.body == '[deleted]')].index
            posts_df.drop(indexNames, inplace=True)
            posts_df['post_title'] = submission.title
            posts_df['post_time'] = datetime.fromtimestamp(submission.created_utc, tz=timezone.utc).strftime('%Y-%m-%d %H:%M:%S')

            all_comments = pd.concat([all_comments, posts_df], ignore_index=True)
        
        except praw.exceptions.RedditAPIException as e:
            if "RATELIMIT" in str(e):
                attempt += 1
                sleep_time = 120 * attempt
                print(f"Rate limit hit. Sleeping for {sleep_time} seconds. Error: {e}")
                time.sleep(sleep_time) 
                continue
            else:
                print(f"Encountered an error: {e}")
                continue
    
        time.sleep(60)


In [8]:
process_posts()

Processing post: Cybertruck's Many Recalls Make It Worse Than 91 Percent of All 2024 Vehicles
Processing post: Researchers jailbreak AI robots to run over pedestrians, place bombs for maximum damage, and covertly spy
Processing post: Valve first came up with the Steam Hardware Survey more than 20 years ago because it wanted to know what specs it should target for Half-Life 2
Processing post: Threads' latest test will finally let you make the ‘following’ feed the default
Processing post: Bluesky breaching rules around disclosure of information, says EU
Processing post: The Future of Online Privacy Hinges on Thousands of New Jersey Cops
Processing post: Most Gen Zers are terrified of AI taking their jobs. Their bosses consider themselves immune
Processing post: Sony’s making a handheld console to compete with Nintendo and Microsoft | The portable console could natively play PS5 games without an active Wi-Fi connection.
Processing post: Anthropic proposes a new way to connect data to AI c

In [9]:
all_comments

Unnamed: 0,body,post_title,post_time
0,I wonder which cars are in the remaining 9%?,Cybertruck's Many Recalls Make It Worse Than 9...,2024-11-25 13:06:48
1,"Since launch, Tesla's polarizing electric pick...",Cybertruck's Many Recalls Make It Worse Than 9...,2024-11-25 13:06:48
2,which 9% of cars make up a list of vehicles WO...,Cybertruck's Many Recalls Make It Worse Than 9...,2024-11-25 13:06:48
3,"It was never purchased for utility, so to the ...",Cybertruck's Many Recalls Make It Worse Than 9...,2024-11-25 13:06:48
4,DOGE will take care of that. Everyday consumer...,Cybertruck's Many Recalls Make It Worse Than 9...,2024-11-25 13:06:48
...,...,...,...
612,I think we’re likely to be survived only by ou...,Droidspeak: AI models work together faster whe...,2024-11-24 12:15:17
613,> Security researchers have disrupted a major ...,Dangerous global botnet fueling residential pr...,2024-11-24 03:46:44
614,I work in security and really don’t like the n...,Dangerous global botnet fueling residential pr...,2024-11-24 03:46:44
615,"ask crowdstrike, most groups follow their nami...",Dangerous global botnet fueling residential pr...,2024-11-24 03:46:44


In [10]:
all_comments.to_csv('red_data_2.csv', encoding='utf-8', index=False)

Labelling Data

In [11]:
# Load the dataset
file_path = 'red_data_2.csv'
data = pd.read_csv(file_path)

# Check the first few rows to understand the structure
# data.head()

In [13]:
# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Function to classify sentiment using VADER
def classify_sentiment_vader(text):
    sentiment_score = analyzer.polarity_scores(str(text))
    # Classify as negative, neutral, or positive based on compound score
    if sentiment_score['compound'] >= 0.05:
        return 2  # Positive
    elif sentiment_score['compound'] <= -0.05:
        return 0  # Negative
    else:
        return 1  # Neutral

# Apply the VADER sentiment classification to the 'body' column
data['score'] = data['body'].apply(classify_sentiment_vader)

# Check the results
data[['body', 'score']].head()


Unnamed: 0,body,score
0,I wonder which cars are in the remaining 9%?,1
1,"Since launch, Tesla's polarizing electric pick...",0
2,which 9% of cars make up a list of vehicles WO...,0
3,"It was never purchased for utility, so to the ...",1
4,DOGE will take care of that. Everyday consumer...,2


In [14]:
data.to_csv('red_data_3.csv')

STAGE 2

Performing Sentimental Analysis on YT Data

In [4]:
# Set your API key here
API_KEY = "AIzaSyCKz6fxf2IvH8z8LIp5mpm76LB9u9fxHUU"

Youtube Comments Extraction using YT.v3 API

In [6]:
# Function to get comments for a given video ID
def get_comments(video_id, max_comments=100, order="relevance"):
    youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=API_KEY)
    
    comments = []
    request = youtube.commentThreads().list(
        part="snippet",
        videoId=video_id,
        maxResults=max_comments,
        textFormat="plainText",
        order=order
    )
    
    while request and len(comments) < max_comments:
        try:
            response = request.execute()
        
            # Loop through each comment in the response
            for item in response.get("items", []):
                full_comment = item["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
                comments.append(full_comment)  # Store the full comment
                
                # Stop if we have reached the max comments
                if len(comments) >= max_comments:
                    break
        
            # Check for more comments (pagination)
            request = youtube.commentThreads().list_next(request, response)
        except HttpError as e:
            if "commentsDisabled" in str(e):
                print(f"Comments are disabled for Video ID: {video_id}. Skipping this video.")
                break
            else:
                print(f"An unexpected error occurred with Video ID: {video_id}. Error: {e}")
                break

    return comments[:max_comments]  # Return up to max_comments

In [8]:
def get_top_videos(query, max_videos=10):
    youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=API_KEY)
    
    # Search for videos based on the query
    request = youtube.search().list(
        q=query,
        part="snippet",
        order="viewCount",
        maxResults=max_videos,
        type="video"
    )
    
    response = request.execute()
    
    video_data = []  # List to hold video data and comments

    for item in response.get("items", []):
        video_id = item["id"]["videoId"]
        video_title = item["snippet"]["title"]  # Get video title
        print(f"Fetching comments for Video ID: {video_id}") 
        
        # Get comments and handle errors (skipping videos with disabled comments)
        comments = get_comments(video_id)
        
        if comments:  # If comments were successfully fetched
            for comment in comments:
                video_data.append({"video_id": video_id, "video_title": video_title, "comment": comment})
    
    return video_data

USER INPUT HERE

In [20]:
# Example usage
query = "tesla we robot"  # Your search query
top_videos_comments = get_top_videos(query, max_videos=10)  # Fetch top 10 videos

# Convert the results into a DataFrame for better readability
df = pd.DataFrame(top_videos_comments)
df.head()

Fetching comments for Video ID: 2cukB4_hDCI
Fetching comments for Video ID: DxREm3s1scA
Fetching comments for Video ID: cpraXaw7dyc
Fetching comments for Video ID: 2lLZ9AWhcNo
Fetching comments for Video ID: XiQkeWOFwmk
Fetching comments for Video ID: fgm5uZaS3-E
Fetching comments for Video ID: DB1027Bfpmo
Fetching comments for Video ID: Mu-eK72ioDk
Fetching comments for Video ID: nAgTgwak7ME
Fetching comments for Video ID: 8vsTNFUFJEU


Unnamed: 0,video_id,video_title,comment
0,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,Robots with guns does not give me a warm and f...
1,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,"""Yee haw"" killed me💀💀"
2,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,"Dude : ""Now give me back my gun""\nRobot : ""giv..."
3,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,Guy: Nice job! I’ll take my gun back.\nRobot: ...
4,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,"*10 seconds later*\nRobot:""Now we have to chex..."


In [25]:
df

Unnamed: 0,video_id,video_title,comment
0,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,Robots with guns does not give me a warm and f...
1,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,"""Yee haw"" killed me💀💀"
2,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,"Dude : ""Now give me back my gun""\nRobot : ""giv..."
3,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,Guy: Nice job! I’ll take my gun back.\nRobot: ...
4,2cukB4_hDCI,Robots testing the Bulletproof #cybertruck,"*10 seconds later*\nRobot:""Now we have to chex..."
...,...,...,...
995,8vsTNFUFJEU,Tesla Optimus Bot FOLDS the Laundry !,"“Hey Jerry, you got that jack rabbit chip?”\nJ..."
996,8vsTNFUFJEU,Tesla Optimus Bot FOLDS the Laundry !,When you see a Terminator and it asks for your...
997,8vsTNFUFJEU,Tesla Optimus Bot FOLDS the Laundry !,That shirt gone be wrinkled as hell 😂
998,8vsTNFUFJEU,Tesla Optimus Bot FOLDS the Laundry !,Such smooth movements. No arthritis


In [None]:
df.to_csv('red_data_4.csv')

In [9]:
def get_search_results_yt(query):
    top_videos_comments = get_top_videos(query, max_videos=10)  # Fetch top 10 videos

    # Convert the results into a DataFrame for better readability
    df = pd.DataFrame(top_videos_comments)
    df.to_csv('red_data_4.csv')

STAGE 3

Sentiment Model begins here...

# **Model v3.0**

In [10]:
df = pd.read_csv('red_data_3.csv')

In [11]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=3)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)

Using device: cuda


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [13]:
# Convert data to Dataset format compatible with Hugging Face Trainer
def preprocess_data(df):
    return {'text': df['body'], 'label': df['score']}

In [14]:
train_size = int(0.8 * len(df))
train_data = df[:train_size]
test_data = df[train_size:]

# Apply preprocessing and convert the data into the proper format
preprocessed_train = df[:train_size].apply(preprocess_data, axis=1).to_list()
preprocessed_test = df[train_size:].apply(preprocess_data, axis=1).to_list()
# Convert preprocessed data into Dataset objects
train_dataset = Dataset.from_list(preprocessed_train)
test_dataset = Dataset.from_list(preprocessed_test)

In [15]:
# Tokenize the datasets
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=512)

In [16]:
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)
# Set format to PyTorch tensors for Trainer
tokenized_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

Map:   0%|          | 0/392 [00:00<?, ? examples/s]

Map:   0%|          | 0/98 [00:00<?, ? examples/s]

In [17]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',               # output directory
    evaluation_strategy="epoch",          # evaluation strategy (evaluate at the end of each epoch)
    learning_rate=2e-5,                   # learning rate
    per_device_train_batch_size=16,       # batch size for training
    per_device_eval_batch_size=16,        # batch size for evaluation
    num_train_epochs=3,                   # number of training epochs
    weight_decay=0.01,                    # strength of weight decay
    logging_dir='./logs',                 # directory for storing logs
    logging_steps=10,                     # log every 10 steps
    save_strategy="epoch"                 # save model at the end of every epoch
)



In [18]:
def compute_metrics(pred: EvalPrediction):
    predictions, labels = pred.predictions, pred.label_ids
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='weighted')
    return {
        'accuracy': acc,
        'f1_score': f1,
    }


In [19]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics
)

In [20]:
# Fine-tune the model
trainer.train()

  0%|          | 0/75 [00:00<?, ?it/s]

  attn_output = torch.nn.functional.scaled_dot_product_attention(


{'loss': 1.1097, 'grad_norm': 3.8403055667877197, 'learning_rate': 1.7333333333333336e-05, 'epoch': 0.4}
{'loss': 1.0669, 'grad_norm': 4.20644998550415, 'learning_rate': 1.4666666666666666e-05, 'epoch': 0.8}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.0226513147354126, 'eval_accuracy': 0.4897959183673469, 'eval_f1_score': 0.43321247863974693, 'eval_runtime': 0.8654, 'eval_samples_per_second': 113.237, 'eval_steps_per_second': 8.088, 'epoch': 1.0}
{'loss': 1.0427, 'grad_norm': 3.9822816848754883, 'learning_rate': 1.2e-05, 'epoch': 1.2}
{'loss': 0.9613, 'grad_norm': 5.70676851272583, 'learning_rate': 9.333333333333334e-06, 'epoch': 1.6}
{'loss': 0.9851, 'grad_norm': 6.892285346984863, 'learning_rate': 6.666666666666667e-06, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.0184816122055054, 'eval_accuracy': 0.45918367346938777, 'eval_f1_score': 0.366649502343256, 'eval_runtime': 0.839, 'eval_samples_per_second': 116.81, 'eval_steps_per_second': 8.344, 'epoch': 2.0}
{'loss': 0.8918, 'grad_norm': 9.054837226867676, 'learning_rate': 4.000000000000001e-06, 'epoch': 2.4}
{'loss': 0.8894, 'grad_norm': 8.25114631652832, 'learning_rate': 1.3333333333333334e-06, 'epoch': 2.8}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.0140188932418823, 'eval_accuracy': 0.46938775510204084, 'eval_f1_score': 0.38917230433265415, 'eval_runtime': 0.8339, 'eval_samples_per_second': 117.517, 'eval_steps_per_second': 8.394, 'epoch': 3.0}
{'train_runtime': 40.7057, 'train_samples_per_second': 28.89, 'train_steps_per_second': 1.842, 'train_loss': 0.9810653940836589, 'epoch': 3.0}


TrainOutput(global_step=75, training_loss=0.9810653940836589, metrics={'train_runtime': 40.7057, 'train_samples_per_second': 28.89, 'train_steps_per_second': 1.842, 'total_flos': 309421379248128.0, 'train_loss': 0.9810653940836589, 'epoch': 3.0})

In [21]:
# Save the fine-tuned model
model.save_pretrained('./fine_tuned_bert_model')
tokenizer.save_pretrained('./fine_tuned_bert_tokenizer')

('./fine_tuned_bert_tokenizer\\tokenizer_config.json',
 './fine_tuned_bert_tokenizer\\special_tokens_map.json',
 './fine_tuned_bert_tokenizer\\vocab.txt',
 './fine_tuned_bert_tokenizer\\added_tokens.json')

In [22]:
# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained('./fine_tuned_bert_model')
tokenizer = BertTokenizer.from_pretrained('./fine_tuned_bert_tokenizer')

In [23]:
import torch.nn.functional as F

# Function to predict sentiment with percentage outputs
def predict_sentiment_with_percentages(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits     # Get logits from model output
    probs = F.softmax(logits, dim=-1).flatten()     # Convert logits to probabilities using softmax
    sentiment_labels = ['negative', 'neutral', 'positive']     # Map the probabilities to their respective sentiment labels
    sentiment_percentages = {label: round(prob.item() * 100, 2) for label, prob in zip(sentiment_labels, probs)}
    return sentiment_percentages

Testing Model

In [18]:
# Example prediction
text = "Should i buy a tesla as Emmanuel said it is good but it also very costly"
result = predict_sentiment_with_percentages(text)
print(result)

{'negative': 23.18, 'neutral': 22.91, 'positive': 53.91}


**Process Starts Here...**

In [24]:
def generate_results():
    df = pd.read_csv('red_data_4.csv')

    sentiment_results = []

    # Iterate over the texts in the DataFrame
    for index, row in df.iterrows():
        text = row['comment']
        sentiment_percentages = predict_sentiment_with_percentages(text)
        sentiment_results.append({
            'comment': text,
            'negative': sentiment_percentages['negative'],
            'neutral': sentiment_percentages['neutral'],
            'positive': sentiment_percentages['positive']
        })

    sentiment_df = pd.DataFrame(sentiment_results)
    return sentiment_df['positive'].mean().__round__(2), sentiment_df['neutral'].mean().__round__(2), sentiment_df['negative'].mean().__round__(2)   

In [21]:
sentiment_df

Unnamed: 0,comment,negative,neutral,positive
0,Robots with guns does not give me a warm and f...,42.29,15.06,42.64
1,"""Yee haw"" killed me💀💀",16.53,49.97,33.49
2,"Dude : ""Now give me back my gun""\nRobot : ""giv...",27.58,27.65,44.77
3,Guy: Nice job! I’ll take my gun back.\nRobot: ...,40.93,10.46,48.61
4,"*10 seconds later*\nRobot:""Now we have to chex...",36.79,22.58,40.63
...,...,...,...,...
995,"“Hey Jerry, you got that jack rabbit chip?”\nJ...",21.54,26.09,52.38
996,When you see a Terminator and it asks for your...,32.86,12.27,54.86
997,That shirt gone be wrinkled as hell 😂,33.81,21.01,45.19
998,Such smooth movements. No arthritis,31.91,33.74,34.35


In [22]:
print("Positive: ",sentiment_df['positive'].mean())
print("Neutral: ",sentiment_df['neutral'].mean())
print("Negative: ",sentiment_df['negative'].mean())

Positive:  43.271089999999994
Neutral:  27.8117
Negative:  28.917300000000004


STAGE 4 (DO NOT RUN THIS!) - MODEL NOT STABLE⚠️

TEXT GEN Model begins here....

In [27]:
# Load the dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

model = GPT2LMHeadModel.from_pretrained("distilgpt2")
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")

# Assign a padding token if not already present
tokenizer.pad_token = tokenizer.eos_token  # Use eos_token as the pad_token


In [28]:
# Tokenize the input text and set up labels
def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()  # Set the labels as input_ids
    return tokenized_inputs

# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Convert dataset to PyTorch tensors
tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

# Create DataLoader
train_dataset = tokenized_datasets["train"]


In [29]:
def collate_fn(batch):
    return tokenizer.pad(batch, padding=True, return_tensors="pt")

train_dataloader = DataLoader(
    train_dataset,
    batch_size=16,
    shuffle=True,
    num_workers=1,  # Disable multiprocessing to debug
    pin_memory=True
)


In [30]:
# Load the model
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Move the model to GPU if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)


In [None]:
# Set the model to training mode
model.train()

# Training loop
epochs = 3
scaler = GradScaler()  # Initialize the scaler for mixed precision

# Set gradient accumulation steps (adjust to simulate larger batches)
accumulation_steps = 4  # Simulates larger batch size

for epoch in range(epochs):
    loop = tqdm(train_dataloader, leave=True)

    optimizer.zero_grad()  # Reset the gradients before starting

    for step, batch in enumerate(loop):
        inputs = {key: val.to(device) for key, val in batch.items()}

        with torch.cuda.amp.autocast():  # Enable mixed precision
            outputs = model(**inputs)
            loss = outputs.loss / accumulation_steps  # Scale loss for accumulation

        scaler.scale(loss).backward()  # Backpropagate loss

        if (step + 1) % accumulation_steps == 0:
            scaler.step(optimizer)  # Update weights
            scaler.update()
            optimizer.zero_grad()  # Reset gradients

print("Training complete!")

In [None]:
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [None]:
# Load the model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Set pad token to eos token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

# Move model to the correct device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Prepare input text and move to device
input_text = "How are you? Tell me about Tesla Motors!"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)

# Create attention mask to differentiate between padding and actual data
attention_mask = torch.ones(inputs.shape, device=device)

# Generate text with repetition penalty
with torch.no_grad():
    outputs = model.generate(
        inputs,
        attention_mask=attention_mask,
        max_length=150,  # Increased length to avoid truncation
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        temperature=0.8,  # Slightly adjusted temperature
        top_k=50,         # Limit next token choices to top-k
        top_p=0.9,        # Use nucleus sampling
        repetition_penalty=1.2  # Penalty to discourage repetition
    )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)



How are you? Tell me about Tesla Motors!
I'm a big fan of the company and I've been working on it for years. It's one of my favorite cars ever made, but there is something special that makes this car so unique to us: The way we build our vehicles in such an innovative manner allows them not only to be built with high quality materials (like aluminum), they can also have their own custom parts available from suppliers like BMW or Mercedes-Benz as well – all without having to worry too much over what will happen when these components go into production…and then get shipped out by truck every year."


STAGE 4.1

In [25]:
genai.configure(api_key="AIzaSyDYCwFViLa0Ry51BxOdLYZSU4JdkMF3kSY")

model2 = genai.GenerativeModel(model_name="gemini-1.5-flash")


In [26]:
def gen_tex(user_input):

    response = model2.generate_content(f"write a mail to {user_input} for partnership with my company in the future")
    return response.text

STAGE 5

Flask

In [27]:
app = Flask(__name__)
app.secret_key = "mypopeshighonweed"
ngrok.set_auth_token("2pQvQx6Ujbi5CAsfXIJeo4AVoOl_6fgRD49CmZMN2zFd9xrtw")

@app.route('/', methods=['GET', 'POST'])
def home():
    user_input = ""
    sentiment = session.get('sentiment', None)
    generated_text = None

    if request.method == 'POST':
        # Get the data from the input box
        user_input = request.form['user_input'].strip()

        if not user_input:  # Validate input
            generated_text = "Please enter a valid input before submitting."
        elif 'search' in request.form:
            search_results = get_search_results_yt(user_input)
            p, neu, neg = generate_results()
            sentiment = f"Positive: {p}, Neutral: {neu}, Negative: {neg}"
            session['sentiment'] = sentiment
        elif 'generate' in request.form:
            generated_text = gen_tex(user_input)

        # Debugging print statements (optional)
        print(f"Search: {user_input}")
        print(f"Sentiment scores: {sentiment}")

    return render_template_string("""
        <form method="POST">
            <input type="text" name="user_input" placeholder="Enter something" value="{{ user_input }}">
            <button type="submit" name="search">Search</button>
            <button type="submit" name="generate">Generate</button>
        </form>
        {% if user_input %}
            <h3>Searching: {{ user_input }}</h3>
            <p>Results: {{ sentiment }}</p>
        {% endif %}
        {% if user_input and generated_text %}
            <h3>Generated Text:</h3>
            <p>{{ generated_text }}</p>
        {% endif %}
    """, user_input=user_input, sentiment=sentiment, generated_text=generated_text)

if __name__ == "__main__":
    # This will keep the Flask app running until you manually stop it (using Ctrl+C)
    # app.run(debug=True, use_reloader=False, port = 5001)
    # Start the ngrok tunnel
    public_url = ngrok.connect(5000)
    print(f"Public URL: {public_url}")
    # Run the Flask app
    app.run(port=5000)

Public URL: NgrokTunnel: "https://312c-115-247-147-18.ngrok-free.app" -> "http://localhost:5000"    
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [27/Nov/2024 18:35:28] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [27/Nov/2024 18:35:29] "GET /favicon.ico HTTP/1.1" 404 -


Fetching comments for Video ID: pBHwJkrz3x4
Fetching comments for Video ID: 6R6F371Hj3k
Comments are disabled for Video ID: 6R6F371Hj3k. Skipping this video.
Fetching comments for Video ID: r1Rrt8iaOUc
Fetching comments for Video ID: igW_YJ7r1Zc
Fetching comments for Video ID: 43nSDUdse60
Fetching comments for Video ID: BDx_YTf9x1g
Fetching comments for Video ID: l2q_-xN2N54
Fetching comments for Video ID: JC9VVO0aUQw
Fetching comments for Video ID: EL1lwZP-RqM
Fetching comments for Video ID: 9vwHuCC6nP8
An unexpected error occurred with Video ID: 9vwHuCC6nP8. Error: <HttpError 400 when requesting https://youtube.googleapis.com/youtube/v3/commentThreads?part=snippet&videoId=9vwHuCC6nP8&maxResults=100&textFormat=plainText&order=relevance&key=AIzaSyCKz6fxf2IvH8z8LIp5mpm76LB9u9fxHUU&alt=json returned "The API server failed to successfully process the request. While this can be a transient error, it usually indicates that the request's input is invalid. Check the structure of the <code>com

127.0.0.1 - - [27/Nov/2024 18:36:34] "POST / HTTP/1.1" 200 -


Search: tesla
Sentiment scores: Positive: 38.99, Neutral: 27.78, Negative: 33.23


127.0.0.1 - - [27/Nov/2024 18:36:46] "POST / HTTP/1.1" 200 -


Search: tesla
Sentiment scores: Positive: 38.99, Neutral: 27.78, Negative: 33.23
