## Imports

In [1]:
import pandas as pd
import openai

In [2]:
from wordcloud import WordCloud
import numpy as np
import matplotlib.pyplot as plt
import itertools
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

## Data

Loading data from csv files. Titles has contains user-voted on titles that have been voted on. Info contains raw data from the videos, including the posted title on youtube. It contains over 7 million entries. Titles only has 145,000 entries so I'll left merge to only keep the entries that have the voted on titles.

In [3]:
titles = pd.read_csv('/Volumes/Samsung_T5/sb-mirror/titles.csv')
info = pd.read_csv('/Volumes/Samsung_T5/sb-mirror/videoInfo.csv')

In [4]:
info

Unnamed: 0,videoID,channelID,title,published
0,QyTRvbb3gkk,UCr7tNSNf7_aEEh5P-F5mE4A,Kronii Got Friendzoned by Chat but She End Up ...,1.634429e+09
1,qU1Yv58EXcc,UCo_IB5145EVNcf8hw1Kku7w,Game Theory: Minecraft's DARKEST Timeline! (He...,1.634429e+09
2,yKkVHBh9DQk,UCXJkLU1wZVqZjjVe1MuRj-A,TRIPLE RECORD EN GLOBILLOS? 🎈,1.633565e+09
3,7wCZSBOX7eM,UCg83RGdRpwfvoFEuE2zWKZA,Johnny vs. Nickelodeon All-Star Brawl (Sponsored),1.633392e+09
4,VVGjjaWWeRA,UCKBYXp4Xn2I2tL1UL4fpbhw,WOTB | NEW BIG HITTING JAGTIGER PREMIUM!,1.634429e+09
...,...,...,...,...
7000657,jRQKOKF3YNg,UCnXM5uNrNWEH7ewvIVsDRIg,DRAINING My Backyard FISH HATCHERY!!! (Surpris...,0.000000e+00
7000658,iZ5EyJudefY,UC06fO6LNH_AUgjbmqaZRV5Q,I Got My Sawmill Setup! Guess How Long I Can C...,0.000000e+00
7000659,9nMS6uMKSd4,UC2eVy7YvBT2XZo74wAT5scQ,A Snowboarders DREAM!! Heli Snowboarding Trip ...,0.000000e+00
7000660,LRB4r6WvqEY,UCLRlryMfL8ffxzrtqv0_k_w,"Spider-Man 4, Keanu Reeves in Sonic 3, Godzill...",0.000000e+00


In [None]:
titles

I'm limiting the info dataframe to just the unique videoID and the title. Then I am renaming the "title" column in the titles dataframe for the merging process.

In [None]:
info_title = info[['videoID','title']]

In [None]:
titles.rename(columns={"title":"other_title"},inplace=True)

In [None]:
title_df = pd.merge(titles,info_title,how='left')

### Exploration

The original column seems to indicate that the video's title was not deemed "clickbaity" enough by the voters and was fine to remain. 

In [None]:
title_df[title_df['original']==1]

In [None]:
title_df['original'].value_counts(normalize=True)

Only 5 Percent of the dataset is "original title." 

In [None]:
title_df[title_df['original']==1]

In [None]:
title_df

In [None]:
title_df['title'].isna().mean()*100

41% of the titles are nans, meaning I have to drop them from the dataset to really be able to guage the success of the chatgpt vs the models.

In [None]:
title_df.dropna(inplace=True)

In [None]:
title_df.sample(1)

In [None]:
title_df.info()

In [None]:
title_df.to_csv('titles_no_nan.csv')

## Title Exploration

In [None]:
text = ' '.join(title_df['title'])

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)


In [None]:
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
other_text = ' '.join(title_df['other_title'])

In [None]:
other_text_cloud = WordCloud(width=800, height=400, background_color='white').generate(other_text)


In [None]:
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
title_df.sample()

# ChatGPT Model

This is my attempt to use ChatGPT to classify the clickbait titles and see how it performs against other models. It is neccesary to set up a ChatGPT API key to perform this action. I will link to how to perform this action. 

In [None]:
test = title_df.sample()['title']

In [None]:
test = test[48248]

In [None]:
test

In [None]:
title_df.iloc[48248]

In [None]:
prompt = f"""Classify the text into one of the classes. Also return the probability of it being clickbait.
Classes: [`clickbait`, `not clickbait`]
Text: World's first screw-bike
Class: `not clickbait`

Text: Mastering mood in photography (3 easy steps).
Class: `not clickbait`

Text: What's Inside the DON'T DIE BOX???.
Class: `clickbait`

Text: 'OBNOXIOUS Idiot Pushes The WRONG JUDGE Too Far!!! Wild Court Cam...'
Class: 'clickbait'

Text: {test}
Class: """

In [None]:
prompt

In [None]:
# Generate completion using OpenAI's GPT-3.5 model
response = openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)

# Extract the generated classification label and probability
output = response
print(response)