# Table of Contents:

**ENGINEER CLASSIFICATION MODEL**

1. Pull in training data (cc_news from HuggingFace)
2. Preprocessing for features: BoW, N-Gram, TF-IDF, GloVE
3. Split Dataset: train-test-split
4. Compile Variables: hstack
5. Model Training: train different classifiers and perform feature ablation studies  
  
  A. Classification (Logistic Regression, SVC)  
  B. Ensemble (Random Forest, Gradient Boost)  
  C. Feed-Forward Network

**TEST ON CHATGPT DATA**

1. Import and preprocess ChatGPT text
2. Featurize, compile features, and check shapes
3. Predict with different prompting methods (natural, summary, detailed)
4. Store results
5. Results

**PLEASE NOTE THAT ADDITIONAL SOURCES ARE CITED IN OUR AFFILIATED RESEARCH PAPER**

# ENGINEER CLASSIFICAITON MODEL

Here, we will engineer a classification model that can classify ChatGPT responses as "right-," "center-", or "left-leaning".

## 1. Pull in training data (cc_news from HuggingFace)

Mount Google drive to pull in our data.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Read .csv file of parsed down domains (see "Annotating our Training Dataset") and check that the data read in properly.

In [None]:
import pandas as pd

path = '/content/drive/MyDrive/nlp_final_project_2023/filtered_train_data.csv'
news_df = pd.read_csv(path)

In [None]:
news_df.head()

Unnamed: 0.1,Unnamed: 0,title,text,domain,date,description,url,image_url,leaning-label
0,11724,Cleveland Shooter Disowned By Family On Twitter,"Donald Harvey, also dubbed as the 'Angel of De...",www.yahoo.com,2017-04-17 05:27:27,Cleveland Police issued an aggravated murder w...,https://www.yahoo.com/news/cleveland-shooter-d...,https://s.yimg.com/uu/api/res/1.2/vZB0t9O5GGqs...,conservative
1,11725,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,www.yahoo.com,2017-10-07 04:31:25,Movie company says it is taking claims “very s...,https://www.yahoo.com/movies/weinstein-company...,https://s.yimg.com/uu/api/res/1.2/4wtFnh7lUeYk...,conservative
2,11726,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,www.yahoo.com,2017-02-15 15:22:55,Former NHL star Mike Comrie — once married to ...,https://www.yahoo.com/celebrity/hilary-duffs-e...,https://s.yimg.com/uu/api/res/1.2/..p2z00Och0J...,conservative
3,11727,"At 117, Jamaican woman likely just became worl...","The world's oldest person Violet Brown, center...",www.yahoo.com,2017-04-17 22:55:18,"DUANVALE, Jamaica (AP) — Violet Brown spent mu...",https://www.yahoo.com/news/117-jamaican-woman-...,https://s.yimg.com/uu/api/res/1.2/Vd4NgTACWY1z...,conservative
4,11728,"Mark Hamill's Carrie Fisher Tribute: ""Making H...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,www.yahoo.com,2017-01-02 00:00:00,"""She was a handful, but my life would have bee...",https://www.yahoo.com/movies/mark-hamills-carr...,https://s.yimg.com/uu/api/res/1.2/Ole1yyNg3gmL...,conservative


Check data counts.

In [None]:
news_df.count()

Unnamed: 0       10555
title            10555
text             10555
domain           10555
date             10545
description      10471
url              10555
image_url        10555
leaning-label    10555
dtype: int64

Missing some descriptions but should be fine?

In [None]:
# Drop previus index column
news_df = news_df.drop(columns='Unnamed: 0')

In [None]:

news_df.head()

Unnamed: 0,title,text,domain,date,description,url,image_url,leaning-label
0,Cleveland Shooter Disowned By Family On Twitter,"Donald Harvey, also dubbed as the 'Angel of De...",www.yahoo.com,2017-04-17 05:27:27,Cleveland Police issued an aggravated murder w...,https://www.yahoo.com/news/cleveland-shooter-d...,https://s.yimg.com/uu/api/res/1.2/vZB0t9O5GGqs...,conservative
1,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,www.yahoo.com,2017-10-07 04:31:25,Movie company says it is taking claims “very s...,https://www.yahoo.com/movies/weinstein-company...,https://s.yimg.com/uu/api/res/1.2/4wtFnh7lUeYk...,conservative
2,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,www.yahoo.com,2017-02-15 15:22:55,Former NHL star Mike Comrie — once married to ...,https://www.yahoo.com/celebrity/hilary-duffs-e...,https://s.yimg.com/uu/api/res/1.2/..p2z00Och0J...,conservative
3,"At 117, Jamaican woman likely just became worl...","The world's oldest person Violet Brown, center...",www.yahoo.com,2017-04-17 22:55:18,"DUANVALE, Jamaica (AP) — Violet Brown spent mu...",https://www.yahoo.com/news/117-jamaican-woman-...,https://s.yimg.com/uu/api/res/1.2/Vd4NgTACWY1z...,conservative
4,"Mark Hamill's Carrie Fisher Tribute: ""Making H...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,www.yahoo.com,2017-01-02 00:00:00,"""She was a handful, but my life would have bee...",https://www.yahoo.com/movies/mark-hamills-carr...,https://s.yimg.com/uu/api/res/1.2/Ole1yyNg3gmL...,conservative


Check that one instance of the data looks good:

In [None]:
# news_df.head()

In [None]:
print(news_df['text'].iloc[0])

Donald Harvey, also dubbed as the 'Angel of Death,' used arsenic, rat poison and cyanide to kill patients at hospitals where he worked during 1970s and '80s.
Steve Stephens, 37, who has been accused of homicide Sunday of 74-year-old Ohio resident named Robert Godwin Sr., has been publicly disowned by his family, according to a Twitter post from his account. The shooting, which was streamed on Facebook Live, took place at 635 E. 93rd St. around 2 p.m. EDT.
The Twitter post on Stephens' account read: "We absolutely do not condone this type of behavior and this atrocity, therefore we do not consider Steve a part of this family. I would like everyone to refrain from posting pictures of our family in association with Steve, for we do not want our young ones to be burdened by this man. Please respect our privacy."
Cleveland Police Department issued an aggravated murder warrant against Stephens on Sunday night. They also alerted residents of Pennsylvania, New York, Indiana and Michigan as the

This is a classification problem of right, left, or moderate. Let's keep our choices in mind.

In [None]:
# choices = ["liberal", "conservative", "moderate"]

We're just looking at bodies of text because ChatGPT would likely be only looking at text. We can narrow down our dataframe to something more manageable.

In [None]:
news_features = news_df[['domain', 'date', 'title', 'description', 'text', 'leaning-label']]

In [None]:
news_features.head()

Unnamed: 0,domain,date,title,description,text,leaning-label
0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative
1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative
2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative
3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative
4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative


In [None]:
news_features.count()

domain           10555
date             10545
title            10555
description      10471
text             10555
leaning-label    10555
dtype: int64

In [None]:
# Nice!

Now we want to make sure we have a numerical leaning label for each text label. A numerical representation is easier to use when calculating our results.

Reference: https://stackoverflow.com/questions/70047812/label-assignment-from-lookup-dictionary-keys-and-value-in-python

In [None]:
# Add new column
label_key = {'moderate':0, 'conservative':1, 'liberal':2}
news_features['numerical-label'] = news_features['leaning-label'].map(label_key)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  news_features['numerical-label'] = news_features['leaning-label'].map(label_key)


Check some examples, as well as data.head().

In [None]:
news_features.iloc[1342]

domain                                                   www.bbc.com
date                                             2018-04-24 17:30:44
title                    Toronto van attack: Moment suspect arrested
description        Alek Minassia pleaded "kill me" and claimed to...
text               Video\nA man suspected of killing 10 people an...
leaning-label                                               moderate
numerical-label                                                    0
Name: 1342, dtype: object

In [None]:
news_features.head()

Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label
0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative,1
1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative,1
2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative,1
3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative,1
4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative,1


## 2. Preprocessing for features: BoW, N-Gram, TF-IDF, GloVE

Now, we need to featurize our data so we can feed it into our classifier during training. We are going to do three main features--Bag of Words, N-Gram, TF-IDF, and GloVe. Featurizing requires some basic preprocessing.

List of features and the affiliated preprocessing steps:
- BoW, N-Gram, TF-IDF:  
  - Clean and remove URLs, hashtags, etc.
  - Tokenize
  - Remove stop words
  - Lemmatize
- GloVe


First, let's remove URLs, mentions, hashtags, non-English text, and other outlying formatting issues. We will use regular expressions that we also harnassed in Assignment 3.

In [None]:
# From assignment 3

import re # Import regular expressions

# URLs (has http:// or www.)
# url_pattern = r'https?://\S+|www\.\S+' #Failed attempt
url_pattern = r'(https:\/\/|www.)[\S]+' # matches https:// or www. through any nonspace character
news_features['text-processed'] = news_features['text'].str.replace(url_pattern, "")

  news_features['text-processed'] = news_features['text'].str.replace(url_pattern, "")


In [None]:
# Mentions (has the @ symbol)
mention_pattern = r'@[\S]+' # matches anything following an @ symbol
news_features['text-processed'] = news_features['text-processed'].str.replace(mention_pattern, "")

  news_features['text-processed'] = news_features['text-processed'].str.replace(mention_pattern, "")


In [None]:
# Hashtags
hashtag_pattern = r'#[\S]+' # matches anything following a # symbol
news_features['text-processed'] = news_features['text-processed'].str.replace(hashtag_pattern, "")

  news_features['text-processed'] = news_features['text-processed'].str.replace(hashtag_pattern, "")


In [None]:
# Non-English text - characters languages
not_roman = r"[^a-zA-Z'\s]" # defines anything that is not alphanumeric text or spaces
news_features['text-processed'] = news_features['text-processed'].str.replace(not_roman, " ")

  news_features['text-processed'] = news_features['text-processed'].str.replace(not_roman, " ")


In [None]:
# # Remove numbers
# numbers = r'\d'
# news_features['text-processed'] = news_features['text-processed'].str.replace(numbers, "")

In [None]:
# Get rid of random hyphens
hyphen = r'-'
news_features['text-processed'] = news_features['text-processed'].str.replace(hyphen, " ")

In [None]:
# Make sure everything has only one space between it
space = r'\s\s+' # selects anything that is more than one space, including line breaks
news_features['text-processed'] = news_features['text-processed'].str.replace(space, " ")

  news_features['text-processed'] = news_features['text-processed'].str.replace(space, " ")


Let's check that it worked:

In [None]:
news_features.head()

Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label,text-processed
0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative,1,Donald Harvey also dubbed as the 'Angel of Dea...
1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative,1,Movie company says it is taking claims very se...
2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative,1,Former NHL star Mike Comrie once married to Hi...
3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative,1,The world's oldest person Violet Brown center ...
4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative,1,Photo Albert L Ortega Gettyimages Carrie and ...


In [None]:
print(news_features['text-processed'].iloc[348])

This undated picture provided on Monday Feb by the Albanian National Coastline Agency shows a shipwreck discovered by the RPM's Hercules research vessel in Ionian Sea Albania The country is promoting the archaeological finds in the waters off its southwest coast to raise public interest and to attract attention of decision makers who can help preserve the discoveries The Albanian National Coastline Agency opened an exhibition on Monday Feb of pictures showing underwater finds of potential archaeological significance from the last decade The Albanian National Coastline Agency via AP TIRANA Albania AP Albania is promoting the archaeological finds in the waters off its southwest coast to raise public interest and to attract attention of decision makers who can help preserve the discoveries The Albanian National Coastline Agency opened an exhibit Monday of photographs showing underwater finds of potential archaeological significance from the last decade The nonprofit RPM Nautical Foundatio

Now, we want to tokenize our text, breaking it up into single words that we can continue to filter. Let's download the 'punkt' library from nltk and apply the tokenizer from their pipeline.

**The below preprocessing code was inspired from labs completed in Dr. Abhijit Mishra's Natural Language Processing and Applications course at UT Austin's School of Information.**

In [None]:
import nltk # Import nltk library
nltk.download('punkt') # Use `punkt` to define punctuation
from nltk.tokenize import word_tokenize # Import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
news_features['text-processed'] = (news_features['text-processed'].str.lower()).apply(word_tokenize)

Now we have tokenized words. We can filter out stopwords using nltk's list of stopwrods to filter out using a list comprehension.

In [None]:
# news_features.head()

In [None]:
nltk.download('stopwords')

from nltk.corpus import stopwords # Import stopwords
stop_words = set(stopwords.words('english')) # Set stopwords to 'English'

print(stop_words) # Did we do this right??

{'their', 'an', "that'll", 'aren', 'a', 'then', 'because', 'here', 'ain', 'how', 'yourself', 'over', 'is', 'under', 'shan', "hasn't", 'and', 'below', "you'd", 'against', 'y', 'our', 'wouldn', 'from', 'should', 'doesn', "shan't", "shouldn't", 'but', 'weren', 'wasn', 'his', 'himself', 'any', 'does', 'these', 'being', 'just', 's', 'those', 'there', 'at', 'such', 'now', 'me', 'once', 'didn', 'the', 'you', 'by', 'as', 'not', 'further', 'off', 'can', 'hasn', "haven't", 'were', "should've", 'more', 'so', 'whom', 'has', 'same', 'won', 'herself', 'did', 'her', 'having', "won't", 'had', 'if', 'that', 'been', 'have', 'all', 'd', 'it', "weren't", "you're", "you've", "hadn't", 'don', "needn't", 'my', 'yours', 'itself', 'most', "she's", 'while', 'other', 'about', 'she', 'am', 'm', "it's", 'out', 'nor', 'mustn', 'needn', 'with', 'myself', 'are', 'll', 'shouldn', "you'll", 'why', 'own', "couldn't", 'until', 'through', "aren't", 'ourselves', 't', 'this', 'your', 'on', 'in', 'couldn', 'isn', "wasn't", '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# We can use .apply() and can iterate through each tokenizezd row to remove stopwords with lambda
news_features['text-processed'] = news_features['text-processed'].apply(lambda x: [word for word in x if word not in stop_words])

Now, we lemmatize, which removes noise by transforming words like "rats" and "rat" to both "rat."

In [None]:
# news_features.head()

In [None]:
# Lemmatize function with help from practicum 2
# Import wordnet, the Lemmatizer, and set it equal to a variable

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
news_features['text-processed'] = news_features['text-processed'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [None]:
news_features.head()

Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label,text-processed
0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative,1,"[donald, harvey, also, dubbed, 'angel, death, ..."
1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative,1,"[movie, company, say, taking, claim, seriously..."
2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative,1,"[former, nhl, star, mike, comrie, married, hil..."
3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative,1,"[world, 's, oldest, person, violet, brown, cen..."
4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative,1,"[photo, albert, l, ortega, gettyimages, carrie..."


Now we download our .csv file.

In [None]:
# Download .csv
news_features.to_csv('news_features.csv')

In [None]:
import pandas as pd

In [None]:
news_features = pd.read_csv('/content/news_features.csv')

In [None]:
# len(news_features)

We are left with pre-processed text that can be used to extract numerical features.

### GloVe

In order to get GloVe embeddings, we have to unzip the GloVe file and load our vectors. We will use 200 dimension GloVe vectors, because they contain more nuance. We used Dr. Abhijit Mishra's example on how to load in GloVe vectors for this section.

In [None]:
# this is a one time download
!wget -c http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# do some necessary conversions
!python -m gensim.scripts.glove2word2vec --input  glove.6B.200d.txt --output glove.6B.200d.vec
!rm glove*.txt

--2023-12-03 16:35:25--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-12-03 16:35:25--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-12-03 16:35:25--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

We load our vectors, then retrieve an embedding from those vectors for each word in each document of our corpus. Then, we can average the embeddings for each document, to get one single embedding (200 dimensions) that represents the average of each document.

In [None]:
import numpy as np
from gensim.models import KeyedVectors

# Load pre-trained GloVe embeddings
word_vectors = KeyedVectors.load_word2vec_format('glove.6B.200d.vec', binary=False)

In [None]:
# Make a function that averages GloVe vectors
# From practicum 5
def get_average_glove_vector(text):
    vectors = [word_vectors[word] for word in text if word in word_vectors]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)

Now, we can apply the function that averages all of those embeddings, and can check our dataset.

In [None]:
news_features['glove'] = news_features['text-processed'].apply(get_average_glove_vector)

In [None]:
news_features.head()

Unnamed: 0.1,Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label,text-processed,glove
0,0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative,1,"['donald', 'harvey', 'also', 'dubbed', ""'angel...","[0.038828976, 0.35248274, -0.16552818, -0.0523..."
1,1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative,1,"['movie', 'company', 'say', 'taking', 'claim',...","[0.03548508, 0.36543903, -0.17005034, -0.07168..."
2,2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative,1,"['former', 'nhl', 'star', 'mike', 'comrie', 'm...","[0.046540175, 0.30033353, -0.16914955, -0.0656..."
3,3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative,1,"['world', ""'s"", 'oldest', 'person', 'violet', ...","[0.038885757, 0.34846243, -0.18184175, -0.0459..."
4,4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative,1,"['photo', 'albert', 'l', 'ortega', 'gettyimage...","[0.041827377, 0.31809318, -0.18032931, -0.0497..."


Those embeddings took a while to generate. Let's save a backup .csv.

In [None]:
import pandas as pd
news_features = pd.read_csv('/content/news_features including GLOVE.csv')

In [None]:
news_features.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label,text-processed,glove
0,0,0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative,1,"['donald', 'harvey', 'also', 'dubbed', ""'angel...",[ 3.88289765e-02 3.52482736e-01 -1.65528178e-...
1,1,1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative,1,"['movie', 'company', 'say', 'taking', 'claim',...",[ 3.54850814e-02 3.65439028e-01 -1.70050338e-...
2,2,2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative,1,"['former', 'nhl', 'star', 'mike', 'comrie', 'm...",[ 4.65401746e-02 3.00333530e-01 -1.69149548e-...
3,3,3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative,1,"['world', ""'s"", 'oldest', 'person', 'violet', ...",[ 0.03888576 0.34846243 -0.18184175 -0.045955...
4,4,4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative,1,"['photo', 'albert', 'l', 'ortega', 'gettyimage...",[ 4.18273769e-02 3.18093181e-01 -1.80329308e-...


## 3. Split Dataset: train-test-split

Now, we can split our dataset to a train set and a test set. That way, we can train our classifier and test it on "unknown" data for accuracy.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(news_features, test_size = 0.2)

In [None]:
train.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label,text-processed,glove
10372,10372,10372,www.cnn.com,2017-04-17 17:32:41,"If you think Atlanta traffic is terrible, it j...",An underground gas leak caused a portion of I-...,(CNN) Atlanta's running out of interstates.\nT...,liberal,2,"['cnn', 'atlanta', ""'s"", 'running', 'interstat...",[ 6.04905523e-02 3.24522138e-01 -1.76917225e-...
6253,6253,6253,www.foxnews.com,2018-02-02 21:45:00,Missouri soldier returns home early and surpri...,A Missouri soldier surprised his son at school...,A Missouri soldier gave his son a special birt...,conservative,1,"['missouri', 'soldier', 'gave', 'son', 'specia...",[ 0.02821473 0.3636754 -0.17310047 -0.051173...
4389,4389,4389,thehill.com,2018-03-19 10:14:06,GOP Senate candidate slams McCaskill over Clin...,The top Republican candidate running to face Sen.,The top Republican candidate running to face S...,moderate,0,"['top', 'republican', 'candidate', 'running', ...",[ 0.05036669 0.31638792 -0.17640485 -0.052371...
8346,8346,8346,www.forbes.com,2017-10-06 13:44:00,Is It Lights Out For Kaspersky After Latest NS...,"Kaspersky isn't definitively done in America, ...",How long can Kaspersky survive the assault on ...,moderate,0,"['long', 'kaspersky', 'survive', 'assault', 'b...",[ 4.58879545e-02 3.69129270e-01 -1.73489451e-...
6531,6531,6531,www.foxnews.com,2017-08-14 22:21:00,Stanton breaks Marlins season HR record in win...,Giancarlo Stanton sets the Miami Marlins recor...,MIAMI (AP) -- Giancarlo Stanton hit his team-r...,conservative,1,"['miami', 'ap', 'giancarlo', 'stanton', 'hit',...",[ 4.81091663e-02 2.76806742e-01 -1.82998374e-...


Let's run a count to make sure that worked!

In [None]:
train.count()

Unnamed: 0.1       8444
Unnamed: 0         8444
domain             8444
date               8438
title              8444
description        8377
text               8444
leaning-label      8444
numerical-label    8444
text-processed     8444
glove              8444
dtype: int64

In [None]:
test.count()

Unnamed: 0.1       2111
Unnamed: 0         2111
domain             2111
date               2107
title              2111
description        2094
text               2111
leaning-label      2111
numerical-label    2111
text-processed     2111
glove              2111
dtype: int64

In [None]:
train.groupby('leaning-label').size()

leaning-label
conservative    2695
liberal         2778
moderate        2971
dtype: int64

In [None]:
test.groupby('leaning-label').size()

leaning-label
conservative    699
liberal         683
moderate        729
dtype: int64

Now, we can featurize our text tokens that we pulled earlier, into Bag of Words, N-Gram, and TF-IDF featurs. We will start by importing `CountVectorizer` and `TfidfVectorizer` to usue as tools to get these features.

References:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- Dr. Abhijit Mishra's NLP course lab materials

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import accuracy_score

Let's make sure the format is correct: we want to make sure we have a list of individual word tokens, in string format.

In [None]:
X_train

10372    ['cnn', 'atlanta', "'s", 'running', 'interstat...
6253     ['missouri', 'soldier', 'gave', 'son', 'specia...
4389     ['top', 'republican', 'candidate', 'running', ...
8346     ['long', 'kaspersky', 'survive', 'assault', 'b...
6531     ['miami', 'ap', 'giancarlo', 'stanton', 'hit',...
                               ...                        
8304     ['seem', 'impossible', 'person', 'reach', 'las...
7812     ['photo', 'peter', 'lyon', 'moment', 'fan', 'r...
4534     ['exclusive', 'interview', 'academy', 'award',...
1211     ['image', 'copyright', 'twitter', 'image', 'ca...
8532     ['bank', 'including', 'barclays', 'bank', 'ame...
Name: text-processed, Length: 8444, dtype: object

In [None]:
type(X_train)

pandas.core.series.Series

Now, we set our training data as those lists of tokens, and set our classification problem to the numerical label (0,1,2 for moderate, right, and left leaning)

In [None]:
# Extract text and labels (reference: practicum 5)
X_train = train['text-processed']
y_train = train['numerical-label']

X_test = test['text-processed']
y_test = test['numerical-label']

Now, we featurize by setting three vectorizers, one for each feature, and we "transform" our data. **It is critical that we featurize both our training and testing data** because our model will not be able to predict off of data unless it is the same shape the model was trained on.

In [None]:
# Bag of Words
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [None]:
# N-Grams
ngram_vectorizer = CountVectorizer(ngram_range=(1,2))
X_train_ngram = ngram_vectorizer.fit_transform(X_train)
X_test_ngram = ngram_vectorizer.transform(X_test)

In [None]:
# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

Now, for the GloVe feature, we call our above GloVe function (the one that averages a vector for the entire document) and place it in an array.

In [None]:
import numpy as np

In [None]:
# GloVE (apply earlier function)
X_train_glove = np.array([get_average_glove_vector(text) for text in X_train])
X_test_glove = np.array([get_average_glove_vector(text) for text in X_test])

Let's be sure that the "x" shape matches for all of our features.

In [None]:
print(X_train.shape)
print(X_train_bow.shape)
print(X_train_ngram.shape)
print(X_train_tfidf.shape)

(8444,)
(8444, 62382)
(8444, 1419558)
(8444, 62382)


In [None]:
print(X_test.shape)
print(X_test_bow.shape)
print(X_test_ngram.shape)
print(X_test_tfidf.shape)

(2111,)
(2111, 62382)
(2111, 1419558)
(2111, 62382)


In [None]:
print(X_train_glove.shape)
print(X_test_glove.shape)

(8444, 200)
(2111, 200)


Great! Now we can combine our features below!

## 4. Compile Variables: hstack

We can simply use `hstack` to combine all features. Since we are doing a feature ablation study for some of the models, we will make several versions of different combinations.

References:
- https://numpy.org/doc/stable/reference/generated/numpy.hstack.html
- https://stackoverflow.com/questions/54560836/how-to-combine-text-features-and-categorical-features-in-python

In [None]:
import scipy.sparse as sp
from scipy.sparse import hstack
# All
X_train_all = hstack((X_train_bow, X_train_ngram, X_train_tfidf, X_train_glove))
X_test_all = hstack((X_test_bow, X_test_ngram, X_test_tfidf, X_test_glove))

# Everything but Bag of Words
X_train_all_but_bow = hstack((X_train_ngram, X_train_tfidf, X_train_glove))
X_test_all_but_bow = hstack((X_test_ngram, X_test_tfidf, X_test_glove))

# Everything but n-gram
X_train_all_but_ngram = hstack((X_train_bow, X_train_tfidf, X_train_glove))
X_test_all_but_ngram = hstack((X_test_bow, X_test_tfidf, X_test_glove))

# Everything but tfidf
X_train_all_but_tfidf = hstack((X_train_bow, X_train_ngram, X_train_glove))
X_test_all_but_tfidf = hstack((X_test_bow, X_test_ngram, X_test_glove))

# Everything but GloVe
X_train_all_but_glove = hstack((X_train_bow, X_train_ngram, X_train_tfidf))
X_test_all_but_glove = hstack((X_test_bow, X_test_ngram, X_test_tfidf))

## 5. Model Training: train different classifiers and perform feature ablation studies

Now we can train several classifiers to identify which model will have the highest level of accuracy.

General Classifiers:
  - Logistic regression
  - SVC (like SVM but for classification)

Ensemble classifiers:
  - Random forest
  - Gradient boosted trees

Neural Networ:
  - Feed-forward neural network

References for the following section:

- Dr. Abhijit Mishra's examples in lab assignments (particularily, the function that trains classifiers, including how to apply the function properly)
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

### A. Classification

Let's set up a function that trains any classification model. Thank you Dr. Abhijit for this example, from our practicum lab!

In [None]:
def train_and_evaluate_classifier(classifier, X_train, y_actual, X_test, y_test_actual):
  classifier.fit(X_train, y_actual)
  y_pred = classifier.predict(X_test)
  accuracy = accuracy_score(y_test_actual, y_pred)
  return accuracy

Now we import our classic models:

In [None]:
# Import Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#### i. Logistic Regression

We will train our Logistic Regression; as you can see below, we had to increase the max_iter in order to help our model converge since it reached the maximum number of iterations.

In [None]:
LogReg_classifier = LogisticRegression()
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 85.31501657982%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# Increase max_iter
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 85.36238749407865%


Now, we can perform a feature ablation study to one-by-one remove features and figure out which configuration has the most success in accuracy.

In [None]:
# Without BOW
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all_but_bow, y_train, X_test_all_but_bow, y_test)
print (f"Accuracy of Logistic Regression without BOW = {accuracy*100}%")

Accuracy of Logistic Regression without BOW = 85.78872572240644%


In [None]:
# Without N-Gram
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all_but_ngram, y_train, X_test_all_but_ngram, y_test)
print (f"Accuracy of Logistic Regression without N-gram = {accuracy*100}%")

Accuracy of Logistic Regression without N-gram = 84.60445286594032%


In [None]:
# Without TFIDF
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all_but_tfidf, y_train, X_test_all_but_tfidf, y_test)
print (f"Accuracy of Logistic Regression without TF-IDF = {accuracy*100}%")

Accuracy of Logistic Regression without TF-IDF = 85.22027475130271%


In [None]:
# Without GloVE
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all_but_glove, y_train, X_test_all_but_glove, y_test)
print (f"Accuracy of Logistic Regression without GloVe = {accuracy*100}%")

Accuracy of Logistic Regression without GloVe = 85.36238749407865%


#### ii. SVC

Let's do the same with our SVC classifier!

In [None]:
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Support Vector Classificaiton = {accuracy*100}%")

Accuracy of Support Vector Classificaiton = 83.13595452392231%


In [None]:
# Without BOW
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all_but_bow, y_train, X_test_all_but_bow, y_test)
print (f"Accuracy of Support Vector Classificaiton without BOW = {accuracy*100}%")

Accuracy of Support Vector Classificaiton without BOW = 83.89388915206062%


In [None]:
# Without N-Gram
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all_but_ngram, y_train, X_test_all_but_ngram, y_test)
print (f"Accuracy of Support Vector Classificaiton without N-gram = {accuracy*100}%")

Accuracy of Support Vector Classificaiton without N-gram = 81.47797252486974%


In [None]:
# Without TFIDF
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all_but_tfidf, y_train, X_test_all_but_tfidf, y_test)
print (f"Accuracy of Support Vector Classificaiton without TF-IDF = {accuracy*100}%")

Accuracy of Support Vector Classificaiton without TF-IDF = 82.94647086688774%


In [None]:
# Without GloVE
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all_but_glove, y_train, X_test_all_but_glove, y_test)
print (f"Accuracy of Support Vector Classificaiton without GloVe = {accuracy*100}%")

Accuracy of Support Vector Classificaiton without GloVe = 83.27806726669826%


### B. Ensemble

Now, we can move on to ensemble classifiers, which should, in theory, be more accurate because they are powered by combinations of different classifiers. We will use all features for both RandomForest and GradientBoosting classifiers, because there was not a drastic difference in accuracy given the above ablation studies on our normal classifiers. Given more time, we could test other feature combinations with our ensemble models.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
RF_Classifier = RandomForestClassifier()
accuracy = train_and_evaluate_classifier(RF_Classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 85.88346755092373%


In [None]:
RF_Classifier = RandomForestClassifier(n_estimators=100, random_state=50)
accuracy = train_and_evaluate_classifier(RF_Classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 85.45712932259593%


In [None]:
GB_Classifier = GradientBoostClassifier()
accuracy = train_and_evaluate_classifier(GB_Classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

### C. Feed-Forward Network

Finally, we will try the MLPClassifier neural network. The first iteration below was interrupted due to computation errors.

In [None]:

from sklearn.neural_network import MLPClassifier

In [None]:
classifier = MLPClassifier(random_state=1, max_iter=300)
accuracy = train_and_evaluate_classifier(classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Support MLP Classificaiton = {accuracy*100}%")



Accuracy of Support MLP Classificaiton = 88.25201326385599%


In [None]:
classifier = MLPClassifier(random_state=1, max_iter=300)
accuracy = train_and_evaluate_classifier(classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Support MLP Classificaiton = {accuracy*100}%")

Accuracy of Support MLP Classificaiton = 87.44670772145902%


Now, we can save our model (it took three hours to train!). Let's save it, download it, and then if we need to we can re-upload it into another colab session. I followed joblib documentation to figure this part out.

In [None]:
!pip install joblib



In [None]:
# Save model
import joblib

model_filename = 'mlp_news_classifier_model.joblib'
joblib.dump(classifier, model_filename)
print(f"Trained model saved to {model_filename}")

Trained model saved to mlp_news_classifier_model.joblib


In [None]:
loaded_classifier = joblib.load('/content/mlp_news_classifier_model.joblib')


Now, once we've loaded in our classifier, we can simply `.predict()` our featurized chatbot outputs to see if they are labeled as left, moderate, or right leaning.

In [None]:
predictions = loaded_classifier.predict(X_new)


# TEST ON CHATGPT DATA

Here is the portion where I pulled in data from  ChatGPT that we gathered from prompting ChatGPT and storing in .csv files.

## 6. Import and preprocess (SAME AS TRAIN DATA) ChatGPT text

We have three files, one for each prompting style (detailed in report). Let's load them in! We will keep distinctions between different prompting methods (natural, summary, and detailed) throughout this entire process because we want to be able to perform analysis on our prompting methods at the end.

In [None]:
import pandas as pd

path_natural = '/content/Prompt Responses - USE Natural Prompt.csv'
path_summary = '/content/Prompt Responses - USE_ Summary Prompt.csv'
path_detailed = '/content/Prompt Responses - USE Detailed Explanation.csv'

chatgpt_natural_df = pd.read_csv(path_natural)
chatgpt_summary_df = pd.read_csv(path_summary)
chatgpt_detailed_df = pd.read_csv(path_detailed)

In [None]:
# print(chatgpt_natural_df['ChatGPT Response'].dtype)

In [None]:
chatgpt_natural_df.head()

Unnamed: 0,Prompt,News Type,Year,Topic,ChatGPT Response
0,What was Hurricane Dorian?,Event,2019,Hurricane Dorian,\r\nHurricane Dorian was a powerful and devast...
1,What happened to the Notre Dame Cathedral in 2...,Event,2019,Notre Dame Cathedral,"The Notre-Dame Cathedral in Paris, France, suf..."
2,What happened at the Women's World Cup in 2019?,Event,2019,Women's World Cup,\r\nThe 2019 FIFA Women's World Cup took place...
3,What happened during the Area 51 raid?,Event,2019,Area 51 raid,As of my last knowledge update in January 2022...
4,What happened during the 2019 Area 51 raid?,Event,2019,Area 51 raid,"\r\nThe ""Storm Area 51"" event that gained atte..."


Now we clean our data in the **exact same way as we cleaned the data training our classifier model**. It is critical that we are consistent, because our model knows how to "recognize" a very specific structure of data.

So again, we use regular expressions to remove abnormalities. I did not include removing websites, hashtags, and mentions, because ChatGPT does not create those.

In [None]:
# Non-English text - characters languages
not_roman = r"[^a-zA-Z'\s]" # defines anything that is not alphanumeric text or spaces
chatgpt_natural_df['processed'] = chatgpt_natural_df['ChatGPT Response'].str.replace(not_roman, " ")
chatgpt_summary_df['processed'] = chatgpt_summary_df['ChatGPT Response'].str.replace(not_roman, " ")
chatgpt_detailed_df['processed'] = chatgpt_detailed_df['ChatGPT Response'].str.replace(not_roman, " ")

  chatgpt_natural_df['processed'] = chatgpt_natural_df['ChatGPT Response'].str.replace(not_roman, " ")
  chatgpt_summary_df['processed'] = chatgpt_summary_df['ChatGPT Response'].str.replace(not_roman, " ")
  chatgpt_detailed_df['processed'] = chatgpt_detailed_df['ChatGPT Response'].str.replace(not_roman, " ")


In [None]:
# Get rid of random hyphens
hyphen = r'-'
chatgpt_natural_df['processed'] = chatgpt_natural_df['processed'].str.replace(hyphen, " ")
chatgpt_summary_df['processed'] = chatgpt_summary_df['processed'].str.replace(hyphen, " ")
chatgpt_detailed_df['processed'] = chatgpt_detailed_df['processed'].str.replace(hyphen, " ")

In [None]:
# Make sure everything has only one space between it
space = r'\s\s+' # selects anything that is more than one space, including line breaks
chatgpt_natural_df['processed'] = chatgpt_natural_df['processed'].str.replace(space, " ")
chatgpt_summary_df['processed'] = chatgpt_summary_df['processed'].str.replace(space, " ")
chatgpt_detailed_df['processed'] = chatgpt_detailed_df['processed'].str.replace(space, " ")

  chatgpt_natural_df['processed'] = chatgpt_natural_df['processed'].str.replace(space, " ")
  chatgpt_summary_df['processed'] = chatgpt_summary_df['processed'].str.replace(space, " ")
  chatgpt_detailed_df['processed'] = chatgpt_detailed_df['processed'].str.replace(space, " ")


In [None]:
chatgpt_natural_df.head()

Unnamed: 0,Prompt,News Type,Year,Topic,ChatGPT Response,processed
0,What was Hurricane Dorian?,Event,2019,Hurricane Dorian,\r\nHurricane Dorian was a powerful and devast...,Hurricane Dorian was a powerful and devastati...
1,What happened to the Notre Dame Cathedral in 2...,Event,2019,Notre Dame Cathedral,"The Notre-Dame Cathedral in Paris, France, suf...",The Notre Dame Cathedral in Paris France suffe...
2,What happened at the Women's World Cup in 2019?,Event,2019,Women's World Cup,\r\nThe 2019 FIFA Women's World Cup took place...,The FIFA Women's World Cup took place in Fran...
3,What happened during the Area 51 raid?,Event,2019,Area 51 raid,As of my last knowledge update in January 2022...,As of my last knowledge update in January the ...
4,What happened during the 2019 Area 51 raid?,Event,2019,Area 51 raid,"\r\nThe ""Storm Area 51"" event that gained atte...",The Storm Area event that gained attention in...


Now, we tokenize, remove stopwords, and lemmatize, as we did with our training data.

In [None]:
import nltk # Import nltk library
nltk.download('punkt') # Use `punkt` to define punctuation
from nltk.tokenize import word_tokenize # Import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
chatgpt_natural_df['tokens'] = (chatgpt_natural_df['processed'].str.lower()).apply(word_tokenize)
chatgpt_summary_df['tokens'] = (chatgpt_summary_df['processed'].str.lower()).apply(word_tokenize)
chatgpt_detailed_df['tokens'] = (chatgpt_detailed_df['processed'].str.lower()).apply(word_tokenize)

In [None]:
chatgpt_natural_df.head()

Unnamed: 0,Prompt,News Type,Year,Topic,ChatGPT Response,processed,tokens
0,What was Hurricane Dorian?,Event,2019,Hurricane Dorian,\r\nHurricane Dorian was a powerful and devast...,Hurricane Dorian was a powerful and devastati...,"[hurricane, dorian, was, a, powerful, and, dev..."
1,What happened to the Notre Dame Cathedral in 2...,Event,2019,Notre Dame Cathedral,"The Notre-Dame Cathedral in Paris, France, suf...",The Notre Dame Cathedral in Paris France suffe...,"[the, notre, dame, cathedral, in, paris, franc..."
2,What happened at the Women's World Cup in 2019?,Event,2019,Women's World Cup,\r\nThe 2019 FIFA Women's World Cup took place...,The FIFA Women's World Cup took place in Fran...,"[the, fifa, women, 's, world, cup, took, place..."
3,What happened during the Area 51 raid?,Event,2019,Area 51 raid,As of my last knowledge update in January 2022...,As of my last knowledge update in January the ...,"[as, of, my, last, knowledge, update, in, janu..."
4,What happened during the 2019 Area 51 raid?,Event,2019,Area 51 raid,"\r\nThe ""Storm Area 51"" event that gained atte...",The Storm Area event that gained attention in...,"[the, storm, area, event, that, gained, attent..."


In [None]:
nltk.download('stopwords')

from nltk.corpus import stopwords # Import stopwords
stop_words = set(stopwords.words('english')) # Set stopwords to 'English'

print(stop_words) # Did we do this right??

{'being', 'ma', 'should', 'before', 'yourselves', 'my', 'up', 'which', 'am', "needn't", 'any', 'we', 'herself', "mightn't", "shan't", 'him', 'own', 'needn', 'each', "should've", 'all', "you'd", 'they', 'below', 'have', 'off', 'who', 'has', 'aren', "doesn't", "hasn't", "you'll", "you're", 'm', 'your', 've', "don't", 'themselves', 'same', 'ain', "weren't", 'no', 'this', 'just', "it's", 'she', 'too', 'of', "you've", 'been', 'until', 'through', 'on', 'that', 'over', 'will', 'don', "hadn't", 'against', 'into', 'd', 'weren', 'out', 'do', 'after', 'both', 'down', "she's", 'y', "that'll", 'how', 'itself', 'why', 'the', 'because', 'ours', 'are', 'nor', 'myself', 'more', 'under', 'other', "wasn't", 'yours', 'while', 'between', 'hers', 'and', 'yourself', 'then', 'for', 'does', "didn't", 'ourselves', 'its', "shouldn't", 'there', 'but', 'mustn', 'did', 'didn', 'where', 'them', 'wouldn', 'once', 'these', 'hadn', 'is', 'haven', 'only', 'you', 'couldn', 'such', 'very', 't', 'from', 'o', 'some', 'were'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# We can use .apply() and can iterate through each tokenizezd row to remove stopwords with lambda
chatgpt_natural_df['tokens'] = chatgpt_natural_df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
chatgpt_summary_df['tokens'] = chatgpt_summary_df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
chatgpt_detailed_df['tokens'] = chatgpt_detailed_df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])

In [None]:
chatgpt_summary_df.head()

Unnamed: 0,Prompt,News Type,Year,Topic,ChatGPT Response,processed,tokens
0,Please provide a one paragraph summary of each...,Event,2019,Hurricane Dorian,"Hurricane Dorian, a powerful Category 5 storm,...",Hurricane Dorian a powerful Category storm str...,"[hurricane, dorian, powerful, category, storm,..."
1,Please provide a one paragraph summary of each...,Event,2019,Notre Dame Cathedral,The Notre Dame Cathedral in Paris suffered a d...,The Notre Dame Cathedral in Paris suffered a d...,"[notre, dame, cathedral, paris, suffered, deva..."
2,Please provide a one paragraph summary of each...,Event,2019,Women's World Cup,The Women's World Cup is a quadrennial interna...,The Women's World Cup is a quadrennial interna...,"[women, 's, world, cup, quadrennial, internati..."
3,Please provide a one paragraph summary of each...,Event,2019,Area 51 raid,"In September 2019, a viral social media event ...",In September a viral social media event called...,"[september, viral, social, media, event, calle..."
4,Please provide a one paragraph summary of each...,Event,2019,Copa America,Copa America is the oldest international footb...,Copa America is the oldest international footb...,"[copa, america, oldest, international, footbal..."


In [None]:
# Lemmatize function with help from practicum 2
# Import wordnet, the Lemmatizer, and set it equal to a variable

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
chatgpt_natural_df['tokens'] = chatgpt_natural_df['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
chatgpt_summary_df['tokens'] = chatgpt_summary_df['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
chatgpt_detailed_df['tokens'] = chatgpt_detailed_df['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [None]:
# chatgpt_natural_df['tokens_str'] = chatgpt_natural_df['tokens'].apply(lambda token_list: [str(token) for token in token_list])
# chatgpt_summary_df['tokens_str'] = chatgpt_summary_df['tokens'].apply(lambda token_list: [str(token) for token in token_list])
# chatgpt_detailed_df['tokens_str'] = chatgpt_detailed_df['tokens'].apply(lambda token_list: [str(token) for token in token_list])

In [None]:
chatgpt_detailed_df.head()

Unnamed: 0,Prompt,Year,Topic,ChatGPT Response,processed,tokens
0,Please provide an 5 paragraph essay including ...,2019,Hurricane Dorian,Title: Unleashing Nature's Fury: Hurricane Dor...,Title Unleashing Nature's Fury Hurricane Doria...,"[title, unleashing, nature, 's, fury, hurrican..."
1,Please provide an 5 paragraph essay including ...,2019,Notre Dame Cathedral,Title: Notre Dame Cathedral: A Symbolic Marvel...,Title Notre Dame Cathedral A Symbolic Marvel E...,"[title, notre, dame, cathedral, symbolic, marv..."
2,Please provide an 5 paragraph essay including ...,2019,Women's World Cup,Title: The Empowering Impact of the 2019 Women...,Title The Empowering Impact of the Women's Wor...,"[title, empowering, impact, woman, 's, world, ..."
3,Please provide an 5 paragraph essay including ...,2019,Area 51 raid,Introduction:\r\n\r\nThe Area 51 Raid of 2019 ...,Introduction The Area Raid of captured the att...,"[introduction, area, raid, captured, attention..."
4,Please provide an 5 paragraph essay including ...,2019,Copa America,Title: Copa America 2019: A Football Extravaga...,Title Copa America A Football Extravaganza Unv...,"[title, copa, america, football, extravaganza,..."


Let's save to backup .csv's!

In [None]:
# chatgpt_detailed_df['tokens']

In [None]:
# Download .csv
chatgpt_natural_df.to_csv('chatgpt_natural_processed.csv')
chatgpt_summary_df.to_csv('chatgpt_summary_processed.csv')
chatgpt_detailed_df.to_csv('chatgpt_detailed_processed.csv')

We can pull the tokens that we want to featurize:

In [None]:
X_natural = chatgpt_natural_df['tokens']
X_summary = chatgpt_summary_df['tokens']
X_detailed = chatgpt_detailed_df['tokens']

Remember, X_test was shaped like this ...

In [None]:
X_test

6809     ['ohio', 'police', 'department', 'took', 'face...
7144     ['major', 'issue', 'post', 'race', 'technical'...
7521     ['shlomit', 'malka', 'hospitalized', 'sunday',...
8174     ['reality', 'sometimes', 'fact', 'always', 'ge...
5264     ['come', 'solving', 'nation', 'complex', 'prob...
                               ...                        
207      ['sarah', 'b', 'boxer', 'nation', 'remains', '...
5623     ['pentagon', 'announced', 'monday', 'despite',...
3012     ['marching', 'band', "'s", 'expected', 'perfor...
10223    ['cnn', 'ridley', 'scott', "'s", 'blade', 'run...
7110     ['stormy', 'daniel', 'returning', 'porn', 'adu...
Name: text-processed, Length: 2111, dtype: object

## 7. Featurize, compile features, and check shapes (do they agree with the shapes our model was trained on?)

Let's be sure that our tokens are in string format, as X_test was (they were formatted as object data types).

In [None]:
# TOKEN_TEST = chatgpt_natural_df['tokens']
X_natural = chatgpt_natural_df['tokens']
X_summary = chatgpt_summary_df['tokens']
X_detailed = chatgpt_detailed_df['tokens']

In [None]:
X_natural_str = X_natural.apply(lambda x: str(x))
X_summary_str = X_summary.apply(lambda x: str(x))
X_detailed_str = X_detailed.apply(lambda x: str(x))

Now, we get Bag of Words, N-Grams, and TF-IDF features for each prompting style, just like we did for our traning data.

In [None]:
# Bag of Words
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_natural_bow = vectorizer.transform(X_natural_str)
X_summary_bow = vectorizer.transform(X_summary_str)
X_detailed_bow = vectorizer.transform(X_detailed_str)

In [None]:
# N-Grams
ngram_vectorizer = CountVectorizer(ngram_range=(1,2))
X_train_ngram = ngram_vectorizer.fit_transform(X_train)
X_natural_ngram = ngram_vectorizer.transform(X_natural_str)
X_summary_ngram = ngram_vectorizer.transform(X_summary_str)
X_detailed_ngram = ngram_vectorizer.transform(X_detailed_str)

In [None]:
# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_natural_tfidf = tfidf_vectorizer.transform(X_natural_str)
X_summary_tfidf = tfidf_vectorizer.transform(X_summary_str)
X_detailed_tfidf = tfidf_vectorizer.transform(X_detailed_str)

GloVe

Now, we can pull GloVe vectors, again, as we did for our training data.

In [None]:
import numpy as np
from gensim.models import KeyedVectors

# Load pre-trained GloVe embeddings
word_vectors = KeyedVectors.load_word2vec_format('glove.6B.200d.vec', binary=False)

In [None]:
# Make a function that averages GloVe vectors
# From practicum 5
def get_average_glove_vector(text):
    vectors = [word_vectors[word] for word in text if word in word_vectors]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)

In [None]:
# GloVE (apply earlier function)
X_train_glove = np.array([get_average_glove_vector(text) for text in X_train])
X_natural_glove = np.array([get_average_glove_vector(text) for text in X_natural_str])
X_summary_glove = np.array([get_average_glove_vector(text) for text in X_summary_str])
X_detailed_glove = np.array([get_average_glove_vector(text) for text in X_detailed_str])


Let's check our shapes! Here is the shape of the data used to train our model:

In [None]:
print(X_train.shape)
print(X_train_bow.shape)
print(X_train_ngram.shape)
print(X_train_tfidf.shape)
print(X_train_glove.shape)

(8444,)
(8444, 62382)
(8444, 1419558)
(8444, 62382)
(8444, 200)


Here is for natural prompting:

In [None]:
print(X_natural_str.shape)
print(X_natural_bow.shape)
print(X_natural_ngram.shape)
print(X_natural_tfidf.shape)
print(X_natural_glove.shape)

(49,)
(49, 62382)
(49, 1419558)
(49, 62382)
(49, 200)


Here is for summary style prompting:

In [None]:
print(X_summary_str.shape)
print(X_summary_bow.shape)
print(X_summary_ngram.shape)
print(X_summary_tfidf.shape)
print(X_summary_glove.shape)

(47,)
(47, 62382)
(47, 1419558)
(47, 62382)
(47, 200)


And here is for detailed explanatory prompting:

In [None]:
print(X_detailed_str.shape)
print(X_detailed_bow.shape)
print(X_detailed_ngram.shape)
print(X_detailed_tfidf.shape)
print(X_detailed_glove.shape)

(36,)
(36, 62382)
(36, 1419558)
(36, 62382)
(36, 200)


Let's stack all features, given that we trained our neural net on all features, and check the shape.

In [None]:
import scipy.sparse as sp
from scipy.sparse import hstack

X_train_all = hstack((X_train_bow, X_train_ngram, X_train_tfidf, X_train_glove))
X_natural_all = hstack((X_natural_bow, X_natural_ngram, X_natural_tfidf, X_natural_glove))
X_summary_all = hstack((X_summary_bow, X_summary_ngram, X_summary_tfidf, X_summary_glove))
X_detailed_all = hstack((X_detailed_bow, X_detailed_ngram, X_detailed_tfidf, X_detailed_glove))

In [None]:
print(X_train_all.shape)
print(X_natural_all.shape)
print(X_summary_all.shape)
print(X_detailed_all.shape)

(8444, 1544522)
(49, 1544522)
(47, 1544522)
(36, 1544522)


In [None]:
# type(X_test[1][1])

In [None]:
type(X_natural[1][1])

str

## 9. Predict

Finally, we can make predictions for each prompt type! Let's input our featurized data into our loaded neural net classifier and output an array of predictions, with classification numbers that represent the predicted classification.

### A. Natural

In [None]:
natural_predictions = loaded_classifier.predict(X_natural_all)

In [None]:
natural_predictions

array([2, 2, 0, 0, 0, 2, 1, 1, 0, 1, 0, 0, 2, 0, 1, 2, 2, 2, 0, 0, 1, 2,
       1, 2, 0, 0, 1, 0, 2, 0, 1, 2, 1, 2, 2, 1, 0, 1, 2, 2, 2, 0, 1, 1,
       2, 0, 2, 0, 0])

### B. Summary

In [None]:
summary_predictions = loaded_classifier.predict(X_summary_all)

In [None]:
summary_predictions

array([2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 2, 2, 0, 2, 0, 0, 0, 0, 0, 1,
       2, 0, 0, 0, 0, 0, 0, 2, 2, 1, 2, 0, 0, 2, 2, 2, 0, 0, 2, 0, 0, 0,
       2, 1, 0])

### C. Detailed

In [None]:
detailed_predictions = loaded_classifier.predict(X_detailed_all)

In [None]:
detailed_predictions

array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 0, 0, 0, 1,
       2, 2, 0, 1, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2])

## 10. Store results

Then, we can store our predictions into dataframes that will allow us to compare the predictions to the affiliated text.

In [None]:
chatgpt_natural_df['predictions'] = natural_predictions
chatgpt_summary_df['predictions'] = summary_predictions
chatgpt_detailed_df['predictions'] = detailed_predictions

In [None]:
chatgpt_detailed_df.tail()

Unnamed: 0,Prompt,Year,Topic,ChatGPT Response,processed,tokens,predictions,label
31,Please provide an 5 paragraph essay including ...,2019,Iran's Nuclear Deal (topic 21),Title: The Iran Nuclear Deal: An Analysis of i...,Title The Iran Nuclear Deal An Analysis of its...,"[title, iran, nuclear, deal, analysis, origin,...",1,conservative
32,Please provide an 5 paragraph essay including ...,2018,Russia and Trump (topic 20),Title: Unraveling the Complex Nexus: Russia an...,Title Unraveling the Complex Nexus Russia and ...,"[title, unraveling, complex, nexus, russia, tr...",2,liberal
33,Please provide an 5 paragraph essay including ...,2017-2019,Transgender Policy (topic 8),Title: Navigating the Waves of Transgender Pol...,Title Navigating the Waves of Transgender Poli...,"[title, navigating, wave, transgender, policy,...",2,liberal
34,Please provide an 5 paragraph essay including ...,2017-2019,"Police Brutality (topic 14, 23, 25, etc.)",Title: Unveiling the Layers of Police Brutalit...,Title Unveiling the Layers of Police Brutality...,"[title, unveiling, layer, police, brutality, c...",2,liberal
35,Please provide an 5 paragraph essay including ...,2017-2019,"Police Shootings (topic 14, 23, 25, etc.)",Title: Police Shootings in the United States (...,Title Police Shootings in the United States Un...,"[title, police, shooting, united, state, unrav...",2,liberal


For readability, we will connvert numerical predictions (0,1,2) to human-language labels (moderate, conservative, liberal)

In [None]:
# Add new column
label_convert_key = {0:'moderate', 1:'conservative', 2:'liberal'}
chatgpt_natural_df['label'] = chatgpt_natural_df['predictions'].map(label_convert_key)
chatgpt_summary_df['label'] = chatgpt_summary_df['predictions'].map(label_convert_key)
chatgpt_detailed_df['label'] = chatgpt_detailed_df['predictions'].map(label_convert_key)

In [None]:
chatgpt_natural_df.head()

Unnamed: 0,Prompt,News Type,Year,Topic,ChatGPT Response,processed,tokens,predictions,label
0,What was Hurricane Dorian?,Event,2019,Hurricane Dorian,\r\nHurricane Dorian was a powerful and devast...,Hurricane Dorian was a powerful and devastati...,"[hurricane, dorian, powerful, devastating, tro...",2,liberal
1,What happened to the Notre Dame Cathedral in 2...,Event,2019,Notre Dame Cathedral,"The Notre-Dame Cathedral in Paris, France, suf...",The Notre Dame Cathedral in Paris France suffe...,"[notre, dame, cathedral, paris, france, suffer...",2,liberal
2,What happened at the Women's World Cup in 2019?,Event,2019,Women's World Cup,\r\nThe 2019 FIFA Women's World Cup took place...,The FIFA Women's World Cup took place in Fran...,"[fifa, woman, 's, world, cup, took, place, fra...",0,moderate
3,What happened during the Area 51 raid?,Event,2019,Area 51 raid,As of my last knowledge update in January 2022...,As of my last knowledge update in January the ...,"[last, knowledge, update, january, proposed, a...",0,moderate
4,What happened during the 2019 Area 51 raid?,Event,2019,Area 51 raid,"\r\nThe ""Storm Area 51"" event that gained atte...",The Storm Area event that gained attention in...,"[storm, area, event, gained, attention, initia...",0,moderate


In [None]:
chatgpt_summary_df.head()

Unnamed: 0,Prompt,News Type,Year,Topic,ChatGPT Response,processed,tokens,predictions,label
0,Please provide a one paragraph summary of each...,Event,2019,Hurricane Dorian,"Hurricane Dorian, a powerful Category 5 storm,...",Hurricane Dorian a powerful Category storm str...,"[hurricane, dorian, powerful, category, storm,...",2,liberal
1,Please provide a one paragraph summary of each...,Event,2019,Notre Dame Cathedral,The Notre Dame Cathedral in Paris suffered a d...,The Notre Dame Cathedral in Paris suffered a d...,"[notre, dame, cathedral, paris, suffered, deva...",0,moderate
2,Please provide a one paragraph summary of each...,Event,2019,Women's World Cup,The Women's World Cup is a quadrennial interna...,The Women's World Cup is a quadrennial interna...,"[woman, 's, world, cup, quadrennial, internati...",2,liberal
3,Please provide a one paragraph summary of each...,Event,2019,Area 51 raid,"In September 2019, a viral social media event ...",In September a viral social media event called...,"[september, viral, social, medium, event, call...",0,moderate
4,Please provide a one paragraph summary of each...,Event,2019,Copa America,Copa America is the oldest international footb...,Copa America is the oldest international footb...,"[copa, america, oldest, international, footbal...",2,liberal


In [None]:
chatgpt_detailed_df.head()

Unnamed: 0,Prompt,Year,Topic,ChatGPT Response,processed,tokens,predictions,label
0,Please provide an 5 paragraph essay including ...,2019,Hurricane Dorian,Title: Unleashing Nature's Fury: Hurricane Dor...,Title Unleashing Nature's Fury Hurricane Doria...,"[title, unleashing, nature, 's, fury, hurrican...",0,moderate
1,Please provide an 5 paragraph essay including ...,2019,Notre Dame Cathedral,Title: Notre Dame Cathedral: A Symbolic Marvel...,Title Notre Dame Cathedral A Symbolic Marvel E...,"[title, notre, dame, cathedral, symbolic, marv...",1,conservative
2,Please provide an 5 paragraph essay including ...,2019,Women's World Cup,Title: The Empowering Impact of the 2019 Women...,Title The Empowering Impact of the Women's Wor...,"[title, empowering, impact, woman, 's, world, ...",0,moderate
3,Please provide an 5 paragraph essay including ...,2019,Area 51 raid,Introduction:\r\n\r\nThe Area 51 Raid of 2019 ...,Introduction The Area Raid of captured the att...,"[introduction, area, raid, captured, attention...",0,moderate
4,Please provide an 5 paragraph essay including ...,2019,Copa America,Title: Copa America 2019: A Football Extravaga...,Title Copa America A Football Extravaganza Unv...,"[title, copa, america, football, extravaganza,...",0,moderate


Once we've checked our data, we can export it to a .csv and save it for further analysis!

In [None]:
chatgpt_natural_df.to_csv('results_natural_df.csv')
chatgpt_summary_df.to_csv('results_summary_df.csv')
chatgpt_detailed_df.to_csv('results_detailed_df.csv')

## 11. Results

Here are the final counts of predicted lean Iof ChatGPT responses) for each prompt style.

In [None]:
chatgpt_natural_df['label'].value_counts()

liberal         18
moderate        18
conservative    13
Name: label, dtype: int64

In [None]:
chatgpt_summary_df['label'].value_counts()

moderate        26
liberal         18
conservative     3
Name: label, dtype: int64

In [None]:
chatgpt_detailed_df['label'].value_counts()

moderate        15
liberal         15
conservative     6
Name: label, dtype: int64