# Engineer a classifier and gather LLM classifications

# Table of Contents:

**ENGINEER CLASSIFICATION MODEL**

1. Pull in training data (cc_news from HuggingFace)
2. Preprocessing for features: BoW, N-Gram, TF-IDF, GloVE
3. Split Dataset: train-test-split
4. Featurize
4. Compile Variables: hstack
5. Model Training: train different classifiers and perform feature ablation studies  
  
  A. Classification (Logistic Regression, SVC)  
  B. Ensemble (Random Forest, Gradient Boost)  
  C. Feed-Forward Network

**TEST ON LLM DATA**

1. Import and preprocess LLM text
2. Predict and store


**PLEASE NOTE THAT ADDITIONAL SOURCES ARE CITED IN OUR AFFILIATED RESEARCH PAPER**

# ENGINEER CLASSIFICAITON MODEL

Here, we will engineer a classification model that can classify ChatGPT responses as "right-," "center-", or "left-leaning".

## 1. Pull in training data (cc_news from HuggingFace)

Mount Google drive to pull in our data.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Read .csv file of dataset, which has been filtered to popular American news domains and given a label (moderate, conservative, or liberal).

In [None]:
import pandas as pd

path = 'dataset of news articles'
news_df = pd.read_csv(path)

In [None]:
news_df.head()

Unnamed: 0,title,text,domain,date,description,url,image_url,leaning-label
0,Cleveland Shooter Disowned By Family On Twitter,"Donald Harvey, also dubbed as the 'Angel of De...",www.yahoo.com,2017-04-17 05:27:27,Cleveland Police issued an aggravated murder w...,https://www.yahoo.com/news/cleveland-shooter-d...,https://s.yimg.com/uu/api/res/1.2/vZB0t9O5GGqs...,conservative
1,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,www.yahoo.com,2017-10-07 04:31:25,Movie company says it is taking claims “very s...,https://www.yahoo.com/movies/weinstein-company...,https://s.yimg.com/uu/api/res/1.2/4wtFnh7lUeYk...,conservative
2,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,www.yahoo.com,2017-02-15 15:22:55,Former NHL star Mike Comrie — once married to ...,https://www.yahoo.com/celebrity/hilary-duffs-e...,https://s.yimg.com/uu/api/res/1.2/..p2z00Och0J...,conservative
3,"At 117, Jamaican woman likely just became worl...","The world's oldest person Violet Brown, center...",www.yahoo.com,2017-04-17 22:55:18,"DUANVALE, Jamaica (AP) — Violet Brown spent mu...",https://www.yahoo.com/news/117-jamaican-woman-...,https://s.yimg.com/uu/api/res/1.2/Vd4NgTACWY1z...,conservative
4,"Mark Hamill's Carrie Fisher Tribute: ""Making H...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,www.yahoo.com,2017-01-02 00:00:00,"""She was a handful, but my life would have bee...",https://www.yahoo.com/movies/mark-hamills-carr...,https://s.yimg.com/uu/api/res/1.2/Ole1yyNg3gmL...,conservative


Check data counts.

In [None]:
news_df.count()

Unnamed: 0       10555
title            10555
text             10555
domain           10555
date             10545
description      10471
url              10555
image_url        10555
leaning-label    10555
dtype: int64

Check that one instance of the data looks good:

In [None]:
print(news_df['text'].iloc[0])

Donald Harvey, also dubbed as the 'Angel of Death,' used arsenic, rat poison and cyanide to kill patients at hospitals where he worked during 1970s and '80s.
Steve Stephens, 37, who has been accused of homicide Sunday of 74-year-old Ohio resident named Robert Godwin Sr., has been publicly disowned by his family, according to a Twitter post from his account. The shooting, which was streamed on Facebook Live, took place at 635 E. 93rd St. around 2 p.m. EDT.
The Twitter post on Stephens' account read: "We absolutely do not condone this type of behavior and this atrocity, therefore we do not consider Steve a part of this family. I would like everyone to refrain from posting pictures of our family in association with Steve, for we do not want our young ones to be burdened by this man. Please respect our privacy."
Cleveland Police Department issued an aggravated murder warrant against Stephens on Sunday night. They also alerted residents of Pennsylvania, New York, Indiana and Michigan as the

This is a classification problem of right, left, or moderate. Let's keep our choices in mind.

In [None]:
# choices = ["liberal", "conservative", "moderate"]

We're just looking at bodies of text because ChatGPT would likely be only looking at text. We can narrow down our dataframe to something more manageable.

In [None]:
news_features = news_df[['domain', 'date', 'title', 'description', 'text', 'leaning-label']]

In [None]:
news_features.head()

Unnamed: 0,domain,date,title,description,text,leaning-label
0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative
1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative
2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative
3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative
4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative


In [None]:
news_features.count()

domain           10555
date             10545
title            10555
description      10471
text             10555
leaning-label    10555
dtype: int64

In [None]:
# Nice!

Now we want to make sure we have a numerical leaning label for each text label. A numerical representation is easier to use when calculating our results.

In [None]:
# Add new column
label_key = {'moderate':0, 'conservative':1, 'liberal':2}
news_features['numerical-label'] = news_features['leaning-label'].map(label_key)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  news_features['numerical-label'] = news_features['leaning-label'].map(label_key)


Check some examples, as well as data.head().

In [None]:
news_features.iloc[1342]

domain                                                   www.bbc.com
date                                             2018-04-24 17:30:44
title                    Toronto van attack: Moment suspect arrested
description        Alek Minassia pleaded "kill me" and claimed to...
text               Video\nA man suspected of killing 10 people an...
leaning-label                                               moderate
numerical-label                                                    0
Name: 1342, dtype: object

In [None]:
news_features.head()

Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label
0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative,1
1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative,1
2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative,1
3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative,1
4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative,1


## 2. Preprocessing for features: BoW, N-Gram, TF-IDF, GloVE

Now, we need to featurize our data so we can feed it into our classifier during training. We are going to do four main features--Bag of Words, N-Gram, TF-IDF, and GloVe. Featurizing requires some basic preprocessing.

List of features and the affiliated preprocessing steps:
- BoW, N-Gram, TF-IDF:  
  - Clean and remove URLs, hashtags, etc.
  - Tokenize
  - Remove stop words
  - Lemmatize
- GloVe


First, let's remove URLs, mentions, hashtags, non-English text, and other outlying formatting issues. We will use regular expressions that we also harnassed in Assignment 3.

In [None]:

import re # Import regular expressions

# URLs (has http:// or www.)
# url_pattern = r'https?://\S+|www\.\S+' #Failed attempt
url_pattern = r'(https:\/\/|www.)[\S]+' # matches https:// or www. through any nonspace character
news_features['text-processed'] = news_features['text'].str.replace(url_pattern, "")

In [None]:
# Mentions (has the @ symbol)
mention_pattern = r'@[\S]+' # matches anything following an @ symbol
news_features['text-processed'] = news_features['text-processed'].str.replace(mention_pattern, "")

In [None]:
# Hashtags
hashtag_pattern = r'#[\S]+' # matches anything following a # symbol
news_features['text-processed'] = news_features['text-processed'].str.replace(hashtag_pattern, "")

In [None]:
# Non-English text - characters languages
not_roman = r"[^a-zA-Z'\s]" # defines anything that is not alphanumeric text or spaces
news_features['text-processed'] = news_features['text-processed'].str.replace(not_roman, " ")

In [None]:
# # Remove numbers
# numbers = r'\d'
# news_features['text-processed'] = news_features['text-processed'].str.replace(numbers, "")

In [None]:
# Get rid of random hyphens
hyphen = r'-'
news_features['text-processed'] = news_features['text-processed'].str.replace(hyphen, " ")

In [None]:
# Make sure everything has only one space between it
space = r'\s\s+' # selects anything that is more than one space, including line breaks
news_features['text-processed'] = news_features['text-processed'].str.replace(space, " ")

Let's check that it worked:

In [None]:
news_features.head()

Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label,text-processed
0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative,1,"Donald Harvey, also dubbed as the 'Angel of De..."
1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative,1,Movie company says it is taking claims “very s...
2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative,1,Former NHL star Mike Comrie — once married to ...
3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative,1,"The world's oldest person Violet Brown, center..."
4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative,1,(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...


In [None]:
print(news_features['text-processed'].iloc[348])

This undated picture provided on Monday, Feb. 13, 2017, by the Albanian National Coastline Agency shows a shipwreck discovered by the RPM's Hercules research vessel in Ionian Sea, Albania. The country is promoting the archaeological finds in the waters off its southwest coast to raise public interest and to attract attention of decision makers who can help preserve the discoveries. The Albanian National Coastline Agency opened an exhibition on Monday, Feb. 13 of 30 pictures showing underwater finds of potential archaeological significance from the last decade. (The Albanian National Coastline Agency via AP)
TIRANA, Albania (AP) — Albania is promoting the archaeological finds in the waters off its southwest coast to raise public interest and to attract attention of decision makers who can help preserve the discoveries.
The Albanian National Coastline Agency opened an exhibit Monday of 30 photographs showing underwater finds of potential archaeological significance from the last decade.


Now, we want to tokenize our text, breaking it up into single words that we can continue to filter. Let's download the 'punkt' library from nltk and apply the tokenizer from their pipeline.

**The below preprocessing code was inspired from labs completed in Dr. Abhijit Mishra's Natural Language Processing and Applications course at UT Austin's School of Information.**

In [None]:
import nltk # Import nltk library
nltk.download('punkt') # Use `punkt` to define punctuation
from nltk.tokenize import word_tokenize # Import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
news_features['text-processed'] = (news_features['text-processed'].str.lower()).apply(word_tokenize)

Now we have tokenized words. We can filter out stopwords using nltk's list of stopwrods to filter out using a list comprehension.

In [None]:
# news_features.head()

In [None]:
nltk.download('stopwords')

from nltk.corpus import stopwords # Import stopwords
stop_words = set(stopwords.words('english')) # Set stopwords to 'English'

print(stop_words) # Did we do this right??

{'for', 't', 'over', 'out', 'through', 'aren', 'of', 'i', 'isn', 'than', 'be', "wasn't", 'he', 'same', 'has', 'having', "wouldn't", 'yourself', 'while', 'can', 'before', 'own', 've', 'hers', 'nor', 'mightn', 'had', 'after', "you'll", 'the', 'our', 'herself', 'down', "mightn't", 'needn', 'during', 'them', 'into', 'mustn', 'you', 'up', "shan't", 'won', 'are', 'd', 'theirs', 'is', "she's", 'this', "should've", "you're", 'so', 'my', 'again', 'from', 'how', 'such', 'and', 'now', 'don', 'her', 'hasn', "doesn't", 'shouldn', 'were', 'his', 'doesn', "shouldn't", "it's", 'have', 'there', "isn't", 'very', 'not', 'but', 'am', 'will', 'if', 'no', 'a', 'or', "you'd", 'its', 'why', 'any', 'me', 'as', "didn't", 'what', 'few', 'all', 'haven', 'being', 'we', 'wasn', 'above', 'been', 'your', 'hadn', 'an', 'only', "hasn't", 'myself', 'between', "mustn't", 'ours', 'that', 'in', 'other', 'should', 'when', 'who', 'themselves', 'some', 'o', "hadn't", "won't", 'more', 'most', 'do', 'ma', "aren't", 'too', 'thes

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# We can use .apply() and can iterate through each tokenizezd row to remove stopwords with lambda
news_features['text-processed'] = news_features['text-processed'].apply(lambda x: [word for word in x if word not in stop_words])

Now, we lemmatize, which removes noise by transforming words like "rats" and "rat" to both "rat."

In [None]:
# news_features.head()

In [None]:
# Lemmatize function with help from practicum 2
# Import wordnet, the Lemmatizer, and set it equal to a variable

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
news_features['text-processed'] = news_features['text-processed'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [None]:
news_features.head()

Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label,text-processed
0,www.yahoo.com,2017-04-17 05:27:27,Cleveland Shooter Disowned By Family On Twitter,Cleveland Police issued an aggravated murder w...,"Donald Harvey, also dubbed as the 'Angel of De...",conservative,1,"[donald, harvey, ,, also, dubbed, 'angel, deat..."
1,www.yahoo.com,2017-10-07 04:31:25,Weinstein Company launches probe into co-found...,Movie company says it is taking claims “very s...,Movie company says it is taking claims “very s...,conservative,1,"[movie, company, say, taking, claim, “, seriou..."
2,www.yahoo.com,2017-02-15 15:22:55,Hilary Duff's Ex-Husband Mike Comrie Investiga...,Former NHL star Mike Comrie — once married to ...,Former NHL star Mike Comrie — once married to ...,conservative,1,"[former, nhl, star, mike, comrie, —, married, ..."
3,www.yahoo.com,2017-04-17 22:55:18,"At 117, Jamaican woman likely just became worl...","DUANVALE, Jamaica (AP) — Violet Brown spent mu...","The world's oldest person Violet Brown, center...",conservative,1,"[world, 's, oldest, person, violet, brown, ,, ..."
4,www.yahoo.com,2017-01-02 00:00:00,"Mark Hamill's Carrie Fisher Tribute: ""Making H...","""She was a handful, but my life would have bee...",(Photo: Albert L. Ortega/Gettyimages)\nCarrie ...,conservative,1,"[(, photo, :, albert, l., ortega/gettyimages, ..."


Now we download our .csv file in case this session crashes.

In [None]:
# Download .csv
news_features.to_csv('news_features.csv')

In [None]:
import pandas as pd

We are left with pre-processed text that can be used to extract numerical features.

## 3. Split Dataset: train-test-split

Now, we can split our dataset to a train set and a test set. That way, we can train our classifier and test it on "unknown" data for accuracy.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(news_features, test_size = 0.2)

In [None]:
train.head()

Unnamed: 0.1,Unnamed: 0,domain,date,title,description,text,leaning-label,numerical-label,text-processed,glove
3440,3440,www.npr.org,2018-07-04 00:00:00,Plea Deal For Former Congressional IT Staffer ...,"An attorney for Imran Awan said his client, wh...",Plea Deal For Former Congressional IT Staffer ...,liberal,2,"['plea', 'deal', 'former', 'congressional', 's...","[0.044859424, 0.33368477, -0.17616433, -0.0678..."
38,38,www.yahoo.com,2017-10-05 11:21:00,Guns have killed more Americans in last 50 yea...,More than 1.5 million US citizens have died as...,People run from the Route 91 Harvest country m...,conservative,1,"['people', 'run', 'route', '91', 'harvest', 'c...","[0.031091737, 0.30892602, -0.16687986, -0.0482..."
1242,1242,www.bbc.com,2017-10-06 14:52:34,Birmingham care home boss stole £90k from 97-y...,Carleen Wilkins stole thousands from her 97-ye...,Image copyright West Midlands Police Image cap...,moderate,0,"['image', 'copyright', 'west', 'midland', 'pol...","[0.030551143, 0.27672046, -0.1790493, -0.05511..."
4524,4524,www.msnbc.com,2017-10-06 19:00:17,Trump may decertify Iran Deal,"According to the Washington Post,","According to the Washington Post, ""President T...",liberal,2,"['according', 'washington', 'post', ',', '``',...","[0.057606265, 0.3210167, -0.16817248, -0.03219..."
6168,6168,www.foxnews.com,2017-08-14 03:00:28,The Latest: 18 killed in Burkina Faso restaura...,The Latest on the attack in Burkina Faso (all ...,The Latest on the attack in Burkina Faso (all ...,conservative,1,"['latest', 'attack', 'burkina', 'faso', '(', '...","[0.05054396, 0.28976792, -0.17549747, -0.07414..."


Let's run a count to make sure that worked!

In [None]:
train.count()

Unnamed: 0         8444
domain             8444
date               8438
title              8444
description        8382
text               8444
leaning-label      8444
numerical-label    8444
text-processed     8444
glove              8444
dtype: int64

In [None]:
test.count()

Unnamed: 0         2111
domain             2111
date               2107
title              2111
description        2089
text               2111
leaning-label      2111
numerical-label    2111
text-processed     2111
glove              2111
dtype: int64

In [None]:
train.groupby('leaning-label').size()

leaning-label
conservative    2720
liberal         2738
moderate        2986
dtype: int64

In [None]:
test.groupby('leaning-label').size()

leaning-label
conservative    674
liberal         723
moderate        714
dtype: int64

Now, we set our training data as those lists of tokens, and set our classification problem to the numerical label (0,1,2 for moderate, right, and left leaning)

In [None]:
# Extract text and labels (reference: practicum 5)
X_train = train['text-processed']
y_train = train['numerical-label']

X_test = test['text-processed']
y_test = test['numerical-label']

In [None]:
X_train[90]

'[\'file\', \'oct.\', \'1\', \',\', \'2015\', \'file\', \'photo\', \',\', \'advocate\', \'victim\', \'domestic\', \'abuse\', \'protest\', \'outside\', \'state\', \'office\', \'downtown\', \'chicago\', \'.\', \'illinois\', \'department\', \'human\', \'service\', \'waited\', \'five\', \'month\', \'inform\', \'dozen\', \'domestic\', \'violence\', \'shelter\', \'money\', \'temporary\', \'budget\', \'lawmaker\', \'approved\', \'last\', \'summer\', \'.\', \'official\', \'providing\', \'service\', \'victim\', \'domestic\', \'violence\', \'tell\', \'associated\', \'press\', \'unaware\', \'$\', \'9\', \'million\', \'state\', \'funding\', \'left\', \'stopgap\', \'plan\', \'expired\', \'december\', \'.\', \'(\', \'ap\', \'photo/sophia\', \'tareen\', \',\', \'file\', \')\', \'springfield\', \',\', \'ill.\', \'(\', \'ap\', \')\', \'—\', \'illinois\', \'official\', \'waited\', \'five\', \'month\', \'alert\', \'dozen\', \'domestic\', \'violence\', \'program\', \'funding\', \'eliminated\', \',\', \'om

## 4. Featurize

Now, we can featurize our text tokens that we pulled earlier, into Bag of Words, N-Gram, and TF-IDF featurs. We will start by importing `CountVectorizer` and `TfidfVectorizer` to usue as tools to get these features.

References:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- Dr. Abhijit Mishra's NLP course lab materials

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import accuracy_score

Now, we featurize by setting three vectorizers, one for each feature, and we "transform" our data. **It is critical that we featurize both our training and testing data** because our model will not be able to predict off of data unless it is the same shape the model was trained on.

In [None]:
# Bag of Words
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [None]:
# N-Grams
ngram_vectorizer = CountVectorizer(ngram_range=(1,2))
X_train_ngram = ngram_vectorizer.fit_transform(X_train)
X_test_ngram = ngram_vectorizer.transform(X_test)

In [None]:
# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

Now, for the GloVe feature, we call our above GloVe function (the one that averages a vector for the entire document) and place it in an array.

In order to get GloVe embeddings, we have to unzip the GloVe file and load our vectors. We will use 200 dimension GloVe vectors, because they contain more nuance. We used Dr. Abhijit Mishra's example on how to load in GloVe vectors for this section.

In [None]:
# this is a one time download
!wget -c http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# do some necessary conversions
!python -m gensim.scripts.glove2word2vec --input  glove.6B.200d.txt --output glove.6B.200d.vec
!rm glove*.txt

--2024-04-21 20:08:42--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-04-21 20:08:42--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-04-21 20:08:43--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

We load our vectors, then retrieve an embedding from those vectors for each word in each document of our corpus. Then, we can average the embeddings for each document, to get one single embedding (200 dimensions) that represents the average of each document.

In [None]:
import numpy as np
from gensim.models import KeyedVectors

# Load pre-trained GloVe embeddings
word_vectors = KeyedVectors.load_word2vec_format('glove.6B.200d.vec', binary=False)

In [None]:
# Make a function that averages GloVe vectors
def get_average_glove_vector(text):
    vectors = [word_vectors[word] for word in text if word in word_vectors]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)

Now, we can apply the function that averages all of those embeddings, and can check our dataset.

In [None]:
import numpy as np

In [None]:
# GloVE (apply earlier function)
X_train_glove = np.array([get_average_glove_vector(text) for text in X_train])
X_test_glove = np.array([get_average_glove_vector(text) for text in X_test])

Let's be sure that the "x" shape matches for all of our features.

In [None]:
print(X_train.shape)
print(X_train_bow.shape)
print(X_train_ngram.shape)
print(X_train_tfidf.shape)

(8444,)
(8444, 67788)
(8444, 1466891)
(8444, 67788)


In [None]:
print(X_test.shape)
print(X_test_bow.shape)
print(X_test_ngram.shape)
print(X_test_tfidf.shape)

(2111,)
(2111, 67788)
(2111, 1466891)
(2111, 67788)


In [None]:
print(X_train_glove.shape)
print(X_test_glove.shape)

(8444, 200)
(2111, 200)


Great! Now we can combine our features below!

## 5. Compile Variables: hstack

We can simply use `hstack` to combine all features. Since we are doing a feature ablation study for some of the models, we will make several versions of different combinations.

References:
- https://numpy.org/doc/stable/reference/generated/numpy.hstack.html
- https://stackoverflow.com/questions/54560836/how-to-combine-text-features-and-categorical-features-in-python

In [None]:
import scipy.sparse as sp
from scipy.sparse import hstack
# All
X_train_all = hstack((X_train_bow, X_train_ngram, X_train_tfidf, X_train_glove))
X_test_all = hstack((X_test_bow, X_test_ngram, X_test_tfidf, X_test_glove))

# # Everything but Bag of Words
# X_train_all_but_bow = hstack((X_train_ngram, X_train_tfidf, X_train_glove))
# X_test_all_but_bow = hstack((X_test_ngram, X_test_tfidf, X_test_glove))

# # Everything but n-gram
# X_train_all_but_ngram = hstack((X_train_bow, X_train_tfidf, X_train_glove))
# X_test_all_but_ngram = hstack((X_test_bow, X_test_tfidf, X_test_glove))

# # Everything but tfidf
# X_train_all_but_tfidf = hstack((X_train_bow, X_train_ngram, X_train_glove))
# X_test_all_but_tfidf = hstack((X_test_bow, X_test_ngram, X_test_glove))

# # Everything but GloVe
# X_train_all_but_glove = hstack((X_train_bow, X_train_ngram, X_train_tfidf))
# X_test_all_but_glove = hstack((X_test_bow, X_test_ngram, X_test_tfidf))

In [None]:
sp.save_npz('/content/drive/MyDrive/Current Projects 2024/Implications of NLP/classifier_data_and_model_training/redoing the classifier from last semester/X_train_all.npz', X_train_all)
sp.save_npz('/content/drive/MyDrive/Current Projects 2024/Implications of NLP/classifier_data_and_model_training/redoing the classifier from last semester/X_test_all.npz', X_test_all)


## 6. Model Training: train different classifiers and perform feature ablation studies

Now we can train several classifiers to identify which model will have the highest level of accuracy.

General Classifiers:
  - Logistic regression
  - SVC (like SVM but for classification)

Ensemble classifiers:
  - Random forest
  - Gradient boosted trees

Neural Networ:
  - Feed-forward neural network

References for the following section:

- Dr. Abhijit Mishra's examples in lab assignments (particularily, the function that trains classifiers, including how to apply the function properly)
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

### A. Classification

Let's set up a function that trains any classification model. Thank you Dr. Abhijit for this example, from our practicum lab!

In [None]:
def train_and_evaluate_classifier(classifier, X_train, y_actual, X_test, y_test_actual):
  classifier.fit(X_train, y_actual)
  y_pred = classifier.predict(X_test)
  accuracy = accuracy_score(y_test_actual, y_pred)
  return accuracy

Now we import our classic models:

In [None]:
# Import Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#### i. Logistic Regression

We will train our Logistic Regression; as you can see below, we had to increase the max_iter in order to help our model converge since it reached the maximum number of iterations.

In [None]:
# Increase max_iter
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 85.36238749407865%


Now, we can perform a feature ablation study to one-by-one remove features and figure out which configuration has the most success in accuracy.

In [None]:
# Without BOW
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all_but_bow, y_train, X_test_all_but_bow, y_test)
print (f"Accuracy of Logistic Regression without BOW = {accuracy*100}%")

Accuracy of Logistic Regression without BOW = 85.78872572240644%


In [None]:
# Without N-Gram
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all_but_ngram, y_train, X_test_all_but_ngram, y_test)
print (f"Accuracy of Logistic Regression without N-gram = {accuracy*100}%")

Accuracy of Logistic Regression without N-gram = 84.60445286594032%


In [None]:
# Without TFIDF
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all_but_tfidf, y_train, X_test_all_but_tfidf, y_test)
print (f"Accuracy of Logistic Regression without TF-IDF = {accuracy*100}%")

Accuracy of Logistic Regression without TF-IDF = 85.22027475130271%


In [None]:
# Without GloVE
LogReg_classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(LogReg_classifier, X_train_all_but_glove, y_train, X_test_all_but_glove, y_test)
print (f"Accuracy of Logistic Regression without GloVe = {accuracy*100}%")

Accuracy of Logistic Regression without GloVe = 85.36238749407865%


#### ii. SVC

Let's do the same with our SVC classifier!

In [None]:
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Support Vector Classificaiton = {accuracy*100}%")

Accuracy of Support Vector Classificaiton = 83.13595452392231%


In [None]:
# Without BOW
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all_but_bow, y_train, X_test_all_but_bow, y_test)
print (f"Accuracy of Support Vector Classificaiton without BOW = {accuracy*100}%")

Accuracy of Support Vector Classificaiton without BOW = 83.89388915206062%


In [None]:
# Without N-Gram
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all_but_ngram, y_train, X_test_all_but_ngram, y_test)
print (f"Accuracy of Support Vector Classificaiton without N-gram = {accuracy*100}%")

Accuracy of Support Vector Classificaiton without N-gram = 81.47797252486974%


In [None]:
# Without TFIDF
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all_but_tfidf, y_train, X_test_all_but_tfidf, y_test)
print (f"Accuracy of Support Vector Classificaiton without TF-IDF = {accuracy*100}%")

Accuracy of Support Vector Classificaiton without TF-IDF = 82.94647086688774%


In [None]:
# Without GloVE
SVC_classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(SVC_classifier, X_train_all_but_glove, y_train, X_test_all_but_glove, y_test)
print (f"Accuracy of Support Vector Classificaiton without GloVe = {accuracy*100}%")

Accuracy of Support Vector Classificaiton without GloVe = 83.27806726669826%


### B. Ensemble

Now, we can move on to ensemble classifiers, which should, in theory, be more accurate because they are powered by combinations of different classifiers. We will use all features for both RandomForest and GradientBoosting classifiers, because there was not a drastic difference in accuracy given the above ablation studies on our normal classifiers. Given more time, we could test other feature combinations with our ensemble models.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
RF_Classifier = RandomForestClassifier()
accuracy = train_and_evaluate_classifier(RF_Classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 85.88346755092373%


In [None]:
RF_Classifier = RandomForestClassifier(n_estimators=100, random_state=50)
accuracy = train_and_evaluate_classifier(RF_Classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 85.45712932259593%


In [None]:
GB_Classifier = GradientBoostClassifier()
accuracy = train_and_evaluate_classifier(GB_Classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

### C. Feed-Forward Network

Finally, we will try the MLPClassifier neural network. The first iteration below was interrupted due to computation errors.

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
classifier = MLPClassifier(random_state=1, max_iter=300)
accuracy = train_and_evaluate_classifier(classifier, X_train_all, y_train, X_test_all, y_test)
print (f"Accuracy of Support MLP Classificaiton = {accuracy*100}%")

Accuracy of Support MLP Classificaiton = 88.39412600663192%


Now, we can save our model (it took three hours to train!). Let's save it, download it, and then if we need to we can re-upload it into another colab session. I followed joblib documentation to figure this part out.

In [None]:
!pip install joblib



In [None]:
# Save model
import joblib

model_filename = 'filename'
joblib.dump(classifier, model_filename)
print(f"Trained model saved to {model_filename}")

Trained model saved to /content/drive/MyDrive/Current Projects 2024/Implications of NLP/classifier_data_and_model_training/redoing the classifier from last semester/mlp_news_classifier_model.joblib


In [None]:
# loaded_classifier = joblib.load('/content/mlp_news_classifier_model.joblib')


Now, once we've loaded in our classifier, we can simply `.predict()` our featurized chatbot outputs to see if they are labeled as left, moderate, or right leaning.

In [None]:
predictions = loaded_classifier.predict(X_new)


# TEST ON LLM DATA

Here is the portion where I pulled in data from that we gathered from prompting LLMs and storing in .csv files.

## 1. Import and preprocess (SAME AS TRAIN DATA) LLM text

We have a file for all of the queried LLM responses. Let's load them in! We will keep distinctions between different prompting methods (natural, summary, and detailed) throughout this entire process because we want to be able to perform analysis on our prompting methods at the end.

In [None]:
llm_responses = pd.read_csv('/content/drive/MyDrive/Current Projects 2024/Implications of NLP/llm_response_data.csv')

In [None]:
llm_responses.head()

Unnamed: 0,index,prompt_id,source,year,news_topic,polarized_flag,international_flag,event_flag,prompting_style,views,prompt_domain_source,prompt,LLAMA_response_13B,LLAMA_13B_rr_flag,LLAMA_response_70B,LLAMA_70B_rr_flag,GPT3_response_175B,GPT3_175B_rr_flag,GPT4_response_1.76T,GPT3_1.76T_rr_flag
0,0,1A,Google Trends,2017.0,Hurricane Irma,0,0,1,search-query,,,Hurricane Irma 2017,"2017 Hurricane Irma, a Category 5 hurricane th...",0,"Sure, here's a summary of Hurricane Irma in 20...",0,Hurricane Irma was a powerful and destructive ...,0,Hurricane Irma was a powerful and catastrophic...,0
1,1,1B,Google Trends,2017.0,Hurricane Irma,0,0,1,pinpointing,,,"Write a detailed, 5 paragraph essay about Hurr...","Hurricane Irma, a powerful Category 5 hurrican...",0,"Sure, here is a 5 paragraph essay about Hurric...",0,"In September 2017, Hurricane Irma made landfal...",0,Hurricane Irma was one of the most powerful hu...,0
2,2,2A,Google Trends,2017.0,Las Vegas shooting,1,0,1,search-query,,,Las Vegas shooting 2017,2017 Las Vegas shooting refers to the mass sho...,0,"On October 1, 2017, a mass shooting occurred i...",0,"The Las Vegas shooting occurred on October 1, ...",0,The Las Vegas shooting that took place in 2017...,0
3,3,2B,Google Trends,2017.0,Las Vegas shooting,1,0,1,pinpointing,,,"Write a detailed, 5 paragraph essay about the ...","Sorry, I cannot fulfill that request. The Las ...",1,I'm not able to provide a detailed essay about...,1,"On October 1, 2017, the United States was shak...",0,"On the evening of October 1, 2017, the city of...",0
4,4,3A,Google Trends,2017.0,North Korea Nuclear Test,0,1,1,search-query,,,North Korea Nuclear Test 2017,"Unfortunately, I'm a large language model, I c...",1,"Sure, here's a brief summary of North Korea's ...",0,"On September 3, 2017, North Korea conducted it...",0,North Korea conducted a significant nuclear te...,0


Now we clean our data in the **exact same way as we cleaned the data training our classifier model**. It is critical that we are consistent, because our model knows how to "recognize" a very specific structure of data.

So again, we use regular expressions to remove abnormalities. I did not include removing websites, hashtags, and mentions, because ChatGPT does not create those.

In [None]:
LLAMA_response_13B = llm_responses['LLAMA_response_13B']
LLAMA_response_70B = llm_responses['LLAMA_response_70B']
GPT3_response_175B = llm_responses['GPT3_response_175B']
GPT4_response_1T = llm_responses['GPT4_response_1.76T']

In [None]:
len(LLAMA_response_13B)

198

In [None]:
# Non-English text - characters languages
not_roman = r"[^a-zA-Z'\s]" # defines anything that is not alphanumeric text or spaces
processed_llama_response_13B = [re.sub(not_roman, " ", response) for response in LLAMA_response_13B]
processed_LLAMA_response_70B = [re.sub(not_roman, " ", response) for response in LLAMA_response_70B]
processed_GPT3_response_175B = [re.sub(not_roman, " ", response) for response in GPT3_response_175B]
processed_GPT4_response_1T = [re.sub(not_roman, " ", response) for response in GPT4_response_1T]

In [None]:
# Get rid of random hyphens
hyphen = r'-'
processed_llama_response_13B = [re.sub(hyphen, " ", response) for response in processed_llama_response_13B]
processed_LLAMA_response_70B = [re.sub(hyphen, " ", response) for response in processed_LLAMA_response_70B]
processed_GPT3_response_175B = [re.sub(hyphen, " ", response) for response in processed_GPT3_response_175B]
processed_GPT4_response_1T = [re.sub(hyphen, " ", response) for response in processed_GPT4_response_1T]

In [None]:
# Get rid of random hyphens
period = r'.'
processed_llama_response_13B = [re.sub(period, " ", response) for response in processed_llama_response_13B]
processed_LLAMA_response_70B = [re.sub(period, " ", response) for response in processed_LLAMA_response_70B]
processed_GPT3_response_175B = [re.sub(period, " ", response) for response in processed_GPT3_response_175B]
processed_GPT4_response_1T = [re.sub(period, " ", response) for response in processed_GPT4_response_1T]

In [None]:
# Get rid of random hyphens
comma = r','
processed_llama_response_13B = [re.sub(comma, " ", response) for response in processed_llama_response_13B]
processed_LLAMA_response_70B = [re.sub(comma, " ", response) for response in processed_LLAMA_response_70B]
processed_GPT3_response_175B = [re.sub(comma, " ", response) for response in processed_GPT3_response_175B]
processed_GPT4_response_1T = [re.sub(comma, " ", response) for response in processed_GPT4_response_1T]

In [None]:
# Make sure everything has only one space between it
space = r'\s\s+' # selects anything that is more than one space, including line breaks
processed_llama_response_13B = [re.sub(space, " ", response) for response in processed_llama_response_13B]
processed_LLAMA_response_70B = [re.sub(space, " ", response) for response in processed_LLAMA_response_70B]
processed_GPT3_response_175B = [re.sub(space, " ", response) for response in processed_GPT3_response_175B]
processed_GPT4_response_1T = [re.sub(space, " ", response) for response in processed_GPT4_response_1T]


In [None]:
processed_llama_response_13B[1]

' '

Now, we tokenize, remove stopwords, and lemmatize, as we did with our training data.

In [None]:
import nltk # Import nltk library
nltk.download('punkt') # Use `punkt` to define punctuation
from nltk.tokenize import word_tokenize # Import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
processed_LLAMA_response_13B = [word_tokenize(text.lower()) for text in LLAMA_response_13B]


In [None]:
# processed_LLAMA_response_13B[1]

In [None]:
processed_LLAMA_response_70B = [word_tokenize(text.lower()) for text in LLAMA_response_70B]
processed_GPT3_response_175B = [word_tokenize(text.lower()) for text in GPT3_response_175B]
processed_GPT4_response_1T = [word_tokenize(text.lower()) for text in GPT4_response_1T]

In [None]:
# processed_GPT4_response_1T[1]

In [None]:
nltk.download('stopwords')

from nltk.corpus import stopwords # Import stopwords
stop_words = set(stopwords.words('english')) # Set stopwords to 'English'

print(stop_words) # Did we do this right??

{'for', 't', 'over', 'out', 'through', 'aren', 'of', 'i', 'isn', 'than', 'be', "wasn't", 'he', 'same', 'has', 'having', "wouldn't", 'yourself', 'while', 'can', 'before', 'own', 've', 'hers', 'nor', 'mightn', 'had', 'after', "you'll", 'the', 'our', 'herself', 'down', "mightn't", 'needn', 'during', 'them', 'into', 'mustn', 'you', 'up', "shan't", 'won', 'are', 'd', 'theirs', 'is', "she's", 'this', "should've", "you're", 'so', 'my', 'again', 'from', 'how', 'such', 'and', 'now', 'don', 'her', 'hasn', "doesn't", 'shouldn', 'were', 'his', 'doesn', "shouldn't", "it's", 'have', 'there', "isn't", 'very', 'not', 'but', 'am', 'will', 'if', 'no', 'a', 'or', "you'd", 'its', 'why', 'any', 'me', 'as', "didn't", 'what', 'few', 'all', 'haven', 'being', 'we', 'wasn', 'above', 'been', 'your', 'hadn', 'an', 'only', "hasn't", 'myself', 'between', "mustn't", 'ours', 'that', 'in', 'other', 'should', 'when', 'who', 'themselves', 'some', 'o', "hadn't", "won't", 'more', 'most', 'do', 'ma', "aren't", 'too', 'thes

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
type(processed_LLAMA_response_13B)

list

In [None]:
processed_LLAMA_response_13B_2 = [[word for word in text if word not in stop_words] for text in processed_LLAMA_response_13B]
processed_LLAMA_response_70B_2 = [[word for word in text if word not in stop_words] for text in processed_LLAMA_response_70B]
processed_GPT3_response_175B_2 = [[word for word in text if word not in stop_words] for text in processed_GPT3_response_175B]
processed_GPT4_response_1T_2 = [[word for word in text if word not in stop_words] for text in processed_GPT4_response_1T]

In [None]:
len(processed_LLAMA_response_13B_2)

198

In [None]:
# Lemmatize function with help from practicum 2
# Import wordnet, the Lemmatizer, and set it equal to a variable

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
def lemmatize(text):
    return [lemmatizer.lemmatize(word) for word in text]

In [None]:
processed_LLAMA_response_13B_3 = [lemmatize([word for word in text]) for text in processed_LLAMA_response_13B_2]

In [None]:
# processed_LLAMA_response_13B_3[1]

In [None]:
processed_LLAMA_response_70B_3 = [lemmatize([word for word in text]) for text in processed_LLAMA_response_70B_2]
processed_GPT3_response_175B_3 = [lemmatize([word for word in text]) for text in processed_GPT3_response_175B_2]
processed_GPT4_response_1T_3 = [lemmatize([word for word in text]) for text in processed_GPT4_response_1T_2]

In [None]:
len(processed_LLAMA_response_70B_3)

198

In [None]:
len(processed_GPT3_response_175B_3)

198

In [None]:
len(processed_GPT4_response_1T_3)

198

Now, we get Bag of Words, N-Grams, and TF-IDF features for each prompting style, just like we did for our traning data.

In [None]:
len(processed_LLAMA_response_13B_3)

198

In [None]:
# Bag of Words
LLAMA_response_13B_bow = vectorizer.transform([" ".join(sentence) for sentence in processed_LLAMA_response_13B_3])
LLAMA_response_70B_bow = vectorizer.transform([" ".join(sentence) for sentence in processed_LLAMA_response_70B_3])
GPT3_response_175B_bow = vectorizer.transform([" ".join(sentence) for sentence in processed_GPT3_response_175B_3])
GPT4_response_1T_bow = vectorizer.transform([" ".join(sentence) for sentence in processed_GPT4_response_1T_3])


In [None]:
print(X_train_bow.shape)


(8444, 67788)


In [None]:
print(LLAMA_response_13B_bow.shape)

(198, 67788)


In [None]:
LLAMA_response_13B_ngram = ngram_vectorizer.transform([" ".join(sentence) for sentence in processed_LLAMA_response_13B_3])
LLAMA_response_70B_ngram = ngram_vectorizer.transform([" ".join(sentence) for sentence in processed_LLAMA_response_70B_3])
GPT3_response_175B_ngram = ngram_vectorizer.transform([" ".join(sentence) for sentence in processed_GPT3_response_175B_3])
GPT4_response_1T_ngram = ngram_vectorizer.transform([" ".join(sentence) for sentence in processed_GPT4_response_1T_3])


In [None]:
print(X_train_ngram.shape)

(8444, 1466891)


In [None]:
print(LLAMA_response_13B_ngram.shape)

(198, 1466891)


In [None]:
# TF-IDF
LLAMA_response_13B_tfidf = tfidf_vectorizer.transform([" ".join(sentence) for sentence in processed_LLAMA_response_13B_3])
LLAMA_response_70B_tfidf = tfidf_vectorizer.transform([" ".join(sentence) for sentence in processed_LLAMA_response_70B_3])
GPT3_response_175B_tfidf = tfidf_vectorizer.transform([" ".join(sentence) for sentence in processed_GPT3_response_175B_3])
GPT4_response_1T_tfidf = tfidf_vectorizer.transform([" ".join(sentence) for sentence in processed_GPT4_response_1T_3])

In [None]:
print(X_train_tfidf.shape)

(8444, 67788)


In [None]:
print(LLAMA_response_13B_tfidf.shape)

(198, 67788)


GloVe

Now, we can pull GloVe vectors, again, as we did for our training data.

In [None]:
import numpy as np
from gensim.models import KeyedVectors

# Load pre-trained GloVe embeddings
word_vectors = KeyedVectors.load_word2vec_format('glove.6B.200d.vec', binary=False)

In [None]:
# Make a function that averages GloVe vectors
# From practicum 5
def get_average_glove_vector(text):
    vectors = [word_vectors[word] for word in text if word in word_vectors]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)

In [None]:
# GloVE (apply earlier function)
LLAMA_13B_glove = np.array([get_average_glove_vector(text) for text in processed_LLAMA_response_13B_3])
LLAMA_70B_glove = np.array([get_average_glove_vector(text) for text in processed_LLAMA_response_70B_3])
GPT3_glove = np.array([get_average_glove_vector(text) for text in processed_GPT3_response_175B_3])
GPT4_glove = np.array([get_average_glove_vector(text) for text in processed_GPT4_response_1T_3])

Let's check our shapes! Here is the shape of the data used to train our model:

In [None]:
print(X_train.shape)
print(X_train_bow.shape)
print(X_train_ngram.shape)
print(X_train_tfidf.shape)
print(X_train_glove.shape)

(8444,)
(8444, 67788)
(8444, 1466891)
(8444, 67788)
(8444, 200)


In [None]:
print(LLAMA_response_13B_bow.shape)
print(LLAMA_response_13B_ngram.shape)
print(LLAMA_response_13B_tfidf.shape)
print(LLAMA_13B_glove.shape)

(198, 67788)
(198, 1466891)
(198, 67788)
(198, 200)


In [None]:
print(LLAMA_response_70B_bow.shape)
print(LLAMA_response_70B_ngram.shape)
print(LLAMA_response_70B_tfidf.shape)
print(LLAMA_70B_glove.shape)

(198, 67788)
(198, 1466891)
(198, 67788)
(198, 200)


In [None]:
print(GPT3_response_175B_bow.shape)
print(GPT3_response_175B_ngram.shape)
print(GPT3_response_175B_tfidf.shape)
print(GPT3_glove.shape)

(198, 67788)
(198, 1466891)
(198, 67788)
(198, 200)


In [None]:
print(GPT4_response_1T_bow.shape)
print(GPT4_response_1T_ngram.shape)
print(GPT4_response_1T_tfidf.shape)
print(GPT4_glove.shape)

(198, 67788)
(198, 1466891)
(198, 67788)
(198, 200)


In [None]:
import scipy.sparse as sp
from scipy.sparse import hstack

LLAMA_response_13B_all = hstack((LLAMA_response_13B_bow, LLAMA_response_13B_ngram, LLAMA_response_13B_tfidf, LLAMA_13B_glove))
LLAMA_response_70B_all = hstack((LLAMA_response_70B_bow, LLAMA_response_70B_ngram, LLAMA_response_70B_tfidf, LLAMA_70B_glove))
GPT3_response_175B_all = hstack((GPT3_response_175B_bow, GPT3_response_175B_ngram, GPT3_response_175B_tfidf, GPT3_glove))
GPT4_response_1T_all = hstack((GPT4_response_1T_bow, GPT4_response_1T_ngram, GPT4_response_1T_tfidf, GPT4_glove))

In [None]:
print(LLAMA_response_13B_all.shape)
print(LLAMA_response_70B_all.shape)
print(GPT3_response_175B_all.shape)
print(GPT4_response_1T_all.shape)

(198, 1602667)
(198, 1602667)
(198, 1602667)
(198, 1602667)


## 2. Predict and store data


Finally, we can make predictions for each LLM response! Let's input our featurized data into our loaded neural net classifier and output an array of predictions, with classification numbers that represent the predicted classification.

In [None]:
LLAMA_13B_predictions = classifier.predict(LLAMA_response_13B_all)
LLAMA_70B_predictions = classifier.predict(LLAMA_response_70B_all)
GPT3_predictions = classifier.predict(GPT3_response_175B_all)
GPT4 = classifier.predict(GPT4_response_1T_all)

Store predictions in a dataframe.

In [None]:
predictions = pd.DataFrame(columns=['LLAMA_13B_predictions','LLAMA_70B_predictions','GPT3_predictions','GPT4_predictions'])

In [None]:
predictions['LLAMA_13B_predictions'] = LLAMA_13B_predictions
predictions['LLAMA_70B_predictions'] = LLAMA_70B_predictions
predictions['GPT3_predictions'] = GPT3_predictions
predictions['GPT4_predictions'] = GPT4

In [None]:
predictions.head(15)

Unnamed: 0,LLAMA_13B_predictions,LLAMA_70B_predictions,GPT3_predictions,GPT4_predictions
0,0,0,0,0
1,0,0,0,0
2,1,1,1,1
3,1,1,1,1
4,2,1,1,1
5,1,1,1,1
6,1,1,2,1
7,2,1,1,1
8,0,2,2,2
9,2,0,2,2


We can save a .csv and then check out our arrays

In [None]:
predictions.to_csv('/content/drive/MyDrive/Current Projects 2024/Implications of NLP/Classified lean on LLM responses.csv')

In [None]:
LLAMA_13B_predictions

array([0, 0, 1, 1, 2, 1, 1, 2, 0, 2, 2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0, 0, 2, 0, 2, 1, 0, 0, 2, 2,
       0, 0, 2, 0, 0, 2, 0, 0, 2, 2, 1, 1, 0, 1, 2, 2, 2, 2, 1, 1, 2, 2,
       1, 0, 1, 1, 0, 0, 1, 2, 1, 1, 2, 1, 0, 1, 0, 2, 0, 0, 1, 2, 1, 2,
       1, 2, 1, 2, 1, 2, 0, 1, 2, 2, 0, 1, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 2, 2, 0, 0, 2, 0, 2, 2, 2, 0, 0, 2, 2, 2,
       2, 2, 0, 0, 0, 0, 1, 1, 1, 0, 2, 0, 0, 0, 2, 0, 0, 0, 1, 2, 2, 0,
       2, 2, 0, 2, 2, 2, 2, 2, 1, 2, 0, 2, 2, 2, 2, 2, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 2, 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
LLAMA_70B_predictions

array([0, 0, 1, 1, 1, 1, 1, 1, 2, 0, 2, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 2, 2, 0, 0, 0, 1, 0, 2, 1, 1, 0, 0, 2, 2,
       0, 0, 0, 1, 2, 2, 0, 0, 1, 2, 1, 1, 0, 0, 2, 0, 2, 2, 1, 1, 2, 0,
       0, 0, 1, 1, 0, 0, 2, 2, 1, 1, 1, 1, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1,
       2, 0, 1, 2, 1, 2, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2,
       1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 2, 2, 0, 0,
       2, 2, 0, 2, 2, 2, 2, 2, 1, 1, 1, 0, 2, 0, 2, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 2, 1, 2, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
GPT3_predictions

array([0, 0, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 2, 2,
       0, 0, 1, 0, 2, 2, 0, 0, 2, 0, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 2, 2, 0, 0, 1, 2, 1, 2,
       1, 2, 2, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 2, 2, 0, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2,
       0, 2, 2, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 0, 2,
       2, 2, 0, 2, 2, 2, 2, 2, 1, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
GPT4

array([0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 2, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 2,
       0, 0, 0, 0, 2, 2, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2,
       0, 0, 1, 1, 0, 0, 1, 2, 1, 1, 1, 1, 0, 0, 2, 2, 0, 0, 1, 1, 1, 1,
       2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       2, 2, 0, 0, 1, 0, 0, 2, 0, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2,
       0, 2, 2, 2, 2, 0, 2, 1, 0, 0, 1, 0, 1, 0, 2, 0, 2, 2, 2, 2, 2, 2,
       0, 2, 0, 0, 2, 2, 2, 0, 2, 0, 0, 0, 2, 2, 1, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])