# Analysing the abstract of - On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? - using NLP

### Loading in libraries and data

In [1]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Punctuation removal
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\smorr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
abstract = "The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models."

In [3]:
abstract

'The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned appr

In [4]:
# Split text into sentences
sentences = sent_tokenize(abstract)
sentences

['The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English.',
 'BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size.',
 'Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English.',
 'In this paper, we take a step back and ask: How big is too big?',
 'What are the possible risks associated with this technology and what paths are available for mitigating those risks?',
 'We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating

In [5]:
# Only returning characters that is not a letter (a-zA-Z), digit (0-9), or whitespace (\s).
clean_sentences = [re.sub(r"[^a-zA-Z0-9\s]", " ", sentence) for sentence in sentences]

# Remove multiple spaces and strip leading/trailing spaces
clean_sentences = [re.sub(r"\s+", " ", sentence).strip() for sentence in clean_sentences]

# Output cleaned sentences - each sentence is an individual string in the list
print(clean_sentences)

['The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models especially for English', 'BERT its variants GPT 2 3 and others most recently Switch C have pushed the boundaries of the possible both through architectural innovations and through sheer size', 'Using these pretrained models and the methodology of fine tuning them for specific tasks researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English', 'In this paper we take a step back and ask How big is too big', 'What are the possible risks associated with this technology and what paths are available for mitigating those risks', 'We provide recommendations including weighing the environmental and financial costs first investing resources into curating and carefully documenting datasets rather than ingesting everything on the web carrying out pre development exercises evaluating how the planned appr

### Tokenization

In [6]:
from nltk.tokenize import word_tokenize

In [7]:
# Tokenize words for each cleaned sentence and flatten the list
words = [word for sentence in clean_sentences for word in word_tokenize(sentence)]

# Output the list of words
print(words)

['The', 'past', '3', 'years', 'of', 'work', 'in', 'NLP', 'have', 'been', 'characterized', 'by', 'the', 'development', 'and', 'deployment', 'of', 'ever', 'larger', 'language', 'models', 'especially', 'for', 'English', 'BERT', 'its', 'variants', 'GPT', '2', '3', 'and', 'others', 'most', 'recently', 'Switch', 'C', 'have', 'pushed', 'the', 'boundaries', 'of', 'the', 'possible', 'both', 'through', 'architectural', 'innovations', 'and', 'through', 'sheer', 'size', 'Using', 'these', 'pretrained', 'models', 'and', 'the', 'methodology', 'of', 'fine', 'tuning', 'them', 'for', 'specific', 'tasks', 'researchers', 'have', 'extended', 'the', 'state', 'of', 'the', 'art', 'on', 'a', 'wide', 'array', 'of', 'tasks', 'as', 'measured', 'by', 'leaderboards', 'on', 'specific', 'benchmarks', 'for', 'English', 'In', 'this', 'paper', 'we', 'take', 'a', 'step', 'back', 'and', 'ask', 'How', 'big', 'is', 'too', 'big', 'What', 'are', 'the', 'possible', 'risks', 'associated', 'with', 'this', 'technology', 'and', 'w

### Remove stop words

In [8]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\smorr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['The', 'past', '3', 'years', 'work', 'NLP', 'characterized', 'development', 'deployment', 'ever', 'larger', 'language', 'models', 'especially', 'English', 'BERT', 'variants', 'GPT', '2', '3', 'others', 'recently', 'Switch', 'C', 'pushed', 'boundaries', 'possible', 'architectural', 'innovations', 'sheer', 'size', 'Using', 'pretrained', 'models', 'methodology', 'fine', 'tuning', 'specific', 'tasks', 'researchers', 'extended', 'state', 'art', 'wide', 'array', 'tasks', 'measured', 'leaderboards', 'specific', 'benchmarks', 'English', 'In', 'paper', 'take', 'step', 'back', 'ask', 'How', 'big', 'big', 'What', 'possible', 'risks', 'associated', 'technology', 'paths', 'available', 'mitigating', 'risks', 'We', 'provide', 'recommendations', 'including', 'weighing', 'environmental', 'financial', 'costs', 'first', 'investing', 'resources', 'curating', 'carefully', 'documenting', 'datasets', 'rather', 'ingesting', 'everything', 'web', 'carrying', 'pre', 'development', 'exercises', 'evaluating', 'pl

### Stemming and lemmatization

In [10]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\smorr\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [11]:
# Stemming
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['the', 'past', '3', 'year', 'work', 'nlp', 'character', 'develop', 'deploy', 'ever', 'larger', 'languag', 'model', 'especi', 'english', 'bert', 'variant', 'gpt', '2', '3', 'other', 'recent', 'switch', 'c', 'push', 'boundari', 'possibl', 'architectur', 'innov', 'sheer', 'size', 'use', 'pretrain', 'model', 'methodolog', 'fine', 'tune', 'specif', 'task', 'research', 'extend', 'state', 'art', 'wide', 'array', 'task', 'measur', 'leaderboard', 'specif', 'benchmark', 'english', 'in', 'paper', 'take', 'step', 'back', 'ask', 'how', 'big', 'big', 'what', 'possibl', 'risk', 'associ', 'technolog', 'path', 'avail', 'mitig', 'risk', 'we', 'provid', 'recommend', 'includ', 'weigh', 'environment', 'financi', 'cost', 'first', 'invest', 'resourc', 'curat', 'care', 'document', 'dataset', 'rather', 'ingest', 'everyth', 'web', 'carri', 'pre', 'develop', 'exercis', 'evalu', 'plan', 'approach', 'fit', 'research', 'develop', 'goal', 'support', 'stakehold', 'valu', 'encourag', 'research', 'direct', 'beyond', '

In [12]:
# Lemmatize
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet') 

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\smorr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
# Reduce words to their root form
lemmatized = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmatized)

['The', 'past', '3', 'year', 'work', 'NLP', 'characterized', 'development', 'deployment', 'ever', 'larger', 'language', 'model', 'especially', 'English', 'BERT', 'variant', 'GPT', '2', '3', 'others', 'recently', 'Switch', 'C', 'pushed', 'boundary', 'possible', 'architectural', 'innovation', 'sheer', 'size', 'Using', 'pretrained', 'model', 'methodology', 'fine', 'tuning', 'specific', 'task', 'researcher', 'extended', 'state', 'art', 'wide', 'array', 'task', 'measured', 'leaderboards', 'specific', 'benchmark', 'English', 'In', 'paper', 'take', 'step', 'back', 'ask', 'How', 'big', 'big', 'What', 'possible', 'risk', 'associated', 'technology', 'path', 'available', 'mitigating', 'risk', 'We', 'provide', 'recommendation', 'including', 'weighing', 'environmental', 'financial', 'cost', 'first', 'investing', 'resource', 'curating', 'carefully', 'documenting', 'datasets', 'rather', 'ingesting', 'everything', 'web', 'carrying', 'pre', 'development', 'exercise', 'evaluating', 'planned', 'approach'

### Topic modelling - using LDA (Latent Dirichlet Allocation) 

In [16]:
!pip install gensim

from gensim import corpora
from gensim.models import LdaModel

Collecting gensim
  Downloading gensim-4.3.3-cp38-cp38-win_amd64.whl (24.0 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-7.0.5-py3-none-any.whl (61 kB)
Collecting scipy<1.14.0,>=1.7.0
  Downloading scipy-1.10.1-cp38-cp38-win_amd64.whl (42.2 MB)
Installing collected packages: smart-open, scipy, gensim
  Attempting uninstall: scipy
    Found existing installation: scipy 1.6.2
    Uninstalling scipy-1.6.2:
      Successfully uninstalled scipy-1.6.2
Successfully installed gensim-4.3.3 scipy-1.10.1 smart-open-7.0.5


In [17]:
documents = [lemmatized]  # Wrap it in a list to create a single document

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(documents)

# Create a corpus: List of bag-of-words representations
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Perform LDA
lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)

# Print the topics
topics = lda_model.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.011*"development" + 0.011*"model" + 0.011*"3" + 0.011*"language"')
(1, '0.022*"model" + 0.022*"development" + 0.016*"English" + 0.016*"big"')


### Sentiment Analysis

In [24]:
from textblob import TextBlob

In [25]:
# Analyze sentiment
for sentence in clean_sentences:
    blob = TextBlob(sentence)
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {blob.sentiment}")  # returns a tuple (polarity, subjectivity)

Sentence: The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models especially for English
Sentiment: Sentiment(polarity=-0.0625, subjectivity=0.4375)
Sentence: BERT its variants GPT 2 3 and others most recently Switch C have pushed the boundaries of the possible both through architectural innovations and through sheer size
Sentiment: Sentiment(polarity=0.125, subjectivity=0.625)
Sentence: Using these pretrained models and the methodology of fine tuning them for specific tasks researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English
Sentiment: Sentiment(polarity=0.06333333333333332, subjectivity=0.22999999999999998)
Sentence: In this paper we take a step back and ask How big is too big
Sentiment: Sentiment(polarity=0.0, subjectivity=0.06666666666666667)
Sentence: What are the possible risks associated with this technology and what paths are availa