## Fine Tuning

### Notes
- Transfer learning from large pre-trained models to new small dataset 
    - Fine tuning existing model instead of training new model from scratch
- Using transformers library can quickly fine tune BERT model (http://jalammar.github.io/illustrated-bert/)
    - Google BERT (Bidirectional Embedding Representations from Transformers): semi-supervised training on large amounts of text (strong ability to recognized langauge patterns)
        - Supervised usage: classify sentences (spam vs no spam), sentiment analysis 
    - Composed of large stack of encoders (1. self attention 2. feed forward NN)
    - Putting sentences through BERT encodes it in specific vector count which can then be fed through a classifier or sentiment analysis?
       - "Vector hand off" occurs between BERT and classifier
    - Word embedding with Elmo (embeds based on context of word in sentence ex. 'nature' has multiple meanings, each unique)
        - Needs to be trained on LARGE dataset then can be used as component in other models
    - ULM-FiT has a process to fine-tune language models
    - Fine-tuning JUST the decoder of a transformer
    - Using BERT to create contextualized word embeddings THEN feed into existing "model" (what model?)
- Transformer
    - Uses "attention" to boost model speed
    - Built of encoder (input) and decoder (output) stacks
    - Flow:
        - 1) Embedding algorithm (word2vec, GlOvE) turns input into vector/tensors (each word is a 512 size vector)
        - 2) Builds more vectors for each word to understand the meaning

### Actions
- Building sentiment analysis that looks at the ORDER of the words (https://nlp.stanford.edu/sentiment/code.html)
    - Recursive NN built on top of grammatical structure
    - Sentiment Treebank built by Stanford
- Implementing Transformer w/ Pytorch (http://nlp.seas.harvard.edu/2018/04/03/attention.html)
- BERT is too big to implement so trying ALBERT (https://medium.com/spark-nlp/1-line-to-albert-word-embeddings-with-nlu-in-python-1691bc048ed1)
    - ALBERT uses two parameter reduction techniques making it quicker than BERT -> allowing for self-supervied learning

In [1]:
import pandas as pd
import requests # download texts

## 1. Downloading data

In [2]:
# download book data
response1 = requests.get('http://classics.mit.edu/Antoninus/meditations.mb.txt') # Meditations by Marcus
response2 = requests.get('http://classics.mit.edu/Epictetus/discourses.mb.txt') # Discourses by Epictetus
response3 = requests.get('http://classics.mit.edu/Epictetus/epicench.1b.txt') # Enchiridion by Epictetus
filename4 = open('/Users/stellajia/Desktop/UCSB/Fall-2022/ENGL197/stoic-analysis/data/shortness-of-life.txt') # Shortness of Life by Seneca
filename5 = open('/Users/stellajia/Desktop/UCSB/Fall-2022/ENGL197/stoic-analysis/data/letters-from-a-stoic.txt') # Letters from a Stoic by Seneca

doc1 = response1.text
doc2 = response2.text
doc3 = response3.text
doc4 = filename4.read()
doc5 = filename5.read()

filename6 = open('/Users/stellajia/Desktop/UCSB/Fall-2022/ENGL197/stoic-analysis/data/daily-stoic.txt') # Daily Stoic by Ryan Holiday (2016)
filename7 = open('/Users/stellajia/Desktop/UCSB/Fall-2022/ENGL197/stoic-analysis/data/guide-to-good-life.txt') # Guide to a Good Life by William Irvine (2008)
filename8 = open('/Users/stellajia/Desktop/UCSB/Fall-2022/ENGL197/stoic-analysis/data/stillness-is-key.txt') # Stillness is Key by Ryan Holiday (2019)
filenmae9 = open('/Users/stellajia/Desktop/UCSB/Fall-2022/ENGL197/stoic-analysis/data/how-to-be-a-stoic.txt') # How to Be a Stoic by Massimo Pigliucci (2017)
filename10 = open('/Users/stellajia/Desktop/UCSB/Fall-2022/ENGL197/stoic-analysis/data/think-like-a-roman-emperor.txt') # How to Think Like a Roman Emperor by Donald Robertson (2019)

doc6 = filename6.read() 
doc7 = filename7.read()
doc8 = filename8.read()
doc9 = filenmae9.read()
doc10 = filename10.read()

In [3]:
# download twitter data
tweets = pd.read_csv('/Users/stellajia/Desktop/UCSB/Fall-2022/ENGL197/stoic-analysis/data/stoicism-tweets.csv') # from netlytic
tweets2 = pd.read_excel('/Users/stellajia/Downloads/stoicism/stoicism-tweet.xls') # from socioviz

## 2. Cleaning data

### 2.1 Cleaning tweets

In [4]:
# remove links from tweet
tweets2_1 = []
for i in range(len(tweets2)):
    if 'https' in tweets2['Text']: 
        tweets2_1.append(tweets2['Text'][i][:-24])
    else:
        tweets2_1.append(tweets2['Text'][i])
# tweets2_1

In [5]:
# combine tweets
all_tweets = tweets2['Text'].append(tweets['description'])
all_tweets.reset_index()
all_tweets.drop(columns=['index'], axis=1)

0      Western societies have become soft. The collap...
1      All this fire at DeSantis and he hasn’t return...
2      “Mean tweets” is funny when it’s banter. But t...
3      @BillRevans So very sorry to hear this, and so...
4      @SterlingSpector @womnofvalr Buddhism and Stoi...
                             ...                        
426    If you find something very difficult to achiev...
427    Vanitas Still Life  by Pieter Claesz, 1630.  (...
428    Contrary to what most "Stoic guys" say, Stoici...
429    If you find something very difficult to achiev...
430    Adversity Reveals - DAY 179 - The Daily Stoic ...
Length: 531, dtype: object

In [6]:
tweets_df = all_tweets.to_frame()
tweets_df['tweet'] = tweets_df[0]
tweets_df[['tweet']]

Unnamed: 0,tweet
0,Western societies have become soft. The collap...
1,All this fire at DeSantis and he hasn’t return...
2,“Mean tweets” is funny when it’s banter. But t...
3,"@BillRevans So very sorry to hear this, and so..."
4,@SterlingSpector @womnofvalr Buddhism and Stoi...
...,...
426,If you find something very difficult to achiev...
427,"Vanitas Still Life by Pieter Claesz, 1630. (..."
428,"Contrary to what most ""Stoic guys"" say, Stoici..."
429,If you find something very difficult to achiev...


In [7]:
all_tweets[2]

2    “Mean tweets” is funny when it’s banter. But t...
2    People talk a lot about #Stoicism but don\'t n...
dtype: object

In [16]:
tweets['description'][2]

"People talk a lot about #Stoicism but don\\'t necessarily put it into practice in daily life. There are many Stoic exercises. I honestly think that for *some* people the key would be contemplating the View from Above regularly – it encapsulates the philosophy. - Donald Robertson"

### 2.2 Cleaning books

In [27]:
# store books
stoicDocs = [doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8, doc9, doc10]
# turn list into string
stoicString = " ".join(stoicDocs)
    
# download stop words
stop_words = open("/Users/stellajia/Desktop/UCSB/Fall-2022/ENGL197/stoic-analysis/data/buckley-salton.txt").read()
# remove stop words
stoicWords = [word for word in stoicString.split() if word.lower() not in stop_words.split()]

In [28]:
print("Old length: ", len(stoicString))
print("New length: ", len(stoicWords))

Old length:  2810288
New length:  207857


In [29]:
# tokenize 
from transformers import AutoTokenizer

22/11/16 20:22:20 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 805937 ms exceeds timeout 120000 ms
22/11/16 20:22:20 WARN SparkContext: Killing executors is not supported by current scheduler.


In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") # pretrained model on English language using a masked language modeling

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = stoicWords.map(tokenize_function, batched=True)

# STILL NEED TO CLEAN TWEETS!

## 3. Using ALBERT

### 3.1 First attempt - NLU

In [9]:
import nlu # access to a bunch of models

In [103]:
# test on one sentence
pipe = nlu.load('albert')
pipe.predict('Stoicism is about living in tranquility')

:: loading settings :: url = jar:file:/Users/stellajia/opt/miniconda3/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/stellajia/.ivy2/cache
The jars for the packages stored in: /Users/stellajia/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-8af00bd8-e44f-4196-9165-c1b5779ad33f;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;4.0.2 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.code.findbugs#annotations;3.0.1 in central
	found net.jcip#jcip-annotations;1.0 in central
	found com.google.code.findbugs#jsr305;3.0.1 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlo

22/11/16 17:29:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


albert_base_uncased download started this may take some time.
Approximate size to download 42.7 MB
[ | ]albert_base_uncased download started this may take some time.
Approximate size to download 42.7 MB
[ | ]Download done! Loading the resource.
[ \ ]

2022-11-16 17:30:42.027131: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ | ]sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
Download done! Loading the resource.
[OK!]


                                                                                

Unnamed: 0,token,word_embedding_albert
0,Stoicism,"[-1.0533853769302368, -0.8770667314529419, 0.8..."
0,is,"[1.0381802320480347, -1.4269936084747314, 0.93..."
0,about,"[-0.38724860548973083, -1.548833966255188, 0.8..."
0,living,"[-0.7819571495056152, -1.5227669477462769, -0...."
0,in,"[0.6698690056800842, -1.0752289295196533, 2.27..."
0,tranquility,"[-0.8311948776245117, -0.32697033882141113, 2...."


In [123]:
# adding more models: part of speech, emotion, sentiment classifier, albert
pipe = nlu.load('sentiment emotion albert') 
predictions = pipe.predict(tweets_df[['tweet']], output_level='token')

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[ | ]sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[ — ]Download done! Loading the resource.


                                                                                

[OK!]
classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[ | ]classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[ \ ]Download done! Loading the resource.
[OK!]
albert_base_uncased download started this may take some time.
Approximate size to download 42.7 MB
[OK!]
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ]tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ / ]Download done! Loading the resource.
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ / ]glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


Exception: Something went wrong during completing the DAG for the Spark NLP Pipeline.If this error persists, please contact us in Slack https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA Or open an issue on Github https://github.com/JohnSnowLabs/nlu/issues

### 3.2 Second attempt - johnsnowlabs

In [9]:
# nlu not working so trying smth else
from  johnsnowlabs import nlp # another way to access albert
from pyspark.sql import SparkSession

In [10]:
nlp.load('sentiment').predict('Wow that easy!')

:: loading settings :: url = jar:file:/Users/stellajia/opt/miniconda3/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/stellajia/.ivy2/cache
The jars for the packages stored in: /Users/stellajia/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-87db7851-9d56-4053-a474-7a1e9193c2b6;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;4.2.1 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.code.findbugs#annotations;3.0.1 in central
	found net.jcip#jcip-annotations;1.0 in central
	found com.google.code.findbugs#jsr305;3.0.1 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlo

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[ | ]sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
Download done! Loading the resource.
[ | ]

2022-11-16 18:48:20.056762: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ | ]glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ / ]Error: Unexpected end of ZLIB input stream
Error: Unexpected end of ZLIB input stream


22/11/16 18:48:26 ERROR ResourceDownloader$: Unexpected end of ZLIB input stream
22/11/16 18:48:26 ERROR ResourceDownloader$: Unexpected end of ZLIB input stream


[OK!]


Exception: Something went wrong during completing the DAG for the Spark NLP Pipeline.If this error persists, please contact us in Slack https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA Or open an issue on Github https://github.com/JohnSnowLabs/nlu/issues

### 3.3 Third attempt - transformers
- Downloading a deep learning library (Pytorch)

In [14]:
from transformers import pipeline

2022-11-16 19:09:56.260485: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [18]:
print(pipeline('sentiment-analysis')('statistics is hard to learn'))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9977403879165649}]


## 4. Self-supervised learning

- Self supervised sentiment analyzer - https://www.sciencedirect.com/science/article/pii/S2666827021000074
    - "SSentiA utilizing limited labeled data can yield similar performance to a fully labeled training dataset"
    