# DS 5899 Project 1 Starter Code

This notebook is intended to serve as a guide to help you develop a supervised sentiment analysis model for the news article given in the provided dataset. The default sentiments are positive, negative, and neutral, but each team may also define their own more granular scale based on those three sentiments. In addition to the data provided in the `project1_annotation_cleaned.xlsx` file, teams may also consider using daily returns or other stock data from Yahoo Finance. After teams develop the sentiment classification model, the sentiment and daily returns along with other information will be used to construct asset portfolios. The final deliverables include the following:
- All project source code in a private GitHub repo within the DSI org https://github.com/vanderbilt-data-science - please use the prefix `5899-spring24` for your project name. 
- All final versions of the data used for the analysis and constructed portfolios
- A README page detailing the results and performance of your portfolios
- A team project presentation 

## About this notebook

The experimental framework defined below is based on the Stanford Sentiment Treebank ([SST](https://nlp.stanford.edu/sentiment/index.html)). You don't have to use it, but it may be helpful to you. 

## Set-up

See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions.

In [6]:
from collections import Counter
import numpy as np
import os
import pandas as pd
from sklearn.linear_model import LogisticRegression
import torch.nn as nn

from torch_rnn_classifier import TorchRNNClassifier
from torch_tree_nn import TorchTreeNN
import sst
import utils

## A softmax baseline


In [14]:
def unigrams_phi(text):
    return Counter(text.split())

Thin wrapper around `LogisticRegression` for the sake of `sst.experiment`:

In [15]:
def fit_softmax_classifier(X, y):
    mod = LogisticRegression(
        fit_intercept=True,
        solver='liblinear',
        multi_class='ovr')
    mod.fit(X, y)
    return mod

The experimental run with some notes:

In [None]:
# For detailed function usage, see sst.py
softmax_experiment = sst.experiment(
    your_train_dataframe, # TODO: add your own here
    unigrams_phi,                
    fit_softmax_classifier,       
    assess_dataframes=[your_eval_datasets]) # TODO: add your own here

`softmax_experiment` contains a lot of information that you can use for error analysis; see [this section below](#Error-analysis) for starter code.

## RNNClassifier wrapper

This section illustrates how to use `sst.experiment` with `TorchRNNClassifier`.

To featurize examples for an RNN, we can just get the words in order, letting the model take care of mapping them into an embedding space.

In [17]:
def rnn_phi(text):
    return text.split()

The model wrapper gets the vocabulary using `sst.get_vocab`. If you want to use pretrained word representations in here, then you can have `fit_rnn_classifier` build that space too.

In [18]:
def fit_rnn_classifier(X, y):
    sst_glove_vocab = utils.get_vocab(X, mincount=2)
    mod = TorchRNNClassifier(
        sst_glove_vocab,
        early_stopping=True)
    mod.fit(X, y)
    return mod

In [None]:
# For detailed function usage, see sst.py
rnn_experiment = sst.experiment(
    your_train_dataframe, # TODO: add your own here
    rnn_phi,
    fit_rnn_classifier,
    vectorize=False,  # For deep learning, use `vectorize=False`.
    assess_dataframes=[your_eval_datasets]) # TODO: add your own here


## Error analysis

This section begins to build an error-analysis framework using the dicts returned by `sst.experiment`. These have the following structure:

```
'model': trained model
'phi': the feature function used
'train_dataset':
   'X': feature matrix
   'y': list of labels
   'vectorizer': DictVectorizer,
   'raw_examples': list of raw inputs, before featurizing   
'assess_datasets': list of datasets, each with the same structure as the value of 'train_dataset'
'predictions': list of lists of predictions on the assessment datasets
'metric': `score_func.__name__`, where `score_func` is an `sst.experiment` argument
'score': the `score_func` score on the each of the assessment dataasets
```
The following function just finds mistakes, and returns a `pd.DataFrame` for easy subsequent processing:

In [20]:
def find_errors(experiment):
    """Find mistaken predictions.

    Parameters
    ----------
    experiment : dict
        As returned by `sst.experiment`.

    Returns
    -------
    pd.DataFrame

    """
    dfs = []
    for i, dataset in enumerate(experiment['assess_datasets']):
        df = pd.DataFrame({
            'raw_examples': dataset['raw_examples'],
            'predicted': experiment['predictions'][i],
            'gold': dataset['y']})
        df['correct'] = df['predicted'] == df['gold']
        df['dataset'] = i
        dfs.append(df)
    return pd.concat(dfs)

## Your implementations below
The following code cells contain a number of functions that your team will need to implement. 

In [27]:
# Implement the `get_token_counts` function such that, given a `pd.DataFrame` in the format of our datasets, 
# it tokenizes the example sentences based on whitespace and creates a count distribution over all of the tokens. 
# The function should return a `pd.Series` sorted by frequency; if you create a count dictionary `d`, then 
# `pd.Series(d).sort_values(ascending=False)` will give you what you need.

def get_token_counts(df):
    
    token_counts = ##### YOUR CODE HERE
    return token_counts


In [28]:
def test_get_token_counts(func):
    df = pd.DataFrame([
        {'sentence': 'a a b'},
        {'sentence': 'a b a'},
        {'sentence': 'a a a b.'}])
    result = func(df)
    for token, expected in (('a', 7), ('b', 2), ('b.', 1)):
        actual = result.loc[token]
        assert actual == expected, \
            "For token {}, expected {}; got {}".format(
            token, expected, actual)

The primary goal of the next few cells is to get you thinking more about this strong baseline feature representation scheme for your dataset.



In [43]:
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier


# The following is a model wrapper function around TorchShallowNeuralClassifier. This function should implement 
# hyperparameter search according to this specification:

# * Set `early_stopping=True` for all experiments.
# * Using 3-fold cross-validation, exhaustively explore this set of hyperparameter combinations:
#   * The hidden dimensionality at 50, 100, and 200.
#   * The hidden activation function as `nn.Tanh()` and `nn.ReLU()`.
# * For all other parameters to `TorchShallowNeuralClassifier`, use the defaults.

def fit_shallow_neural_classifier_with_hyperparameter_search(X, y):
    
    ##### YOUR CODE HERE
    base_model = # TorchShallowNeuralClassifier(##### YOUR CODE HERE)




    opt_model = # utils.fit_classifier_with_hyperparameter_search(##### YOUR CODE HERE)
    return opt_model

Exploring the use of BERT as encoder for text data


In [None]:
from transformers import BertModel, BertTokenizer
import vsm


# Instantiate a Bert model and tokenizer based on `bert_weights_name`:
bert_weights_name = 'bert-base-uncased'
##### YOUR CODE HERE 

# setting up a tokenizer and a model with BERT


# implement the function `hf_cls_phi` that uses Hugging Face functionality to encode individual examples with BERT 
# and returns the final output representation above the [CLS] token.

def hf_cls_phi(text):
    # Get the ids. `vsm.hf_encode` will help; be sure to
    # set `add_special_tokens=True`.
    ##### YOUR CODE HERE


    # Get the BERT representations. `vsm.hf_represent` will help:
    ##### YOUR CODE HERE


    # Index into `reps` to get the representation above [CLS].
    # The shape of `reps` should be (1, n, 768), where n is the
    # number of tokens. You need the 0th element of the 2nd dim:
    ##### YOUR CODE HERE
    # cls_rep = 

    # These conversions should ensure that you can work with the
    # representations flexibly. 
    return cls_rep.cpu().numpy()

In [45]:
def test_hf_cls_phi(func):
    rep = func("Just testing!")

    expected_shape = (768,)
    result_shape = rep.shape
    assert rep.shape == (768,), \
        "Expected shape {}; got {}".format(
        expected_shape, result_shape)

    # String conversion to avoid precision errors:
    expected_first_val = str(0.1709)
    result_first_val = "{0:.04f}".format(rep[0])

    assert expected_first_val == result_first_val, \
        ("Unexpected representation values. Expected the "
        "first value to be {}; got {}".format(
            expected_first_val, result_first_val))

The following functions of `predict_one` guide you to map a text (str) directly to a label prediction – one of 'positive', 'negative', 'neutral'. You may need to modify the code below to work for your dataset

In [56]:
def predict_one_softmax(text):
    # Singleton list of feature dicts:
    feats = [softmax_experiment['phi'](text)]
    # Vectorize to get a feature matrix:
    X = softmax_experiment['train_dataset']['vectorizer'].transform(feats)
    # Standard sklearn `predict` step:
    preds = softmax_experiment['model'].predict(X)
    # Be sure to return the only member of the predictions,
    # rather than the singleton list:
    return preds[0]

If you used an RNN like the one we demoed above, then featurization is a bit more straightforward:

In [57]:
def predict_one_rnn(text):
    # List of tokenized examples:
    X = [rnn_experiment['phi'](text)]
    # Standard `predict` step on a list of lists of str:
    preds = rnn_experiment['model'].predict(X)
    # Be sure to return the only member of the predictions,
    # rather than the singleton list:
    return preds[0]