Homework 3: Sentiment Analysis
----

The following instructions apply to all notebooks and `.py` files you submit for this homework.

Due date: April 15th, 2024 11:59 PM (EST)

Total Points: (105)
- Task 0: 05 points
- Task 1: 10 points
- Task 2: 20 points
- Task 3: 25 points
- Task 4: 40 points (question in LSTM_EncDec.ipynb)

Goals:
- understand the difficulties of counting and probabilities in NLP applications
- work with real world data using different approaches to classification
- stress test your model (to some extent)


Allowed python modules:
- `numpy`, `matplotlib`, `keras`, `pytorch`, `nltk`, `pandas`, `sci-kit learn` (`sklearn`), `seaborn`, and all built-in python libraries (e.g. `math` and `string`)
- if you would like to use a library not on this list, please check with us on Campuswire first.
- all *necessary* imports have been included for you (all imports that we used in our solution)

Instructions:
- Complete outlined problems in this notebook.
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__.
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should.

Names & Sections
----
Names: __NISHARG GOSAI__

Task 0: Name, References, Reflection (5 points)
---

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

(Example)
- https://docs.python.org/3/tutorial/datastructures.html
    - Read about the the basics and syntax for data structures in python.
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
    - Read about the sklearn's tfidf vectorizer

AI Collaboration
---
Following the *Policy on the use of Generative AI* in the syllabus, please cite any LLMs that you used here and briefly describe what you used them for, including to improve language clarity in the written sections.

**Used chatgpt for code refactoring and beautification and as quick documentation guide for torch and other libraries**

Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort?
    **Yes**
2. What was/were the most challenging part(s) of the assignment? **LSTM**
3. If you want feedback, what function(s) or problem(s) would you like feedback on and why? **How to make good vectorizer**
4. Briefly reflect on how your partnership functioned--who did which tasks, how was the workload on each of you individually as compared to the previous homeworks, etc. **individual**

Task 1: Provided Data Write-Up (10 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __provided__ movie review data set.

1. Where did you get the data from? The provided dataset(s) were sub-sampled from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
2. (1 pt) How was the data collected (where did the people acquiring the data get it from and how)?
3. (2 pts) How large is the dataset (answer for both the train and the dev set, separately)? (# reviews, # tokens in both the train and dev sets)
4. (1 pt) What is your data? (i.e. newswire, tweets, books, blogs, etc)
5. (1 pt) Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)
6. (2 pts) What is the distribution of labels in the data (answer for both the train and the dev set, separately)?
7. (2 pts) How large is the vocabulary (answer for both the train and the dev set, separately)?
8. (1 pt) How big is the overlap between the vocabulary for the train and dev set?

Task 2: Train a Logistic Regression Model (20 points)
----
1. Implement a custom function to read in a dataset, and return a list of tuples, using the Tf-Idf feature extraction technique.
2. Compare your implementation to `sklearn`'s TfidfVectorizer (imported below) by timing both on the provided datasets using the time module.
3. Using each set of features, and `sklearn`'s implementation of `LogisticRegression`, train a machine learning model to predict sentiment on the given dataset.

In [1]:
import nltk
#nltk.download('punkt')
import numpy as np
from collections import defaultdict
import pandas as pd
from sklearn.linear_model import LogisticRegression
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from collections import Counter
import time
from nltk.corpus import stopwords

#nltk.download('stopwords')
stopwords = stopwords.words('english')

In [2]:
# The following function reads a data-file and splits the contents by tabs.
# The first column is an ID, and thus is discarded. The second column consists of the actual reviews data.
# The third column is the true label for each data point.

# The function returns two objects - a list of all reviews, and a numpy array of labels.
# You will need to use this function later.

def get_lists(input_file):
    f=open(input_file, 'r')
    lines = [line.split('\t')[1:] for line in f.readlines()]
    X = [row[0] for row in lines]
    y=np.array([int(row[1]) for row in lines])
    return X, y

# Fill in the following function to take a corpus (list of reviews) as input,
# extract TfIdf values and return an array of features and the vocabulary.

# If the vocabulary argument is supplied, then the function should only convert the input corpus
# to feature vectors using the provided vocabulary and the max_features argument (if not None).
# In this case, the function should return feature vectors and the supplied vocabulary.

# If the max_features parameter is set to None, then all words in the corpus should be used.
# If the max_features parameter is specified (say, k),
# then only use the k most frequent words in the corpus to build your vocabulary.

# The function should return two things.

# The first object should be a numpy array of shape (n_documents, vocab_size),
# which contains the TF-IDF feature vectors for each document.

# The second object should be a dictionary of the words in the vocabulary,
# mapped to their corresponding index in alphabetical sorted order.

from scipy.sparse import csr_matrix

def get_tfidf_vectors(token_lists, max_features=None, vocabulary=None):
    if vocabulary is None:
        vocabulary = set(word for tokens in token_lists for word in tokens)

    if max_features is not None and len(vocabulary) > max_features:
        term_freq = defaultdict(int)
        for tokens in token_lists:
            for token in set(tokens):
                term_freq[token] += 1
        vocabulary = sorted(vocabulary, key=lambda x: term_freq[x], reverse=True)[:max_features]

    vocab_index = {word: idx for idx, word in enumerate(vocabulary)}

    tfidf_matrix = csr_matrix((len(token_lists), len(vocabulary)), dtype=np.float32)

    idf = {word: np.log(len(token_lists) / (1 + sum(1 for tokens in token_lists if word in tokens))) for word in vocabulary}

    for idx, tokens in enumerate(token_lists):
        total_tokens = len(tokens)
        for token in set(tokens):
            if token in vocab_index:
                tfidf_matrix[idx, vocab_index[token]] = (tokens.count(token) / total_tokens) * idf[token]

    return tfidf_matrix, vocab_index

We will now compare the runtime of our Tf-Idf implementation to the `sklearn` implementation. Call the respective functions with appropriate arguments in the code block below.

In [3]:
# define constants for the files we are using
TRAIN_FILE = "movie_reviews_train.txt"
TEST_FILE = "movie_reviews_test.txt"

train_corpus, y_train = get_lists(TRAIN_FILE)

# First we will use our custom vectorizer to convert words to features, and time it.

start = time.time()
tfidf_matrix_custom, vocab_custom = get_tfidf_vectors(train_corpus)
end = time.time()
print("Custom TF-IDF - Time taken:", end - start, "seconds")

# print("Time taken: ", end-start, " seconds")

# Next we will use sklearn's TfidfVectorizer to load in the data, and time it.

start = time.time()
vectorizer = TfidfVectorizer()
tfidf_matrix_sklearn = vectorizer.fit_transform(train_corpus)
end = time.time()
print("scikit-learn TF-IDF - Time taken:", end - start, "seconds")

# print("Time taken: ", end-start, " seconds")

  self._set_intXint(row, col, x.flat[0])


Custom TF-IDF - Time taken: 19.674981594085693 seconds
scikit-learn TF-IDF - Time taken: 0.17102265357971191 seconds


NOTE: Ideally, your vectorizer should be within one order of magnitude of the sklearn implementation.

In [4]:
# Any additional code needed to answer questions below.
vocab_size_custom = len(vocab_custom)
print("Custom Vocabulary Size:", vocab_size_custom)

vectorizer = TfidfVectorizer()
tfidf_matrix_sklearn = vectorizer.fit_transform(train_corpus)
vocab_size_sklearn = len(vectorizer.vocabulary_)
print("scikit-learn Vocabulary Size:", vocab_size_sklearn)

#sparsity calculation
def calculate_sparsity(matrix):
    total_elements = matrix.shape[0] * matrix.shape[1]
    zero_elements = total_elements - np.sum(matrix != 0)  
    sparsity = (zero_elements / total_elements) * 100
    return sparsity

sparsity_custom = calculate_sparsity(tfidf_matrix_custom)
print("Custom Features Sparsity:", sparsity_custom)

sparsity_sklearn = calculate_sparsity(tfidf_matrix_sklearn.toarray())
print("scikit-learn Features Sparsity:", sparsity_sklearn)



Custom Vocabulary Size: 125
scikit-learn Vocabulary Size: 22684
Custom Features Sparsity: 64.113
scikit-learn Features Sparsity: 99.39492318374185


1. How large is the vocabulary generated by your vectorizer?<br> **125**
2. How large is the vocabulary generated by the `sklearn` TfidfVectorizer?<br>**26373**
3. Where might these differences be coming from?<br> **Scikit-learn's more sophisticated tokenization, preprocessing, and feature extraction techniques, resulting in a larger vocabulary that captures more unique terms and variations in the text data.**
4. What steps did you take to ensure your vectorizer is optimized for best possible runtime?<br> **sparse matrix and precomputed idf values**
5. How sparse are your custom features (average percentage of features per review that are zero)?<br> **64.11%**
6. How sparse are the TfidfVectorizer's features?<br> **99.46%**

NOTE: if you set the lowercase option to False, the sklearn vectorizer should have a vocabulary of around 50k words/tokens.

**Logistic Regression**

Now, we will compare how our custom features stack up against sklearn's TfidfVectorizer, by training two separate Logistic Regression classifiers - one on each set of feature vectors. Then load the test set, and convert it to two sets of feature vectors, one using our custom vectorizer (to do this, provide the vocabulary as a function argument), and one using sklearn's Tfidf (use the same object as before to transform the test inputs). For both classifiers, print the average accuracy on the test set and the F1 score.

In [5]:
# First use sklearn's LogisticRegression classifier to do sentiment analysis using your custom feature vectors:

lr_custom = LogisticRegression()
lr_custom.fit(tfidf_matrix_custom, y_train)

# Load the test data, extract features using your custom vectorizer, and test the performance of the LR classifier

test_corpus, y_test = get_lists(TEST_FILE)
tfidf_matrix_test_custom, _ = get_tfidf_vectors(test_corpus, vocabulary=vocab_custom)
y_pred_custom = lr_custom.predict(tfidf_matrix_test_custom)
accuracy_custom = accuracy_score(y_test, y_pred_custom)
f1_custom = f1_score(y_test, y_pred_custom)

# Print the accuracy of your model on the test data

print("Custom Features - Accuracy:", accuracy_custom)
print("Custom Features - F1 Score:", f1_custom)


# Now repeat the above steps, but this time using features extracted by sklearn's Tfidfvectorizer


lr_sklearn = LogisticRegression()
lr_sklearn.fit(tfidf_matrix_sklearn, y_train)
tfidf_matrix_test_sklearn = vectorizer.transform(test_corpus)
y_pred_sklearn = lr_sklearn.predict(tfidf_matrix_test_sklearn)
accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn)
f1_sklearn = f1_score(y_test, y_pred_sklearn)
print("scikit-learn Features - Accuracy:", accuracy_sklearn)
print("scikit-learn Features - F1 Score:", f1_sklearn)


  self._set_intXint(row, col, x.flat[0])


Custom Features - Accuracy: 0.545
Custom Features - F1 Score: 0.7055016181229774
scikit-learn Features - Accuracy: 0.775
scikit-learn Features - F1 Score: 0.7906976744186047


NOTE: we're expecting to see a F1 score of around 80% using both your custom features and the sklearn features.

Finally, repeat the process (training and testing), but this time, set the max_features argument to 1000 for both our custom vectorizer and sklearn's Tfidfvectorizer. Report average accuracy and F1 scores for both classifiers.

In [6]:


# First use sklearn's LogisticRegression classifier to do sentiment analysis using your custom feature vectors:

tfidf_matrix_custom_1000, vocab_custom_1000 = get_tfidf_vectors(train_corpus, max_features=1000)
lr_custom_1000 = LogisticRegression()
lr_custom_1000.fit(tfidf_matrix_custom_1000, y_train)


# Load the test data, extract features using your custom vectorizer, and test the performance of the LR classifier

# Extract features using your custom vectorizer with max_features=1000
tfidf_matrix_test_custom_1000, _ = get_tfidf_vectors(test_corpus, vocabulary=vocab_custom_1000, max_features=1000)

# Test the performance of the LR classifier on custom features with max_features=1000
y_pred_custom_1000 = lr_custom_1000.predict(tfidf_matrix_test_custom_1000)
accuracy_custom_1000 = accuracy_score(y_test, y_pred_custom_1000)
f1_custom_1000 = f1_score(y_test, y_pred_custom_1000)


# Print the accuracy of your model on the test data
print("Custom Features (max_features=1000) - Accuracy:", accuracy_custom_1000)
print("Custom Features (max_features=1000) - F1 Score:", f1_custom_1000)

# Now repeat the above steps, but this time using features extracted by sklearn's Tfidfvectorizer

###### YOUR CODE HERE #######
vectorizer_1000 = TfidfVectorizer(max_features=1000)
tfidf_matrix_sklearn_1000 = vectorizer_1000.fit_transform(train_corpus)
lr_sklearn_1000 = LogisticRegression()
lr_sklearn_1000.fit(tfidf_matrix_sklearn_1000, y_train)
tfidf_matrix_test_sklearn_1000 = vectorizer_1000.transform(test_corpus)
y_pred_sklearn_1000 = lr_sklearn_1000.predict(tfidf_matrix_test_sklearn_1000)
accuracy_sklearn_1000 = accuracy_score(y_test, y_pred_sklearn_1000)
f1_sklearn_1000 = f1_score(y_test, y_pred_sklearn_1000)
print("scikit-learn Features (max_features=1000) - Accuracy:", accuracy_sklearn_1000)
print("scikit-learn Features (max_features=1000) - F1 Score:", f1_sklearn_1000)

  self._set_intXint(row, col, x.flat[0])
  self._set_intXint(row, col, x.flat[0])


Custom Features (max_features=1000) - Accuracy: 0.545
Custom Features (max_features=1000) - F1 Score: 0.7055016181229774
scikit-learn Features (max_features=1000) - Accuracy: 0.785
scikit-learn Features (max_features=1000) - F1 Score: 0.8036529680365296


1. Is there a stark difference between the two vectorizers with 1000 features?<br>**The scikit-learn TfidfVectorizer seems to consistently outperform the custom TF-IDF vectorizer in terms of accuracy and F1 score, even when both are limited to 1000 features.**
2. Use sklearn's documentation for the Tfidfvectorizer to figure out what may be causing the performance difference (or lack thereof).<br>**Performance variations may arise from differences in tokenization and parameter tuning. Tweaking these aspects could reduce performance gaps.**

NOTE: Irrespective of your conclusions, both implementations should be above 60% F1 Score.

Task 3: Train a Feedforward Neural Network Model (25 points)
----
1. Using PyTorch, implement a feedforward neural network to do sentiment analysis. This model should take sparse vectors of length 10000 as input (note this is 10000, not 1000), and have a single output with the sigmoid activation function. The number of hidden layers, and intermediate activation choices are up to you, but please make sure your model does not take more than ~1 minute to train.
2. Evaluate the model using PyTorch functions for average accuracy, area under the ROC curve and F1 scores (see [torcheval](https://pytorch.org/torcheval/stable/)) using both vectorizers, with max_features set to 10000 in both cases.

In [7]:
#!pip install torch torchvision torchaudio
import torch
import torch.nn as nn
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch.optim as optim
from torcheval.metrics.functional import binary_f1_score
from torcheval.metrics import BinaryAUROC, BinaryAccuracy


# if torch.backends.mps.is_available():
# 	device = torch.device("mps")
if torch.cuda.is_available():
	device = torch.device("cuda")
else:
	device = torch.device("cpu")

In [8]:
class FeedForward(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(FeedForward, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

    def predict(self, x):
        with torch.no_grad():
            outputs = self.forward(x)
            predicted = torch.round(outputs)
        return predicted

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
from torch.utils.data import TensorDataset, DataLoader

# Load the data
X, y = get_lists("movie_reviews_train.txt")

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=10000)

# Fit and transform the data
tfidf_matrix = tfidf_vectorizer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, y, test_size=0.2, random_state=42)

# Function to create a PyTorch DataLoader
def create_data_loader(X, y, batch_size=64):
    tensor_x = torch.Tensor(X.toarray())  
    tensor_y = torch.Tensor(y).unsqueeze(1)
    dataset = TensorDataset(tensor_x, tensor_y) 
    return DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Prepare DataLoaders
batch_size = 64
train_loader = create_data_loader(X_train, y_train, batch_size)
test_loader = create_data_loader(X_test, y_test, batch_size)


In [10]:
# Create a feedforward neural network model
# you may use any activation function on the hidden layers
# you should use binary cross-entropy as your loss function
# Adam is an appropriate optimizer for this task

class FeedForward(nn.Module):
    def __init__(self, input_size):
        super(FeedForward, self).__init__()
        self.fc1 = nn.Linear(input_size, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out
    
    def predict(self, x):
        with torch.no_grad():
            outputs = self.forward(x)
            predicted = torch.round(outputs)
        return predicted

# Define the model with the correct input size
input_size = 10000  
model = FeedForward(input_size)

# Move the model to the appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [11]:
# Train the model for 50 epochs on both custom and sklearn vectors
num_epochs = 50

# Train the model on custom vectors
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f"Epoch [{epoch+1}/{num_epochs}], Custom Loss: {loss.item()}")

# Train the model on sklearn vectors
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f"Epoch [{epoch+1}/{num_epochs}], Sklearn Loss: {loss.item()}")


Epoch [1/50], Custom Loss: 0.6735047101974487
Epoch [2/50], Custom Loss: 0.5949637293815613
Epoch [3/50], Custom Loss: 0.48349279165267944
Epoch [4/50], Custom Loss: 0.3643815517425537
Epoch [5/50], Custom Loss: 0.25304919481277466
Epoch [6/50], Custom Loss: 0.16466936469078064
Epoch [7/50], Custom Loss: 0.10459190607070923
Epoch [8/50], Custom Loss: 0.08420170843601227
Epoch [9/50], Custom Loss: 0.06560181826353073
Epoch [10/50], Custom Loss: 0.04548829048871994
Epoch [11/50], Custom Loss: 0.038836829364299774
Epoch [12/50], Custom Loss: 0.02514059841632843
Epoch [13/50], Custom Loss: 0.01968236081302166
Epoch [14/50], Custom Loss: 0.016673708334565163
Epoch [15/50], Custom Loss: 0.017924942076206207
Epoch [16/50], Custom Loss: 0.01413821056485176
Epoch [17/50], Custom Loss: 0.012358732521533966
Epoch [18/50], Custom Loss: 0.009591622278094292
Epoch [19/50], Custom Loss: 0.008305499330163002
Epoch [20/50], Custom Loss: 0.0065590087324380875
Epoch [21/50], Custom Loss: 0.00729340920224

In [16]:
#!pip install torcheval
import torch
from torch.utils.data import DataLoader
from torcheval.metrics.functional import binary_f1_score
from torcheval.metrics import BinaryAUROC, BinaryAccuracy

# Helper function to evaluate the model
def evaluate_model_metrics(model, test_loader):
    model.eval()
    y_true = []
    y_pred = []
    y_scores = []
    
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            
            predicted = outputs.round().squeeze()  
            y_true.extend(labels.squeeze().cpu().numpy())  
            y_pred.extend(predicted.cpu().numpy())
            y_scores.extend(outputs.squeeze().cpu().numpy())  

    y_true = torch.tensor(y_true, dtype=torch.float32)
    y_pred = torch.tensor(y_pred, dtype=torch.float32)
    y_scores = torch.tensor(y_scores, dtype=torch.float32)

    # Calculate F1 Score
    f1 = binary_f1_score(y_pred, y_true)  

    # Calculate AUROC
    auroc = BinaryAUROC()
    auroc.update(y_scores, y_true.int())
    auroc_score = auroc.compute()

    # Calculate Binary Accuracy
    accuracy = BinaryAccuracy()
    accuracy.update(y_pred, y_true)
    accuracy_score = accuracy.compute()

    return f1.item(), auroc_score.item(), accuracy_score.item()

# Evaluate Custom Model
print("Evaluating Custom Model:")
f1_custom, auroc_custom, accuracy_custom = evaluate_model_metrics(model, test_loader)
print(f"Custom Model - F1 Score: {f1_custom:.4f}")
print(f"Custom Model - AUROC: {auroc_custom:.4f}")
print(f"Custom Model - Accuracy: {accuracy_custom:.4f}")

# Evaluate Sklearn Model
print("Evaluating Sklearn Model:")
f1_sklearn, auroc_sklearn, accuracy_sklearn = evaluate_model_metrics(model, test_loader) 
print(f"Sklearn Model - F1 Score: {f1_sklearn:.4f}")
print(f"Sklearn Model - AUROC: {auroc_sklearn:.4f}")
print(f"Sklearn Model - Accuracy: {accuracy_sklearn:.4f}")


Evaluating Custom Model:
Custom Model - F1 Score: 0.8085
Custom Model - AUROC: 0.8851
Custom Model - Accuracy: 0.8031
Evaluating Sklearn Model:
Sklearn Model - F1 Score: 0.8085
Sklearn Model - AUROC: 0.8851
Sklearn Model - Accuracy: 0.8031


NOTE: As in the last task, we're expecting to see a F1 score of over 60% using both your custom features and the sklearn features.

5 points in this assignment are reserved for overall style (both for writing and for code submitted). All work submitted should be clear, easily interpretable, and checked for spelling, etc. (Re-read what you write and make sure it makes sense). Course staff are always happy to give grammatical help (but we won't pre-grade the content of your answers).