# Sample Notebook

This assumes that your data is in a pandas DataFrame and that you're working with a text classification problem.

Adjust this to fit the specifics of your competition and dataset.

In this section:

- We first import the necessary libraries and load the data.
- We then display the first few rows of the data to get a sense of what it looks like.
- We plot the distribution of labels in the data. This can help identify any class imbalance that might affect model performance.
- We plot the distribution of text lengths. This can give an idea of the range of text lengths the model will need to handle.
- Finally, we create a word cloud of the most common words in the text. This can give a sense of the most important words in the data.


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

In [None]:
# Load the data
data = pd.read_csv("/data/eval_student_summaries/prompts_test.csv")

# Display the first few rows of the data
print(data.head())

### Display the distribution of labels

In [None]:
sns.countplot(data['label'])
plt.title('Distribution of Labels')
plt.show()

### Display the length of the texts


In [None]:
data['text_length'] = data['text'].apply(len)
sns.histplot(data['text_length'])
plt.title('Distribution of Text Lengths')
plt.show()

### Display the most common words in the text

In [None]:
all_text = ' '.join(data['text'])
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_text)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title('Most Common Words')
plt.show()

## Performance Metrics

Calculate and display more detailed performance metrics for a text classification model in a Jupyter notebook.

This assumes that you've already trained your model and made predictions on your validation data.

In this section:

- We first calculate the classification report, which includes precision, recall, and F1-score for each class, as well as overall accuracy.
- We then calculate and display the confusion matrix, which shows the number of true positives, true negatives, false positives, and false negatives for each class.
- If your problem is binary classification, we also calculate and display the ROC curve, which shows the trade-off between the true positive rate and false positive rate for different threshold values. The area under the ROC curve (AUC) is also calculated as a single-number summary of model performance.

The specific metrics that are most relevant will depend on your problem.

Example:
    If you have imbalanced classes, you might want to focus more on precision, recall, or the F1-score rather than overall accuracy.

### Import necessary libraries

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

### Calculate the classification report

In [None]:
report = classification_report(val_data['label'], val_predictions, output_dict=True)
print(pd.DataFrame(report).transpose())


### Calculate and display the confusion matrix

In [None]:
cm = confusion_matrix(val_data['label'], val_predictions)
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.show()


### If your problem is binary classification, you can also calculate and display the ROC curve

In [None]:
if len(np.unique(val_data['label'])) == 2:
    fpr, tpr, _ = roc_curve(val_data['label'], val_predictions)
    roc_auc = auc(fpr, tpr)
    plt.figure()
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()


## Sentiment Analysis

Sentiment analysis is a common type of binary classification problem in NLP, where the goal is to determine whether a given text is positive or negative.

In this section:

- We calculate the classification report as before, but now we specify that the classes are 'Negative' and 'Positive'.
- We calculate and display the confusion matrix as before, but now we label the axes with 'Negative' and 'Positive'.
- We calculate and display the ROC curve as before. This is particularly useful for binary classification problems like sentiment analysis, as it shows how the model's performance changes as the threshold for deciding between 'Negative' and 'Positive' is varied.

In [None]:
# Import necessary libraries
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np


### Calculate the classification report

In [None]:
report = classification_report(val_data['label'], val_predictions, target_names=['Negative', 'Positive'], output_dict=True)
print(pd.DataFrame(report).transpose())

### Calculate and display the confusion matrix

In [None]:
cm = confusion_matrix(val_data['label'], val_predictions)
sns.heatmap(cm, annot=True, fmt='d', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### Calculate and display the ROC curve

In [None]:
fpr, tpr, _ = roc_curve(val_data['label'], val_predictions)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()