# AMASUM Model Evaluation

This is a small notebook to leverage the Rouge metric to measure how well AMASUM stacks up against state-of-art models like Chat GPT when generating summaries for customer reviews on e-commerce websites.

In [4]:
from rouge import Rouge


---

I have copied some of the text from the modelling notebook into here to compare.

---

### Positive Model Summary Eval

In [5]:
# Instantiate the ROUGE scorer
rouge = Rouge()

# Define your extractive summary and reference summary as strings
extractive_summary = (
 '- The rotor is praised for its direct fit, perfect performance, and easy '
 'installation, making it a recommended product for Honda Accord owners.\n'
 '- The reviewer notes that the rotor is of better quality than the one it '
 'replaced, and at a cheaper price compared to local parts stores. '
 'Additionally, it solved a start problem during wet weather.\n'
 '- There are no specific issues or concerns mentioned in the review, '
 'indicating that the rotor is generally well-liked and effective.\n'
)
# chat gpt reference summary.
reference_summary = ('- This review discusses the effectiveness and quality of the rotor for a '
 '1999 Honda Accord LX, comparing it favorably to the original product '
 'supplied by Honda.\n'
 '- The rotor is praised for being a direct fit and for its perfect '
 'performance, as well as being easy to install.\n'
 '- The reviewer also notes that the rotor is of better quality than the one '
 'it replaced, and at a cheaper price compared to local parts stores. '
 'Additionally, it solved a start problem during wet weather.')

# Calculate ROUGE scores
scores = rouge.get_scores(extractive_summary, reference_summary)

# Access specific ROUGE metrics (e.g., ROUGE-2, ROUGE-L)
rouge_2_f1 = scores[0]["rouge-2"]["f"]
rouge_l_f1 = scores[0]["rouge-l"]["f"]

# Print the scores
print("ROUGE-2 F1 Score:", rouge_2_f1)
print("ROUGE-L F1 Score:", rouge_l_f1)

ROUGE-2 F1 Score: 0.5032258014618106
ROUGE-L F1 Score: 0.661157019796462


---

This is night and day difference from our baseline summaries which I will remind you here were:
- **ROUGE-2 F1 Score**: 0.08053690791585993
- **ROUGE-L F1 Score**: 0.2079999950924801

Our Positive Model is generating summaries that reach approximately **66%** of the effectiveness demonstrated by ChatGPT 3.5 Turbo. This indicates a substantial proficiency of AMASUM in this task.

Lets take a look at the Negative Model results now.

---



### Negative Model Summary Eval

In [7]:
# Instantiate the ROUGE scorer
rouge = Rouge()

# Define your extractive summary and reference summary as strings
extractive_summary = (
  '- Brightness and compatibility issues: Many customers found that these '
 'license plate lights were not bright enough and did not fit their specific '
 'car models, leading to disappointment and the need for returns.\n'
 '- Light output: The light output was a major concern for customers, with '
 'some mentioning that the lights appeared blue instead of white and were not '
 'as bright as expected.\n'
 '- Desire for brighter options: Some customers expressed a desire for '
 'brighter license plate lights, indicating a preference for more powerful '
 'options.\n'
)
# chat gpt reference summary.
reference_summary = ('- Many customers found that these license plate lights were not bright '
 'enough and did not meet their expectations.\n'
 '- Compatibility was a major issue, with customers reporting that the lights '
 'were not suitable for their specific car models, leading to disappointment '
 'and the need for returns.\n'
 '- Some customers were particularly dissatisfied with the light output, with '
 'one customer mentioning that the lights appeared blue instead of white, and '
 'others expressing a desire for brighter options.')

# Calculate ROUGE scores
scores = rouge.get_scores(extractive_summary, reference_summary)

# Access specific ROUGE metrics (e.g., ROUGE-2, ROUGE-L)
rouge_2_f1 = scores[0]["rouge-2"]["f"]
rouge_l_f1 = scores[0]["rouge-l"]["f"]

# Print the scores
print("ROUGE-2 F1 Score:", rouge_2_f1)
print("ROUGE-L F1 Score:", rouge_l_f1)

ROUGE-2 F1 Score: 0.5064935015145894
ROUGE-L F1 Score: 0.6837606787902696


---

# Conclusion

Lovely!

We are getting about **68%** of the effectiveness demonstrated by ChatGPT 3.5 Turbo with the Negative Model. 

I am more than pleased with the first iteration of this model. Fine-tuned on opensource materials we are closer to getting to state-of-the-art model like Open AI's GPT and with more training on the full dataset and more powerful GPU's I beleive we could get even closer.

---