### This won't help your score at all, but if you are curious to see what topics are in the training contexts, here you go. 

<p >
<img src="https://maartengr.github.io/BERTopic/logo.png" alt="drawing" width="200" style="float: left; margin: 20px;"/>
 </p>
 
Since I wouldn't understand the topics unless they are in English, I'm working with a translated set. Thank you to @jacob34 (Clear 'n Simple) for translating everything to English in this notebook https://www.kaggle.com/jacob34/chaii-qa-simple-google-translate
    
I'll be using BERTopic, which can do a pretty good job of finding topics. If you don't know about this package, it is actually quite impressive. See the website here for more details: https://maartengr.github.io/BERTopic/

I turned the GPU on because the sentence transformer will go much faster.



In [None]:
# You will have to restart the kernel after running this
!pip install bertopic -qq

In [None]:
%%capture

import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

df = pd.read_csv("../input/chaii-qa-simple-google-translate/train_with_google_translations.csv")

sentence_model = SentenceTransformer("all-mpnet-base-v2")
topic_model = BERTopic(embedding_model=sentence_model)


topics, probs = topic_model.fit_transform(df["context_gtrans"].dropna().tolist())

### Here we can see that most of the topics are about India, math, science, and history

In [None]:
topic_model.get_topic_info()

### I think these topics are pretty coherent

Here are some that are easy to pick out

Topic 1: India  
Topic 4: Thinkers and Inventors 
Topic 5: Agriculture  
Topic 6: Europe  
Topic 7: Rivers and Bodies of Water    
Topic 9: East Asia  
Topic 10: Acting and Movies  
Topic 13: Solar System  
Topic 15: World War II  
Topic 17: Disease  
Topic 18: Chemistry  
Topic 20: Biology  
Topic 26: Exotic Mammals  
Topic 27: Nuclear Chemistry/Physics  
Topic 30: Language   
Topic -1: Stopwords  

Note: Because this is a stochastic process, if you run the model again it will not generate the exact same results.

In [None]:
# This gets the words most associated with the topic
for topic_num in set(topics):
    print("\n\n", f"Topic: {topic_num}", [x[0] for x in topic_model.get_topic(topic_num)])

### Topics in 2 dimensions
Hover over the bubbles to see the topic name, or use the slider at the bottom to highlight a topic.

In [None]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

fig = topic_model.visualize_topics(height=1000, width=1000)


iplot(fig)

### Topics through hierarchical clustering
Another way of visualizing how the topics are associated

In [None]:
topic_model.visualize_hierarchy(height=1000, width=1000)

### Hover over a square to see the similarities between the topics

In [None]:
topic_model.visualize_heatmap(height=1000, width=1000)