# Motivation
I wanted to configure a Solr index within a Docker container, and I found the library `pysolr`. I'm going to play around with it within this notebook! 

# Setup
The cells below will help to set up the rest of the notebook.

I'll start by changing the working directory to the root of the repo.  

In [1]:
%cd ..

d:\data\programming\neural-needle-drop


Next, I'm going to import some modules.

In [30]:
# Import statements
from pysolr import Solr
import requests
import mysql.connector
import traceback
import pandas as pd
import json
import os

I'm also going to set up my MySQL cursor. 

In [14]:
# Set up the connection to the MySQL server
cnx = mysql.connector.connect(
    user='root', password=os.getenv("MYSQL_PASSWORD"), 
    host='localhost', database='neural-needle-drop')

# Create a cursor 
cursor = cnx.cursor()

# Methods 
The cells below will help throughout the rest of the notebook. 

In [8]:
def add_field_to_schema(field_def, core_name, solr_url="http://localhost:8983/solr/"):

    """This method will add a field to the schema for a given core"""

    core_url = f"{solr_url}{core_name}"
    solr_cmd = {'add-field': field_def}
    r = requests.post(core_url + '/schema', json=solr_cmd)

In [23]:
def query_to_df(query, print_error=False):
    '''Query the active MySQL database and return results in a DataFrame'''

    # Try to return the results as a DataFrame
    try:
        # Execute the query
        cursor.execute(query)

        # Fetch the results 
        res = cursor.fetchall()

        # Return a DataFrame
        return pd.DataFrame(res, columns=[i[0] for i in cursor.description])

    # If we run into an Exception, return None
    except Exception as e:
        if (print_error):
            print(f"Ran into the following error:\n{e}\nStack trace:")
            print(traceback.format_exc())
        return None

# Configuring the Index
I'm going to start by defining a schema, and then updating the `neural-needledrop` core to ensure that this schema is in place.

In [11]:
# Define the schema fields
fields = [
    {"name": "transcription", 
    "type": "text_general", 
    "stored": True, 
    "indexed": True, 
    "required": False, 
    "multiValued": False}
]

# Update the schema
for field in fields:
    add_field_to_schema(field, 'neural-needledrop', solr_url="http://localhost:8983/solr/")

# Adding Data to the Index
Now that I've properly configured the index, I ought to add some data. First, I should actually *try* to add the data! 

In [29]:
# Load in all of the transcription data
tnd_transcription_data_df = query_to_df("SELECT * FROM transcriptions", print_error=True)

# Transform the tnd_transcription_data_df to only have the video ID and the transcription
full_transcriptions_df = tnd_transcription_data_df.query("segment==-1")[["id", "text"]].rename(columns={"text": "transcription"})

With this data loaded, we can start populating the Solr core. 

In [34]:
# Create a list of dicts from the transcription DataFrame
upload_data = [{"id": row.id, "transcription": row.transcription} for row in full_transcriptions_df.itertuples()]

# Upload this data to the Solr core
url = f'http://localhost:8983/solr/neural-needledrop/update/json/docs?commit=true'
headers = {'content-type': 'application/json'}
r = requests.post(url, data=json.dumps(upload_data), headers=headers)

# Querying the Core
I've got data in the core now. With this in mind, I want to try and query something! 

In [80]:
def query_with_highlighting(query_string):

    """
    This method will query the transcriptions with the given `query_string`. It'll return a couple of different DataFrames: 

    - `results_df` - this is a DataFrame of (id, score) tuples
    - `highlights_df` - this is a DataFrame of all of the 
    """

    params = {
        'q': 'transcription:' + query_string,
        'hl': 'true',
        'hl.fl': 'transcription',
        'fl': 'id, score'
    }
    response = requests.get('http://localhost:8983/solr/neural-needledrop/select', params=params)
    results = response.json()

    # Creating the results and highlights DataFrames
    results_df = pd.DataFrame.from_records(results["response"]["docs"])
    highlights_df = pd.DataFrame(results["highlighting"]).T.reset_index().rename(columns={"index": "id"})
    highlights_df = highlights_df.merge(results_df, on="id")
    
    # Return both of the DataFrames I created
    return results_df, highlights_df

query = "in rainbows"
results_df, highlights_df = query_with_highlighting(query)


In [81]:
highlights_df.iloc[0].transcription

["This is thanks largely <em>in</em> part to the classic records he dropped under the microphone's name <em>in</em> the early 2000s. "]