# The Challenge
We are building a next semester course load LLM based recommender system for university students. 
Assume this will become a real tool to be used by real students some day.

## Task

There are 2 main objectives, in order of priority:
1. Setup an evaluation system to test this LLM based recommender so we can understand where it performs well and where it fails. 
   - At a minimum, we expect that you'll create more synthetic data for this task and add it to the examples already in the data provided.
   - We also expect you'll define some form of evaluation code to make sure that the recommender is behaving correctly.
   - Correctness is a bit subjective here, at a minimum we'd like to avoid classes that clash in terms of time, classes for which the student does not have the pre-requisites and relevant classes given the student's course history, major and their request.
   - Look through the data, think through what is realistic about it or not. Think of ways to combine programmatic and AI based solutions with synergy.

2. Once you're done, you're to convert this code into an API with FastAPI. 
   - Include containerization using Dockerfile for the same.
  
3. [Optional] Make the recommender better based on what we learn from 1.
   - Keep the recommender interface the same, but feel free to modify the prompt or add additional internal steps.

## Data
We've created a minimum set of data to get you started. This data includes a set of fake courses, majors and students.
This is just a sample, most of this challenge is around generating and improving this data to make it more realistic.

## Tools
What's allowed? Literally anything apart from having someone else do this for you. 
You can use any tools you'd have regular access to in your work.

You'll also get an Anyscale API key. Please be careful with this key, don't share it or commit it anywhere. We will get a notification if you do! You've been warned 😉.

We recommend you take a look at the following libraries to help you:
1. Together API
2. Langchain
3. Ragas / DeepEval

## Submission
Submit the final code to our team. We suggest adding comments to your code to explain your reasoning and to help us when we go through it in person later.

## Evaluation
- We want to see you flex your python muscles, write clean production quality code when you can.
- We want to see how you think through this somewhat open-ended problem.
  - Do you understand what's important when creating synthetic data and evaluating LLM powered products?
  - Can you be biased towards action in the face of uncertainty and situations where there is no single right answer.
- We want to see you problem solve in a real world environment, are you thoughtful and resourceful?


# Deps

In [None]:
%%bash
pip install langchain_openai==0.0.5 langchain_core==0.1.52 langchain==0.1.14

In [3]:
# Imports
from dataclasses import dataclass
from typing import List, Optional
import os

from tqdm import tqdm
# from langchain.cache import SQLiteCache
from langchain.globals import set_llm_cache
# from langchain_together import ChatTogether
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Setup LLM cache to speed things up (optional)
# set_llm_cache(SQLiteCache(database_path=".langchain.db"))

# Data

In [5]:

# Here are your dataclasses. Feel free to modify them if you want but you'll need to edit the data as well.
@dataclass
class Major:
    name: str
    required_courses: List[str]

@dataclass
class Course:
    course_id: str
    course_name: str
    time_blocks: List[str]
    prerequisites: List[str]
    department: str
    credits: int
    level: str
    instructor: str
    description: str
    selected_time_block: Optional[str] = None

@dataclass
class Student:
    student_id: str
    name: str
    majors: List[Major]
    completed_courses: List[str]
    level: str
    
    def get_student_majors(self, all_majors):
        return [all_majors[major_name] for major_name in self.majors]

In [None]:
# Load data
import json
with open('courses.json', 'r') as file:
    courses_json = json.load(file)

all_courses = [] 
for course_json in courses_json:
    all_courses.append(Course(**course_json))
    
with open('students.json', 'r') as file:
    students_json = json.load(file)

all_students = [] 
for student_json in students_json:
    all_students.append(Student(**student_json))

with open('majors.json', 'r') as file:
    majors_json = json.load(file)

all_majors = {}
for major_json in majors_json:
    major = Major(**major_json)
    all_majors[major.name] = major

all_majors


# Recommender

In [8]:

# Step 0: Set anyscale env key
os.environ["TOGETHER_API_KEY"] = 'e42468dadb57398b4575eca2eae2a9d516a519217a379bda5d02332aab6935c0' # This key is unique to you so do not share this anywhere else or you will be flagged.

# Optional: If you have an openai key can also try using their models
model = ChatOpenAI(base_url = 'https://api.together.xyz/v1', api_key=os.getenv('TOGETHER_API_KEY'), model="mistralai/Mixtral-8x7B-Instruct-v0.1")

# Step 2: Create a prompt function that describes the task to the LLM
prompt = ChatPromptTemplate.from_template(
    """Given the following student details, majors, and a list of possible courses, recommend which courses the student
should take next semester. Consider prerequisites, major requirements, and avoid schedule conflicts.
\n\nStudent Details: {student}\n\nMajors: {majors}\n\nPossible Courses: {possible_courses}.
Student Request: {request}
Your output should be a json in this format:
{{
    "courses": [
        {{"course_id": "abc123", "selected_time_slot": "A"}},
        {{"course_id": "edf234", "selected_time_slot": "C"}},
        {{"course_id": "fgh456", "selected_time_slot": "D"}}
        ...
    ],
    "explanation": "<insert explanation of why this is a good course suggestion>"
}}
""")

# Step 3: Define a processing chain
# This chain takes the student, their majors, and possible courses, generates a prompt, and gets recommendations from the LLM
output_parser = StrOutputParser()
chain = prompt | model | output_parser

In [None]:

student = all_students[3]
student_request = 'I want to get as many courses from my major done as possible, but also have some diversity. I want to do 3 courses.'

majors = student.get_student_majors(all_majors)
recommendations = chain.invoke({'request': student_request, 'possible_courses': all_courses, 'student': student, 'majors': student.majors})

print(recommendations)


# Evaluation (Your Task)
Setup your evaluation code here

Note: I've provided some inspiration code, but its not very good. Feel free to use it or delete it with better evaluations!

In [None]:
'''
Here is an example runner.
Pros:
- It runs several students and collects the right information so we can evaluate
- It caches the output so it can be run multiple times quickly

Cons:
- It's sequential (inefficient)
'''

def get_recommendations(student):
    majors = student.get_student_majors(all_majors)
    recommendations = chain.invoke({
        'request': student_request, 
        'possible_courses': all_courses, 
        'student': student, 
        'majors': majors
    })
    return recommendations

recommendations = []
errors = []
for student in tqdm(all_students):
    try:
        recommendations.append({'student': student, 'recommendation': get_recommendations(student)})
    except Exception as e:
        errors.append({'student': student, 'error': str(e)})

print(f'Successfully generated {len(recommendations)} / {(len(all_students))}')

In [None]:
'''
Here is an example evaluation.
Pros:
- It evaluates the output on several relevant metrics

Cons:
- It's sequential (inefficient)
- It produces text outputs instead of numerics, it's hard to go from this to a single 
metric we can use to decide if the model is working well or not
- It doesn't test everything we want / care about
- It uses an LLM to evaluate thing like course_time_slots overlaps that could be done programatically
- It only evaluates one student at a time and provides no aggregate metrics
'''

from langchain.evaluation import load_evaluator
from pprint import pprint
# This is equivalent to loading using the enum
from langchain.evaluation import EvaluatorType


recommendation = recommendations[0]
student = recommendation['student']
query = prompt.invoke({'request': student_request, 'possible_courses': all_courses, 'student': student, 'majors': student.majors})
prediction = recommendation['recommendation']


# If you wanted to specify multiple criteria. Generally not recommended
custom_criteria = {
    "course_relevancy": "Are the selected courses relevant for the student?",
    "course_time_slots": "Are there time overlaps?",
    "prerequisites": "Before taking the recommended classes, does the student have necessary prerequisites?",
    "student_request": "Do the recommended courses follow the student request?",
}

eval_chain = load_evaluator(
    EvaluatorType.CRITERIA,
    criteria=custom_criteria,
    llm=model
)
eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)

print("Recommendation")
pprint(recommendation)

print("\nMulti-criteria evaluation")
pprint(eval_result)