en_core_web_md## This notebook:
- Recommend a course based on free text the user inputs (this is more like an Information Retrieval system, research engine, because the user input is free text, like Google search)
1.    
    - User enters a block of text (free text, no preset options)
    - This block of text would describe their interest
    - Process this block of text
    
2.    
    - Process text of the descriptions of courses
    
3.    
    - Match the topic of user's input text and all the courses
    - Compute similarity scores
    - Rank these scores from high to low
    - Return the n number of recommendations needed by order of similarity

In [1]:
import pandas as pd
import numpy as np
import sklearn
import faiss

import spacy

# nlp = spacy.load('en_core_web_md')

from sentence_transformers import SentenceTransformer
import scipy.spatial

import pickle

In [2]:
def load_df(course_type):
    # course_type: credit course or online course
    # credit f21: take in free_text_f_21.csv
    # credit w22: take in free_text_w_22.csv
    # online course: take in online df
    
    if course_type == 'Fall 2021':
        df = pd.read_csv('assets/free_text_f_21.csv')
        corpus = df['text'].tolist()
        corpus_embeddings_file = 'corpus_embeddings_f_21.pkl'
        embedder_file = 'embedder_f_21.pkl'
        
    elif course_type == 'Winter 2022':
        df = pd.read_csv('assets/free_text_w_22.csv')
        corpus = df['text'].tolist()
        corpus_embeddings_file = 'corpus_embeddings_w_22.pkl'
        embedder_file = 'embedder_w_22.pkl'
        
    elif course_type == 'online':
        df = pd.read_csv('assets/original/2021-10-19-MichiganOnline-courses.csv')
        corpus = df['description'].tolist()
        corpus_embeddings_file = 'corpus_embeddings_online.pkl'
        embedder_file = 'embedder_online.pkl'
        
    return corpus_embeddings_file, embedder_file, corpus

In [3]:
def load_sentence_transformer(course_type):
    # course_type: credit course or online course
    # credit course: take in free_text_fw.csv
    # online course: take in online df
    
    corpus_embeddings_file, embedder_file, corpus = load_df(course_type)

    #Load sentences & embeddings from disc
    with open(corpus_embeddings_file, "rb") as fIn:
        stored_data = pickle.load(fIn)
        stored_corpus = stored_data['corpus']
        stored_embeddings = stored_data['embeddings']

    with open(embedder_file, "rb") as fIn:
        stored_embedder = pickle.load(fIn)
        
    return stored_data, stored_corpus, stored_embeddings, stored_embedder, corpus

In [4]:
def recommender(course_type, queries, k):
    # query free text input from user about his interest
    # we want to see k nearest neighbors of query
    
    stored_data, stored_corpus, stored_embeddings, stored_embedder, corpus = load_sentence_transformer(course_type)
    
    d= 768
    index = faiss.IndexFlatL2(d)
    index.add(np.stack(stored_embeddings, axis=0))   

    query_embeddings = stored_embedder.encode(queries)
    D, I = index.search(query_embeddings, k)     # actual search

    for query, query_embedding in zip(queries, query_embeddings):
        distances, indices = index.search(np.asarray(query_embedding).reshape(1,768),k)
        print("\n======================\n")
        print("Query:", query)
        print("\nTop 5 most similar sentences in corpus:")
        for idx in range(0,k):
            print(corpus[indices[0,idx]], "(Distance: %.4f)" % distances[0,idx])

In [27]:
queries =["I am interested in computer science",
         "I like pop music", 
        "I like Asian culture, especially Janpanese history", 
        "I like to use computer skill to resolve biological problems"]
recommender('Fall 2021', queries, 5)



Query: I am interested in computer science

Top 5 most similar sentences in corpus:
Special Topics  - Discover Computer Science Topics of current interest selected by the faculty. (Distance: 127.5233)
Research Sem in Info   (Distance: 145.3734)
Prog, Info & People  Introduction to programming with a focus on applications in informatics.  Covers the fundamental elements of a modern programming language and how to access data on the internet.  Explores how humans and technology complement one another, including techniques used to coordinate groups of people working together on software development. (Distance: 150.9201)
Research Work EECS   (Distance: 155.8544)
Being Data Scientist   (Distance: 163.5696)


Query: I like pop music

Top 5 most similar sentences in corpus:
Advan Excel with VBA   (Distance: 220.4980)
Positive Communicatn   (Distance: 225.4344)
Lit&Social Change  - What Difference Can a Story Make? Sure, pop culture is fun. It’s great to watch an Academy Award winner, read a

In [29]:
queries = ['I want to learn more about data science']
recommender('online', queries, 5)



Query: I want to learn more about data science

Top 5 most similar sentences in corpus:
This course extends our understanding of information visualization. Leveraging the topics covered in Information Visualization I, we introduce interactive techniques that can be used broadly for visualization. The course will also introduce techniques for visualizing specific data types: temporal, network, hierarchical, and text.
Information visualization is a crucial step in both understanding and communicating data. In the first Information Visualization course we focused on the fundamental principles behind visualization--the most basic types of data, encodings, and the perceptual and cognitive processes that make some visualizations better than others. This class will expand on these basic principles to include interactivity. We will see how interactivity can act as a boost to expressiveness and effectiveness by studying a handful of common cross-cutting techniques. The class will also expand 