## This notebook:
- Recommend a course based on free text the user inputs (this is more like an Information Retrieval system, research engine, because the user input is free text, like Google search)
1.    
    - User enters a block of text (free text, no preset options)
    - This block of text would describe their interest
    - Process this block of text
    
2.    
    - Process text of the descriptions of courses
    
3.    
    - Match the topic of user's input text and all the courses
    - Compute similarity scores
    - Rank these scores from high to low
    - Return the n number of recommendations needed by order of similarity

In [16]:
import pandas as pd
import numpy as np
import sklearn

In [5]:
online = pd.read_csv('assets/original/2021-10-19-MichiganOnline-courses.csv')
f_21 = pd.read_csv('assets/f_21_merge.csv')
w_22 = pd.read_csv('assets/w_22_merge.csv')

In [6]:
df = f_21
df.drop_duplicates(subset=['course'], inplace=True)
df.dropna(subset=['description'], inplace=True)
df.fillna('', inplace=True)

In [7]:
df.head()

Unnamed: 0,Class Nbr,course,Term,Session,Acad Group,Subject,Course Title,description,Component,Time,...,Seats Remaining,Has WL,Units,sub_title,credits,requirements_distribution,consent,advisory_prerequisites,other_course_info,repeatability
0,30282,AAS 103,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Social Sci Seminar,This course seeks to introduce students to eve...,SEM,1-230PM,...,1,Y,3.0,- Reading Africa: Critical Perspectives on Po...,3,SS,,"Enrollment restricted to first-year students, ...",(Cross-Area Courses). May not be included in a...,May not be repeated for credit.
3,30276,AAS 104,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Humanities Seminar,This seminar introduces first-year students to...,SEM,1130-1PM,...,2,Y,3.0,- Black Lives and Life Writing: How We Tell S...,3,HU,With permission of instructor.,"Enrollment restricted to first-year students, ...",(Cross-Area Courses). May not be included in a...,May not be repeated for credit.
21,19186,AAS 115,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Elementary Swahili,This course is an introduction to spoken and w...,REC,1-2PM,...,13,Y,4.0,- Swahili Language and Culture,4,,,,,May not be repeated for credit.
25,26657,AAS 125,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Elem Yoruba I,This course is designed to introduce the Yorub...,REC,9-10AM,...,5,Y,4.0,- Yoruba,4,,,,May not repeat the same language at the same l...,May not be repeated for credit.
26,30898,AAS 202,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Intro Afr Diasp Stds,Is the African Diaspora a concept or an actual...,SEM,1-230PM,...,9,Y,3.0,- Global Blackness,3,,,,,May not be repeated for credit.


In [8]:
df_1 = df[['Class Nbr', 'course', 'Course Title','description']]

In [9]:
corpus = df_1['description'].tolist()
corpus[:10]

['This course seeks to introduce students to everyday life in urban Africa. The course is designed to equip students with basic and useful knowledge about the how urban residents – rich and poor, newcomers and old-timers, young and old, men and women – negotiate the challenges of living in cities.  This course focuses on networks, associational life, and relationships that are the ties that bind urban residents together.  Social organization, religious belief and practice, ethnicity, economic and political systems, the arts, and popular culture are some of the topics we will explore.  We will be approaching these themes from a variety of disciplinary perspectives, including history, anthropology, literature, political science, sociology, and economics.',
 'This seminar introduces first-year students to the intellectual community of humanities scholars working in the field of Afroamerican and African studies.  The topic of the seminar varies from year to year.',
 'This course is an intr

In [10]:
import spacy

nlp = spacy.load('en_core_web_md')

## Bert Sentence Transformer 

In [11]:
#pip install sentence_transformers

In [12]:
from sentence_transformers import SentenceTransformer
import scipy.spatial
embedder = SentenceTransformer('bert-base-nli-mean-tokens')

In [13]:
%%time
corpus_embeddings = embedder.encode(corpus)

Wall time: 10min 16s


## Candidate Genration using Faiss vector similarity search library  
Faiss is a library developed by Facebook AI Research. It is for effecient similarity search and clustering of dense vectors.

In [14]:
#pip install faiss-cpu

Faiss is a library developed by Facebook AI Research. It is for effecient similarity search and clustering of dense vectors.
1. Tutorial : https://github.com/facebookresearch/faiss/wiki/Getting-started  
2. facebookrearch : https://github.com/facebookresearch/faiss

In [17]:
import faiss
d= 768
index = faiss.IndexFlatL2(d)
print(index.is_trained)
index.add(np.stack(corpus_embeddings, axis=0))
print(index.ntotal)

True
1910


In [33]:
def recommander(df, queries, k):
    # query free text input from user about his interest
    # we want to see k nearest neighbors of query
    # df could be f_21, w_22

    df_1 = df[['Class Nbr', 'course', 'Course Title','description']]
    query_embeddings = embedder.encode(queries)
    D, I = index.search(query_embeddings, k)     # actual search

    for query, query_embedding in zip(queries, query_embeddings):
        distances, indices = index.search(np.asarray(query_embedding).reshape(1,768),k)
        print("\n======================\n")
        print("Query:", query)
        print("\nTop 5 most similar sentences in corpus:")
        for idx in range(0,k):
            print(corpus[indices[0,idx]], "(Distance: %.4f)" % distances[0,idx])


In [34]:
queries =["I am interested in computer science",
         "I like pop music", 
        "I like Asian culture, especially Janpanese history", 
        "I like to use computer skill to resolve biological problems"]
recommander(f_21, queries, 5)



Query: I am interested in computer science

Top 5 most similar sentences in corpus:
Introduction to programming with a focus on applications in informatics.  Covers the fundamental elements of a modern programming language and how to access data on the internet.  Explores how humans and technology complement one another, including techniques used to coordinate groups of people working together on software development. (Distance: 153.2221)
Includes the study of digital synthesis techniques. Special attention is given to the relationship between technology, the creative process, and individual statement. (Distance: 160.4367)
Includes the study of digital synthesis techniques. Special attention is given to the relationship between technology, the creative process, and individual statement. (Distance: 160.4367)
Practical experience in the use of a system of computer programs for social scientists. (Distance: 162.5113)
Advanced technical communication for computer science.  Design and wri