## This notebook:
- Recommend a course based on the course description
1.    
    - Topic modeling on course descriptions
    - Take a course title (e.g. AAS 103)
    - Take the course description of this course
    - Process text of the description
    - Find the topics of this block of text
    
2.    
    - Process text of the descriptions of all other courses
    - Find the topics of those blocks of text
    
3.    
    - Match the topic of the input course and other courses
    - Compute similarity scores
    - Rank these scores from high to low
    - Return the n number of recommendations needed (num_of_rec) by order of similarity
    
- Need to improve

    - Text processing
    - Topic modeling (the recommendations are not quite logical yet since the text processing and topic modeling are not quite well-done yet)
    - Efficiency of the algorithm (slow now)

## Update
1. Because the courses already have distinct clusters such as academic groups (LSI, engineering, dentistry ...) and subject (Afroamerican sections, etc), it makes more sense that we recommend courses in the same academic group and subject.


2. Somehow, processing the texts (stopwords removal, lemma, etc) produce poorer recommendations. The results look much better without the language processing. 


3. About topic modeling -- I'm not sure how we could utilize topic modeling, since the total number of academic group is about 20 so if we cluster the courses with topic modeling, it's not going to work very well unless we use a large number of cluster like 200 - 500. We could try topic modeling in the subject level, but I think count vec and tfidf vec works pretty well, so not sure if that would be necessary. 

-------------
- Concat fall and winter -> add "semester" parameter (fall, winter, doesn't matter)
- Add LSA requirement parameter

- Maybe we also want to add filter level None
- I think we should add an option to recommenda a Michigan online course based on the content of the entered course

### Paper that helps: https://www.frontiersin.org/articles/10.3389/frai.2020.00042/full

In [11]:
import pandas as pd
import neattext.functions as nfx

In [12]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel, sigmoid_kernel

In [13]:
# Load our dataset
f_21 = pd.read_csv('assets/f_21_merge.csv')
w_22 = pd.read_csv('assets/w_22_merge.csv')

## Concat

In [14]:
def concat(f_df, w_df):
    f_df['semester'] = 'fall'
    w_df['semester'] = 'winter'
    df = pd.concat([f_df, w_df])
    return df

In [15]:
fw = concat(f_21, w_22)

## Cosine similarity

#### Parameters:

* df = input dataset
* course_title = input course
* num_or_rec = number of recommendation
* filter_level = 'academic_group', subject', or 'no_filter'
* semester = 'fall' or 'winter'
* lsa

In [16]:
def make_recommendation_cos(df, course_title, num_of_rec = 10, filter_level = 'subject', semester = 'fall', lsa = None):
    # Specify semester
    df = df[df['semester'] == semester]
    
    # Clean df
    df = df.fillna('').drop_duplicates(subset=['course']).reset_index().drop(columns='index')
    
    # Input 
    input_ag = df.loc[df['course'] == course_title, 'Acad Group'].unique()
    input_sub = df.loc[df['course'] == course_title, 'Subject'].unique()
    input_course = df.loc[df['course'] == course_title, 'Course Title'].unique()
    
    # Filter the df
    if filter_level == 'academic_group':
        df = df[df['Acad Group'].isin(input_ag)] 
    elif filter_level == 'subject':
        df = df[(df['Subject'].isin(input_sub)) | (df['Course Title'].isin(input_course))]
    
    # Merge all the text information
    df['text'] = df['Acad Group'] + ' ' + df['Subject'] + ' ' + df['Course Title'] + ' ' + df['description']

    # Reset index
    df = df.reset_index(drop = True)
    
    # Vectorize our Text
    count_vect = CountVectorizer()
    cv_mat = count_vect.fit_transform(df['text'])

    df_cv_words = pd.DataFrame(cv_mat.todense(), columns=count_vect.get_feature_names())

    # Cosine Similarity Matrix
    cosine_sim_mat = cosine_similarity(cv_mat)

    # Get Course ID/Index
    course_indices = pd.Series(df.index, index=df['course'])

    # ID for title
    idx = course_indices[course_title]

    # Course Indice
    # Search inside cosine_sim_mat
    scores = list(enumerate(cosine_sim_mat[idx]))

    # Scores
    # Sort Scores
    sorted_scores = sorted(scores, key=lambda x:x[1], reverse=True)

    # Recommender
    selected_course_indices = [i[0] for i in sorted_scores[1:]]
    selected_course_scores = [i[1] for i in sorted_scores[1:]]

    result = df[df.columns].iloc[selected_course_indices]

    rec_df = pd.DataFrame(result)

    rec_df['similarity_scores'] = selected_course_scores

    return rec_df[:num_of_rec]

## Sigmoid kernel

In [17]:
def make_recommendation_sk(df, course_title, num_of_rec = 10, filter_level = 'academic_group', semester = 'fall', lsa = None):
    # Specify semester
    df = df[df['semester'] == semester]
    
    # Clean df
    df = df.fillna('').drop_duplicates(subset=['course']).reset_index().drop(columns='index')
    
    # Input 
    input_ag = df.loc[df['course'] == course_title, 'Acad Group'].unique()
    input_sub = df.loc[df['course'] == course_title, 'Subject'].unique()
    input_course = df.loc[df['course'] == course_title, 'Course Title'].unique()
    
    # Filter the df
    if filter_level == 'academic_group':
        df = df[df['Acad Group'].isin(input_ag)] 
    elif filter_level == 'subject':
        df = df[(df['Subject'].isin(input_sub)) | (df['Course Title'].isin(input_course))]

    # Merge all the text information
    df['text'] = df['Acad Group'] + ' ' + df['Subject'] + ' ' + df['Course Title'] + ' ' + df['sub_title'] + ' ' + df['description']

    # Reset index
    df = df.reset_index(drop = True)
    
    # Vectorize our Text
    tfidf = TfidfVectorizer(max_df = 0.5, max_features=None, 
                strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                ngram_range=(1,1))

    # Fitting the TF-IDF on the 'description' text
    tfidf_matrix = tfidf.fit_transform(df['text'])

    # Compute the sigmoid kernel
    sig = sigmoid_kernel(tfidf_matrix, tfidf_matrix)

    # Reverse mapping of indices and course titles
    indices = pd.Series(df.index, index=df['course']).drop_duplicates()

    # Get the index corresponding to course title
    idx = indices[course_title]

    # Get the pairwise similarity scores 
    sig_scores = list(enumerate(sig[idx]))

    # Sort the courses
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)

    # Scores of the n most similar courses
    sig_scores = sig_scores[1:num_of_rec+1]

    # Take the indices
    course_indices = [i[0] for i in sig_scores]

    # Top 10 most similar courses
    rec_df = df[df.columns].iloc[course_indices]
    
    rec_df['sig_scores'] = sig_scores

    return rec_df

In [21]:
make_recommendation_cos(f_21, 'ECON 208', 10, 'academic_group')

Unnamed: 0,Class Nbr,course,Term,Session,Acad Group,Subject,Course Title,description,Component,Time,...,sub_title,credits,requirements_distribution,consent,advisory_prerequisites,other_course_info,repeatability,semester,text,similarity_scores
1207,30254,MIDEAST 590,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Middle East Studies (MIDEAST) Open Sections,Topics in MES,What does it mean to love? As both a literary ...,LEC,530-7PM,...,"- Sin, Sex, and Desire: Romance in the Middle...",3,,With permission of instructor.,Upper-level undergraduates or graduate student...,,May be elected three times for credit. May be...,fall,"Literature, Sci, and the Arts Middle East Stud...",0.636491
42,19146,ALA 102,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Applied Liberal Arts (ALA) Open Sections,Student in the Univ,This course will provide students with an oppo...,DIS,4-5PM,...,,1,,,Michigan Community Scholars Program participant.,,May not be repeated for credit.,fall,"Literature, Sci, and the Arts Applied Liberal ...",0.622891
1382,23618,PPE 400,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts","Philosophy, Politics and Economics (PPE) Open ...",Political Economy,This is the capstone seminar for non-Honors se...,SEM,1-230PM,...,,3,ULWR,,Completion of distribution requirements for PP...,,May not be repeated for credit.,fall,"Literature, Sci, and the Arts Philosophy, Poli...",0.620149
793,11250,FTVM 427,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts","Film, Television, and Media (FTVM) Open Sections",Screenwriting III,This advanced screenwriting seminar is a pure ...,LEC,6-9PM,...,,3,,With permission of instructor.,FTVM 310 and 410. Limited to students whose w...,,May not be repeated for credit.,fall,"Literature, Sci, and the Arts Film, Television...",0.610623
640,32817,EEB 401,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Ecology and Evolutionary Biology (EEB) Open Se...,Advanced Topics,This field course will be held at the UM Biol...,LAB,ARR,...,- Microbes in the Wild: Environmental Microbi...,2,BS,With permission of department.,Intended for senior majors. The prerequisites ...,,May be repeated for a maximum of 6 credit(s). ...,fall,"Literature, Sci, and the Arts Ecology and Evol...",0.609963
13,27259,AAS 304,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Gender&Immigr,"Refugees, migrants, immigrants, diaspora grou...",SEM,4-530PM,...,"- Refugees of Unjust Worlds: Globalization, G...",3,SS,With permission of instructor.,The seminar is intended for junior and senior ...,,May not be repeated for credit.,fall,"Literature, Sci, and the Arts Afroamerican & A...",0.608691
953,32519,HISTORY 487,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",History (HISTORY) Open Sections,Convers & Christnty,\n\r\n\r\nOur seminar investigates change of v...,SEM,2-5PM,...,,3,,,,,May not be repeated for credit.,fall,"Literature, Sci, and the Arts History (HISTORY...",0.608198
1464,21179,REEES 214,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts","Russian, East European and Eurasian Studies (R...",Rock Poetry,\n\n\nThis course provides an introduction to ...,LEC,230-4PM,...,,3,HU,,,,May not be repeated for credit.,fall,"Literature, Sci, and the Arts Russian, East Eu...",0.60798
1738,27282,WGS 304,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Women's and Gender Studies (WGS) Open Sections,Gender&Immigr,"Refugees, migrants, immigrants, diaspora grou...",SEM,4-530PM,...,"- Refugees of Unjust Worlds: Globalization, G...",3,SS,With permission of instructor.,The seminar is intended for junior and senior ...,,May not be repeated for credit.,fall,"Literature, Sci, and the Arts Women's and Gend...",0.607547
1321,21177,POLISH 214,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Polish (POLISH) Open Sections,Rock Poetry,\n\n\nThis course provides an introduction to ...,LEC,230-4PM,...,,3,HU,,,,May not be repeated for credit.,fall,"Literature, Sci, and the Arts Polish (POLISH) ...",0.605946


In [20]:
make_recommendation_sk(f_21, 'ECON 208', 10, 'academic_group')

Unnamed: 0,Class Nbr,course,Term,Session,Acad Group,Subject,Course Title,description,Component,Time,...,sub_title,credits,requirements_distribution,consent,advisory_prerequisites,other_course_info,repeatability,semester,text,sig_scores
619,11083,ECON 621,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,Labor Economics I,,LEC,1-230PM,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(619, 0.761606152801366)"
621,18980,ECON 641,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,Inter Trade Theory,,LEC,230-4PM,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(621, 0.7616033512213852)"
626,11085,ECON 695,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,Intro Research I,,SEM,830-10AM,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(626, 0.7616032090632607)"
632,36039,ECON 995,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,Diss-Cand,,IND,ARR,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(632, 0.7616027800461544)"
631,39059,ECON 990,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,Diss-Precand,,IND,ARR,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(631, 0.7616027057399164)"
607,27475,ECON 500,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,Quantitative Methods,,DIS,8-9PM,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(607, 0.7616026208594763)"
618,19078,ECON 617,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,Game Theory,,LEC,1130-1PM,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(618, 0.7616023429044292)"
628,31648,ECON 751,Fall 2021,First 7 Week Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,Computational Econ,,LEC,1-230PM,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(628, 0.7616021425456341)"
629,31649,ECON 752,Fall 2021,Second 7 Week Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,AdvMathMethDynModel,,LEC,1-230PM,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(629, 0.7616018850876954)"
612,40047,ECON 599,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Economics (ECON) Open Sections,Special Tutorial,,IND,ARR,...,,,,,,,,fall,"Literature, Sci, and the Arts Economics (ECON)...","(612, 0.7616017867992091)"
