## This notebook:
- Recommend a course based on the course description
1.    
    - Topic modeling on course descriptions
    - Take a course title (e.g. AAS 103)
    - Take the course description of this course
    - Process text of the description
    - Find the topics of this block of text
    
2.    
    - Process text of the descriptions of all other courses
    - Find the topics of those blocks of text
    
3.    
    - Match the topic of the input course and other courses
    - Compute similarity scores
    - Rank these scores from high to low
    - Return the n number of recommendations needed (num_of_rec) by order of similarity
    
- Need to improve

    - Text processing
    - Topic modeling (the recommendations are not quite logical yet since the text processing and topic modeling are not quite well-done yet)
    - Efficiency of the algorithm (slow now)

## Update
1. Because the courses already have distinct clusters such as academic groups (LSI, engineering, dentistry ...) and subject (Afroamerican sections, etc), it makes more sense that we recommend courses in the same academic group and subject.


2. Somehow, processing the texts (stopwords removal, lemma, etc) produce poorer recommendations. The results look much better without the language processing. 


3. About topic modeling -- I'm not sure how we could utilize topic modeling, since the total number of academic group is about 20 so if we cluster the courses with topic modeling, it's not going to work very well unless we use a large number of cluster like 200 - 500. We could try topic modeling in the subject level, but I think count vec and tfidf vec works pretty well, so not sure if that would be necessary. 

-------------
- Concat fall and winter -> add "semester" parameter (fall, winter, doesn't matter)
- Add LSA requirement parameter

### Paper that helps: https://www.frontiersin.org/articles/10.3389/frai.2020.00042/full

In [1]:
import pandas as pd
import neattext.functions as nfx

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel, sigmoid_kernel

In [3]:
# Load our dataset
f_21 = pd.read_csv('assets/f_21_merge.csv')
w_22 = pd.read_csv('assets/w_22_merge.csv')

## Cosine similarity

In [4]:
#df = input dataset
#course_title = input course
#num_or_rec = number of recommendation
#filter_level = 'academic_group', subject'

def make_recommendation_cos(df, course_title, num_of_rec = 10, filter_level = 'subject'):
    df = df.fillna('').drop_duplicates(subset=['course']).reset_index().drop(columns='index')


    input_ag = df.loc[df['course'] == course_title, 'Acad Group'].unique()
    input_sub = df.loc[df['course'] == course_title, 'Subject'].unique()
    input_course = df.loc[df['course'] == course_title, 'Course Title'].unique()
    
    if filter_level == 'academic_group':
        df = df[df['Acad Group'].isin(input_ag)] 
    elif filter_level == 'subject':
        df = df[(df['Subject'].isin(input_sub)) | (df['Course Title'].isin(input_course))]
        
    df = df.reset_index(drop = True)
    # Vectorize our Text
    count_vect = CountVectorizer()
    cv_mat = count_vect.fit_transform(df['description'])

    df_cv_words = pd.DataFrame(cv_mat.todense(), columns=count_vect.get_feature_names())

    # Cosine Similarity Matrix
    cosine_sim_mat = cosine_similarity(cv_mat)

    # Get Course ID/Index
    course_indices = pd.Series(df.index, index=df['course'])

    # ID for title
    idx = course_indices[course_title]

    # Course Indice
    # Search inside cosine_sim_mat
    scores = list(enumerate(cosine_sim_mat[idx]))

    # Scores
    # Sort Scores
    sorted_scores = sorted(scores, key=lambda x:x[1], reverse=True)

    # Recommender
    selected_course_indices = [i[0] for i in sorted_scores[1:]]
    selected_course_scores = [i[1] for i in sorted_scores[1:]]

    result = df[df.columns].iloc[selected_course_indices]

    rec_df = pd.DataFrame(result)

    rec_df['similarity_scores'] = selected_course_scores

    return rec_df[:num_of_rec]

## Sigmoid kernel

In [5]:
def make_recommendation_sk(df, course_title, num_of_rec = 10, filter_level = 'subject'):
    df = df.fillna('').drop_duplicates(subset=['course']).reset_index().drop(columns='index')


    input_ag = df.loc[df['course'] == course_title, 'Acad Group'].unique()
    input_sub = df.loc[df['course'] == course_title, 'Subject'].unique()
    input_course = df.loc[df['course'] == course_title, 'Course Title'].unique()
    
    if filter_level == 'academic_group':
        df = df[df['Acad Group'].isin(input_ag)] 
    elif filter_level == 'subject':
        df = df[(df['Subject'].isin(input_sub)) | (df['Course Title'].isin(input_course))]
    
    df = df.reset_index(drop = True)
    
    
    tfidf = TfidfVectorizer(max_df = 0.5, max_features=None, 
                strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                ngram_range=(1,1))
    
    ##############

    # Fitting the TF-IDF on the 'description' text
    tfidf_matrix = tfidf.fit_transform(df['description'])

    # Compute the sigmoid kernel
    sig = sigmoid_kernel(tfidf_matrix, tfidf_matrix)

    # Reverse mapping of indices and course titles
    indices = pd.Series(df.index, index=df['course']).drop_duplicates()

    # Get the index corresponding to course title
    idx = indices[course_title]

    # Get the pairwise similarity scores 
    sig_scores = list(enumerate(sig[idx]))

    # Sort the courses
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)

    # Scores of the n most similar courses
    sig_scores = sig_scores[1:num_of_rec+1]

    # Take the indices
    course_indices = [i[0] for i in sig_scores]

    # Top 10 most similar courses
    rec_df = df[df.columns].iloc[course_indices]
    
    rec_df['sig_scores'] = sig_scores

    return rec_df

In [6]:
make_recommendation_cos(f_21, 'TURKISH 201', 10, 'academic_group')

Unnamed: 0,Class Nbr,course,Term,Session,Acad Group,Subject,Course Title,description,Component,Time,...,Has WL,Units,sub_title,credits,requirements_distribution,consent,advisory_prerequisites,other_course_info,repeatability,similarity_scores
149,30046,ANTHRCUL 450,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts","Anthropology,Cultural (ANTHRCUL) Open Sections",Anthro of Insurgency,This course explores the interlinked categorie...,SEM,5-8PM,...,Y,3.00,,3,,,,,May not be repeated for credit.,0.770789
1161,32329,MELANG 440,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Middle East Languages (MELANG) Open Sections,Coptic I,Coptic was the language of early Christianity ...,REC,10-1130AM,...,Y,3.00,,3,,,,,May not be repeated for credit.,0.74786
1321,21177,POLISH 214,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Polish (POLISH) Open Sections,Rock Poetry,\n\n\nThis course provides an introduction to ...,LEC,230-4PM,...,Y,3.00,,3,HU,,,,May not be repeated for credit.,0.741101
1464,21179,REEES 214,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts","Russian, East European and Eurasian Studies (R...",Rock Poetry,\n\n\nThis course provides an introduction to ...,LEC,230-4PM,...,Y,3.00,,3,HU,,,,May not be repeated for credit.,0.741101
313,31729,BCS 350,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Bosnian/Croatian/Serbian (BCS) Open Sections,Holocaust Legacy,\n\nThe course’s primary focus is raising awar...,LEC,230-4PM,...,Y,3.00,,3,"HU, RE",,,,May not be repeated for credit.,0.739946
1026,31758,JUDAIC 350,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Judaic Studies (JUDAIC) Open Sections,Holocaust Legacy,\n\nThe course’s primary focus is raising awar...,LEC,230-4PM,...,Y,3.00,,3,"HU, RE",,,,May not be repeated for credit.,0.739946
1466,31759,REEES 350,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts","Russian, East European and Eurasian Studies (R...",Holocaust Legacy,\n\nThe course’s primary focus is raising awar...,LEC,230-4PM,...,Y,3.00,,3,"HU, RE",,,,May not be repeated for credit.,0.739946
440,22820,CLCIV 480,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Classical Civilization (CLCIV) Open Sections,Studying Antiquity,Vergil’s Aeneid is many things: a complex epic...,LEC,1-3PM,...,Y,1.00-2.00,- Vergil's Aeneid,2,,,,,May be repeated for a maximum of 6 credit(s). ...,0.736944
91,32540,AMCULT 348,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",American Culture (AMCULT) Open Sections,American Radicalism,This course offers a general history of radica...,LEC,4-530PM,...,,4.00,,4,SS,,,,May not be repeated for credit.,0.733903
929,32478,HISTORY 346,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",History (HISTORY) Open Sections,American Radicalism,This course offers a general history of radica...,DIS,12-1PM,...,Y,4.00,,4,SS,,,,May not be repeated for credit.,0.733903


In [7]:
make_recommendation_sk(f_21, 'TURKISH 201', 10, 'academic_group')

Unnamed: 0,Class Nbr,course,Term,Session,Acad Group,Subject,Course Title,description,Component,Time,...,Has WL,Units,sub_title,credits,requirements_distribution,consent,advisory_prerequisites,other_course_info,repeatability,sig_scores
1714,23006,TURKISH 499,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Turkish Language (TURKISH) Open Sections,Ind Study in Turkish,An independent study course in the area of Tur...,IND,ARR,...,N,1.00-4.00,,1 - 4,,With permission of department.,,,May be elected three times for credit. May be...,"(1714, 0.7616087046279325)"
1712,32302,TURKISH 408,Fall 2021,First 7 Week Session,"Literature, Sci, and the Arts",Turkish Language (TURKISH) Open Sections,Ottoman Elements I,Ottoman Turkish is the common term for the Sou...,SEM,9-10AM,...,Y,1.00,,1,,,Second year proficiency in Turkish.,,May not be repeated for credit.,"(1712, 0.7616069393117031)"
1713,32305,TURKISH 409,Fall 2021,Second 7 Week Session,"Literature, Sci, and the Arts",Turkish Language (TURKISH) Open Sections,Ottoman Elements II,Ottoman Turkish is the common term for the Sou...,SEM,9-10AM,...,Y,1.00,,1,,,First year proficiency in Turkish and first ye...,,May not be repeated for credit.,"(1713, 0.7616056780203626)"
269,17964,ASIANLAN 411,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Asian Languages (ASIANLAN) Open Sections,Advanced Filipino I,"This course teaches advanced speaking, listeni...",REC,3-5PM,...,Y,3.00,,3,,With permission of instructor.,,,May be elected twice for credit.,"(269, 0.7616015575778774)"
1488,15838,RUSSIAN 103,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Russian (RUSSIAN) Open Sections,Int First Yr,This course is designed to introduce students ...,REC,12-1PM,...,,8.00,,8,,,,,May not be repeated for credit.,"(1488, 0.7616015574968404)"
866,11590,GREEKMOD 301,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Modern Greek (GREEKMOD) Open Sections,Inter Mod Greek I,This course continues the Modern Greek languag...,LEC,1-230PM,...,Y,3.00,,3,,,GREEKMOD 202.,,May not be repeated for credit.,"(866, 0.7616015034807913)"
260,14156,ASIANLAN 275,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Asian Languages (ASIANLAN) Open Sections,2nd Yr Vietnamese I,The course is aimed at improving students' spe...,REC,10-11AM,...,Y,4.00,,4,,With permission of instructor.,,,May not be repeated for credit.,"(260, 0.7616014068176152)"
758,18996,FRENCH 101,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",French (FRENCH) Open Sections,Elementary,The sequence of FRENCH 101/102 presents the es...,REC,3-4PM,...,Y,4.00,,4,,With permission of instructor.,Students with any prior study of French must t...,,May not be repeated for credit.,"(758, 0.7616013686371976)"
186,39240,ARABIC 101,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Arabic Language (ARABIC) Open Sections,Elementary Arabic I,This is the first of a two-semester course in ...,REC,1130-1PM,...,Y,5.00,,5,,With permission of instructor.,,,May not be repeated for credit.,"(186, 0.7616012135122696)"
241,14147,ASIANLAN 115,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Asian Languages (ASIANLAN) Open Sections,1st Yr Hindi I,ASIANLAN 115 is a beginner’s course designed f...,REC,9-10AM,...,Y,4.00,,4,,With permission of instructor.,,Students with prior knowledge of Hindi are enc...,May not be repeated for credit.,"(241, 0.7616010418507846)"
