## This notebook:
- Recommend a course based on the course description
1.    
    - Topic modeling on course descriptions
    - Take a course title (e.g. AAS 103)
    - Take the course description of this course
    - Process text of the description
    - Find the topics of this block of text
    
2.    
    - Process text of the descriptions of all other courses
    - Find the topics of those blocks of text
    
3.    
    - Match the topic of the input course and other courses
    - Compute similarity scores
    - Rank these scores from high to low
    - Return the n number of recommendations needed (num_of_rec) by order of similarity
    
- Need to improve

    - Text processing
    - Topic modeling (the recommendations are not quite logical yet since the text processing and topic modeling are not quite well-done yet)
    - Efficiency of the algorithm (slow now)

### Paper that helps: https://www.frontiersin.org/articles/10.3389/frai.2020.00042/full

In [166]:
import pandas as pd
import neattext.functions as nfx

In [167]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel, sigmoid_kernel

In [168]:
# Load our dataset
online = pd.read_csv('assets/original/2021-10-19-MichiganOnline-courses.csv')
f_21 = pd.read_csv('assets/f_21_merge.csv')
w_22 = pd.read_csv('assets/w_22_merge.csv')

## Cosine similarity

In [169]:
def make_recommendation_cos(df, course_title, num_of_rec):
    
    df = df.fillna('').drop_duplicates(subset=['course']).reset_index().drop(columns='index')
    
    # Clean Text
    df['description'] = df['description'].apply(nfx.remove_stopwords)
    df['description'] = df['description'].apply(nfx.remove_special_characters)
    
    # Vectorize our Text
    count_vect = CountVectorizer()
    cv_mat = count_vect.fit_transform(df['description'])
    
    df_cv_words = pd.DataFrame(cv_mat.todense(), columns=count_vect.get_feature_names())
    
    # Cosine Similarity Matrix
    cosine_sim_mat = cosine_similarity(cv_mat)
    
    # Get Course ID/Index
    course_indices = pd.Series(df.index, index=df['course'])

    # ID for title
    idx = course_indices[course_title]
    
    # Course Indice
    # Search inside cosine_sim_mat
    scores = list(enumerate(cosine_sim_mat[idx]))
    
    # Scores
    # Sort Scores
    sorted_scores = sorted(scores, key=lambda x:x[1], reverse=True)
    
    # Recommender
    selected_course_indices = [i[0] for i in sorted_scores[1:]]
    selected_course_scores = [i[1] for i in sorted_scores[1:]]
    
    result = df[df.columns].iloc[selected_course_indices]
    
    rec_df = pd.DataFrame(result)
    
    rec_df['similarity_scores'] = selected_course_scores
    
    return rec_df.head(num_of_rec)

## Sigmoid kernel

In [170]:
def make_recommendation_sk(df, course, num_of_rec):
    
    tfidf = TfidfVectorizer(min_df=3, max_features=None, 
                strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                ngram_range=(1, 3),
                stop_words='english')

    df = df.fillna('').drop_duplicates(subset=['course']).reset_index().drop(columns='index')

    # Fitting the TF-IDF on the 'description' text
    tfidf_matrix = tfidf.fit_transform(df['description'])

    # Compute the sigmoid kernel
    sig = sigmoid_kernel(tfidf_matrix, tfidf_matrix)

    # Reverse mapping of indices and course titles
    indices = pd.Series(df.index, index=df['course']).drop_duplicates()

    # Get the index corresponding to course title
    idx = indices[course]

    # Get the pairwsie similarity scores 
    sig_scores = list(enumerate(sig[idx]))

    # Sort the courses
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)

    # Scores of the n most similar courses
    sig_scores = sig_scores[1:num_of_rec+1]

    # Take the indices
    course_indices = [i[0] for i in sig_scores]

    # Top 10 most similar courses
    rec_df = df[df.columns].iloc[course_indices]
    
    rec_df['sig_scores'] = sig_scores

    return rec_df

In [171]:
make_recommendation_cos(f_21, 'AAS 103', 10)

Unnamed: 0,Class Nbr,course,Term,Session,Acad Group,Subject,Course Title,description,Component,Time,...,Has WL,Units,sub_title,credits,requirements_distribution,consent,advisory_prerequisites,other_course_info,repeatability,similarity_scores
2895,37153,MIDEAST 490,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Middle East Studies (MIDEAST) Open Sections,Topics in MES,course aims focus analyzing popular culture me...,SEM,1-230PM,...,Y,3.0,- Contemporary Culture and Media in Turkey,3,,With permission of instructor.,,,May be elected twice for credit. May be elect...,0.346603
3705,11707,PORTUG 101,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Portuguese (PORTUG) Open Sections,Elementary,PORT101 highly communicative course students s...,REC,10-11AM,...,Y,4.0,,4,,With permission of instructor.,,,May not be repeated for credit.,0.295663
1789,20075,ENVIRON 405,Fall 2021,Regular Academic Session,Environment and Sustainability,Environment (ENVIRON) Open Sections,Urban Sprawl,course investigates political imperatives poli...,LEC,10-1130AM,...,Y,3.0,,3,,,ENVIRON 350 or 370.,,May not be repeated for credit.,0.291619
234,10647,ANTHRCUL 101,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts","Anthropology,Cultural (ANTHRCUL) Open Sections",Intro to Anthro,course introduces students anthropology subdis...,DIS,4-5PM,...,N,4.0,,4,"SS, RE",,,Does not count toward requirements for the Ant...,May not be repeated for credit.,0.282542
2153,11331,HISTORY 260,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",History (HISTORY) Open Sections,U S to 1865,course analyzes history United States 1865 foc...,DIS,1-2PM,...,Y,4.0,,4,SS,,,,May not be repeated for credit.,0.275862
419,14147,ASIANLAN 115,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Asian Languages (ASIANLAN) Open Sections,1st Yr Hindi I,ASIANLAN 115 beginners course designed student...,REC,9-10AM,...,Y,4.0,,4,,With permission of instructor.,,Students with prior knowledge of Hindi are enc...,May not be repeated for credit.,0.268793
429,24765,ASIANLAN 185,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Asian Languages (ASIANLAN) Open Sections,First Year Bengali I,ASIANLAN 185 beginners course designed student...,REC,9-10AM,...,Y,4.0,,4,,With permission of instructor.,,,May not be repeated for credit.,0.268793
424,27678,ASIANLAN 145,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Asian Languages (ASIANLAN) Open Sections,1st Yr Punjabi I,ASIANLAN 145 beginners course designed student...,REC,11-12PM,...,Y,4.0,,4,,With permission of instructor.,,,May not be repeated for credit.,0.267064
442,15824,ASIANLAN 305,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Asian Languages (ASIANLAN) Open Sections,Inter Spoken Chn I,course designed spoken supplement postsecondye...,REC,11-12PM,...,Y,2.0,,2,,With permission of instructor.,,,May be elected twice for credit.,0.266815
3711,23618,PPE 400,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts","Philosophy, Politics and Economics (PPE) Open ...",Political Economy,capstone seminar nonHonors seniors Philosophy ...,SEM,1-230PM,...,Y,3.0,,3,ULWR,,Completion of distribution requirements for PP...,,May not be repeated for credit.,0.261742


In [172]:
make_recommendation_sk(f_21, 'AAS 103', 10)

Unnamed: 0,Class Nbr,course,Term,Session,Acad Group,Subject,Course Title,description,Component,Time,...,Has WL,Units,sub_title,credits,requirements_distribution,consent,advisory_prerequisites,other_course_info,repeatability,sig_scores
2895,37153,MIDEAST 490,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Middle East Studies (MIDEAST) Open Sections,Topics in MES,This course aims to focus on analyzing popular...,SEM,1-230PM,...,Y,3.0,- Contemporary Culture and Media in Turkey,3,,With permission of instructor.,,,May be elected twice for credit. May be elect...,"(2895, 0.7615990979612387)"
4674,10061,URP 423,Fall 2021,Regular Academic Session,Architecture & Urban Planning,Urban and Regional Planning (URP) Open Sections,Int U P&Env,Introduction to Urban and Environmental Planning,DIS,12-1PM,...,Y,3.0,,3,,,,,May not be repeated for credit.,"(4674, 0.7615986964810271)"
29,33106,AAS 495,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Senior Seminar,\nThis course explores cities in contemporary ...,SEM,4-530PM,...,Y,4.0,- Contemporary Africa and the World,4,ULWR,,Upperclass standing.,(Cross-Area Courses).,May be repeated for a maximum of 8 credit(s).,"(29, 0.7615981538535184)"
3681,29772,POLSCI 436,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Political Science (POLSCI) Open Sections,ResMidEastPolitics,This course will examine politics and society ...,SEM,2-4PM,...,N,3.0,,3,,With permission of instructor.,,,May not be repeated for credit.,"(3681, 0.7615977222757488)"
1789,20075,ENVIRON 405,Fall 2021,Regular Academic Session,Environment and Sustainability,Environment (ENVIRON) Open Sections,Urban Sprawl,This course investigates the political imperat...,LEC,10-1130AM,...,Y,3.0,,3,,,ENVIRON 350 or 370.,,May not be repeated for credit.,"(1789, 0.7615976026816589)"
10,36353,AAS 275,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Blk Women Pop Cult,Popular culture is an important site for creat...,SEM,1-230PM,...,Y,3.0,,3,ID,,,,May not be repeated for credit.,"(10, 0.761597517013301)"
3705,11707,PORTUG 101,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Portuguese (PORTUG) Open Sections,Elementary,PORT101 is a highly communicative course for s...,REC,10-11AM,...,Y,4.0,,4,,With permission of instructor.,,,May not be repeated for credit.,"(3705, 0.7615975007962578)"
380,32373,ASIAN 257,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Asian Studies (ASIAN) Open Sections,Great Cities in Asia,This course serves as an introduction to the h...,DIS,9-10AM,...,Y,4.0,,4,HU,,,,May be elected twice for credit.,"(380, 0.7615974415269534)"
234,10647,ANTHRCUL 101,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts","Anthropology,Cultural (ANTHRCUL) Open Sections",Intro to Anthro,This course introduces students to anthropolog...,DIS,4-5PM,...,N,4.0,,4,"SS, RE",,,Does not count toward requirements for the Ant...,May not be repeated for credit.,"(234, 0.7615972520876618)"
143,32141,ALA 421,Fall 2021,Second 7 Week Session,"Literature, Sci, and the Arts",Applied Liberal Arts (ALA) Open Sections,Creat Inclusive Comm,Residence Staff at Michigan work with resident...,SEM,ARR,...,N,1.0,,1,,With permission of instructor.,,,May be repeated for a maximum of 4 credit(s).,"(143, 0.7615971756555602)"
