## This notebook:
- Recommend a course based on the course description
1.    
    - Topic modeling on course descriptions
    - Take a course title (e.g. AAS 103)
    - Take the course description of this course
    - Process text of the description
    - Find the topics of this block of text
    
2.    
    - Process text of the descriptions of all other courses
    - Find the topics of those blocks of text
    
3.    
    - Match the topic of the input course and other courses
    - Compute similarity scores
    - Rank these scores from high to low
    - Return the n number of recommendations needed (num_of_rec) by order of similarity
    
- Need to improve

    - Text processing
    - Topic modeling (the recommendations are not quite logical yet since the text processing and topic modeling are not quite well-done yet)
    - Efficiency of the algorithm (slow now)

## Update
1. Because the courses already have distinct clusters such as academic groups (LSI, engineering, dentistry ...) and subject (Afroamerican sections, etc), it makes more sense that we recommend courses in the same academic group and subject.


2. Somehow, processing the texts (stopwords removal, lemma, etc) produce poorer recommendations. The results look much better without the language processing. 


3. About topic modeling -- I'm not sure how we could utilize topic modeling, since the total number of academic group is about 20 so if we cluster the courses with topic modeling, it's not going to work very well unless we use a large number of cluster like 200 - 500. We could try topic modeling in the subject level, but I think count vec and tfidf vec works pretty well, so not sure if that would be necessary. 

-------------
- Concat fall and winter -> add "semester" parameter (fall, winter, doesn't matter)
- Add LSA requirement parameter

- Maybe we also want to add filter level None
- I think we should add an option to recommenda a Michigan online course based on the content of the entered course

### Paper that helps: https://www.frontiersin.org/articles/10.3389/frai.2020.00042/full

In [1]:
import pandas as pd
import neattext.functions as nfx

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel, sigmoid_kernel

In [411]:
# # Load our dataset
# f_21 = pd.read_csv('assets/f_21_merge.csv')
# w_22 = pd.read_csv('assets/w_22_merge.csv')

# df = pd.concat([f_21, w_22])

# df.columns

# df = df.fillna('')
# df['requirements_distribution'] = df['requirements_distribution'] + ', ' + df['other']
# df['requirements_distribution'] = [x.split(', ') for x in df['requirements_distribution']]

# df['text'] = df['Course Title'] + df['sub_title'] + df['description'] + df['Acad Group'] + df['Subject']

# df.drop_duplicates(subset='course', inplace=True)

# df.to_csv('assets/fw.csv', index=False)

## Cosine similarity

#### Parameters:

* df = input dataset
* course_title = input course
* num_or_rec = number of recommendation
* filter_level = 'academic_group', subject', or 'no_filter'
* semester = 'fall' or 'winter'
* lsa

### Valid options for drop-down selections

In [372]:
# Course title: free text of similar format -- still thinking about whether to make it free text or only valid options

# Num of rec: 1<n<10, otherwise: invalid

# filter_level
filter_level_options = ['academic_group', 'subject']

# semester
semester_options = df['Term'].unique().tolist()

# lsa
lsa_list = {x for l in df['requirements_distribution'].dropna() for x in l}
lsa_options = [i for i in lsa_list]

In [421]:
def make_recommendation_cos(course_title, num_of_rec=10, filter_level='subject', semester='Fall 2021', lsa=None):
    
    if int(num_of_rec) > 10 or int(num_of_rec) < 1:
        print('Please enter the desire number of recommendation between 1 and 10 (inclusive).')
        
    else:
        # Read csv
        df = pd.read_csv('assets/fw.csv')

        # Specify valid courses
        valid_courses = df['course'].unique().tolist()

        if course_title not in valid_courses:
            print(f'Please enter a valid course choice. Course {course_title} is not in our list.')

        else:
            # Specify semester
            df = df[df['Term'] == semester]

            # Input 
            input_ag = df.loc[df['course'] == course_title, 'Acad Group']
            input_sub = df.loc[df['course'] == course_title, 'Subject']
            input_course = df.loc[df['course'] == course_title, 'Course Title']

            # Filter the df
            if filter_level == 'academic_group':
                df = df[df['Acad Group'].isin(input_ag)] 
            elif filter_level == 'subject':
                df = df[(df['Subject'].isin(input_sub)) | (df['Course Title'].isin(input_course))]
            
            if len(df) == 0:
                print('Sorry, there is no match. Please try again with a different course or choose a different LSA requirement distribution.')
                
            else:
                # Reset index
                df.reset_index(inplace=True)

                # Vectorize our Text
                count_vect = CountVectorizer()
                cv_mat = count_vect.fit_transform(df['text'])

                # Cosine Similarity Matrix
                cosine_sim_mat = cosine_similarity(cv_mat)

                # Get Course ID/Index
                course_indices = pd.Series(df.index, index=df['course'])

                # ID for title
                idx = course_indices[course_title]

                # Course Indice
                # Search inside cosine_sim_mat
                scores = list(enumerate(cosine_sim_mat[idx]))

                # Scores
                # Sort Scores
                sorted_scores = sorted(scores, key=lambda x:x[1], reverse=True)

                # Recommender
                selected_course_indices = [i[0] for i in sorted_scores[1:]]
                selected_course_scores = [i[1] for i in sorted_scores[1:]]

                result = df[df.columns].iloc[selected_course_indices]

                rec_df = pd.DataFrame(result)

                rec_df['similarity_scores'] = selected_course_scores

                # Filter by lsa requirement distribution
                if lsa == None:
                    rec_df = rec_df
                else:
                    rec_df = rec_df[rec_df['requirements_distribution'].map(lambda x: lsa in x)]

                # If query returns no results, return error message, otherwise, return df filtered to these colummsn
                
                cols_to_filter = ['course', 'Term', 'Acad Group', 'Subject', 'Course Title', 'description', 'credits', 'requirements_distribution', 'similarity_scores']
                
                if len(rec_df) == 0:
                    print('Sorry, there is no match. Please try again with a different course or choose a different LSA requirement distribution.')
                
                elif len(rec_df) < num_of_rec:
                    rec_df = rec_df[cols_to_filter]
                    return rec_df
                    
                else:
                    rec_df = rec_df[:num_of_rec][cols_to_filter]
                    return rec_df

In [448]:
make_recommendation_cos('HISTORY 371', num_of_rec=5)

Unnamed: 0,course,Term,Acad Group,Subject,Course Title,description,credits,requirements_distribution,similarity_scores
0,AMCULT 371,Fall 2021,"Literature, Sci, and the Arts",American Culture (AMCULT) Open Sections,Sex & Gender US Hist,Beginning in seventeenth-century British Ameri...,3,"['HU', 'RE', '']",0.99055
73,WGS 371,Fall 2021,"Literature, Sci, and the Arts",Women's and Gender Studies (WGS) Open Sections,Sex & Gender US Hist,Beginning in seventeenth-century British Ameri...,3,"['HU', 'RE', '']",0.988558
24,HISTORY 346,Fall 2021,"Literature, Sci, and the Arts",History (HISTORY) Open Sections,American Radicalism,This course offers a general history of radica...,4,"['SS', '']",0.742455
45,HISTORY 450,Fall 2021,"Literature, Sci, and the Arts",History (HISTORY) Open Sections,Japan to 1700,"What lies behind the image of “Cool Japan,” re...",3,"['', '']",0.736586
19,HISTORY 312,Fall 2021,"Literature, Sci, and the Arts",History (HISTORY) Open Sections,European Integration,The construction of the European Union has bee...,4,"['ID', '']",0.733807


In [447]:
df['course'].sample(3)

7329     POLSCI 630
2173    HISTORY 371
4298    STDABRD 340
Name: course, dtype: object

In [None]:
'DATASCI 606'

## Sigmoid kernel

In [17]:
def make_recommendation_sk(df, course_title, num_of_rec = 10, filter_level = 'academic_group', semester = 'fall', lsa = None):
    # Specify semester
    df = df[df['semester'] == semester]
    
    # Clean df
    df = df.fillna('').drop_duplicates(subset=['course']).reset_index().drop(columns='index')
    
    # Input 
    input_ag = df.loc[df['course'] == course_title, 'Acad Group'].unique()
    input_sub = df.loc[df['course'] == course_title, 'Subject'].unique()
    input_course = df.loc[df['course'] == course_title, 'Course Title'].unique()
    
    # Filter the df
    if filter_level == 'academic_group':
        df = df[df['Acad Group'].isin(input_ag)] 
    elif filter_level == 'subject':
        df = df[(df['Subject'].isin(input_sub)) | (df['Course Title'].isin(input_course))]

    # Merge all the text information
    df['text'] = df['Acad Group'] + ' ' + df['Subject'] + ' ' + df['Course Title'] + ' ' + df['sub_title'] + ' ' + df['description']

    # Reset index
    df = df.reset_index(drop = True)
    
    # Vectorize our Text
    tfidf = TfidfVectorizer(max_df = 0.5, max_features=None, 
                strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                ngram_range=(1,1))

    # Fitting the TF-IDF on the 'description' text
    tfidf_matrix = tfidf.fit_transform(df['text'])

    # Compute the sigmoid kernel
    sig = sigmoid_kernel(tfidf_matrix, tfidf_matrix)

    # Reverse mapping of indices and course titles
    indices = pd.Series(df.index, index=df['course']).drop_duplicates()

    # Get the index corresponding to course title
    idx = indices[course_title]

    # Get the pairwise similarity scores 
    sig_scores = list(enumerate(sig[idx]))

    # Sort the courses
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)

    # Scores of the n most similar courses
    sig_scores = sig_scores[1:num_of_rec+1]

    # Take the indices
    course_indices = [i[0] for i in sig_scores]

    # Top 10 most similar courses
    rec_df = df[df.columns].iloc[course_indices]
    
    rec_df['sig_scores'] = sig_scores

    return rec_df

In [45]:
make_recommendation_sk(f_21, 'ECON 208', 10)

NameError: name 'make_recommendation_sk' is not defined