## This notebook:
- Recommend a course based on free text the user inputs (this is more like an Information Retrieval system, research engine, because the user input is free text, like Google search)
1.    
    - User enters a block of text (free text, no preset options)
    - This block of text would describe their interest
    - Process this block of text
    - Find the topics of this block of text
    
2.    
    - Process text of the descriptions of courses
    - Find the topics of those blocks of text
    
3.    
    - Match the topic of user's input text and all the courses
    - Compute similarity scores
    - Rank these scores from high to low
    - Return the n number of recommendations needed by order of similarity
    
- Need to improve

    - Text processing
    - Topic modeling (the recommendations are not quite logical yet since the text processing and topic modeling are not quite well-done yet)
    - This is my test using fuzzywuzzy, a very simple text matching library. It doesn't work since it only based on the text itself and not the semantic of the text.
    - We need to look at similary of documents based on their semantic, word embedding
    - Look at Hellinger distance https://radimrehurek.com/gensim_3.8.3/auto_examples/tutorials/run_distance_metrics.html
    - Look at NLP using Deep Learning in Python https://towardsdatascience.com/deep-learning-for-semantic-text-matching-d4df6c2cf4c5 (seems cool, detailed notebook and youtube tutorial)
    - Look at NLP using RNN (word2vec) https://towardsdatascience.com/text-matching-with-deep-learning-e6aa05333399

In [4]:
import pandas as pd
from fuzzywuzzy import fuzz, process



In [5]:
online = pd.read_csv('assets/original/2021-10-19-MichiganOnline-courses.csv')
f_21 = pd.read_csv('assets/f_21_merge.csv')
w_22 = pd.read_csv('assets/w_22_merge.csv')

In [82]:
user_text = 'korean history'

In [76]:
def get_score(user_text, df):
    
    df.drop_duplicates(subset=['course'], inplace=True)
    df.dropna(subset=['description'], inplace=True)
    df.fillna('', inplace=True)
    
    des_list = []
    course_title_list = []
    score_list = []
    course_list = []

    for index, row in df.iterrows():

        des = row['description']
        course_title = row['Course Title']
        course = row['course']
        score = fuzz.ratio(user_text, des)

        des_list.append(des)
        course_title_list.append(course_title)
        score_list.append(score)
        course_list.append(course)

    score_df = pd.DataFrame({'user_text': user_text, 'course_des': des_list, 'course': course_list,
                             'course_title': course_title_list, 'score': score_list})
    
    score_df = score_df[score_df['score'] >= 20]

    return score_df.sort_values(by='score', ascending=False)

In [83]:
get_score(user_text, f_21)

Unnamed: 0,user_text,course_des,course,course_title,score
1298,korean history,Independent study.,MUSICOL 481,Special Projects,38
1797,korean history,Course leads to THEORY 236.,THEORY 135,Intr Mus Thry,34
503,korean history,Applied Statistics II,DATASCI 501,Applied Stat II,29
627,korean history,Topics of current interest selected by the fac...,EECS 198,Special Topics,28
1807,korean history,Individual work and reading for graduate stude...,THEORY 570,Directed Indiv Study,25
439,korean history,Regular reports and conferences required.,CLARCH 499,Supervised Reading,25
1803,korean history,Special topics that vary from term to term.,THEORY 407,Directed Indiv Stdy,25
945,korean history,Theories of Pictorial Autonomy: Writing About ...,HISTART 402,Cont Interp in A H,24
1202,korean history,Selected topics pertinent to mechanical engine...,MECHENG 499,Spec Topics in M E,24
1830,korean history,Individual work and reading for undergraduate ...,THTREMUS 400,Directed Reading,24


In [1]:
import pandas as pd
import numpy as np
import sklearn

In [6]:
df = f_21
df.drop_duplicates(subset=['course'], inplace=True)
df.dropna(subset=['description'], inplace=True)
df.fillna('', inplace=True)

In [9]:
df.head()

Unnamed: 0,Class Nbr,course,Term,Session,Acad Group,Subject,Course Title,description,Component,Time,...,Seats Remaining,Has WL,Units,sub_title,credits,requirements_distribution,consent,advisory_prerequisites,other_course_info,repeatability
0,30282,AAS 103,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Social Sci Seminar,This course seeks to introduce students to eve...,SEM,1-230PM,...,1,Y,3.0,- Reading Africa: Critical Perspectives on Po...,3,SS,,"Enrollment restricted to first-year students, ...",(Cross-Area Courses). May not be included in a...,May not be repeated for credit.
3,30276,AAS 104,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Humanities Seminar,This seminar introduces first-year students to...,SEM,1130-1PM,...,2,Y,3.0,- Black Lives and Life Writing: How We Tell S...,3,HU,With permission of instructor.,"Enrollment restricted to first-year students, ...",(Cross-Area Courses). May not be included in a...,May not be repeated for credit.
21,19186,AAS 115,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Elementary Swahili,This course is an introduction to spoken and w...,REC,1-2PM,...,13,Y,4.0,- Swahili Language and Culture,4,,,,,May not be repeated for credit.
25,26657,AAS 125,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Elem Yoruba I,This course is designed to introduce the Yorub...,REC,9-10AM,...,5,Y,4.0,- Yoruba,4,,,,May not repeat the same language at the same l...,May not be repeated for credit.
26,30898,AAS 202,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Intro Afr Diasp Stds,Is the African Diaspora a concept or an actual...,SEM,1-230PM,...,9,Y,3.0,- Global Blackness,3,,,,,May not be repeated for credit.


In [10]:
df_1 = df[['Class Nbr', 'course', 'Course Title','description']]

In [17]:
corpus = df_1['description'].tolist()
corpus[:10]

['This course seeks to introduce students to everyday life in urban Africa. The course is designed to equip students with basic and useful knowledge about the how urban residents – rich and poor, newcomers and old-timers, young and old, men and women – negotiate the challenges of living in cities.  This course focuses on networks, associational life, and relationships that are the ties that bind urban residents together.  Social organization, religious belief and practice, ethnicity, economic and political systems, the arts, and popular culture are some of the topics we will explore.  We will be approaching these themes from a variety of disciplinary perspectives, including history, anthropology, literature, political science, sociology, and economics.',
 'This seminar introduces first-year students to the intellectual community of humanities scholars working in the field of Afroamerican and African studies.  The topic of the seminar varies from year to year.',
 'This course is an intr

In [13]:
corpus = list(np.unique(corpus))


## Feature Extraction
### Counter Vectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

In [15]:
X_train_counts = count_vect.fit_transform(corpus[:10])
X_train_counts = pd.DataFrame(X_train_counts.toarray())
X_train_counts.columns = count_vect.get_feature_names()
X_train_counts



Unnamed: 0,10,12,121,1945,1980s,1989,19th,2000s,500,abandonment,...,with,within,women,word,write,writing,writings,year,you,your
0,0,2,0,1,1,1,1,1,0,0,...,6,1,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,3,0,0,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,4,0,1,0,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
7,1,1,0,0,0,0,0,0,0,1,...,3,0,0,0,1,0,1,0,0,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,1,0


In [16]:
corpus[0]

'\n\n\nThis course provides an introduction to Polish culture in the larger context of Slavic and Central European cultures through a detailed study and analysis of “music of protest” (jazz, cabaret, rock, punk) during the 1945-1989 period of Soviet dominance and during the period of transition to democracy and after the establishment of full democratic rule in Poland. This course also provides an introduction to rhetoric and contextual reading of poetry (as well as other forms of expression). We will study in detail texts by some of the most important Polish and other Central European pop, jazz, cabaret, rock, and punk authors and bands with the purpose of identifying devices and strategies used to create meanings. We will connect texts with elements of the daily lives of people in Poland (as well as other Central European nations) focusing in particular on: cultural heritage, history, politics, social issues, past and future myths (interpretations of past events and projections of pe

In [18]:
X_train_counts.loc[0]


10          0
12          2
121         0
1945        1
1980s       1
           ..
writing     0
writings    0
year        0
you         0
your        0
Name: 0, Length: 787, dtype: int64

## Tfidf

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(corpus[:10])
X_train_tfidf = pd.DataFrame(X_train_tfidf.toarray())
X_train_tfidf.columns = vectorizer.get_feature_names()
X_train_tfidf



Unnamed: 0,116,1773,19th,202,aas,about,access,achieve,acquire,activities,...,will,with,women,working,would,writing,written,year,yoruba,young
0,0.0,0.0,0.0,0.0,0.0,0.071739,0.0,0.0,0.0,0.0,...,0.143477,0.057279,0.096458,0.0,0.0,0.0,0.0,0.0,0.0,0.096458
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.183789,0.0,0.0,0.0,0.551368,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.33646,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.149977,...,0.262423,0.0,0.0,0.0,0.0,0.131212,0.0,0.0,0.176424,0.0
4,0.0,0.0,0.0,0.080392,0.068341,0.0,0.0,0.0,0.0,0.0,...,0.0,0.047738,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.125834,0.0,0.0,0.0,0.10697,0.0,0.0,0.125834,0.0,0.10697,...,0.0,0.074723,0.0,0.0,0.125834,0.093586,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081449,0.0,...,0.0,0.048366,0.0,0.0,0.0,0.181727,0.0,0.0,0.0,0.0
7,0.0,0.0,0.089571,0.0,0.0,0.133233,0.089571,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.102304,0.0,0.0,0.0,0.076086,0.0,0.0,0.0,0.0,...,0.228259,0.06075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Word Vectors  
Word vectors - also called word embeddings - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive context.  
**There are two possilbe approaches:**  
**CBOW (Continuous Bag Of Words):** It predicts the word, given context around the word as input

**Skip-gram:** It predicts the context, given the word as input

In [22]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [24]:
!python -m spacy download en_core_web_md 

Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.2.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')


In [32]:
import spacy

nlp = spacy.load('en_core_web_md')


In [33]:
len(nlp('dog').vector)


300

In [34]:
def most_similar(word, topn=5):
    word = nlp.vocab[str(word)]
    queries = [
      w for w in word.vocab 
      if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
    ]

    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

In [35]:
most_similar("lion", topn=10)

[]