## This notebook:
- Recommend a course based on free text the user inputs (this is more like an Information Retrieval system, research engine, because the user input is free text, like Google search)
1.    
    - User enters a block of text (free text, no preset options)
    - This block of text would describe their interest
    - Process this block of text
    - Find the topics of this block of text
    
2.    
    - Process text of the descriptions of courses
    - Find the topics of those blocks of text
    
3.    
    - Match the topic of user's input text and all the courses
    - Compute similarity scores
    - Rank these scores from high to low
    - Return the n number of recommendations needed by order of similarity
    
- Need to improve

    - Text processing
    - Topic modeling (the recommendations are not quite logical yet since the text processing and topic modeling are not quite well-done yet)
    - This is my test using fuzzywuzzy, a very simple text matching library. It doesn't work since it only based on the text itself and not the semantic of the text.
    - We need to look at similary of documents based on their semantic, word embedding
    - Look at Hellinger distance https://radimrehurek.com/gensim_3.8.3/auto_examples/tutorials/run_distance_metrics.html
    - Look at NLP using Deep Learning in Python https://towardsdatascience.com/deep-learning-for-semantic-text-matching-d4df6c2cf4c5 (seems cool, detailed notebook and youtube tutorial)
    - Look at NLP using RNN (word2vec) https://towardsdatascience.com/text-matching-with-deep-learning-e6aa05333399

In [4]:
import pandas as pd
from fuzzywuzzy import fuzz, process



In [5]:
online = pd.read_csv('assets/original/2021-10-19-MichiganOnline-courses.csv')
f_21 = pd.read_csv('assets/f_21_merge.csv')
w_22 = pd.read_csv('assets/w_22_merge.csv')

In [82]:
user_text = 'korean history'

In [76]:
def get_score(user_text, df):
    
    df.drop_duplicates(subset=['course'], inplace=True)
    df.dropna(subset=['description'], inplace=True)
    df.fillna('', inplace=True)
    
    des_list = []
    course_title_list = []
    score_list = []
    course_list = []

    for index, row in df.iterrows():

        des = row['description']
        course_title = row['Course Title']
        course = row['course']
        score = fuzz.ratio(user_text, des)

        des_list.append(des)
        course_title_list.append(course_title)
        score_list.append(score)
        course_list.append(course)

    score_df = pd.DataFrame({'user_text': user_text, 'course_des': des_list, 'course': course_list,
                             'course_title': course_title_list, 'score': score_list})
    
    score_df = score_df[score_df['score'] >= 20]

    return score_df.sort_values(by='score', ascending=False)

In [83]:
get_score(user_text, f_21)

Unnamed: 0,user_text,course_des,course,course_title,score
1298,korean history,Independent study.,MUSICOL 481,Special Projects,38
1797,korean history,Course leads to THEORY 236.,THEORY 135,Intr Mus Thry,34
503,korean history,Applied Statistics II,DATASCI 501,Applied Stat II,29
627,korean history,Topics of current interest selected by the fac...,EECS 198,Special Topics,28
1807,korean history,Individual work and reading for graduate stude...,THEORY 570,Directed Indiv Study,25
439,korean history,Regular reports and conferences required.,CLARCH 499,Supervised Reading,25
1803,korean history,Special topics that vary from term to term.,THEORY 407,Directed Indiv Stdy,25
945,korean history,Theories of Pictorial Autonomy: Writing About ...,HISTART 402,Cont Interp in A H,24
1202,korean history,Selected topics pertinent to mechanical engine...,MECHENG 499,Spec Topics in M E,24
1830,korean history,Individual work and reading for undergraduate ...,THTREMUS 400,Directed Reading,24


In [1]:
import pandas as pd
import numpy as np
import sklearn

In [6]:
df = f_21
df.drop_duplicates(subset=['course'], inplace=True)
df.dropna(subset=['description'], inplace=True)
df.fillna('', inplace=True)

In [9]:
df.head()

Unnamed: 0,Class Nbr,course,Term,Session,Acad Group,Subject,Course Title,description,Component,Time,...,Seats Remaining,Has WL,Units,sub_title,credits,requirements_distribution,consent,advisory_prerequisites,other_course_info,repeatability
0,30282,AAS 103,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Social Sci Seminar,This course seeks to introduce students to eve...,SEM,1-230PM,...,1,Y,3.0,- Reading Africa: Critical Perspectives on Po...,3,SS,,"Enrollment restricted to first-year students, ...",(Cross-Area Courses). May not be included in a...,May not be repeated for credit.
3,30276,AAS 104,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Humanities Seminar,This seminar introduces first-year students to...,SEM,1130-1PM,...,2,Y,3.0,- Black Lives and Life Writing: How We Tell S...,3,HU,With permission of instructor.,"Enrollment restricted to first-year students, ...",(Cross-Area Courses). May not be included in a...,May not be repeated for credit.
21,19186,AAS 115,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Elementary Swahili,This course is an introduction to spoken and w...,REC,1-2PM,...,13,Y,4.0,- Swahili Language and Culture,4,,,,,May not be repeated for credit.
25,26657,AAS 125,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Elem Yoruba I,This course is designed to introduce the Yorub...,REC,9-10AM,...,5,Y,4.0,- Yoruba,4,,,,May not repeat the same language at the same l...,May not be repeated for credit.
26,30898,AAS 202,Fall 2021,Regular Academic Session,"Literature, Sci, and the Arts",Afroamerican & African Studies (AAS) Open Sect...,Intro Afr Diasp Stds,Is the African Diaspora a concept or an actual...,SEM,1-230PM,...,9,Y,3.0,- Global Blackness,3,,,,,May not be repeated for credit.


In [10]:
df_1 = df[['Class Nbr', 'course', 'Course Title','description']]

In [17]:
corpus = df_1['description'].tolist()
corpus[:10]

['This course seeks to introduce students to everyday life in urban Africa. The course is designed to equip students with basic and useful knowledge about the how urban residents – rich and poor, newcomers and old-timers, young and old, men and women – negotiate the challenges of living in cities.  This course focuses on networks, associational life, and relationships that are the ties that bind urban residents together.  Social organization, religious belief and practice, ethnicity, economic and political systems, the arts, and popular culture are some of the topics we will explore.  We will be approaching these themes from a variety of disciplinary perspectives, including history, anthropology, literature, political science, sociology, and economics.',
 'This seminar introduces first-year students to the intellectual community of humanities scholars working in the field of Afroamerican and African studies.  The topic of the seminar varies from year to year.',
 'This course is an intr

In [13]:
corpus = list(np.unique(corpus))


## Feature Extraction
### Counter Vectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

In [15]:
X_train_counts = count_vect.fit_transform(corpus[:10])
X_train_counts = pd.DataFrame(X_train_counts.toarray())
X_train_counts.columns = count_vect.get_feature_names()
X_train_counts



Unnamed: 0,10,12,121,1945,1980s,1989,19th,2000s,500,abandonment,...,with,within,women,word,write,writing,writings,year,you,your
0,0,2,0,1,1,1,1,1,0,0,...,6,1,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,3,0,0,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,4,0,1,0,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
7,1,1,0,0,0,0,0,0,0,1,...,3,0,0,0,1,0,1,0,0,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,1,0


In [16]:
corpus[0]

'\n\n\nThis course provides an introduction to Polish culture in the larger context of Slavic and Central European cultures through a detailed study and analysis of “music of protest” (jazz, cabaret, rock, punk) during the 1945-1989 period of Soviet dominance and during the period of transition to democracy and after the establishment of full democratic rule in Poland. This course also provides an introduction to rhetoric and contextual reading of poetry (as well as other forms of expression). We will study in detail texts by some of the most important Polish and other Central European pop, jazz, cabaret, rock, and punk authors and bands with the purpose of identifying devices and strategies used to create meanings. We will connect texts with elements of the daily lives of people in Poland (as well as other Central European nations) focusing in particular on: cultural heritage, history, politics, social issues, past and future myths (interpretations of past events and projections of pe

In [18]:
X_train_counts.loc[0]


10          0
12          2
121         0
1945        1
1980s       1
           ..
writing     0
writings    0
year        0
you         0
your        0
Name: 0, Length: 787, dtype: int64

## Tfidf

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(corpus[:10])
X_train_tfidf = pd.DataFrame(X_train_tfidf.toarray())
X_train_tfidf.columns = vectorizer.get_feature_names()
X_train_tfidf



Unnamed: 0,116,1773,19th,202,aas,about,access,achieve,acquire,activities,...,will,with,women,working,would,writing,written,year,yoruba,young
0,0.0,0.0,0.0,0.0,0.0,0.071739,0.0,0.0,0.0,0.0,...,0.143477,0.057279,0.096458,0.0,0.0,0.0,0.0,0.0,0.0,0.096458
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.183789,0.0,0.0,0.0,0.551368,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.33646,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.149977,...,0.262423,0.0,0.0,0.0,0.0,0.131212,0.0,0.0,0.176424,0.0
4,0.0,0.0,0.0,0.080392,0.068341,0.0,0.0,0.0,0.0,0.0,...,0.0,0.047738,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.125834,0.0,0.0,0.0,0.10697,0.0,0.0,0.125834,0.0,0.10697,...,0.0,0.074723,0.0,0.0,0.125834,0.093586,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081449,0.0,...,0.0,0.048366,0.0,0.0,0.0,0.181727,0.0,0.0,0.0,0.0
7,0.0,0.0,0.089571,0.0,0.0,0.133233,0.089571,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.102304,0.0,0.0,0.0,0.076086,0.0,0.0,0.0,0.0,...,0.228259,0.06075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Word Vectors  
Word vectors - also called word embeddings - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive context.  
**There are two possilbe approaches:**  
**CBOW (Continuous Bag Of Words):** It predicts the word, given context around the word as input

**Skip-gram:** It predicts the context, given the word as input

In [44]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [24]:
#!python -m spacy download en_core_web_md 

Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.2.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')


In [49]:
import spacy

nlp = spacy.load('en_core_web_md')
#nlp = spacy.load("en_core_web_sm")

In [33]:
len(nlp('dog').vector)


300

In [58]:
def most_similar(word, topn=5):
    word = nlp.vocab[str(word)]
    queries = [
      w for w in word.vocab 
      if w.is_lower == word.is_lower and w.prob >= -30 and np.count_nonzero(w.vector)   #!!!!when I change -30 to -20 no similar words 
    ]

    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

In [59]:
most_similar("king", topn=10)

[('lion', 0.47474015),
 ('he', 0.39728197),
 ('y.', 0.3512264),
 ('r.', 0.3512264),
 ('who', 0.33955762),
 ('let', 0.33448273),
 ('when', 0.32163125),
 ('was', 0.31741124),
 ('dare', 0.31443095),
 ('did', 0.31363884)]

In [60]:
most_similar("lion", topn=10)

[('king', 0.47474015),
 ('he', 0.31557545),
 ('i', 0.29816413),
 ('dare', 0.29063576),
 ('ca', 0.27542132),
 ('she', 0.2714466),
 ('when', 0.26939633),
 ('u', 0.26731068),
 ('does', 0.2552375),
 ('there', 0.2522287)]

In [62]:
most_similar("man", topn=10)

[('he', 0.68311125),
 ('who', 0.56556106),
 ('when', 0.5414661),
 ('that', 0.5309426),
 ('what', 0.51977086),
 ('she', 0.51776016),
 ('was', 0.51282316),
 ('could', 0.5085001),
 ('there', 0.50046384),
 ('why', 0.49632692)]

Sentence (or document) objects have vectors, derived from the averages of individual token vectors. This makes it possible to compare similarities between whole documents.


In [65]:
doc = nlp("I'm interested in Chinese traditional music")
len(doc.vector)

300

In [66]:
most_similar(doc, topn=10)

  by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
  return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]


[('kan', 0.0),
 ('mar', 0.0),
 ('sept.', 0.0),
 ('mont.', 0.0),
 ('jr.', 0.0),
 ('k.', 0.0),
 ('calif.', 0.0),
 ('ill.', 0.0),
 ("o'clock", 0.0),
 ('mich.', 0.0),
 ('might', 0.0)]

## Bert Sentence Transformer 

In [68]:
#pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
Collecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp39-cp39-win_amd64.whl (2.0 MB)
Collecting torch>=1.6.0
  Downloading torch-1.10.0-cp39-cp39-win_amd64.whl (226.5 MB)
Collecting torchvision
  Downloading torchvision-0.11.1-cp39-cp39-win_amd64.whl (984 kB)
Note: you may need to restart the kernel to use updated packages.
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp39-cp39-win_amd64.whl (1.1 MB)
Collecting huggingface-hub
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
Collecting filelock
  Downloading filelock-3.4.0-py3-none-any.whl (9.8 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
Collecting pyyaml>=5.1
  Downloading P

In [69]:
from sentence_transformers import SentenceTransformer
import scipy.spatial
embedder = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading: 100%|██████████| 391/391 [00:00<00:00, 392kB/s]
Downloading: 100%|██████████| 3.95k/3.95k [00:00<00:00, 661kB/s]
Downloading: 100%|██████████| 2.00/2.00 [00:00<00:00, 1.00kB/s]
Downloading: 100%|██████████| 625/625 [00:00<00:00, 313kB/s]
Downloading: 100%|██████████| 122/122 [00:00<00:00, 62.4kB/s]
Downloading: 100%|██████████| 229/229 [00:00<00:00, 115kB/s]
Downloading: 100%|██████████| 438M/438M [20:47<00:00, 351kB/s]
Downloading: 100%|██████████| 53.0/53.0 [00:00<00:00, 17.7kB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 37.4kB/s]
Downloading: 100%|██████████| 466k/466k [00:01<00:00, 390kB/s]
Downloading: 100%|██████████| 399/399 [00:00<00:00, 200kB/s]
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 367kB/s]
Downloading: 100%|██████████| 190/190 [00:00<00:00, 95.2kB/s]


In [70]:
%%time
corpus_embeddings = embedder.encode(corpus)

Wall time: 10min 42s


## Candidate Genration using Faiss vector similarity search library  
Faiss is a library developed by Facebook AI Research. It is for effecient similarity search and clustering of dense vectors.



In [72]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.7.1.post2-cp39-cp39-win_amd64.whl (10.1 MB)
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.1.post2
Note: you may need to restart the kernel to use updated packages.


Faiss is a library developed by Facebook AI Research. It is for effecient similarity search and clustering of dense vectors.
1. Tutorial : https://github.com/facebookresearch/faiss/wiki/Getting-started  
2. facebookrearch : https://github.com/facebookresearch/faiss

In [74]:
import faiss
d= 768
index = faiss.IndexFlatL2(d)
print(index.is_trained)
index.add(np.stack(corpus_embeddings, axis=0))
print(index.ntotal)

True
1910


In [84]:
queries =["I am interested in computer science", "I like pop music", 
        "I like Asian culture, especially Janpanese history", "I like to use computer skill to resolve biological problems",
        "I like playing with data and statistics"]
query_embeddings = embedder.encode(queries)

In [85]:
k = 5                          # we want to see 4 nearest neighbors
D, I = index.search(np.stack(query_embeddings, axis=0), k)     # actual search
print(I)                   # neighbors of the 5 first queries

[[  71  476  479 1468 1796]
 [ 738  844 1005 1292  339]
 [ 188  923 1423  457  339]
 [ 463 1131 1173  671  292]
 [ 339  457 1423  665  697]]


In [93]:
# recommandation for first query "I am interested in computer science"
df_1.iloc[I[0]]

Unnamed: 0,Class Nbr,course,Course Title,description
457,21972,ALA 118,"Prog, Info & People",Introduction to programming with a focus on ap...
5456,32293,COMP 416,Sem Electron Mus,Includes the study of digital synthesis techni...
5467,32295,COMP 526,Adv Stdy Elec Mus,Includes the study of digital synthesis techni...
21704,11699,POLSCI 514,Computer Usage,Practical experience in the use of a system of...
28181,22083,TCHNCLCM 497,Adv Tch Com for CS,Advanced technical communication for computer ...


In [95]:
#recommandation for "I like pop music"
df_1.iloc[I[1]]

Unnamed: 0,Class Nbr,course,Course Title,description
11367,31586,ENGLISH 319,Lit&Social Change,"Sure, pop culture is fun. It’s great to watch ..."
13100,31784,FRENCH 272,Fr Film&Culture,"In this course, we will explore French-languag..."
14300,15889,HONORS 135,Ideas in Honors,Music is undoubtedly one of the most powerful ...
18319,13033,MUSICOL 139,Intro Study Music,A survey of musical concepts and repertories o...
3812,30291,BIOPHYS 445,Intro to Info Theory,This course introduces the basic tools of Info...


In [86]:
for query, query_embedding in zip(queries, query_embeddings):
    distances, indices = index.search(np.asarray(query_embedding).reshape(1,768),k)
    print("\n======================\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    for idx in range(0,5):
        print(corpus[indices[0,idx]], "(Distance: %.4f)" % distances[0,idx])



Query: I am interested in computer science

Top 5 most similar sentences in corpus:
Introduction to programming with a focus on applications in informatics.  Covers the fundamental elements of a modern programming language and how to access data on the internet.  Explores how humans and technology complement one another, including techniques used to coordinate groups of people working together on software development. (Distance: 153.2221)
Includes the study of digital synthesis techniques. Special attention is given to the relationship between technology, the creative process, and individual statement. (Distance: 160.4367)
Includes the study of digital synthesis techniques. Special attention is given to the relationship between technology, the creative process, and individual statement. (Distance: 160.4367)
Practical experience in the use of a system of computer programs for social scientists. (Distance: 162.5113)
Advanced technical communication for computer science.  Design and wri


## Reranking using Bidirectional LSTM model
**Reference:** https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/

In [87]:
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import WordNetLemmatizer, SnowballStemmer
toko_tokenizer = ToktokTokenizer()
wordnet_lemmatizer = WordNetLemmatizer()

In [88]:
def normalize_text(text):
        puncts = ['/', ',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
         '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
         '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
         '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
         '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]

        def clean_text(text):
            text = str(text)
            text = text.replace('\n', '')
            text = text.replace('\r', '')
            for punct in puncts:
                if punct in text:
                    text = text.replace(punct, '')
            return text.lower()

        def clean_numbers(text):
            if bool(re.search(r'\d', text)):
                text = re.sub('[0-9]{5,}', '#####', text)
                text = re.sub('[0-9]{4}', '####', text)
                text = re.sub('[0-9]{3}', '###', text)
                text = re.sub('[0-9]{2}', '##', text)
            return text

        contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}

        def _get_contractions(contraction_dict):
            contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys()))
            return contraction_dict, contraction_re

        contractions, contractions_re = _get_contractions(contraction_dict)

        def replace_contractions(text):
            def replace(match):
                return contractions[match.group(0)]
            return contractions_re.sub(replace, text)

        stopword_list = nltk.corpus.stopwords.words('english')

        def remove_stopwords(text, is_lower_case=True):
            tokens = toko_tokenizer.tokenize(text)
            tokens = [token.strip() for token in tokens]
            if is_lower_case:
                filtered_tokens = [token for token in tokens if token not in stopword_list]
            else:
                filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
            filtered_text = ' '.join(filtered_tokens)    
            return filtered_text

        def lemmatizer(text):
            tokens = toko_tokenizer.tokenize(text)
            tokens = [token.strip() for token in tokens]
            tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens]
            return ' '.join(tokens)

        def trim_text(text):
            tokens = toko_tokenizer.tokenize(text)
            tokens = [token.strip() for token in tokens]
            return ' '.join(tokens)
        
        def remove_non_english(text):
            tokens = toko_tokenizer.tokenize(text)
            tokens = [token.strip() for token in tokens]
            tokens = [token for token in tokens if d.check(token)]
            eng_text = ' '.join(tokens)
            return eng_text

        text_norm = clean_text(text)
        text_norm = clean_numbers(text_norm)
        text_norm = replace_contractions(text_norm)
#         text_norm = remove_stopwords(text_norm)
#         text_norm = remove_non_english(text_norm)
        text_norm = lemmatizer(text_norm)
        text_norm = trim_text(text_norm)
        return text_norm