##Text Simmilarity Analysis with Job Description
By: Karen Pu
Date: April 25, 2024

This code calculates the text similarity between Why Sybbure questions and the
official Sybbure decription. Inspired by how ML is used to match candidate
resumes to desired traits highlighted in job description, we tested if
Td-Idf similarity between applicant responses and Sybbure description was
associated with whether an applicant made it or not

In [None]:
# Description of Sybbure Searle program and what they are looking for

description = """The SyBBURE Searle Program seeks to transform the scientific field into an inclusive, integrated, purposeful community—one student at a time.
To seek after this vision, we provide students with experiences in both research and design within an idealized community so that they may be transformed into the next generation of innovators, regardless of their current major or future plans.
A generous gift from Gideon Searle, a Vanderbilt alumnus, allows us to provide this distinctive opportunity for undergraduate students at Vanderbilt University to explore science at promising and exciting frontiers! The gift supports student participation in research projects conducted in labs across Vanderbilt’s campus. We partner with the faculty, post-doctoral fellows, and graduate students in these labs to provide our students with tailored, mentored immersion experiences in advanced scientific research. To bolster their research experience, students engage in team-based design projects with peers in the program to further explore their interests and solve real-world problems.
While students gravitate towards the program for the research experience, they often stay because of the community. We are an idealized research and design playground where students can work together and grow both personally and professionally, exemplifying the inclusive, integrated, purposeful community we ideally want the scientific community to be.
To establish our program culture, we create a scaffold of high expectations coupled with a dynamic, fluid, and fun environment. The philosophy of the SyBBURE Searle Program exists around a “maximum effort for maximum return” ideology. Through this year-round research and design experience, students develop and expand their skill sets, are encouraged to explore their passions and ideas, and receive personalized mentorship from their faculty research adviser, as well as the SyBBURE Searle team."""

In [None]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# extract necessary data
import pandas as pd

df = pd.read_csv("./sample_data/sybbure_data.csv")

filtered = df[["What are your goals and motivations for participating in SyBBURE?",
          "Why are you seeking an experience in research and/or design?",
               "Group Interview",
               "Spot in the program?"]]

filtered['Interview'] = filtered["Group Interview"].apply(lambda x: 1 if x == 'Y' else 0)
filtered["Spot"] = filtered["Spot in the program?"].apply(lambda x: 1 if x == 'Y' else 0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['Interview'] = filtered["Group Interview"].apply(lambda x: 1 if x == 'Y' else 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered["Spot"] = filtered["Spot in the program?"].apply(lambda x: 1 if x == 'Y' else 0)


In [None]:
# preprocess data
import nltk
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

nltk.download('stopwords')

def preprocess(text):
    text = text.lower()

    text = "".join([char for char in text if char not in string.punctuation])
    words = nltk.word_tokenize(text)

    stop_words = stopwords.words('english')
    filtered_words = [word for word in words if word not in stop_words]

    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in filtered_words]

    return stemmed

description_processed = preprocess(description)
filtered['Goal_Token'] = filtered["What are your goals and motivations for participating in SyBBURE?"].apply(preprocess)
filtered["Research_Token"] = filtered["Why are you seeking an experience in research and/or design?"].apply(preprocess)

filtered['Combine'] = filtered.apply(lambda row: " ".join(row['Goal_Token'] + row['Research_Token']), axis=1)
df = filtered[['Combine', 'Interview', 'Spot']]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['Goal_Token'] = filtered["What are your goals and motivations for participating in SyBBURE?"].apply(preprocess)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered["Research_Token"] = filtered["Why are you seeking an experience in research and/or design?"].apply(preprocess)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = v

In [None]:
# split train and test for spot in program
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(filtered['Combine'],filtered['Spot'],test_size=0.2)

Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(description_processed)
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

print(Tfidf_vect.vocabulary_)


{'sybbur': 95, 'searl': 87, 'program': 74, 'seek': 88, 'transform': 102, 'scientif': 86, 'field': 33, 'inclus': 49, 'integr': 51, 'purpos': 78, 'community': 9, 'on': 61, 'student': 93, 'time': 99, 'vision': 107, 'provid': 77, 'experi': 29, 'research': 82, 'design': 15, 'within': 110, 'ideal': 46, 'commun': 8, 'may': 56, 'next': 59, 'gener': 38, 'innov': 50, 'regardless': 81, 'current': 14, 'major': 54, 'futur': 37, 'plan': 69, 'gift': 40, 'gideon': 39, 'vanderbilt': 106, 'alumnu': 4, 'allow': 3, 'us': 105, 'distinct': 17, 'opportun': 62, 'undergradu': 103, 'univers': 104, 'explor': 30, 'scienc': 85, 'promis': 76, 'excit': 24, 'frontier': 35, 'support': 94, 'particip': 63, 'project': 75, 'conduct': 10, 'lab': 53, 'across': 0, 'campu': 7, 'partner': 64, 'faculti': 31, 'postdoctor': 71, 'fellow': 32, 'graduat': 41, 'tailor': 96, 'mentor': 57, 'immers': 48, 'advanc': 1, 'bolster': 6, 'engag': 21, 'teambas': 98, 'peer': 66, 'interest': 52, 'solv': 91, 'realworld': 79, 'problem': 72, 'gravit

In [None]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
print(predictions_NB)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Naive Bayes Accuracy Score ->  84.61538461538461


In [None]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
print(predictions_SVM)
print(Test_Y)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
SVM Accuracy Score ->  84.61538461538461


Acknowledgements:
- Sybbure Searle Undergraduate Research Program (https://www.sybbure.org/)
- ChatGPT
- https://www.jetir.org/papers/JETIR2305459.pdf