# Sentiment analysis

## Analyses
1. ~~Simple keyword analysis + tag cloud / bar chart visualisation~~
2. ~~Keyword analysis using word embeddings to conflate synonyms~~
    * Visualization of 2.?
3. Sentiment analysis, e.g. pos/neg/neutral etc. 
4. Clustering of keywords, possibly using Carrot2 or similar
5. Compare different cohorts, e.g. based on career stage, geography etc.  (if we have this data).
6. Correlation analysis to see how responses are related
   * e.g. does working longer hours correlate with greater negative sentiment?


TODO

* ~~Load dataframe~~
* ~~Apply basic sentiment analysis to column 1~~
  * ~~do manual check~~
* ~~Apply SA to all columns~~
* Test different variations on the SA output, e.g. just use pos?
* Test different SA algos
* Programmatically iterate over all columns

## Import data, clean & normalise

In [19]:
import pandas as pd

In [20]:
# read csv ignoring potential N/As
df = pd.read_csv('Open Ended Questions-Table 1.csv', na_filter = False)

In [21]:
df

Unnamed: 0,What are the TWO major challenges faced by embryologists in the workplace?,Unnamed: 1,"What are the TWO major challenges faced by the embryology profession, in general?",Unnamed: 3,Provide TWO suggestions to improve embryologists' working conditions,Unnamed: 5,What is your career goal?
0,1,2,1,2,1,2,Open-Ended Response
1,Equality with clinical members,Poor pay,Shortage of trained staff,Equality,Improve number,Improve pay,Continue
2,Bullying by colleagues and managers,"Poorly designed protocols, technical ignorance...",Stress of delivering high quality work without...,Presence of narcissistic individuals destroyin...,Screening for sociopathic personality traits n...,Clinics must be staffed so that there is a goo...,To survive until retirement without suffering ...
3,Working hours and too many responsibilities,Salary,Access to training,Certification,Add reading and projects into basic workload n...,Understand for how many hours a brain can be f...,Get my ESHRE certification Publish paper thro...
4,burnout/stress,poor management,not enough highly trained staff,politics,better pay,better CPD opportunities,FRCPath
...,...,...,...,...,...,...,...
1252,low salaries,weekend and holiday work,low salaries,weekend and holiday work,increase salaries,close over christmas,Dont make a big mistake
1253,Low salaries,Weekend and holiday work,Low salary,Weekend and holiday work,Higher salaries,Close over Christmas holidays,Make babies and make people happy.
1254,Respect from other departments for the complex...,Financial compensation for overtime etc.,Financial participation: Owning shares,lack of respect for the work we do by doctors,Pay them!,Invite financial incentives to the embryologis...,I reached the glass ceiling. All I can do is h...
1255,limited staff,doctors,Salaries,Limit work opportunities,Better salaries,Flexible working hours,Laboratory director


In [22]:
# create a dict for the short an log column names
long_col_names = {'Q1a': df.columns[0], 'Q1b': df.columns[0], 'Q2a': df.columns[2], 'Q2b': df.columns[2], \
                  'Q3a': df.columns[4], 'Q3b': df.columns[4], 'Q4': df.columns[6]}

In [23]:
long_col_names

{'Q1a': 'What are the TWO major challenges faced by embryologists in the workplace?',
 'Q1b': 'What are the TWO major challenges faced by embryologists in the workplace?',
 'Q2a': 'What are the TWO major challenges faced by the embryology profession, in general? ',
 'Q2b': 'What are the TWO major challenges faced by the embryology profession, in general? ',
 'Q3a': "Provide TWO suggestions to improve embryologists' working conditions",
 'Q3b': "Provide TWO suggestions to improve embryologists' working conditions",
 'Q4': 'What is your career goal?'}

In [24]:
long_col_names['Q1a']

'What are the TWO major challenges faced by embryologists in the workplace?'

In [25]:
# rename the columns
df = df.rename(columns={df.columns[0]: 'Q1a', df.columns[1]: 'Q1b', df.columns[2]: 'Q2a',df.columns[3]: \
                        'Q2b', df.columns[4]: 'Q3a', df.columns[5]: 'Q3b', df.columns[6]: 'Q4'})

In [26]:
df.columns[0]

'Q1a'

In [27]:
# remove first row
df = df.iloc[1:]

In [39]:
df

Unnamed: 0,Q1a,Q1b,Q2a,Q2b,Q3a,Q3b,Q4
1,Equality with clinical members,Poor pay,Shortage of trained staff,Equality,Improve number,Improve pay,Continue
2,Bullying by colleagues and managers,"Poorly designed protocols, technical ignorance...",Stress of delivering high quality work without...,Presence of narcissistic individuals destroyin...,Screening for sociopathic personality traits n...,Clinics must be staffed so that there is a goo...,To survive until retirement without suffering ...
3,Working hours and too many responsibilities,Salary,Access to training,Certification,Add reading and projects into basic workload n...,Understand for how many hours a brain can be f...,Get my ESHRE certification Publish paper thro...
4,burnout/stress,poor management,not enough highly trained staff,politics,better pay,better CPD opportunities,FRCPath
5,Recognizion,Trust,Handson training,Trust,Good laboratory training,Troubleshooting,Be confident in the work i do Academically st...
...,...,...,...,...,...,...,...
1252,low salaries,weekend and holiday work,low salaries,weekend and holiday work,increase salaries,close over christmas,Dont make a big mistake
1253,Low salaries,Weekend and holiday work,Low salary,Weekend and holiday work,Higher salaries,Close over Christmas holidays,Make babies and make people happy.
1254,Respect from other departments for the complex...,Financial compensation for overtime etc.,Financial participation: Owning shares,lack of respect for the work we do by doctors,Pay them!,Invite financial incentives to the embryologis...,I reached the glass ceiling. All I can do is h...
1255,limited staff,doctors,Salaries,Limit work opportunities,Better salaries,Flexible working hours,Laboratory director


In [30]:
# create a dictionary to store each of the columns, indexed on the column name
answers = {}
# create a dict of strings to store the column content, indexed on the column name
responses = {}
# iterate over the dataframe columns
for series_name, series in df.items():
    print(series_name)
    print(series)
    answers[series_name] = pd.DataFrame([' '.join(df[series_name].to_list())], columns=['content'])
    responses[series_name] = answers[series_name].values[0,0]

Q1a
1                          Equality with clinical members
2                     Bullying by colleagues and managers
3             Working hours and too many responsibilities
4                                          burnout/stress
5                                             Recognizion
                              ...                        
1252                                         low salaries
1253                                         Low salaries
1254    Respect from other departments for the complex...
1255                                        limited staff
1256                                             Training
Name: Q1a, Length: 1256, dtype: object
Q1b
1                                                Poor pay
2       Poorly designed protocols, technical ignorance...
3                                                  Salary
4                                         poor management
5                                                   Trust
                         

In [31]:
responses['Q1a']

'Equality with clinical members Bullying by colleagues and managers Working hours and too many responsibilities burnout/stress Recognizion Stress created by no embryos/ less embryos/ poor quality embryos General workplace politics Burnout due to working continously through weekends without breaks fertilization outcomes are laid on the embryologists lack of leadership Recognition of our contribution in the ART field Time managment administration work work overload Time spent at work Workload Overworked Oocyte quality hard cases; bad sperm, fragile oocytes We are not being paid fairly (it\'s a specialized field) Biopsy Poor pay; lack of transparency of salary in the private sector time management with patients/doctors Pression Daily work hours and weekends Employees shortage Stress induced work environment Recognition and sallary make no mistake space space Never ever make a mistake Workload BURN OUT Repeated manual processes emotional stress A good salary Heavy workload Scientific Recog

In [35]:
# lower case
for col_name in responses:
    responses[col_name] = responses[col_name].casefold()

## Sentiment Analysis using NLTK

See https://www.datacamp.com/tutorial/text-analytics-beginners-nltk

In [40]:
# import libraries
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [41]:
df['Q1a']

1                          Equality with clinical members
2                     Bullying by colleagues and managers
3             Working hours and too many responsibilities
4                                          burnout/stress
5                                             Recognizion
                              ...                        
1252                                         low salaries
1253                                         Low salaries
1254    Respect from other departments for the complex...
1255                                        limited staff
1256                                             Training
Name: Q1a, Length: 1256, dtype: object

In [13]:
# create preprocess_text function
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]
   
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    # Join the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)
    return processed_text

In [14]:
df['Q1a'] = df['Q1a'].apply(preprocess_text)
df

Unnamed: 0,Q1a,Q1b,Q2a,Q2b,Q3a,Q3b,Q4
1,equality clinical member,Poor pay,Shortage of trained staff,Equality,Improve number,Improve pay,Continue
2,bullying colleague manager,"Poorly designed protocols, technical ignorance...",Stress of delivering high quality work without...,Presence of narcissistic individuals destroyin...,Screening for sociopathic personality traits n...,Clinics must be staffed so that there is a goo...,To survive until retirement without suffering ...
3,working hour many responsibility,Salary,Access to training,Certification,Add reading and projects into basic workload n...,Understand for how many hours a brain can be f...,Get my ESHRE certification Publish paper thro...
4,burnout/stress,poor management,not enough highly trained staff,politics,better pay,better CPD opportunities,FRCPath
5,recognizion,Trust,Handson training,Trust,Good laboratory training,Troubleshooting,Be confident in the work i do Academically st...
...,...,...,...,...,...,...,...
1252,low salary,weekend and holiday work,low salaries,weekend and holiday work,increase salaries,close over christmas,Dont make a big mistake
1253,low salary,Weekend and holiday work,Low salary,Weekend and holiday work,Higher salaries,Close over Christmas holidays,Make babies and make people happy.
1254,respect department complex work,Financial compensation for overtime etc.,Financial participation: Owning shares,lack of respect for the work we do by doctors,Pay them!,Invite financial incentives to the embryologis...,I reached the glass ceiling. All I can do is h...
1255,limited staff,doctors,Salaries,Limit work opportunities,Better salaries,Flexible working hours,Laboratory director


In [None]:
# initialize NLTK sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

In [48]:
# BINARY get_sentiment function
def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    sentiment = 1 if scores['pos'] > 0 else 0
    return sentiment

In [81]:
# compound get_sentiment function
def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    return scores['compound']

In [82]:
df['sentiment'] = df['Q1a'].apply(get_sentiment)

In [83]:
from keybert import KeyBERT

In [None]:
doc = responses['Q1a']

In [None]:
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
print(keywords)

In [None]:
keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2))
print(keywords)

In [None]:
# To diversify the results, we take the 2 x top_n most similar words/phrases to the document. Then, we take all top_n combinations 
# from the 2 x top_n words and extract the combination that are the least similar to each other by cosine similarity.

kw_model.extract_keywords(doc, keyphrase_ngram_range=(2, 3), stop_words='english',
                              use_maxsum=True, nr_candidates=20, top_n=10)

In [None]:
# To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases which is 
# also based on cosine similarity. The results with high diversity:

kw_model.extract_keywords(doc, keyphrase_ngram_range=(2, 3), stop_words='english',
                          use_mmr=True, diversity=0.7, top_n=10)

In [None]:
#  Guided KeyBERT

seed_keywords = ["workload"]
kw_model.extract_keywords(doc, seed_keywords=seed_keywords, keyphrase_ngram_range=(2, 3), stop_words='english')

## Prepare embeddings

In [None]:
doc_embeddings, word_embeddings = kw_model.extract_embeddings(doc, keyphrase_ngram_range=(2, 3), stop_words='english')

In [None]:
kw_model.extract_keywords(doc, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings, keyphrase_ngram_range=(2, 3), stop_words='english')

## Generate keywords for all 7 verbatims

In [None]:
for col_name in responses:
    print(long_col_names[col_name])
    keywords = kw_model.extract_keywords(responses[col_name], keyphrase_ngram_range=(2, 4), stop_words='english', use_mmr=True, diversity=0.7, top_n=10)
    print(keywords, "\n")

In [84]:
with pd.option_context('display.max_colwidth', None):
  display(df[['Q1a', 'sentiment']])

Unnamed: 0,Q1a,sentiment
1,Equality with clinical members,0.0000
2,Bullying by colleagues and managers,-0.5994
3,Working hours and too many responsibilities,0.0000
4,burnout/stress,0.0000
5,Recognizion,0.0000
...,...,...
1252,low salaries,-0.2732
1253,Low salaries,-0.2732
1254,Respect from other departments for the complex work we do,0.4767
1255,limited staff,-0.2263


In [85]:
df['sentiment'] = df['Q1b'].apply(get_sentiment)

In [86]:
with pd.option_context('display.max_colwidth', None):
  display(df[['Q1b', 'sentiment']])

Unnamed: 0,Q1b,sentiment
1,Poor pay,-0.5423
2,"Poorly designed protocols, technical ignorance by colleagues and managers.",-0.3612
3,Salary,0.0000
4,poor management,-0.4767
5,Trust,0.5106
...,...,...
1252,weekend and holiday work,0.4019
1253,Weekend and holiday work,0.4019
1254,Financial compensation for overtime etc.,0.0000
1255,doctors,0.0000


In [95]:
df['sentiment'] = df['Q2a'].apply(get_sentiment)
with pd.option_context('display.max_colwidth', None):
  display(df[['Q2a', 'sentiment']])

Unnamed: 0,Q2a,sentiment
1,Shortage of trained staff,-0.2500
2,Stress of delivering high quality work without sufficient staff.,-0.4215
3,Access to training,0.0000
4,not enough highly trained staff,0.0000
5,Handson training,0.0000
...,...,...
1252,low salaries,-0.2732
1253,Low salary,-0.2732
1254,Financial participation: Owning shares,0.2960
1255,Salaries,0.0000


In [96]:
df['sentiment'] = df['Q2b'].apply(get_sentiment)
with pd.option_context('display.max_colwidth', None):
  display(df[['Q2b', 'sentiment']])

Unnamed: 0,Q2b,sentiment
1,Equality,0.0000
2,Presence of narcissistic individuals destroying mental health of colleagues.,-0.5574
3,Certification,0.0000
4,politics,0.0000
5,Trust,0.5106
...,...,...
1252,weekend and holiday work,0.4019
1253,Weekend and holiday work,0.4019
1254,lack of respect for the work we do by doctors,0.2023
1255,Limit work opportunities,0.3818


In [97]:
df['sentiment'] = df['Q3a'].apply(get_sentiment)
with pd.option_context('display.max_colwidth', None):
  display(df[['Q3a', 'sentiment']])

Unnamed: 0,Q3a,sentiment
1,Improve number,0.4939
2,Screening for sociopathic personality traits needs to be introduced. A 45 minute interview is no way to select for entry into the profession.,-0.2960
3,Add reading and projects into basic workload not in leisure time,0.0000
4,better pay,0.3612
5,Good laboratory training,0.4404
...,...,...
1252,increase salaries,0.3182
1253,Higher salaries,0.0000
1254,Pay them!,-0.1759
1255,Better salaries,0.4404


In [98]:
df['sentiment'] = df['Q3b'].apply(get_sentiment)
with pd.option_context('display.max_colwidth', None):
  display(df[['Q3b', 'sentiment']])

Unnamed: 0,Q3b,sentiment
1,Improve pay,0.3612
2,Clinics must be staffed so that there is a good margin of available workers so that staff are not pressured to work at 110% and then burn out.,0.5523
3,Understand for how many hours a brain can be functional performing lab procedures per day,0.0000
4,better CPD opportunities,0.6705
5,Troubleshooting,0.1779
...,...,...
1252,close over christmas,0.0000
1253,Close over Christmas holidays,0.3818
1254,Invite financial incentives to the embryologists as well. e.g. Shares!,0.7574
1255,Flexible working hours,0.2263


In [99]:
df['sentiment'] = df['Q4'].apply(get_sentiment)
with pd.option_context('display.max_colwidth', None):
  display(df[['Q4', 'sentiment']])

Unnamed: 0,Q4,sentiment
1,Continue,0.0000
2,To survive until retirement without suffering mental illness or making a catestrophic lab error.,0.2760
3,Get my ESHRE certification Publish paper through research Presentations and leading workshops Acquire lab directing skills,0.0000
4,FRCPath,0.0000
5,Be confident in the work i do Academically strong,0.7579
...,...,...
1252,Dont make a big mistake,0.2584
1253,Make babies and make people happy.,0.5719
1254,I reached the glass ceiling. All I can do is hope to study further and further and further.....,0.5106
1255,Laboratory director,0.0000


## Iterate over all columns

In [93]:
for col_name, col in df.items():
    print(col_name)
    # print(series)
    col.apply(get_sentiment)

Q1a
Q1b
Q2a
Q2b
Q3a
Q3b
Q4
sentiment


AttributeError: 'float' object has no attribute 'encode'