<h1>CTRL-F like Search</h1>
<p>The dataset contains five people with a summary about their career. They have all won prizes or have received awards for their work except for one person. The first assignment is to figure out who hasn't won any prizes or awards. First I'll show you how I tried doing it in the cells below, by using the wonderful "CTRL+f like search". 

    The first cell (#1)  uses the <i>pandas.Series.str.contains</i> function with the keyword 'win' and filters the dataset by excluding the people who have the string 'win' in their summary. This approach has narrowed it down to three people. 
    
After adding more keywords in the second cell (#2), there were only two rows (persons) left. Verbs might not be the best key words, so decided using nouns instead: 'prize' and 'award' (#3). This resulted in Rosalind Franklin, who was excluded in the first (#1) search because she had the word 'win' in her summary. I checked the summary and apparantly she didn't have the actual word 'win' in it but it detected the string 'win' in 'owing'.</br>
    <i>owing to disagreement with her director, john randall, and her colleague maurice wilkins'. </i></br>

This approach is giving me unreliable and wrong results as it only matches strings with strings and shows the importance of tokenizing the text beforehand. The assignment starts after the next three cells, go through them first and check the output of the <i>pandas.Series.str.contains</i> function.
</p>



In [1]:
""' !pip install spacy
!pip install spacy-transformers
!pip install wikipedia
!pip install neo4j '''


Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 34.4 MB/s eta 0:00:00
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [94]:
import pandas as pd
from pathlib import Path

#read the assignment's dataset
df = pd.read_excel(Path('assigment1_data.xlsx'))

In [95]:
#1
keyword = 'win'
df[~df.summary.str.contains('(?i)'+keyword)]

Unnamed: 0,person,summary
0,Jennifer Doudna,"Jennifer Anne Doudna (; born February 19, 196..."
3,Gertrude Elion,"Gertrude ""Trudy"" Belle Elion (January 23, 1918..."
4,Rita Levi-Montalcini,Rita Levi-Montalcini (22 April 1909 – 30 Decem...


In [6]:
#2
keyword = 'win|receive'
df[~df.summary.str.contains('(?i)'+keyword)]

Unnamed: 0,person,summary
3,Gertrude Elion,"Gertrude ""Trudy"" Belle Elion (January 23, 1918..."
4,Rita Levi-Montalcini,Rita Levi-Montalcini (22 April 1909 – 30 Decem...


In [6]:
#3
keyword = 'prize|award'
df[~df.summary.str.contains('(?i)'+keyword)]

Unnamed: 0,person,summary
2,Rosalind Franklin,Rosalind Elsie Franklin (25 July 1920 – 16 Apr...


<b>Assignment 1:</b> </br>
As mentioned before, the first assignment is to find out which person hasn't won an award. The Python NLP package spaCY is used to  improve the 'CTRL-F like search.

First we use tokenization and lemmatization in order to match the key words (relating to awards) with words in the texts of the 5 persons. For information about tokenization, lemmatization, as well as Named Entity Recogntion, please refer to the previous Techathon NLP II:  
https://github.com/DataScienceOrdina/techathon-NLP-II/blob/main/spacy-hackathon-presentation-november-2022.pdf  
https://github.com/DataScienceOrdina/techathon-NLP-II/blob/main/Assignments/Assignment-1-spaCy-101.ipynb

In [26]:
import pandas as pd
from pathlib import Path
import spacy
!python -m spacy download en_core_web_sm
import re

#read the assignment's dataset
df = pd.read_excel(Path('assigment1_data.xlsx'))

#Language class with the English model 'en_core_web_sm' is loaded
nlp = spacy.load('en_core_web_sm')

#Bonus question: "cleaning" data important or not?
def alphanumericalOnly(text):
    return re.sub(r'[^a-zA-Z0-9 ]', '', text).lower()

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 6.8 MB/s eta 0:00:00
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


You can try to play around with different kinds of _keywords_ to improve the search.

In [16]:
keywords = ["win", "award", "prize", "receive"]
keywords_nlp = [nlp(k) for k in keywords]
docs = [nlp(doc) for doc in df['summary']] 
spacy.displacy.render(docs, style='ent')

In [17]:
awardCounter = {"Jennifer":0, "Rachel":0, "Rosalind":0,
                "Gertrude":0, "Rita":0}
for doc in docs:
    for token in doc:
        for k in keywords_nlp:
            if (token.lemma_.casefold() == k[0].lemma_):
                awardCounter[doc[0].text] += 1
                print(f"Person: {doc[0].text}, token: {token}, "
                      #f"token_lemma: {token.lemma_.casefold()}, " 
                      f"keyword: {k[0].lemma_}")

print(f"\n{awardCounter}")

Person: Jennifer, token: received, keyword: receive
Person: Jennifer, token: Prize, keyword: prize
Person: Jennifer, token: awards, keyword: award
Person: Jennifer, token: Award, keyword: award
Person: Jennifer, token: Prize, keyword: prize
Person: Jennifer, token: Prize, keyword: prize
Person: Jennifer, token: Prize, keyword: prize
Person: Jennifer, token: Award, keyword: award
Person: Jennifer, token: Prize, keyword: prize
Person: Rachel, token: won, keyword: win
Person: Rachel, token: Award, keyword: award
Person: Rachel, token: awarded, keyword: award
Person: Gertrude, token: Prize, keyword: prize
Person: Rita, token: awarded, keyword: award
Person: Rita, token: Prize, keyword: prize

{'Jennifer': 9, 'Rachel': 3, 'Rosalind': 0, 'Gertrude': 1, 'Rita': 2}


Now we will use word embeddings to find matches with our search words. Word embeddings were introduced in a previous Techathon, but for quick refresh you can look it up here:

https://en.wikipedia.org/wiki/Word_embedding  
https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/

In [10]:
#Download medium sized language model, which includes word embeddings
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.1/en_core_web_md-3.4.1-py3-none-any.whl (42.8 MB)
     --------------------------------------- 42.8/42.8 MB 17.2 MB/s eta 0:00:00
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')


Try playing around with different settings for _similarityThreshold_, as well as with different _keywords_.

In [25]:
nlp = spacy.load('en_core_web_md')
keywords = ["win", "award", "prize", "receive"]
keywords_nlp = [nlp(k) for k in keywords]
docs = [nlp(doc) for doc in df['summary']] 

awardCounter = {"Jennifer":0, "Rachel":0, "Rosalind":0,
                "Gertrude":0, "Rita":0}

similarityThreshold = 0.5
for doc in docs:
    for token in doc:
        for k in keywords_nlp:
            similarityScore = token.similarity(k)
            if similarityScore > similarityThreshold:
                awardCounter[doc[0].text] += 1
                print(f'Person: {doc[0].text}, {token} <-> {k}, '
                    f'similarity: {similarityScore}')

print(f"\n{awardCounter}")

#UserWarning: [W008] Evaluating Token.similarity based on empty vectors. 
# -> when evaluating unknown tokens that have no valid word vector

  similarityScore = token.similarity(k)


Person: Jennifer, received <-> receive, similarity: 0.7367027486482727
Person: Jennifer, Prize <-> award, similarity: 0.6396464987419995
Person: Jennifer, Prize <-> prize, similarity: 0.6891455645215978
Person: Jennifer, prestigious <-> award, similarity: 0.5954339461101952
Person: Jennifer, awards <-> award, similarity: 0.8765914436826451
Person: Jennifer, awards <-> prize, similarity: 0.6013163726404726
Person: Jennifer, Award <-> award, similarity: 0.7827086338667512
Person: Jennifer, Prize <-> award, similarity: 0.6396464987419995
Person: Jennifer, Prize <-> prize, similarity: 0.6891455645215978
Person: Jennifer, recipient <-> award, similarity: 0.5305744256883699
Person: Jennifer, recipient <-> receive, similarity: 0.559757817169588
Person: Jennifer, Prize <-> award, similarity: 0.6396464987419995
Person: Jennifer, Prize <-> prize, similarity: 0.6891455645215978
Person: Jennifer, Prize <-> award, similarity: 0.6396464987419995
Person: Jennifer, Prize <-> prize, similarity: 0.68914

<b>Assignment 2:</b>
Now do it yourself: find out which one of the five persons worked at a Public Ivy School. Public Ivy School is an informal term for prestigious universities in the United States of America. 
(See: https://en.wikipedia.org/wiki/Public_Ivy)

Hint: The words _Public Ivy School_ themself are not in the data.

In [27]:
nlp = spacy.load('en_core_web_md')
keyword_nlp = nlp("Public Ivy School")
docs = [nlp(doc) for doc in df['summary']] 

awardCounter = {"Jennifer":0, "Rachel":0, "Rosalind":0,
                "Gertrude":0, "Rita":0}

similarityThreshold = 0.5
for doc in docs:
    for token in doc:
        similarityScore = token.similarity(keyword_nlp)
        if similarityScore > similarityThreshold:
            awardCounter[doc[0].text] += 1
            print(f'Person: {doc[0].text}, {token} <-> {keyword_nlp}, '
                  f'similarity: {similarityScore}')

print(f"\n{awardCounter}")


Person: Jennifer, University <-> Public Ivy School, similarity: 0.6291199222892067
Person: Jennifer, Institute <-> Public Ivy School, similarity: 0.5324329089480745
Person: Jennifer, graduated <-> Public Ivy School, similarity: 0.5178858279667898
Person: Jennifer, College <-> Public Ivy School, similarity: 0.6558392707641082
Person: Jennifer, Harvard <-> Public Ivy School, similarity: 0.5049717602257867
Person: Jennifer, School <-> Public Ivy School, similarity: 0.8072080710135979
Person: Jennifer, Institute <-> Public Ivy School, similarity: 0.5324329089480745
Person: Jennifer, Institutes <-> Public Ivy School, similarity: 0.5055165265838937
Person: Jennifer, University <-> Public Ivy School, similarity: 0.6291199222892067
Person: Rosalind, graduated <-> Public Ivy School, similarity: 0.5178858279667898
Person: Rosalind, College <-> Public Ivy School, similarity: 0.6558392707641082
Person: Rosalind, University <-> Public Ivy School, similarity: 0.6291199222892067
Person: Rosalind, Col

  similarityScore = token.similarity(keyword_nlp)


In [None]:
#Hint: comparing tokens versus comparing entities. Which would work better?
spacy.displacy.render(docs, style='ent')

In [6]:
#Answer: now comparing entities instead of tokens.
nlp = spacy.load('en_core_web_md') 
keyword_nlp = nlp("Public Ivy School")
docs = [nlp(doc) for doc in df['summary']]

awardCounter = {"Jennifer":0, "Rachel":0, "Rosalind":0,
                "Gertrude":0, "Rita":0}

similarityThreshold = 0.5
for doc in docs:
    for ent in doc.ents:
        similarityScore = ent.similarity(keyword_nlp)
        if similarityScore > similarityThreshold:
            awardCounter[doc[0].text] += 1
            print(f'Person: {doc[0].text}, {ent} <-> {keyword_nlp}, '
                  f'similarity: {similarityScore}')

print(f"\n{awardCounter}")

Person: Jennifer, the Howard Hughes Medical Institute <-> Public Ivy School, similarity: 0.507655762343079
Person: Jennifer, Pomona College <-> Public Ivy School, similarity: 0.681009594596115
Person: Jennifer, Harvard Medical School <-> Public Ivy School, similarity: 0.7992868418085236
Person: Jennifer, Lawrence Berkeley National Laboratory <-> Public Ivy School, similarity: 0.5833282496347029
Person: Rosalind, Newnham College <-> Public Ivy School, similarity: 0.6706007641001862
Person: Rosalind, Birkbeck College <-> Public Ivy School, similarity: 0.6191564816684598

{'Jennifer': 4, 'Rachel': 0, 'Rosalind': 2, 'Gertrude': 0, 'Rita': 0}
