**Loading dataset from Kaggle**

Link - https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes

In [1]:
!pip install kaggle



In [2]:
!mkdir ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json

In [3]:
! kaggle competitions download nbme-score-clinical-patient-notes

Downloading nbme-score-clinical-patient-notes.zip to /content
 81% 8.00M/9.83M [00:01<00:00, 13.1MB/s]
100% 9.83M/9.83M [00:01<00:00, 8.32MB/s]


In [4]:
!unzip *.zip

Archive:  nbme-score-clinical-patient-notes.zip
  inflating: features.csv            
  inflating: patient_notes.csv       
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


### Installing Libraries

In [5]:
!pip install rake-nltk -q

In [6]:
!pip install requests-html

Collecting requests-html
  Downloading requests_html-0.10.0-py3-none-any.whl (13 kB)
Collecting pyquery (from requests-html)
  Downloading pyquery-2.0.0-py3-none-any.whl (22 kB)
Collecting fake-useragent (from requests-html)
  Downloading fake_useragent-1.2.1-py3-none-any.whl (14 kB)
Collecting parse (from requests-html)
  Downloading parse-1.19.1-py2.py3-none-any.whl (18 kB)
Collecting bs4 (from requests-html)
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting w3lib (from requests-html)
  Downloading w3lib-2.1.2-py3-none-any.whl (21 kB)
Collecting pyppeteer>=0.0.14 (from requests-html)
  Downloading pyppeteer-1.0.2-py3-none-any.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.4/83.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyee<9.0.0,>=8.1.0 (from pyppeteer>=0.0.14->requests-html)
  Downloading pyee-8.2.2-py2.py3-none-any.whl (12 kB)
Collecting urllib3<2.0.0,>=1.25.8 (from pyppeteer>=0.0

### Importing Libraries

In [7]:
import numpy as np
import pandas as pd
from rake_nltk import Rake

import nltk
nltk.download('stopwords')
nltk.download('punkt')

from requests_html import HTMLSession

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [8]:
clinical_notes = pd.read_csv("/content/patient_notes.csv")

In [9]:
clinical_notes.pn_history[5]

"17 yo m, presenting with palpitations/increased heart rate, 5-6 episodes in the past few months, most recently 2 days ago. 2 days ago, episode was associated with chest pressure, shortness of breath, and lightheadedness.\r\nNo diaphoresis, no vomiting, no tremor, no loss of consciousness\r\nNo fever, no nausea or vomiting or diarrhea\r\nNo rash, no change in skin colour\r\nAppetite good\r\nDenies anxiety prior to or during this episode\r\nPHx: healthy\r\nMedications: Taking friend's prescription Adderral to help study for tests (not prescribed for him); taking a few times per week - reported no temporal relation to palpitations\r\nAllergies: none\r\nSubstances: no cigarettes, 3-4 beers on weekends, Adderall; no cocaine use or other substances\r\nSocial history: freshman in college - reported no concerns about school\r\nFamily history: father - MI at 52, mother - thyroid disease"

In [10]:
r = Rake(punctuations = [')','(',',',':','),',').','.'])

In [11]:
r.extract_keywords_from_text(clinical_notes.pn_history[500])

In [12]:
phrase_df = pd.DataFrame(r.get_ranked_phrases_with_scores(), columns = ['score','phrase'])

In [13]:
phrase_df.loc[phrase_df.score>3]

Unnamed: 0,score,phrase
0,35.5,recent episode happened one day ago
1,30.0,heart pounding 1 - 2 times
2,25.5,began college 4 months ago
3,25.333333,adderall 2 - 3 times
4,23.833333,etoh 3 - 4 drinks
5,15.0,also reports using coffee
6,10.0,taken adderall night
7,9.666667,paternal heart attack
8,9.0,permission per mother
9,9.0,maternal thyroid issue


# RAKE and WebScraping

In [14]:
def extract_text():
    s = HTMLSession()
    url = 'https://www.musicradar.com/reviews/tech/akg-c214-172209'
    response = s.get(url)
    return response.html.find('div#article-body', first=True).text


r = Rake()
r.extract_keywords_from_text(extract_text())
for rating, keyword in r.get_ranked_phrases_with_scores():
    if rating > 5:
        print(rating, keyword)

16.0 incredible 143db dynamic range
16.0 accurately capturing transient details
15.0 technical excellence ever since
15.0 rugged double mesh grille
15.0 phantom power sources ranging
14.666666666666666 terminated large diaphragm capsule
14.285714285714285 new c214 large diaphragm
14.285714285714285 c214 sounds extremely similar
9.0 without unduly affecting
9.0 spider suspension mount
9.0 solid bottom end
9.0 resistant matte greyish
9.0 remove unwanted low
9.0 one pickup pattern
9.0 incredibly low self
9.0 improving sonic accuracy
9.0 cast body provides
9.0 60 years ago
8.666666666666666 integrated capsule suspension
8.5 metal carrying case
8.333333333333334 160hz bass roll
8.0 reduces mechanical noise
7.8 purpose studio microphone
7.8 fire condenser microphone
7.75 prevent electric guitars
7.75 filter also minimises
7.619047619047619 c214 works well
7.6 loud sound sources
7.583333333333334 also performs well
7.166666666666667 frequency response curve
5.333333333333334 double bass


# YAKE

Please refer here for YAKE - https://pypi.org/project/yake/

### Homework - Create an Implementation of Keyword extraction using any other algorithm but RAKE

---