This project's main purpose is to extract reviews from patients or their families on doctors. Then, reviews on the same doctor will be processed such that similar opinions are retained and become a brief and general reviews from multiple reviews on a specific doctor. Specifically, a graphical window will show up when our site user hovers on that doctor's info.

### Import libraries and the review file

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/teohangxanh/Practice-Data-Science/master/FarLandMD/zocdoc%20reviews.csv', encoding = "ISO-8859-1", index_col=None)
pd.set_option('max_colwidth', 200)
pd.set_option("display.max_rows", 6)

In [2]:
df.columns

Index(['Doctor', 'Reviews'], dtype='object')

### Clean the dataset

In [3]:
df.head()

Unnamed: 0,Doctor,Reviews
0,"Dr. Jon Biorkman, MD",['This review is an overdue huge thank you for an extraordinary medical doctor whose care and professionalism helped me even in his absence: I was Dr. Biorkman\'s patient while in Irvine until 20...
1,,
2,"Richard McConkie, FNP-C","[""I was so pleased with the kindness and care of the staff at west valley. This was my first visit and it felt like I'd been going there forever. Dr Richard was very empathic and understanding. Li..."
3,,
4,"Dr. Crystal Song, NMD","[""I go monthly to Dr Song to strengthen my abdomen which has a hernia caused by a botched surgery 2 years ago. I am extremely satisfied by the high expertise and genuine care that Dr Song gives m..."


In [4]:
# Remove rows that have missing data in all columns: Doctor and Reviews
df.dropna(how='all', inplace=True)

In [5]:
df.head()

Unnamed: 0,Doctor,Reviews
0,"Dr. Jon Biorkman, MD",['This review is an overdue huge thank you for an extraordinary medical doctor whose care and professionalism helped me even in his absence: I was Dr. Biorkman\'s patient while in Irvine until 20...
2,"Richard McConkie, FNP-C","[""I was so pleased with the kindness and care of the staff at west valley. This was my first visit and it felt like I'd been going there forever. Dr Richard was very empathic and understanding. Li..."
4,"Dr. Crystal Song, NMD","[""I go monthly to Dr Song to strengthen my abdomen which has a hernia caused by a botched surgery 2 years ago. I am extremely satisfied by the high expertise and genuine care that Dr Song gives m..."
6,"Dr. Christopher Ciccone, MD",['Excellent service and advice as well as a perscrition for spider bite infection\n\nDr. Ciccone has been my doc for 25 years and is now the family doc for my children and husband too! His offic...
8,"Dr. Martin Maag, MD","['Very positive and encouraging, time well spent.\n\nFriendly, professional, and knowledgeable.\n\nits not often that the Doctor is the one to greet you and sit you down in the consultation room, ..."


In [6]:
print(len(df.Doctor.unique()) == df.shape[0])
print(len(df.Doctor.unique()))

True
36


Clearly, there are no two reviews (from different patients / families) for the same doctor in the data set

In [7]:
# Remove meaningless symbols from reviews
pattern = '|'.join(['\[', '\]', '\\\\\\', '\'', "\'", '\\n', '\\\\', '  '])
df.Reviews = df.Reviews.str.replace(pattern, '')

In [8]:
# Have a look at one review
df.iloc[0, 1]

'This review is an overdue huge thank you for an extraordinary medical doctor whose care and professionalism helped me even in his absence:I was Dr. Biorkmans patient while in Irvine until 2006, when I moved to N. Cal.When I was diagnosed 7 years ago with a mass, and I feared for my life, I looked in the mirror and asked: "When would you not be afraid?"The answer came promptly: "If I saw MY doctor" (Dr. Biorkman). So even if he does not know, he is still MY doctor - the one I always trusted,had a professional, accurate answer with no drama, got it right every single time, and helped me stay healthy.Having a doctor one trusts is essential.Dr. Biorkman has been our doctor for almost 30 years. He knows our family and cares about our health. He takes his time to listen and provides thoughtful and excellent care to every member of our family. Over the years we have made sure any insurance plan we choose gives us access to Dr. Biorkman., "Dr. Biorkman has always, and I mean always, listened 

In [9]:
import spacy
from spacy.lang.en import English

nlp = spacy.load('en_core_web_lg')
# The first review is used to test
doc = nlp(df.iloc[0, 1])

In [10]:
# Create a list of word tokens and remove stop words
filtered_words = []
for token in doc:
    # Remove stop words
    if token.is_stop == False and token.is_punct == False:
        filtered_words.append(token.text)
print(filtered_words)

['review', 'overdue', 'huge', 'thank', 'extraordinary', 'medical', 'doctor', 'care', 'professionalism', 'helped', 'absence', 'Dr.', 'Biorkmans', 'patient', 'Irvine', '2006', 'moved', 'N.', 'Cal', 'diagnosed', '7', 'years', 'ago', 'mass', 'feared', 'life', 'looked', 'mirror', 'asked', 'afraid?"The', 'answer', 'came', 'promptly', 'saw', 'doctor', 'Dr.', 'Biorkman', 'know', 'doctor', 'trusted', 'professional', 'accurate', 'answer', 'drama', 'got', 'right', 'single', 'time', 'helped', 'stay', 'healthy', 'Having', 'doctor', 'trusts', 'essential', 'Dr', 'Biorkman', 'doctor', '30', 'years', 'knows', 'family', 'cares', 'health', 'takes', 'time', 'listen', 'provides', 'thoughtful', 'excellent', 'care', 'member', 'family', 'years', 'sure', 'insurance', 'plan', 'choose', 'gives', 'access', 'Dr.', 'Biorkman', 'Dr.', 'Biorkman', 'mean', 'listened', 'carefully', 'concerns', 'advised', 'sagely', 'cared', 'person', 'nt', 'matter', 'large', 'small', 'health', 'issue', 'provides', 'comfort', 'care', 'Pr

In [11]:
# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# Add the component to the pipeline
nlp.add_pipe(sbd)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)
print(sents_list)

['This review is an overdue huge thank you for an extraordinary medical doctor whose care and professionalism helped me even in his absence:I was Dr. Biorkmans patient while in Irvine until 2006, when I moved to N. Cal.', 'When I was diagnosed 7 years ago with a mass, and I feared for my life, I looked in the mirror and asked: "When would you not be afraid?"The answer came promptly:', '"If I saw MY doctor" (Dr. Biorkman).', 'So even if he does not know, he is still MY doctor - the one I always trusted,had a professional, accurate answer with no drama, got it right every single time, and helped me stay healthy.', 'Having a doctor one trusts is essential.', 'Dr.', 'Biorkman has been our doctor for almost 30 years.', 'He knows our family and cares about our health.', 'He takes his time to listen and provides thoughtful and excellent care to every member of our family.', 'Over the years we have made sure any insurance plan we choose gives us access to Dr. Biorkman., "Dr. Biorkman has alway

### Make use of spacy matcher to find patterns

In [12]:
from spacy.matcher import Matcher
doc = nlp(df.iloc[1, 1])
matcher = Matcher(nlp.vocab)
p1 = [{'ORTH': 'Dr.', 'OP': '*'},
           {'ENT_TYPE': 'PERSON'},
           {'POS': 'ADV', 'OP': '*'},
           {'LEMMA': 'be'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'}]
p2 = [{'LOWER':{'IN': ['he', 'she']}},
           {'POS': 'ADV', 'OP': '*'},
           {'LEMMA': 'be'},
           {'POS': 'ADV', 'OP': '*'},
           {'DEP': 'CARDINAL', 'OP': '*'},
           {'POS': 'ADJ'},
           {'POS': 'NOUN', 'OP': '*'}]
p3 = [{'LOWER':{'IN': ['he', 'she']}},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'VERB'},
           {'DEP': 'TRUE', 'OP': '*'},
           {'POS': 'ADJ', 'OP': '*'},
           {'POS': 'NOUN'}]
p4 = [{'ORTH': 'Dr.', 'OP': '*'},
           {'ENT_TYPE': 'PERSON'},
           {'POS': 'ADV', 'OP': '*'},
           {'LEMMA': 'be'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ', 'OP': '+'},
           {'ORTH': 'and', 'OP': '*'},
           {}]
p5 = [{'POS': 'NOUN'},
           {'LEMMA': 'be'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'},
           {'ORTH': ',', 'OP': '*'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'},
           {'ORTH': ',', 'OP': '*'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'},
           {'ORTH': ', and', 'OP': '*'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'}]
p6 = [{'LOWER':{'IN': ['he', 'she']}},
           {'LEMMA': 'be'},
           {'POS': 'ADJ'},
           {'ORTH': ',', 'OP': '?'},
           {'POS': 'ADJ', 'OP': '?'},
           {'ORTH': ',', 'OP': '?'}, 
           {'ORTH': 'and', 'OP': '?'},
           {'POS': 'ADJ', 'OP': '?'}]
patterns = [p1, p2, p3, p4, p5, p6]
matcher.add("review", None, *patterns)

# Store the matched phrases in span_storage
span_storage = []
for match_id, start, end in matcher(doc):
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    span_storage.append(span.text)

In [13]:
span_storage

['Richard was very empathic',
 'Richard was very empathic and',
 'Richard was very empathic and understanding',
 'Richard was amazingly kind',
 'Richard was amazingly kind and',
 'Richard was amazingly kind and patient',
 'Dr. McConkie was very thorough',
 'McConkie was very thorough',
 'Dr. McConkie was very thorough with',
 'McConkie was very thorough with',
 'McConkie was exceptional',
 'McConkie was exceptional.',
 'He is compassionate',
 'He is compassionate,',
 'He is compassionate, professional',
 'He is compassionate, professional and',
 'He is compassionate, professional and knowledgeable',
 'Richard is really good',
 'Richard is really good at']

In [14]:
# Create a matrix storing the semantic similarity of every two phrases in sim_table
import numpy as np
sim_table = []
for doc1 in span_storage:
    for doc2 in span_storage:
        sim_table.append(nlp(doc1).similarity(nlp(doc2)))
sim_table = np.array(sim_table).astype(float).reshape(len(span_storage), len(span_storage))       

In [15]:
# Convert the matrix to DataFrame for a better visualization
import pandas as pd
sim_table = pd.DataFrame(data=sim_table, columns=span_storage, index=span_storage)

In [16]:
sim_table.head()

Unnamed: 0,Richard was very empathic,Richard was very empathic and,Richard was very empathic and understanding,Richard was amazingly kind,Richard was amazingly kind and,Richard was amazingly kind and patient,Dr. McConkie was very thorough,McConkie was very thorough,Dr. McConkie was very thorough with,McConkie was very thorough with,McConkie was exceptional,McConkie was exceptional.,He is compassionate,"He is compassionate,","He is compassionate, professional","He is compassionate, professional and","He is compassionate, professional and knowledgeable",Richard is really good,Richard is really good at
Richard was very empathic,1.0,0.968968,0.934621,0.862182,0.838039,0.830251,0.768914,0.728817,0.755126,0.711325,0.642493,0.636033,0.687397,0.66598,0.654049,0.651136,0.651739,0.793777,0.762364
Richard was very empathic and,0.968968,1.0,0.971865,0.8701,0.900403,0.891078,0.793269,0.771259,0.808346,0.783203,0.693878,0.703229,0.727726,0.727142,0.719464,0.750835,0.739513,0.818753,0.79518
Richard was very empathic and understanding,0.934621,0.971865,1.0,0.829036,0.866894,0.877677,0.794047,0.788057,0.814128,0.802807,0.692426,0.706004,0.755704,0.749318,0.752639,0.78061,0.779111,0.803617,0.782999
Richard was amazingly kind,0.862182,0.8701,0.829036,1.0,0.973329,0.92091,0.803244,0.781206,0.799558,0.771587,0.732172,0.728514,0.690763,0.692774,0.679183,0.687295,0.671169,0.887896,0.853151
Richard was amazingly kind and,0.838039,0.900403,0.866894,0.973329,1.0,0.953058,0.809739,0.803016,0.832595,0.820888,0.758224,0.769506,0.718681,0.737549,0.728503,0.768081,0.742994,0.885445,0.859264


### From my observation, all extracted reviews that are from a same sentence (for example: 'Richard was amazingly kind', 'Richard was amazingly kind and', and 'Richard was amazingly kind and patient' have at least 90% semantic similarity. Thus, I will group those with at least 91% semantic similarity together and only keep the longest one among them.

In [17]:
# Convert sim_table back to numpy format
sim_table = sim_table.to_numpy()

In [18]:
'''This function is to add all float numbers which have values equal or greater than a threshold into a same set'''
def similar_phrases(j, a_list, threshold=.91):
    group = set()
    for i in range(len(a_list)):
        if a_list[i] >= threshold:
            group.add(i)
    return group

In [19]:
print(sim_table)

[[1.         0.96896847 0.93462086 0.86218214 0.83803926 0.83025121
  0.76891442 0.72881661 0.75512585 0.71132539 0.6424933  0.63603267
  0.68739728 0.66598042 0.6540486  0.65113646 0.65173863 0.79377731
  0.7623641 ]
 [0.96896847 1.         0.97186464 0.87009984 0.90040294 0.89107801
  0.79326887 0.77125891 0.80834557 0.78320344 0.69387805 0.70322945
  0.72772649 0.72714238 0.719464   0.75083476 0.73951341 0.81875262
  0.79518046]
 [0.93462086 0.97186464 1.         0.82903617 0.86689408 0.87767714
  0.79404704 0.78805688 0.81412771 0.80280706 0.69242626 0.70600368
  0.75570364 0.74931773 0.75263932 0.78060981 0.77911148 0.80361706
  0.78299883]
 [0.86218214 0.87009984 0.82903617 1.         0.97332895 0.92091002
  0.80324364 0.78120552 0.79955808 0.77158698 0.73217249 0.72851399
  0.69076317 0.69277411 0.67918256 0.68729487 0.67116855 0.88789601
  0.85315085]
 [0.83803926 0.90040294 0.86689408 0.97332895 1.         0.95305769
  0.80973908 0.80301555 0.83259464 0.82088773 0.75822376 0.7

In [20]:
# Create a list storing sets of indices of strings in span_storage that are likely from a same sentence. 
j = 0
indices = []
for i in sim_table:
    if similar_phrases(j, i) not in indices:
        indices.append(similar_phrases(j, i))
        j += 1
print(indices)

[{0, 1, 2}, {3, 4, 5}, {8, 6, 7}, {8, 9, 6, 7}, {8, 9, 7}, {10, 11}, {12, 13, 14}, {12, 13, 14, 15, 16}, {16, 13, 14, 15}, {17, 18}]


In [21]:
span_storage

['Richard was very empathic',
 'Richard was very empathic and',
 'Richard was very empathic and understanding',
 'Richard was amazingly kind',
 'Richard was amazingly kind and',
 'Richard was amazingly kind and patient',
 'Dr. McConkie was very thorough',
 'McConkie was very thorough',
 'Dr. McConkie was very thorough with',
 'McConkie was very thorough with',
 'McConkie was exceptional',
 'McConkie was exceptional.',
 'He is compassionate',
 'He is compassionate,',
 'He is compassionate, professional',
 'He is compassionate, professional and',
 'He is compassionate, professional and knowledgeable',
 'Richard is really good',
 'Richard is really good at']

In [22]:
'''This function is to create a list of sets that are subsets of other sets from a list'''
def clean_subsets(my_list):
    remove_list = []
    for i in range(len(my_list)-1):
        if i < len(my_list) and my_list[i].issubset(my_list[i+1]):
            remove_list.append(my_list[i])
    for i in range(len(my_list)-1, -1, -1):
        if i > 0 and my_list[i].issubset(my_list[i-1]):
            remove_list.append(my_list[i])  
    return remove_list

In [23]:
# Create a list to remove sets that are subsets of some sets  
remove_list = clean_subsets(indices)
# Clean the         
for i in remove_list:
    indices.remove(i)

In [24]:
# Have a look at updated indices from span_storage
indices

[{0, 1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10, 11}, {12, 13, 14, 15, 16}, {17, 18}]

In [25]:
# Retain the longest phrase from a set of phrases that are likely from a same sentence
meaningful_phrases = []
for sublist in indices:
    max_length = 0
    max_str = ''
    for phrase_index in sublist:
        if max_length < len(span_storage[phrase_index]):
            max_length = len(span_storage[phrase_index])
            max_str = span_storage[phrase_index]
    meaningful_phrases.append(max_str)

In [26]:
meaningful_phrases

['Richard was very empathic and understanding',
 'Richard was amazingly kind and patient',
 'Dr. McConkie was very thorough with',
 'McConkie was exceptional.',
 'He is compassionate, professional and knowledgeable',
 'Richard is really good at']

## Limitation:
The data set is not huge enough to have multiple reviews for a same doctor. Thus,
we cannot create a list of reviews from so many patients for a specific doctors and extract meaningful phrases aboout him / her.

## Future extension:
When we have collected a huge amount of data, we can
* Create a list of reviews from so many patients for a specific doctors and extract meaningful phrases about him / her.
* Classify one review as positive or negative
* Create a rating system based on the number of positive / negative words / phrases from the review (a little different from Amazon's).