Predict A Doctor's Consultation Fee 



We have all been in situation where we go to a doctor in emergency and find that the consultation fees are too high. As a data scientist we all should do better. What if you have data that records important details about a doctor and you get to build a model to predict the doctor’s consulting fee.? This is the hackathon that lets you do that.



Size of training set: 5961 records

Size of test set: 1987 records

FEATURES:

Qualification: Qualification and degrees held by the doctor

Experience: Experience of the doctor in number of years

Rating: Rating given by patients

Profile: Type of the doctor

Miscellaeous_Info: Extra information about the doctor

Fees: Fees charged by the doctor

Place: Area and the city where the doctor is located.

In [1]:
#import the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tabulate import tabulate
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from scipy.stats import skew
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


In [2]:

DoctorsData= pd.read_excel('Doctor_Data.xlsx') 
DoctorsData

Unnamed: 0,Qualification,Experience,Rating,Place,Profile,Miscellaneous_Info,Fees
0,"BHMS, MD - Homeopathy",24 years experience,100%,"Kakkanad, Ernakulam",Homeopath,"100% 16 Feedback Kakkanad, Ernakulam",100
1,"BAMS, MD - Ayurveda Medicine",12 years experience,98%,"Whitefield, Bangalore",Ayurveda,"98% 76 Feedback Whitefield, Bangalore",350
2,"MBBS, MS - Otorhinolaryngology",9 years experience,,"Mathikere - BEL, Bangalore",ENT Specialist,,300
3,"BSc - Zoology, BAMS",12 years experience,,"Bannerghatta Road, Bangalore",Ayurveda,"Bannerghatta Road, Bangalore ₹250 Available on...",250
4,BAMS,20 years experience,100%,"Keelkattalai, Chennai",Ayurveda,"100% 4 Feedback Keelkattalai, Chennai",250
...,...,...,...,...,...,...,...
5956,"MBBS, MS - ENT",19 years experience,98%,"Basavanagudi, Bangalore",ENT Specialist,"98% 45 Feedback Basavanagudi, Bangalore",300
5957,MBBS,33 years experience,,"Nungambakkam, Chennai",General Medicine,,100
5958,MBBS,41 years experience,97%,"Greater Kailash Part 2, Delhi",General Medicine,"97% 11 Feedback Greater Kailash Part 2, Delhi",600
5959,"MBBS, MD - General Medicine",15 years experience,90%,"Vileparle West, Mumbai",General Medicine,General Medical Consultation Viral Fever Treat...,100


In [3]:
DoctorsData.isnull().sum()

Qualification            0
Experience               0
Rating                3302
Place                   25
Profile                  0
Miscellaneous_Info    2620
Fees                     0
dtype: int64

In [4]:
DoctorsData['Rating']=DoctorsData['Rating'].fillna(0)


In [5]:
DoctorsData.drop(['Miscellaneous_Info'],axis=1,inplace=True)

In [6]:
#For categorical features  are  # 'DoctorsData

DoctorsData['Place']= DoctorsData.apply(lambda x: DoctorsData['Place'].fillna(DoctorsData['Place'].value_counts().index[0]))

In [7]:
DoctorsData.isnull().sum()

Qualification    0
Experience       0
Rating           0
Place            0
Profile          0
Fees             0
dtype: int64

In [8]:
DoctorsData.keys()

Index(['Qualification', 'Experience', 'Rating', 'Place', 'Profile', 'Fees'], dtype='object')

In [9]:
# Sort the Qualification
sorted(DoctorsData.Qualification[DoctorsData.Qualification.apply(lambda x: len(x.split(','))).idxmax()].split(","))

[' Certificate in Cosmetic Dentistry',
 ' Certification in Full Mouth Rehabilitation',
 ' Certified Advance Course In Endodontics',
 ' Certified BPS Dentist',
 ' Certified in Orthodontics',
 ' Degree in Dental Implant',
 ' Fellowship in Advanced Endoscopic Sinus Surgery',
 ' Fellowship in Lasers & Cosmetology',
 ' Professional Implantology Training Course (PITC)',
 'Fellowship in Oral implantlogy']

In [10]:
import re
def SpaceofQual(text):
    arr = re.sub(r'\([^()]+\)', lambda x: x.group().replace(",","-"), text) # to replace ',' with '-' inside brackets only
    return ','.join(sorted(arr.lower().replace(" ","").split(",")))

In [11]:
DoctorsData.Qualification = DoctorsData.Qualification.apply(lambda x: SpaceofQual(x))

In [12]:
# Define a function to create a function of all Qualifications seprataed by ','
def QualificationData(series):
    Quals = ''
    for i in series:
        Quals += i + ','
    return Quals

In [13]:
# List  unique Qualifications along with there occurence in Train Set
from collections import Counter
text = QualificationData(DoctorsData.Qualification)
dfQualification = pd.DataFrame.from_dict(dict(Counter(text.split(',')).most_common()), orient='index').reset_index()
dfQualification.columns=['Qualification','Count']
dfQualification

Unnamed: 0,Qualification,Count
0,mbbs,2808
1,bds,1363
2,bams,764
3,bhms,749
4,md-dermatology,606
...,...,...
782,postgraduatediplomaindiabetology(pgdd)(univers...,1
783,certifiedinadvancedorthodontics,1
784,fdsendodontics,1
785,certificationinsmiledesigning,1


In [14]:
dfQualification['code']= dfQualification.Qualification.astype('category').cat.codes

In [15]:
dfQualification.head(10)

Unnamed: 0,Qualification,Count,code
0,mbbs,2808,502
1,bds,1363,27
2,bams,764,25
3,bhms,749,29
4,md-dermatology,606,535
5,ms-ent,411,645
6,venereology&leprosy,297,783
7,md-generalmedicine,285,540
8,diplomainotorhinolaryngology(dlo),250,216
9,md-homeopathy,181,543


In [16]:
DoctorsData['Qualification'].unique()

array(['bhms,md-homeopathy', 'bams,md-ayurvedamedicine',
       'mbbs,ms-otorhinolaryngology', ...,
       'fellowshipindermatosurgery,mbbs,md-dermatology,venereology&leprosy',
       'bds,certificationinsmiledesigning',
       'dhms(diplomainhomeopathicmedicineandsurgery),md-homeopathy,postgraduatediplomainhealthcaremanagement(pgdhm)'],
      dtype=object)

In [17]:
# Applying LabelEncoder  
#,'Duration'
# from sklearn.preprocessing import LabelEncoder
# transcol=['Qualification']
# for col in DoctorsData :
    
#     for i in transcol:
        
#         if col==i  :
#             print(i)
#             labelencoder = LabelEncoder()
#             DoctorsData[col] = labelencoder.fit_transform(DoctorsData[col])

In [18]:
# handling catagorical variables
DoctorsData = pd.get_dummies(DoctorsData, columns = [ 'Profile'], drop_first = True)

In [19]:
DoctorsData

Unnamed: 0,Qualification,Experience,Rating,Place,Fees,Profile_Dentist,Profile_Dermatologists,Profile_ENT Specialist,Profile_General Medicine,Profile_Homeopath
0,"bhms,md-homeopathy",24 years experience,100%,"Kakkanad, Ernakulam",100,0,0,0,0,1
1,"bams,md-ayurvedamedicine",12 years experience,98%,"Whitefield, Bangalore",350,0,0,0,0,0
2,"mbbs,ms-otorhinolaryngology",9 years experience,0,"Mathikere - BEL, Bangalore",300,0,0,1,0,0
3,"bams,bsc-zoology",12 years experience,0,"Bannerghatta Road, Bangalore",250,0,0,0,0,0
4,bams,20 years experience,100%,"Keelkattalai, Chennai",250,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
5956,"mbbs,ms-ent",19 years experience,98%,"Basavanagudi, Bangalore",300,0,0,1,0,0
5957,mbbs,33 years experience,0,"Nungambakkam, Chennai",100,0,0,0,1,0
5958,mbbs,41 years experience,97%,"Greater Kailash Part 2, Delhi",600,0,0,0,1,0
5959,"mbbs,md-generalmedicine",15 years experience,90%,"Vileparle West, Mumbai",100,0,0,0,1,0


In [20]:

# replace the values
DoctorsData['Experience']=DoctorsData['Experience'].str.replace('years experience', '')

DoctorsData['Rating']=DoctorsData['Rating'].str.replace('%', '')




In [21]:
DoctorsData['Rating']=DoctorsData['Rating'].fillna(0)


In [22]:
DoctorsData.drop(['Qualification','Place'],axis=1,inplace=True)
# DoctorsData.drop(['Place'],axis=1,inplace=True)

In [23]:
DoctorsData

Unnamed: 0,Experience,Rating,Fees,Profile_Dentist,Profile_Dermatologists,Profile_ENT Specialist,Profile_General Medicine,Profile_Homeopath
0,24,100,100,0,0,0,0,1
1,12,98,350,0,0,0,0,0
2,9,0,300,0,0,1,0,0
3,12,0,250,0,0,0,0,0
4,20,100,250,0,0,0,0,0
...,...,...,...,...,...,...,...,...
5956,19,98,300,0,0,1,0,0
5957,33,0,100,0,0,0,1,0
5958,41,97,600,0,0,0,1,0
5959,15,90,100,0,0,0,1,0


In [24]:
DoctorsData.dtypes

Experience                  object
Rating                      object
Fees                         int64
Profile_Dentist              uint8
Profile_Dermatologists       uint8
Profile_ENT Specialist       uint8
Profile_General Medicine     uint8
Profile_Homeopath            uint8
dtype: object

In [25]:
DoctorsData['Experience']=DoctorsData['Experience'].astype(int)
DoctorsData['Rating']=DoctorsData['Rating'].astype(int)

In [26]:
# Lets Check Outliers in the Dataset 

from scipy.stats import zscore
print('Before zscore',DoctorsData.shape)
z_score=abs(zscore(DoctorsData))
hrds=DoctorsData.iloc[(z_score<3).all(axis=1)]
print('After zscore',hrds.shape)

Before zscore (5961, 8)
After zscore (5903, 8)


In [27]:
x=hrds.drop(['Fees'],axis=1)
x.shape

(5903, 7)

In [28]:
y=hrds['Fees']
y=np.array(y).reshape(-1,1)

In [29]:
x.skew()

Experience                  0.897661
Rating                      0.240969
Profile_Dentist             1.241682
Profile_Dermatologists      1.718892
Profile_ENT Specialist      2.310896
Profile_General Medicine    1.557957
Profile_Homeopath           2.050262
dtype: float64

In [30]:
# splitting data as X_train and X_test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2,random_state = 42)

In [31]:
#Linear Regression
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [32]:
y_pred = regressor.predict(X_test)

In [33]:
# calculating RMSE
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(y_test, y_pred))
rmse

176.2374718227872

In [34]:
df = pd.DataFrame({'Actual': np.array(y_test)[:,0], 'Predicted': y_pred[:,0]})
df

Unnamed: 0,Actual,Predicted
0,100,290.691147
1,700,341.932181
2,200,299.201867
3,200,352.208990
4,400,426.994079
...,...,...
1176,500,356.237385
1177,300,260.106675
1178,800,420.510053
1179,200,268.047788


In [35]:
from sklearn.externals import joblib
joblib.dump(regressor,'Doctor_Fee.obj')
OuModel=joblib.load('Doctor_Fee.obj')


In [None]:
# Conclusion
    1) One has to predict the Doctor Fee
       .Linear Regression
       .Handling Null Values
    Qulification has to handle to more accurancy 