# DATA DESCRIPTION

This dataset includes details of Doctor Consultaion fees, below are some variables of the datset:
- Qualification: Qualification and degrees held by the doctor

- Experience: Experience of the doctor in number of years

- Rating: Rating given by patients

- Profile: Type of the doctor

- Miscellaneous_Info: Extra information about the doctor

- Place: Area and the city where the doctor is located.

TARGET VARIABLE --> Fees: Fees charged by the doctor 

PROBLEM STATEMENT :

We have all been in situation where we go to a doctor in emergency and find that the consultation fees are too high. As a data scientist we all should do better. What if you have data that records important details about a doctor and you get to build a model to predict the doctor’s consulting fee.? This is the use case that let's you do that. 

So, from the problem statement and the Dataset we can understand that it is a "Regression problem". so we will be using some Regression algorithms to make our model and then use GRIDSEARCHCV for hypeparameter tuning and save the predicted model using pkl.

# Importing the needed Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer
import warnings
warnings.filterwarnings('ignore')

# DATA PREPARATION/Loading the Data

In [None]:
df = pd.read_excel("Final_Train.xlsx")

In [None]:
#Lets see the columns of the dataset
df.columns

We have 6 independent variables and 1 target variable, i.e. Fees in the training dataset.

In [None]:
#Loading the head of the Dataset to get a general view of the Data we will be working with.
df.head()

In [None]:
df.tail()

So by seeing the data we get a general understanding that some Data Cleaning is needed in the Dataset.

In [None]:
#Checking The Data Dimension
df.shape

In [None]:
#Lets check for null values
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull())
plt.title('Null values')
plt.show()

So many white lines, telling us the presence of many null values in the dataset

In [None]:
#Lets get some more information about the Dataset
df.info()

So from the above we can see that there are two types of values present in the dataset, i.e ,int64 and object

In [None]:
#Lets get a general idea about the dataset by the describe method
df.describe()

# DATA CLEANING


Lets clean and adjust the Experience column

In [None]:
# Extract years of experience
df["Experience"] = df["Experience"].str.split()
df["Experience"] = df["Experience"].str[0].astype("int")

Lets clean and adjust the Place column

In [None]:
# Extract cities


df["Place"].fillna("Unknown,Unknown",inplace=True)
df["Place"] = df["Place"].str.split(",")
df["City"] = df["Place"].str[-1]
df["Place"] = df["Place"].str[0]

Lets fill the missing values in Rating column

In [None]:
# Seperate Ratings into bins
df["Rating"].fillna("-99%",inplace=True)
df["Rating"] = df["Rating"].str[:-1].astype("int")
bins = [-99,0,10,20,30,40,50,60,70,80,90,100]
labels = [i for i in range(11)]
df["Rating"] = pd.cut(df["Rating"],bins=bins,labels=labels,include_lowest=True)

In [None]:
#lets see the value counts of rating column
df['Rating'].value_counts().sort_index()

Adjusting the Qualification column data as it has many things which needs cleaning before modelling.

In [None]:
# Extract relevant qualification
df["Qualification"]=df["Qualification"].str.split(",")
Qualification ={}
for x in df["Qualification"].values:
    for each in x:
        each = each.strip()
        if each in Qualification:
            Qualification[each]+=1
        else:
            Qualification[each]=1

In [None]:
most_qua = sorted(Qualification.items(),key=lambda x:x[1],reverse=True)[:10]
final_qua =[]
for tup in most_qua:
    final_qua.append(tup[0])
for title in final_qua:
    df[title]=0
    
for x,y in zip(df["Qualification"].values,np.array([idx for idx in range(len(df))])):
    for q in x:
        q = q.strip()
        if q in final_qua:
            df[q][y] = 1
df.drop("Qualification",axis=1,inplace=True)


In [None]:
#Lets see the value counts of Profile column
df['Profile'].value_counts()

In [None]:
#Lets see the value counts of city column
df['City'].value_counts()

From the above we can see that there is an column named 'e' which we will deal next.

In [None]:
df["City"][3980] = "Unknown"
df["Place"][3980] = "Unknown"

In [None]:
#Now lets see again if the column 'e' is removed or not.
df['City'].value_counts()

In [None]:
# Get dummies
df = pd.get_dummies(df,columns=["City","Profile"],prefix=["City","Profile"])

Since the number of Cities are less, We can dummify the city names. 

In [None]:
df['Miscellaneous_Info'].value_counts ()

Now I will drop the 'Miscellaneous_Info' column as I am no NLP expert

In [None]:
#Dropping the column
df.drop("Miscellaneous_Info",axis=1,inplace=True)

So we are dropping the Miscellaneous_Info section as I am No NLP expert.

In [None]:
#Lets again check the head of the dataset
df.head()

In [None]:
#Lets check if we have cleaned the data
df.info()

- From the above we can see that some new columns have come into existence.
- There are now five dtypes present i.e category, int32, int64, object, uint8.

In [None]:
#Lets again check for null values
df.isnull().sum()

From the above we can see that there are no null values present in the dataset

# Lets do some EDA over the Dataset

In [None]:
#Now lets check some correlation from the dataset
sns.heatmap(df.corr())

Showing the correlations of features with the target. No correlations are extremely high. So we will take every variables into action.

# Separating independent variable and target variable and also encoding the dataset

In [None]:
X = df.drop("Fees",axis=1)
y = df["Fees"]

# Encoding
enc = OrdinalEncoder()
X = enc.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)

# feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# Model Building

USING SUPPORT VECTOR MACHINE

In [None]:

def score(y_pred,y):
    y_pred = np.log(y_pred)
    y = np.log(y)
    return 1 - ((np.sum((y_pred-y)**2))/len(y))**1/2

In [None]:

# Define own scorer
scorer = make_scorer(score,greater_is_better=True)

In [None]:

# support vector machine 
from sklearn.svm import SVR
m = SVR(gamma="scale")
m.fit(scaler.transform(X_train),y_train)

In [None]:
# Prediction
y_pred = m.predict(scaler.transform(X_test))
score(y_pred,y_test)

# GRIDSEARCHCV/HYPERMETER TUNING

In [None]:
# Hyperparameter tunning
parameters = {"C":[0.1,1,10],"kernel":["linear","rbf","poly"]}
reg = GridSearchCV(m,param_grid=parameters,scoring=scorer,n_jobs=-1,cv=5)

In [None]:
reg.fit(X_train,y_train)

In [None]:
reg.best_params_

In [None]:
y_pred_tuned = reg.predict(scaler.transform(X_test))
score(y_pred_tuned,y_test)

So from the above we can see that we increased the accuracy score drastically.  

# Saving Best Model Using PKl

In [None]:
import joblib
filename = 'fees_model.pkl'
joblib.dump(y_pred_tuned, filename)