### Import Python Libraries

In [488]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import socket

### Import Train and Test Data and Append them

Appending the datasets makes coding simpler and all the data transformations can be applied to both the datasets at once. After data-cleaning and data-transformations, the appended dataset can be split back to test and train datasets


In [489]:
df=pd.read_excel("/Users/sd15068/Downloads/Final Participant Data Folder/Final_Train.xlsx")

In [490]:
df_test=pd.read_excel("/Users/sd15068/Downloads/Final Participant Data Folder/Final_Test.xlsx")

In [491]:
big_df = df.append(df_test)

### Feature Generation

All the relevant variables are converted to usable features, that can go into the model. For example, years experience is transformed to an integer; 'Qualification' is split into 3 different string variables as 'Qual1', 'Qual2' and 'Qual3'; Similar operations are performed on all the columns. This is probably the most important step in the entire analysis, re-iterating the importance of Feature Engineering.

In [492]:
big_df['years_exp'] = big_df['Experience'].str.slice(stop=2).astype(int)
big_df['City'] = big_df['Place'].str.split(',').str[1]
big_df['Locality'] = big_df['Place'].str.split(',').str[0]

big_df['Qual_1'] = big_df['Qualification'].str.split(',').str[0]
big_df['Qual_2'] = big_df['Qualification'].str.split(',').str[1]
big_df['Qual_3'] = big_df['Qualification'].str.split(',').str[2]

big_df['Misc'] = big_df['Miscellaneous_Info'].str.split('%').str[0]
big_df['Misc_len'] = big_df['Misc'].str.len()
big_df.loc[big_df['Misc_len']>3, 'Misc'] = 0
big_df['Misc'].fillna(0,inplace = True)
big_df['Misc'] = big_df['Misc'].astype(int)
big_df['Misc_2'] = big_df['Miscellaneous_Info'].str.split('% ').str[1]
big_df['Misc_3'] = big_df['Misc_2'].str.split(' ').str[0]
big_df['Misc_3'].fillna(0,inplace = True)
big_df['Misc_3_len'] = big_df['Misc_3'].str.len()
big_df.loc[big_df['Misc_3_len']>3, 'Misc_3'] = 0
big_df.loc[big_df['Misc_3']==',', 'Misc_3'] = 0
big_df['Misc_3'] = big_df['Misc_3'].astype(int)
big_df['Misc_4'] = big_df['Misc']*np.log((1+big_df['Misc_3']))

big_df['Rating'].fillna('0%',inplace = True)
big_df['City'].fillna("XXX",inplace = True)
big_df['Locality'].fillna("XXX",inplace = True)
big_df['Qualification'].fillna("XXX",inplace = True)
big_df['Profile'].fillna("XXX",inplace = True)
big_df['Qual_1'].fillna("XXX",inplace = True)
big_df['Qual_2'].fillna("XXX",inplace = True)
big_df['Qual_3'].fillna("XXX",inplace = True)
big_df['Rating'] = big_df['Rating'].str.slice(stop=-1).astype(int)

### Feature Selection

Create a new dataframe with the relevant features that would go into the ML model.

In [493]:
big_df = big_df.drop(big_df[['Experience','Miscellaneous_Info','Place','Qualification','Misc_len','Misc_3_len']], axis=1)

In [494]:
big_df = big_df[['Qual_1','Qual_2','Qual_3','years_exp', 'Rating','Profile','Locality','City','Misc','Misc_3','Fees']]

In [495]:
df_train = big_df[0:5961]

In [496]:
df_train.describe()

Unnamed: 0,years_exp,Rating,Misc,Misc_3,Fees
count,5961.0,5961.0,5961.0,5961.0,5961.0
mean,17.303976,42.217245,23.556786,9.677906,307.94464
std,11.142798,47.340934,40.828486,39.358833,190.920373
min,0.0,0.0,0.0,0.0,5.0
25%,9.0,0.0,0.0,0.0,150.0
50%,14.0,0.0,0.0,0.0,300.0
75%,23.0,96.0,56.0,1.0,500.0
max,66.0,100.0,100.0,854.0,950.0


In [497]:
df_test = big_df[5961:]

In [498]:
df_test = df_test.drop(['Fees'], axis =1)

### Prepare categorial variables for XGBoost using label encoder

Internally, XGBoost models represent all problems as a regression predictive modeling problem that only takes numerical values as input. If your data is in a different form, it must be prepared into the expected format.

To convert categorical text data into model-understandable numerical data, we use the Label Encoder class. So all we have to do, to label encode a column is import the LabelEncoder class from the sklearn library, fit and transform the column of the data, and then replace the existing text data with the new encoded data.

In [499]:
from sklearn.preprocessing import LabelEncoder

In [500]:
lb_encode = LabelEncoder()
df_test["Qual_1_code"] = lb_encode.fit_transform(df_test["Qual_1"])
df_test["Qual_2_code"] = lb_encode.fit_transform(df_test["Qual_2"])
df_test["Qual_3_code"] = lb_encode.fit_transform(df_test["Qual_3"])
df_test["Profile_code"] = lb_encode.fit_transform(df_test["Profile"])
df_test["City_code"] = lb_encode.fit_transform(df_test["City"])
df_test["Locality_code"] = lb_encode.fit_transform(df_test["Locality"])

In [501]:
df_test.head()

Unnamed: 0,Qual_1,Qual_2,Qual_3,years_exp,Rating,Profile,Locality,City,Misc,Misc_3,Qual_1_code,Qual_2_code,Qual_3_code,Profile_code,City_code,Locality_code
0,MBBS,XXX,XXX,35,0,General Medicine,Ghatkopar East,Mumbai,0,0,48,243,195,4,6,142
1,MBBS,Diploma in Otorhinolaryngology (DLO),XXX,31,0,ENT Specialist,West Marredpally,Hyderabad,0,0,48,79,195,3,5,577
2,MBBS,DDVL,XXX,40,70,Dermatologists,KK Nagar,Chennai,70,4,48,28,195,2,1,212
3,BAMS,XXX,XXX,0,0,Ayurveda,New Ashok Nagar,Delhi,0,0,1,243,195,0,3,373
4,BDS,MDS - Conservative Dentistry & Endodontics,XXX,16,100,Dentist,Kanakpura Road,Bangalore,0,0,2,157,195,1,0,231


In [502]:
df_test_merge_1 = df_test[['Qual_1','Qual_1_code']].drop_duplicates()
df_test_merge_2 = df_test[['Qual_2','Qual_2_code']].drop_duplicates()
df_test_merge_3 = df_test[['Qual_3','Qual_3_code']].drop_duplicates()
df_test_merge_4 = df_test[['Profile','Profile_code']].drop_duplicates()
df_test_merge_5 = df_test[['City','City_code']].drop_duplicates()
df_test_merge_6 = df_test[['Locality','Locality_code']].drop_duplicates()

### Merge features from test data to train data

In [503]:
df_train = pd.merge(df_train,df_test_merge_1[['Qual_1','Qual_1_code']],on='Qual_1', how='left')
df_train = pd.merge(df_train,df_test_merge_2[['Qual_2','Qual_2_code']],on='Qual_2', how='left')
df_train = pd.merge(df_train,df_test_merge_3[['Qual_3','Qual_3_code']],on='Qual_3', how='left')
df_train = pd.merge(df_train,df_test_merge_4[['Profile','Profile_code']],on='Profile', how='left')
df_train = pd.merge(df_train,df_test_merge_5[['City','City_code']],on='City', how='left')
df_train = pd.merge(df_train,df_test_merge_6[['Locality','Locality_code']],on='Locality', how='left')

In [504]:
df_xgb = df_train[['Qual_1_code','Qual_2_code','Qual_3_code','years_exp', 'Rating','Profile_code','Locality_code','City_code','Misc','Misc_3','Fees']]

### Create X and Y datasets

In [505]:
X = df_xgb.drop(['Fees'], axis=1)
y = df_xgb.Fees

### Import XGBoost

Convert the dataset into an optimized data structure called Dmatrix that XGBoost supports and gives it acclaimed performance and efficiency gains.

In [506]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [507]:
data_dmatrix = xgb.DMatrix(data=X,label=y)

### Create the train and test set for cross-validation 

Test and Train data are created for the cross-validation of the results using the train_test_split function from sklearn's model_selection module with test_size size equal to 30% of the data. Also, to maintain reproducibility of the results, a random_state is also assigned.

In [508]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

### XGBoost Regressor

The next step is to instantiate an XGBoost regressor object by calling the XGBRegressor() class from the XGBoost library with the hyper-parameters passed as arguments. For classification problems, you would have used the XGBClassifier() class.

In [509]:
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, n_estimators = 10)

### Fit the regressor
Fit the regressor to the training set and make predictions on the test set using the familiar .fit() and .predict() methods.

In [510]:
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)

### RMSE

Compute the rmse by invoking the mean_sqaured_error function from sklearn's metrics module.

In [511]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

RMSE: 212.685080


In [512]:
df_test_xgb = df_test[['Qual_1_code','Qual_2_code','Qual_3_code','years_exp', 'Rating','Profile_code','Locality_code','City_code','Misc','Misc_3']]

### Final Prediction

Use the model created to predict the Fees for test data.

In [513]:
preds_1 = xg_reg.predict(df_test_xgb)

In [514]:
df_test_xgb['Fees'] = preds_1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [515]:
df_test_xgb.to_csv('submission23.csv')