Problem Statement

At Thumbtack, we strive to help customer get projects done by connecting them with the right local service providers (referred to as"pros" hereafter). And, of course, we use a lot of machine learning algorithms to achieve that goal at scale. We gathered a great deal of information from both customers and pros over time about how they interacted with each other. One particular question that we are interested in is to predict which quote will get contacted by a customer for a request(more details below). In this challenge, you will get the chance to showcase your shining machine learning skills and help us to build a predictive model forfuture requests using a (fictional) dataset.As context, here is a 10,000-foot view of the Thumbtack product. A customer comes to Thumbtack and submits a request for what he/she wants to accomplish. Based on the that request, Thumbtack matches the customer with some number of pros by sending them an invite to quote on therequest. A subset of the pros with the invitation would be interested in this request and, thus, submit a quote to the customer. The customers would receive a number of quotes (with estimated price, past reviews of the pro, profile of the pro and other information) and decide whether to contact a particular quote or not. A simple diagram might help you to understand it better (apparently, good drawing ability is not the most critical skill we are looking for). Note that we use (i), i = 1, .., 5, to denote the sequence of these events.


Each row represents a quote and the response variable to be predicted is the column 'contacted', which indicates whether the quote is contacted by the customer (1) or not (0). The goal is to build a powerful machine learning model to predict whether a future quote will be contacted by customer.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
givenData = pd.read_csv('/Users/siddharthachandra/Documents/programmingProjects/ThumbtackAssignment/Thumbtack_challenge.csv')
print givenData.describe()
print givenData.shape

          request_id     service_id       quote_id  has_profile_picture  \
count  107646.000000  107646.000000  107646.000000        107646.000000   
mean    21233.040930    7931.734816   53823.500000             0.981504   
std     12160.240623    4584.307559   31074.867876             0.134736   
min         1.000000       1.000000       1.000000             0.000000   
25%     10756.250000    3942.000000   26912.250000             1.000000   
50%     21240.500000    7983.500000   53823.500000             1.000000   
75%     31632.750000   11810.000000   80734.750000             1.000000   
max     42431.000000   16277.000000  107646.000000             1.000000   

       description_length     num_photos    num_reviews   num_licenses  \
count       107646.000000  107646.000000  107646.000000  107646.000000   
mean           418.838675      10.507599      36.677749       0.309449   
std            720.826076      17.030634      61.768952       0.673713   
min              0.000000   



In [3]:
#Analysis:
#price_estimate - has null values, corresponding to 'price_other'...need to equalize those...choosing 0.0  
#Response variable - data skew not present
#category - 39 unique values
# 107646 records
# unique request_ids - 42431
# unique platform - 5
#Need feature scaling for: description length, num_photos, num_reviews, price_estimate, service_distance, minutes_since_request, quote_msg_length
#Need binarization of categorical variables -> {category, platform, price_type}

givenData.columns

Index([u'request_id', u'service_id', u'quote_id', u'category', u'platform',
       u'has_profile_picture', u'description_length', u'num_photos',
       u'num_reviews', u'num_licenses', u'num_websites', u'avg_rating',
       u'price_type', u'price_estimate', u'service_distance',
       u'minutes_since_request', u'quote_message_length', u'contacted'],
      dtype='object')

In [6]:
#Analysis -> 'Contacted/not contacted' by various categories
Y1_category = givenData[(givenData['contacted'] ==1)].groupby(['minutes_since_request'])['contacted'].count()
Y0_category = givenData[(givenData['contacted'] ==0)].groupby(['minutes_since_request'])['contacted'].count()
#.groups.keys()
cats_1 = Y1_category.sort_values(ascending = False)
cats_0 = Y0_category.sort_values(ascending = False)

print (cats_1)
print (cats_0)

minutes_since_request
2       2588
3       2255
4       1928
1       1763
5       1638
6       1391
7       1281
8       1153
9        937
10       894
11       858
12       757
13       694
14       646
15       579
16       529
17       518
18       480
19       475
20       411
22       409
21       381
23       372
26       338
24       327
28       317
25       316
27       292
30       271
29       258
        ... 
2207       1
2213       1
2087       1
2217       1
2225       1
2233       1
2234       1
2240       1
2255       1
2258       1
2163       1
2157       1
2155       1
2152       1
2088       1
2089       1
2090       1
2091       1
2093       1
2111       1
2117       1
2118       1
2121       1
2124       1
2132       1
2136       1
2137       1
2140       1
2141       1
4490       1
Name: contacted, dtype: int64
minutes_since_request
2       2217
3       2007
4       1828
5       1514
6       1418
1       1389
7       1253
8       1078
9        938
10       928
11 

In [9]:
#Randomize dataset and do a 60-20-20 split before any processing
df_randomized = givenData

#taking care of Nan values here / also try remvoing them completely
#df_randomized = df_randomized.fillna(0.0)
df_randomized = df_randomized[~df_randomized['price_estimate'].isnull()]

df_randomized = df_randomized.reindex(np.random.permutation(df_randomized.index))

In [18]:
#Convert category variable to discrete values
from sklearn import preprocessing

le_category = preprocessing.LabelEncoder()

#to convert into numbers
df_randomized['categoryNumber'] = le_category.fit_transform(df_randomized.category)

#adding feature that gives us th product of rating and number of reviews
df_randomized['ratingComposite'] = df_randomized.avg_rating * df_randomized.num_reviews

df_randomized
#to convert back
#df_randomized.cat = le_category.inverse_transform(df_randomized.category)

Unnamed: 0,request_id,service_id,quote_id,category,platform,has_profile_picture,description_length,num_photos,num_reviews,num_licenses,num_websites,avg_rating,price_type,price_estimate,service_distance,minutes_since_request,quote_message_length,contacted,categoryNumber,ratingComposite
103397,40712,15903,103394,House Cleaning (One Time),computer,1,0,0,2,0,1,5.0,price_fixed,170.0,10.849617,104,636,0,18,10.0
107438,42350,10035,107552,House Cleaning (One Time),phone,1,0,0,26,0,0,5.0,price_fixed,120.0,1.354488,2352,342,0,18,130.0
101526,39952,7368,101154,Apartment Cleaning,computer,1,262,8,74,1,0,4.8,price_fixed,130.0,10.506780,38,86,1,0,355.2
13469,5354,14436,13261,House Cleaning (Recurring),computer,1,0,0,0,0,0,0.0,price_fixed,95.0,14.919060,128,528,0,19,0.0
11256,4484,10909,10925,Commercial Cleaning,computer,1,0,5,3,0,1,5.0,price_fixed,1200.0,3.942305,46,213,1,6,15.0
103266,40665,3403,103464,House Cleaning (Recurring),computer,0,127,0,0,0,0,0.0,price_fixed,60.0,9.007733,144,240,1,19,0.0
102596,40390,10468,102189,Carpet Cleaning,native,1,0,1,12,0,1,4.8,price_fixed,120.0,8.865340,2,959,1,3,57.6
104822,41297,4477,107628,House Cleaning (One Time),computer,1,292,26,39,0,1,4.7,price_fixed,150.0,33.985644,4039,22,0,18,183.3
55634,21975,5978,54868,Event DJ,phone,1,703,26,11,0,0,5.0,price_fixed,250.0,844.984508,4,103,0,11,55.0
86115,33803,11852,85464,Carpet Cleaning,computer,1,0,22,17,0,1,4.9,price_fixed,199.0,14.608757,5,2045,0,3,83.3


In [42]:
#plot analysis of data:
#Based on scatter plot analysis, it seems like there is no clear trend observed for splitting the data into the 2 output categories
%matplotlib
jet=plt.get_cmap('coolwarm')
relevant = ['ratingComposite', 'categoryNumber', 'price_type', 'avg_rating', 'num_websites', 'price_estimate', 'quote_message_length', 'service_distance', 'has_profile_picture', 'num_reviews', 'num_licenses','num_photos', 'platform','category','description_length', 'minutes_since_request', 'contacted']
df_reduced = df_randomized[relevant]
x = df_reduced['categoryNumber']
y = df_reduced['price_estimate']
z = df_reduced['contacted']
labels = [0,1]

plt.xlabel('categoryNumber')
plt.ylabel('price_estimate')
plt.scatter(x, y, s=100, c=z, cmap=jet)

cb=plt.colorbar(ticks=np.array(labels))
cb.set_ticklabels(labels)

Using matplotlib backend: MacOSX


In [43]:
# training set -> set to be used for training the model
# test set -> to be removed. This set would be used to calculate final performance of the classifier
# validation set -> would be used for selecting model family
splitAt = int(0.2 * len(df_reduced))

test = df_reduced[0:splitAt]
cv_trainTest = df_reduced[splitAt:]
cv_test = cv_trainTest[0:splitAt]
cv_train = cv_trainTest[splitAt:]

In [44]:
#   3 stages to preprocess training set:
#1. fillNA for Price_estimate(done already)
#2. Normalize and scale values of x
#3. Binarize categorical variables
len(cv_trainTest)

77737

In [54]:
#model selection
from sklearn.cross_validation import KFold
# from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

output = [u'contacted']
categoricalVariables = ['category', 'platform', 'price_type']
numericalFeatures = ['ratingComposite', 'description_length','num_photos', 'num_reviews', 'num_licenses','num_websites','avg_rating', 'price_estimate', 'service_distance', 'minutes_since_request', 'quote_message_length']
features = numericalFeatures + categoricalVariables

cv_trainTestSmall = cv_trainTest[0:5000]

clf_logisticRegression = LogisticRegression(penalty='l2', C=0.00000001,random_state = 10)   #logistic regression 0.01, 0.1 --54.355
clf_svmLinear = SVC(C=0.01, kernel = 'linear', probability=True)
clf_svmPoly = SVC(C=1.0, degree = 5, kernel = 'poly', probability=True)
#clf_svmRbf = SVC(C=1.0, gamma = 1.0)
clf_randomForest = RandomForestClassifier(n_estimators=100, n_jobs=2)

#clfs = [clf_logisticRegression]
clfs = [clf_logisticRegression, clf_svmLinear, clf_svmPoly, clf_randomForest]

Y = cv_trainTestSmall[output]
X = cv_trainTestSmall[features]

kf = KFold(len(Y), n_folds=5)
max = 0
scores = []

for idx in range(0,len(clfs)):
    #implement cross-validation
    cv_scores = []
    for train_index, test_index in kf:
        #pre-process train
        X_trainRaw = cv_trainTestSmall.iloc[train_index]
                    
        #1. Normalize and scale values of x
        for n in numericalFeatures:  
            X_trainRaw.loc[:, n] = (X_trainRaw[n] - X_trainRaw[n].mean()) / (X_trainRaw[n].max() - X_trainRaw[n].min())
        
        #2. Binarize categorical variables
        X_trainProcessed = X_trainRaw
        for c in categoricalVariables:
            X_trainProcessed.loc[:,c] = X_trainProcessed[c].astype('category')
                
        X_trainProcessed = pd.get_dummies(X_trainProcessed)
        categoryFeaturesTrain = X_trainProcessed.columns.difference(X_trainRaw.columns)
        features = categoryFeaturesTrain.union(numericalFeatures)    
        X_train = X_trainProcessed[features].values
        
        #pre-process test
        X_testRaw = cv_trainTestSmall.iloc[test_index]
        
        #1. Normalize and scale values of x
        for n in numericalFeatures:  
            X_testRaw.loc[:, n] = (X_testRaw[n] - X_trainRaw[n].mean()) / (X_trainRaw[n].max() - X_trainRaw[n].min())
    
        #2. Binarize categorical variables
        X_testProcessed = X_testRaw
        for c in categoricalVariables:
            X_testProcessed.loc[:,c] = X_testProcessed[c].astype('category')

        X_testProcessed = pd.get_dummies(X_testProcessed)
        diff = categoryFeaturesTrain.difference(X_testProcessed.columns)
        
        #This adds binarized_categorical columns with 0.0 for all binarized_categorical variables that are absent in test set but present in training set
        for k in diff:
            if (k not in X_testProcessed.columns):
                X_testProcessed[k] = 0.0
                
        X_test = X_testProcessed[features].values
        Y_train, Y_test = Y.iloc[train_index].values.ravel(), Y.iloc[test_index].values.ravel()      
        
        clf = clfs[idx].fit(X_train, Y_train)
        score = clf.score(X_test, Y_test)
        cv_scores.append(score)
    scores.append(np.mean(cv_scores))

print(scores)
highScoreIndex = np.argmax(scores)
print(highScoreIndex)
clf = clfs[highScoreIndex]

[0.55080000000000007, 0.5454, 0.48499999999999999, 0.52539999999999998]
0


In [55]:
#Use training+validation set to train the model

X_trainRaw = cv_trainTestSmall
Y_trainRaw = cv_trainTestSmall[output]
Y_testRaw = test[output]

#pre-process train
#1. Normalize and scale values of x
for n in numericalFeatures:  
    X_trainRaw.loc[:, n] = (X_trainRaw[n] - X_trainRaw[n].mean()) / (X_trainRaw[n].max() - X_trainRaw[n].min())

#2. Binarize categorical variables
X_trainProcessed = X_trainRaw
for c in categoricalVariables:
    X_trainProcessed.loc[:,c] = X_trainProcessed[c].astype('category')
                
X_trainProcessed = pd.get_dummies(X_trainProcessed)
categoryFeaturesTrain = X_trainProcessed.columns.difference(X_trainRaw.columns)
features = categoryFeaturesTrain.union(numericalFeatures)    
X_train = X_trainProcessed[features].values    



#pre-process test
X_testRaw = test
        
#1. Normalize and scale values of x
for n in numericalFeatures:  
    X_testRaw.loc[:, n] = (X_testRaw[n] - X_trainRaw[n].mean()) / (X_trainRaw[n].max() - X_trainRaw[n].min())
    
#2. Binarize categorical variables
X_testProcessed = X_testRaw
for c in categoricalVariables:
    X_testProcessed.loc[:,c] = X_testProcessed[c].astype('category')

X_testProcessed = pd.get_dummies(X_testProcessed)
diff = categoryFeaturesTrain.difference(X_testProcessed.columns)

#This adds binarized_categorical columns with 0.0 for all binarized_categorical variables that are absent in test set but present in training set
for k in diff:
    if (k not in X_testProcessed.columns):
        X_testProcessed[k] = 0.0

X_test = X_testProcessed[features].values

Y_train, Y_test = Y_trainRaw.values.ravel(), Y_testRaw.values.ravel()      

In [56]:
#get performance on test set
model = clf.fit(X_train, Y_train)
predictions = model.predict(X_test)
model.score(X_test, Y_test)

0.54039312545024187

In [57]:
#performance metrics of classifier
from sklearn import metrics

print(metrics.classification_report(Y_test, predictions))
print(metrics.confusion_matrix(Y_test, predictions))

             precision    recall  f1-score   support

          0       0.59      0.46      0.52     10344
          1       0.51      0.63      0.56      9090

avg / total       0.55      0.54      0.54     19434

[[4758 5586]
 [3346 5744]]


In [60]:
#Source: http://stackoverflow.com/questions/28719067/roc-curve-and-cut-off-point-python
def Find_Optimal_Cutoff(target, predicted):
    """ Find the optimal probability cutoff point for a classification model related to event rate
    Parameters
    ----------
    target : Matrix with dependent or target data, where rows are observations

    predicted : Matrix with predicted data, where rows are observations

    Returns
    -------     
    list type, with optimal cutoff value

    """
    fpr, tpr, threshold = roc_curve(target, predicted)
    i = np.arange(len(tpr)) 
    roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})
    roc_t = roc.ix[(roc.tf-0).abs().argsort()[:1]]

    return list(roc_t['threshold']) 


#plot ROC
from sklearn.metrics import roc_curve, auc

actual = Y_test
allPredictions = model.predict_proba(X_test)
predictions = [i[0] for i in allPredictions]

false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)

# determine optimal threshold
optimalThreshold = Find_Optimal_Cutoff(actual, predictions)

plt.title('ROC')
plt.plot(false_positive_rate, true_positive_rate, 'b',
label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#AUC < 0.5 indicates a failed classifier

In [61]:
X_testProcessed[features].columns

Index([u'avg_rating', u'category_Apartment Cleaning',
       u'category_Bar or Bat Mitzvah DJ',
       u'category_Basement or Attic Cleaning', u'category_Carpet Cleaning',
       u'category_Chimney Cleaning', u'category_Cleaning Out',
       u'category_Commercial Cleaning', u'category_Deep or Spring Cleaning',
       u'category_Dryer Vent Cleaning', u'category_Duct and Vent Cleaning',
       u'category_EDM or House Music DJ', u'category_Event DJ',
       u'category_Fireplace and Chimney Cleaning', u'category_Floor Cleaning',
       u'category_Garage Cleaning',
       u'category_Gutter Cleaning and Maintenance',
       u'category_House Cleaning (One Time)',
       u'category_House Cleaning (Recurring)', u'category_Mattress Cleaning',
       u'category_Move-in or Move-out Cleaning',
       u'category_Office Cleaning (One Time)',
       u'category_Office Cleaning (Recurring)',
       u'category_Outdoor or Balcony Cleaning', u'category_Quinceanera DJ',
       u'category_Roof Cleaning', u'c

In [68]:
#show perf wrt category, platform and priceType
#Get performance figures on the test Set for different categories:

#A. Total number of records in test set

#B. Price_type:
    #a. price_type_price_fixed
    #b. price_type_price_hourly
    
#C. Platform:
    #a. platform_other
    #b. platform_computer
    #c. platform_native
    #d. platform_phone
    #e. platform_tablet
        
#D. categories: (skipping for now)
    
totalCount = len(Y_test)
X_testProcessed[features]
# platform_other_count = [k for k in X_testProcessed[features]]
# platform_other_count

# anyBrandCount = len([k for k in X_testProcessed[features] if k!='not_an_ad' and k!= 'no prediction'])
# anyBrandCountPerc = float(anyBrandCount)/totalCount
# noAdCount = len([k for k in Y_test if k=='not_an_ad'])
# noAdCountPerc = float(noAdCount)/totalCount

# #number of surrenders by the classifier
# noPredictionCount = len([k for k in predictions if k=='no prediction'])
# noPredictionCountPerc = float(noPredictionCount)/totalCount

# # %correctly Predicted, %incorrectlyredicted, %surrendered
# anyBrandCorrect = 0
# not_an_adCorrect = 0
# not_an_adSurrender = 0
# anyBrandSurrender = 0
# not_an_adIncorrect = 0
# anyBrand_notAnAdIncorrect = 0
# anyBrand_wrongBrandIncorrect = 0

# for idx,k in enumerate(Y_test):
#     if(k == predictions[idx]):
#         if (k=='not_an_ad'):
#             not_an_adCorrect += 1    
#         else:
#             anyBrandCorrect += 1
#     elif(predictions[idx] == 'no prediction'):
#         if (k=='not_an_ad'):
#             not_an_adSurrender += 1
#         else:
#             anyBrandSurrender += 1
#     else:
#         if (k=='not_an_ad'):
#             not_an_adIncorrect += 1
#         else:
#             if (predictions[idx] == 'not_an_ad'):
#                 anyBrand_notAnAdIncorrect += 1
#             else:
#                 anyBrand_wrongBrandIncorrect += 1
                

Unnamed: 0,avg_rating,category_Apartment Cleaning,category_Bar or Bat Mitzvah DJ,category_Basement or Attic Cleaning,category_Carpet Cleaning,category_Chimney Cleaning,category_Cleaning Out,category_Commercial Cleaning,category_Deep or Spring Cleaning,category_Dryer Vent Cleaning,...,platform_native,platform_other,platform_phone,platform_tablet,price_estimate,price_type_price_fixed,price_type_price_hourly,quote_message_length,ratingComposite,service_distance
103397,5.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,170.00,1.0,0.0,636.0,1.000000e+01,1.084962e+01
107438,5.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,120.00,1.0,0.0,342.0,1.300000e+02,1.354488e+00
101526,4.800000e+00,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,130.00,1.0,0.0,86.0,3.552000e+02,1.050678e+01
13469,1.796785e-16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,95.00,1.0,0.0,528.0,-1.851852e-17,1.491906e+01
11256,5.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1200.00,1.0,0.0,213.0,1.500000e+01,3.942305e+00
103266,1.796785e-16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,60.00,1.0,0.0,240.0,-1.851852e-17,9.007733e+00
102596,4.800000e+00,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,120.00,1.0,0.0,959.0,5.760000e+01,8.865340e+00
104822,4.700000e+00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,150.00,1.0,0.0,22.0,1.833000e+02,3.398564e+01
55634,5.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,250.00,1.0,0.0,103.0,5.500000e+01,8.449845e+02
86115,4.900000e+00,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,199.00,1.0,0.0,2045.0,8.330000e+01,1.460876e+01
