Problem Statement

At Thumbtack, we strive to help customer get projects done by connecting them with the right local service providers (referred to as"pros" hereafter). And, of course, we use a lot of machine learning algorithms to achieve that goal at scale. We gathered a great deal of information from both customers and pros over time about how they interacted with each other. One particular question that we are interested in is to predict which quote will get contacted by a customer for a request(more details below). In this challenge, you will get the chance to showcase your shining machine learning skills and help us to build a predictive model forfuture requests using a (fictional) dataset.As context, here is a 10,000-foot view of the Thumbtack product. A customer comes to Thumbtack and submits a request for what he/she wants to accomplish. Based on the that request, Thumbtack matches the customer with some number of pros by sending them an invite to quote on therequest. A subset of the pros with the invitation would be interested in this request and, thus, submit a quote to the customer. The customers would receive a number of quotes (with estimated price, past reviews of the pro, profile of the pro and other information) and decide whether to contact a particular quote or not. A simple diagram might help you to understand it better (apparently, good drawing ability is not the most critical skill we are looking for). Note that we use (i), i = 1, .., 5, to denote the sequence of these events.


Each row represents a quote and the response variable to be predicted is the column 'contacted', which indicates whether the quote is contacted by the customer (1) or not (0). The goal is to build a powerful machine learning model to predict whether a future quote will be contacted by customer.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [4]:
givenData = pd.read_csv('/Users/siddharthachandra/Documents/programmingProjects/ThumbtackAssignment/Thumbtack_challenge.csv')
print givenData.describe()
print givenData.shape

          request_id     service_id       quote_id  has_profile_picture  \
count  107646.000000  107646.000000  107646.000000        107646.000000   
mean    21233.040930    7931.734816   53823.500000             0.981504   
std     12160.240623    4584.307559   31074.867876             0.134736   
min         1.000000       1.000000       1.000000             0.000000   
25%     10756.250000    3942.000000   26912.250000             1.000000   
50%     21240.500000    7983.500000   53823.500000             1.000000   
75%     31632.750000   11810.000000   80734.750000             1.000000   
max     42431.000000   16277.000000  107646.000000             1.000000   

       description_length     num_photos    num_reviews   num_licenses  \
count       107646.000000  107646.000000  107646.000000  107646.000000   
mean           418.838675      10.507599      36.677749       0.309449   
std            720.826076      17.030634      61.768952       0.673713   
min              0.000000   



In [5]:
#Analysis:
#price_estimate - has null values...need to equalize those 
#Response variable - data skew not present
#category - 39 unique values
# 107646 records
# unique request_ids - 42431
# unique platform - 5
#Need feature scaling for: description length, num_photos, num_reviews, price_estimate, service_distance, minutes_since_request, quote_msg_length
#Need binarization of categorical variables -> {category, platform, price_type}


In [6]:
#Analysis -> 'Contacted/not contacted' by various categories
Y1_category = givenData[(givenData['contacted'] ==1)].groupby(['minutes_since_request'])['contacted'].count()
Y0_category = givenData[(givenData['contacted'] ==0)].groupby(['minutes_since_request'])['contacted'].count()
#.groups.keys()
cats_1 = Y1_category.sort_values(ascending = False)
cats_0 = Y0_category.sort_values(ascending = False)

print (cats_1)
print (cats_0)

minutes_since_request
2       2588
3       2255
4       1928
1       1763
5       1638
6       1391
7       1281
8       1153
9        937
10       894
11       858
12       757
13       694
14       646
15       579
16       529
17       518
18       480
19       475
20       411
22       409
21       381
23       372
26       338
24       327
28       317
25       316
27       292
30       271
29       258
        ... 
2207       1
2213       1
2087       1
2217       1
2225       1
2233       1
2234       1
2240       1
2255       1
2258       1
2163       1
2157       1
2155       1
2152       1
2088       1
2089       1
2090       1
2091       1
2093       1
2111       1
2117       1
2118       1
2121       1
2124       1
2132       1
2136       1
2137       1
2140       1
2141       1
4490       1
Name: contacted, dtype: int64
minutes_since_request
2       2217
3       2007
4       1828
5       1514
6       1418
1       1389
7       1253
8       1078
9        938
10       928
11 

In [7]:
#Randomize dataset and do a 60-20-20 split before any processing
df_randomized = givenData
df_randomized = df_randomized.reindex(np.random.permutation(df_randomized.index))

# training set -> set to be used for training the model
# test set -> to be removed. This set would be used to calculate final performance of the classifier
# validation set -> would be used for selecting model family

splitAt = int(0.2 * len(df_randomized))

test = df_randomized[0:splitAt]
cv_trainTest = df_randomized[splitAt:]
cv_test = cv_trainTest[0:splitAt]
cv_train = cv_trainTest[splitAt:]

In [8]:
df = cv_train
df

#   3 stages to preprocess training set:
#1. fillNA based on categories (also check with price_type)
#2. Normalize and scale values of x
#3. Binarize categorical variables

Unnamed: 0,request_id,service_id,quote_id,category,platform,has_profile_picture,description_length,num_photos,num_reviews,num_licenses,num_websites,avg_rating,price_type,price_estimate,service_distance,minutes_since_request,quote_message_length,contacted
84229,33057,14988,83509,House Cleaning (Recurring),computer,1,0,0,0,0,0,0.0,price_hourly,25.00,5.488336,16,59,0
31457,12618,3569,30672,Apartment Cleaning,phone,1,243,1,29,0,0,4.9,price_fixed,50.00,7.092609,16,358,1
18091,7263,12642,17547,Deep or Spring Cleaning,phone,1,0,0,9,0,0,4.4,price_fixed,175.00,6.248006,9,383,1
63051,24886,3013,62388,Gutter Cleaning and Maintenance,phone,1,142,11,37,2,1,5.0,price_fixed,125.00,45.709540,30,450,0
80308,31473,15397,83131,House Cleaning (One Time),phone,1,0,7,3,0,1,5.0,price_fixed,135.00,4.266813,1754,258,0
64695,25536,11050,64167,Move-in or Move-out Cleaning,tablet,1,0,16,40,0,0,4.4,price_fixed,0.00,21.895555,73,196,0
76450,29994,11191,76348,Event DJ,computer,1,0,2,15,0,1,4.7,price_fixed,400.00,39.815020,513,938,0
5704,2327,10546,5190,House Cleaning (Recurring),phone,1,0,28,10,0,1,4.2,price_fixed,105.00,24.484184,9,649,0
69255,27242,1795,68842,Duct and Vent Cleaning,tablet,1,1034,9,3,0,1,3.7,price_fixed,135.00,19.158430,31,392,0
88377,34715,7281,87898,House Cleaning (Recurring),phone,1,914,14,30,1,0,4.7,price_fixed,105.00,4.761711,21,966,0


In [9]:
#TEST: Fill NA based on category median
uniqueCategories = pd.unique(df['category'])
for c in uniqueCategories:
    median = df[(df['category'] == c)]['price_estimate'].median()
    #print (c, median)
    df[(df['category'] == c)].fillna(median,inplace=True)  

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)


In [10]:
#TEST: Normalize and scale values of x
df_norm = df
groups = ['description_length','num_photos', 'num_reviews', 'num_licenses','num_websites','avg_rating', 'price_estimate', 'service_distance', 'minutes_since_request', 'quote_message_length']

for g in groups:  
    df_norm.loc[:, g] = (df_norm[g] - df_norm[g].mean()) / (df_norm[g].max() - df_norm[g].min())
    

df_norm.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,request_id,service_id,quote_id,has_profile_picture,description_length,num_photos,num_reviews,num_licenses,num_websites,avg_rating,price_estimate,service_distance,minutes_since_request,quote_message_length,contacted
count,64588.0,64588.0,64588.0,64588.0,64588.0,64588.0,64588.0,64588.0,64588.0,64588.0,58339.0,64588.0,64588.0,64588.0,64588.0
mean,21271.858132,7940.015281,53924.485617,0.981127,-3.4722399999999995e-19,-2.213983e-18,3.42411e-18,4.366084e-18,8.690914e-18,9.13096e-18,-3.330346e-20,8.388381999999999e-19,-7.494537999999999e-19,5.088035e-19,0.469298
std,12161.214964,4580.633621,31076.610711,0.136079,0.01799799,0.03131008,0.130556,0.07397438,0.1155015,0.3292906,0.005239222,0.01855462,0.1038965,0.030174,0.49906
min,1.0,1.0,2.0,0.0,-0.01048153,-0.019555,-0.07743836,-0.03411538,-0.1312442,-0.8140748,-0.0003965367,-0.008404342,-0.0535293,-0.04219254,0.0
25%,10829.0,3958.0,27083.75,1.0,-0.01048153,-0.01769281,-0.07110924,-0.03411538,-0.1312442,0.06592525,,-0.005406531,-0.05153859,-0.02311045,0.0
50%,21274.5,7990.5,53897.5,1.0,-0.005229031,-0.00838182,-0.0500122,-0.03411538,0.06875581,0.1259252,,-0.002824312,-0.0446817,-0.004028368,0.0
75%,31650.0,11815.0,80816.25,1.0,0.003450092,0.004653562,0.01327894,-0.03411538,0.06875581,0.1859252,,0.001224439,-0.0002225119,0.02089349,1.0
max,42430.0,16274.0,107646.0,1.0,0.9895185,0.980445,0.9225616,0.9658846,0.8687558,0.1859252,0.9996035,0.9915957,0.9464707,0.9578075,1.0


In [11]:
#TEST: Binarize categorical variables
categoricalVariables = ['category', 'platform', 'price_type']

df_processed = df_norm
for c in categoricalVariables:
    df_processed.loc[:,c] = df_processed[c].astype('category')

df_processed = pd.get_dummies(df_processed)
df_processed.columns.difference(df_norm.columns)

Index([u'category_Apartment Cleaning', u'category_Bar or Bat Mitzvah DJ',
       u'category_Basement or Attic Cleaning', u'category_Carpet Cleaning',
       u'category_Chimney Cleaning', u'category_Cleaning Out',
       u'category_Commercial Cleaning', u'category_Deep or Spring Cleaning',
       u'category_Dryer Vent Cleaning', u'category_Duct and Vent Cleaning',
       u'category_EDM or House Music DJ', u'category_Event DJ',
       u'category_Fire Damage Restoration Cleaning',
       u'category_Fireplace and Chimney Cleaning', u'category_Floor Cleaning',
       u'category_Garage Cleaning',
       u'category_Gutter Cleaning and Maintenance',
       u'category_Hot Tub and Spa Cleaning and Maintenance',
       u'category_House Cleaning (One Time)',
       u'category_House Cleaning (Recurring)', u'category_Mattress Cleaning',
       u'category_Move-in or Move-out Cleaning',
       u'category_Office Cleaning (One Time)',
       u'category_Office Cleaning (Recurring)',
       u'category_Out

In [12]:
# features = [u'has_profile_picture',u'description_length', u'num_photos', u'num_reviews', u'num_licenses',
#        u'num_websites', u'avg_rating', u'price_estimate', u'service_distance',
#        u'minutes_since_request', u'quote_message_length',
#        u'category_Apartment Cleaning', u'category_Bar or Bat Mitzvah DJ',
#        u'category_Basement or Attic Cleaning', u'category_Carpet Cleaning',
#        u'category_Chimney Cleaning', u'category_Cleaning Out',
#        u'category_Commercial Cleaning', u'category_Deep or Spring Cleaning',
#        u'category_Dryer Vent Cleaning', u'category_Duct and Vent Cleaning',
#        u'category_EDM or House Music DJ', u'category_Event DJ',
#        u'category_Fire Damage Restoration Cleaning',
#        u'category_Fireplace and Chimney Cleaning', u'category_Floor Cleaning',
#        u'category_Garage Cleaning',
#        u'category_Gutter Cleaning and Maintenance',
#        u'category_Hot Tub and Spa Cleaning and Maintenance',
#        u'category_House Cleaning (One Time)',
#        u'category_House Cleaning (Recurring)', u'category_Mattress Cleaning',
#        u'category_Move-in or Move-out Cleaning',
#        u'category_Office Cleaning (One Time)',
#        u'category_Office Cleaning (Recurring)',
#        u'category_Outdoor or Balcony Cleaning', u'category_Quinceanera DJ',
#        u'category_Roof Cleaning', u'category_Rug Cleaning',
#        u'category_Solar Panel Cleaning or Inspection',
#        u'category_Spanish Music DJ', u'category_Steam Cleaning',
#        u'category_Sweet 16 DJ',
#        u'category_Swimming Pool Cleaning or Maintenance',
#        u'category_Tile and Grout Cleaning', u'category_Top 40 DJ',
#        u'category_Upholstery and Furniture Cleaning', u'category_Wedding DJ',
#        u'category_Window Blinds Cleaning', u'category_Window Cleaning',
#        u'platform_computer', u'platform_native', u'platform_other',
#        u'platform_phone', u'platform_tablet', u'price_type_price_fixed',
#        u'price_type_price_hourly', u'price_type_price_other']

output = [u'contacted']

In [27]:
#model selection
from sklearn.cross_validation import KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

clf_naiveBayes = MultinomialNB()
clf_logisticRegression = LogisticRegression(penalty='l2', C=10000)   #logistic regression 0.01, 0.1 --54.355
#clf_lr = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, n_iter=5, random_state=10)   #logistic regression
clf_svmLinear = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=10) #linear svm
clf_svmPoly = SVC(C=1.0, decision_function_shape='ovr', degree = 4, gamma = 1.0)
clf_svmRbf = SVC(C=1.0, decision_function_shape='ovr', degree = 4, gamma = 1.0)
clf_randomForest = RandomForestClassifier(n_estimators=100, n_jobs=2)

clfs = [clf_logisticRegression]
#clfs = [clf_naiveBayes, clf_logisticRegression, clf_svmLinear, clf_svmPoly, clf_svmRbf, clf_randomForest]

Y = cv_trainTest[output]
kf = KFold(len(Y), n_folds=5)
max = 0
scores = []
categoricalVariables = ['category', 'platform', 'price_type']
numericalVariables = ['description_length','num_photos', 'num_reviews', 'num_licenses','num_websites','avg_rating', 'price_estimate', 'service_distance', 'minutes_since_request', 'quote_message_length']

for idx in range(0,len(clfs)):
    #implement cross-validation
    cv_scores = []
    for train_index, test_index in kf:
        #pre-process train
        X_trainRaw = cv_trainTest.iloc[train_index]
        #1. fillNa
        uniqueCategories = pd.unique(X_trainRaw['category'])
        for c in uniqueCategories:
            median = X_trainRaw[(X_trainRaw['category'] == c)]['price_estimate'].median()
            X_trainRaw[(X_trainRaw['category'] == c)].fillna(median,inplace=True)
            
#         #2. Normalize and scale values of x
#         for n in numericalVariables:  
#             X_trainRaw.loc[:, n] = (X_trainRaw[n] - X_trainRaw[n].mean()) / (X_trainRaw[n].max() - X_trainRaw[n].min())
            
#         #3. Binarize categorical variables
#         X_trainProcessed = X_trainRaw
#         for c in categoricalVariables:
#             X_trainProcessed.loc[:,c] = X_trainProcessed[c].astype('category')

#         X_trainProcessed = pd.get_dummies(X_trainProcessed)
#         features = X_trainProcessed.columns.difference(X_trainRaw.columns)
        
#         X_train = X_trainProcessed[features].values
        
#         #pre-process test
#         X_testRaw = cv_trainTest.iloc[test_index]
        
#         #1. fillNa
#         for c in uniqueCategories:
#             median = X_trainRaw[(X_trainRaw['category'] == c)]['price_estimate'].median()
#             X_testRaw[(X_testRaw['category'] == c)].fillna(median,inplace=True)
            
#         #2. Normalize and scale values of x
#         for n in numericalVariables:  
#             X_testRaw.loc[:, n] = (X_testRaw[n] - X_trainRaw[n].mean()) / (X_trainRaw[n].max() - X_trainRaw[n].min())
    
#         #3. Binarize categorical variables
#         X_testProcessed = X_testRaw
#         for c in categoricalVariables:
#             X_testProcessed.loc[:,c] = X_testProcessed[c].astype('category')

#         X_testProcessed = pd.get_dummies(X_testProcessed)
#         diff = features.difference(X_testProcessed.columns)
#         for k in diff:
#             if (k not in X_testProcessed.columns):
#                 X_testProcessed[k] = 0.0
                
#         X_test = X_testProcessed[features].values
#         Y_train, Y_test = Y.iloc[train_index].values.ravel(), Y.iloc[test_index].values.ravel()

#         clf = clfs[idx].fit(X_train, Y_train)
#         score = clf.score(X_test, Y_test)
#         cv_scores.append(score)
#     scores.append(np.mean(cv_scores))

# print(scores)
# highScoreIndex = np.argmax(scores)
# print(highScoreIndex)
# clf = clfs[highScoreIndex]

----
('Gutter Cleaning and Maintenance', 109.5)
('House Cleaning (Recurring)', 100.0)
('Move-in or Move-out Cleaning', 150.0)
('Wedding DJ', 575.0)
('House Cleaning (One Time)', 135.0)
('Carpet Cleaning', 115.0)
('Event DJ', 400.0)
('Deep or Spring Cleaning', 150.0)
('Commercial Cleaning', 175.0)
('Basement or Attic Cleaning', 120.0)
('Apartment Cleaning', 90.0)
('Window Cleaning', 130.0)
('Swimming Pool Cleaning or Maintenance', 85.0)
('Office Cleaning (Recurring)', 147.99)
('Upholstery and Furniture Cleaning', 100.0)
('Tile and Grout Cleaning', 150.0)
('Sweet 16 DJ', 350.0)
('EDM or House Music DJ', 350.0)
('Quinceanera DJ', 500.0)
('Duct and Vent Cleaning', 189.0)
('Floor Cleaning', 210.0)
('Dryer Vent Cleaning', 95.0)
('Roof Cleaning', 300.0)
('Mattress Cleaning', 90.0)
('Cleaning Out', 142.5)
('Top 40 DJ', 300.0)
('Chimney Cleaning', 99.0)
('Garage Cleaning', 125.0)
('Rug Cleaning', 100.0)
('Fireplace and Chimney Cleaning', 115.0)
('Bar or Bat Mitzvah DJ', 662.5)
('Office Cleaning

In [1]:

# Y = cv_trainTest['label'].values
# kf = KFold(len(Y), n_folds=5)

# max = 0
# scores = []

# for idx in range(0,len(clfs)):
#     #implement cross-validation
#     cv_scores = []
#     for train_index, test_index in kf:
#         X_train_counts = vectorizer.fit_transform(cv_trainTest['ocr_text_all'].values[train_index])
#         X_train = tfidf_transformer.fit_transform(X_train_counts)
#         X_test_counts = vectorizer.transform(cv_trainTest['ocr_text_all'].values[test_index])
#         X_test = tfidf_transformer.transform(X_test_counts)
#         Y_train, Y_test = Y[train_index], Y[test_index]
#         clf = clfs[idx].fit(X_train, Y_train)
#         score = clf.score(X_test, Y_test)
#         cv_scores.append(score)
#     scores.append(np.mean(cv_scores))
    
# print(scores)
# highScoreIndex = np.argmax(scores)
# print(highScoreIndex)
# clf = clfs[highScoreIndex]

NameError: name 'cv_trainTest' is not defined

In [None]:
#pd.unique(givenData.category.ravel())
#givenData[(givenData['price_estimate'] ==1)].shape