# Thumbtack Analytics Challenge
  
Thumbtack has decided to take a closer look at performance in two of its largest categories - House
Cleaning and Local Moving. Please complete the analyses suggested below and overlay your own
recommendations for how we can improve and grow our marketplace.

    ● Based on the data, what types of pros are customers interested in?
    
        Customers have a high rate of Contact for pros with
         1. Recent Site Activity (response time?)
         2. More Reviews
         3. High Search Rank
        Somewhat suprisingly, rating was not a leading indicator of contacting or not
         4. In the moving category, price was sensitive by about 20 bucks
        
        Differenciators for actually being hired
         1. Recent Site Activity (response time?)
         2. More Reveiews
         

    Dashboard below is a quick glance at the variables avaiable.

https://public.tableau.com/views/ThumbtackSampleProj/ConsumerPreferences?:language=en&:embed_code_version=3&:loadOrderID=2&:display_count=y&publish=yes&:origin=viz_share_link

    ● Based on the types of pros that customers are interested in, 
            how would you describe the quantity and quality of the search results? 
            What could be improved?

     Quantity and Quality:    
            * Very large presence of House Cleaner results that grade poorly in 
            features pertaining to being hired. (roughly 13x more than top graded) 
            * Local Moving is a job hired less and has a similar magnitude of 17x more 
            results in grade D than grade A
            
     What could be improved:
            Regardless of the number of job actually accepted, if the population of pros
            behaved in a way that reflected the top hires, the consumer would begin to 
            rely on other factors to pick whom to hire. (Perhaps availability, 
            response time, or positive reviews from friends/neighbors vs "the crowd". )
            
    Dashboard below is a quick glance at the grading spread for all the records of 
    search instances provided.


https://public.tableau.com/views/ThumbtackSampleProjGrading/GradingDistrobution?:language=en&:retry=yes&:display_count=y&:origin=viz_share_link


# Observations post ML building 
For trying to predict if a search result would be contacted or not, the features (columns) provided indicate a relatively strong ability to predict which search results receive contacts.  As for distinguishing which pros get hired once contacted, the results indicate the modeling ability at picking which pro gets hired is about as good as tossing a coin :/  

Leading me to conclude, the features provided in the dataset are limited to making decent predictions on which pros search results gets contacted in both categories but not whom gets hired.  

The following is an outline of using machine learning in Python to express which

"features" aka columns were the best predictors of being contacted and hired 

based on the sample data provided. 

# Prompt


Thumbtack is a marketplace for local services. Customers come to our website or mobile app to see our
directory of service professionals (example) in nearly 500 categories. As part of the search experience, customers can provide some basic details about their projects in the search filters to see pros that best match their needs. Customers can also see pros’ price estimates for their projects. From the list of pros, customers can then explore pro profiles, contact the pros that interest them, and ultimately hire a pro. In this process, Thumbtack generates revenue by charging pros for each customer that contacts them.

Downloaded Fiels : https://drive.google.com/drive/folders/1v8wmMVvQPFBHjtGutjYA4bL9V_eEt7Ii

# Visitors CSV
This dataset contains a list of search results. Each result is a pro that matched 

a specific visitor’s search.

    ● row_number (integer): row number in data set
    ● visitor_id (integer): unique identifier for the visitor that the 
        search result is associated with
    ● search_timestamp (timestamp): timestamp of when the visitor loaded 
        the search results
    ● category (string): category of the visitor’s search
    ● pro_user_id (integer): unique identifier for the pro
    ● num_reviews (integer): number of reviews that the pro had at the 
        time of the search
    ● avg_rating (float): average rating across pro’s reviews
    ● pro_last_active_time_before_search (timestamp): timestamp of when 
        the pro last responded to a customer that contacted them, prior 
        to the search_timestamp
    ● cost_estimate_cents (integer): pro’s price estimate for the visitor’s 
        project, in cents. For House Cleaning searches, this is the price estimate 
        for the entire project. For Local Moving searches, this is the estimated 
        hourly rate.
    ● result_position (integer): pro’s rank in search results. Rank = 1 means 
        the pro was ranked first among the search results.
    ● service_page_viewed (boolean): TRUE indicates that the visitor clicked 
        to view the pro’s profile, FALSE otherwise


# Contacts CSV
This dataset contains a list of customers reaching out to pros. Each row is a 

visitor that reached out to a pro through a search in the Visitors CSV.
    
    ● visitor_id (integer): unique identifier for the visitor that reached
        out to the pro
    ● pro_user_id (integer): unique identifier for the pro that the visitor contacted
    ● contact_id (integer): unique identifier for the visitor-pro contact
    ● hired (boolean): TRUE indicates that the visitor eventually hired 
        the pro, FALSE otherwise


In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

visitorsDf = pd.read_csv('ThumbTack Proj/Visitors.csv')
visitorsDf = visitorsDf[['row_number', 'visitor_id', 'search_timestamp', 'category',
       'pro_user_id', 'num_reviews', 'avg_rating','pro_last_active_time_before_search'
        , 'cost_estimate_cents','result_position', 'service_page_viewed']]
contactsDf = pd.read_csv('ThumbTack Proj/Contacts.csv')
contactsDf = contactsDf[['visitor_id', 'pro_user_id', 'contact_id', 'hired']]


In [2]:
visitorsDf.columns
len(visitorsDf)

26102

In [3]:
# Run profile report across each DF and decide which cols to keep/drop
#ProfileReport(visitorsDf)

In [4]:
contactsDf.columns

Index(['visitor_id', 'pro_user_id', 'contact_id', 'hired'], dtype='object')

In [5]:
# Run profile report across each DF and decide which cols to keep/drop
#ProfileReport(contactsDf)

# Profile Report Warnings
This output was clipped from the report HTML renderings that I'm withholding from the final version of this notebook. Relevenct in making considerations for removal of sparce fields/features, constants, etc 

visitorsDf
    
    search_timestamp has a high cardinality: 3428 distinct values	
        High cardinality
    pro_last_active_time_before_search has a high cardinality: 14610 distinct values	
        High cardinality
    avg_rating has 1155 (4.4%) missing values	Missing
    pro_last_active_time_before_search has 1067 (4.1%) missing values	Missing
    cost_estimate_cents has 2158 (8.3%) missing values	Missing
    row_number has unique values	Unique
    num_reviews has 1155 (4.4%) zeros	Zeros

contactsDf
    
    contact_id is highly correlated with visitor_id	High correlation
    visitor_id is highly correlated with contact_id	High correlation

Open Q's at this point - 
    
    * Are visitor ID's unique to search results and regardless of if 
        visitor is a returning site visitor?
            ** appears to be unique or obscured 
    * What happens when joining ProUser ID and Visitor ID?  Row count explode?
            ** good to left join
    

In [6]:
comboDf = pd.merge(visitorsDf, contactsDf, how='left', left_on=['visitor_id','pro_user_id'], right_on=['visitor_id','pro_user_id'])
len(comboDf)
#comboDf.head(40)
#comboDf[comboDf['visitor_id']==343492100068655000]

26102

In [7]:
comboDf.columns

Index(['row_number', 'visitor_id', 'search_timestamp', 'category',
       'pro_user_id', 'num_reviews', 'avg_rating',
       'pro_last_active_time_before_search', 'cost_estimate_cents',
       'result_position', 'service_page_viewed', 'contact_id', 'hired'],
      dtype='object')

In [8]:
comboDf['search_timestamp'] = pd.to_datetime(comboDf['search_timestamp'])
comboDf['pro_last_active_time_before_search'] = pd.to_datetime(comboDf['pro_last_active_time_before_search'])
comboDf['time_since_logged_in'] = ((comboDf['pro_last_active_time_before_search']
                                -comboDf['search_timestamp'])/np.timedelta64(1,'h'))
comboDf['contacted'] =~ comboDf['hired'].isna()
comboDf['hired'] = comboDf['hired'].replace({True: 1, False: 0})
comboDf['hour'] = comboDf['search_timestamp'].dt.hour
#comboDf.groupby('contacted')['contacted'].count()
#len(contactsDf)

In [9]:
#comboDf = comboDf[['row_number','category','hired','contacted',
#       'num_reviews','avg_rating','cost_estimate_cents','result_position',
#       'time_since_logged_in','hour']]

movingDf = comboDf.where(comboDf['category'] == 'Local Moving (under 50 miles)').dropna(subset=['category'])
    #len(movingDf) - 7048
cleaningDf = comboDf.where(comboDf['category'] == 'House Cleaning').dropna(subset=['category'])
    #len(cleaningDf) - 19054

### Moving Category
contacted_movingDf_X = movingDf[['num_reviews','avg_rating'
                     ,'cost_estimate_cents','result_position','time_since_logged_in','hour']]
contacted_movingDf_y = movingDf[['contacted']]

movingDf = movingDf.where(comboDf['contacted'] == 1).dropna(subset=['contacted'])
    #len(hired_movingDf_X) - 155
hired_movingDf_X = movingDf[['num_reviews','avg_rating'
                     ,'cost_estimate_cents','result_position','time_since_logged_in','hour']]
hired_movingDf_y = movingDf[['hired']]

### Cleaning Category 
contacted_cleaningDf_X = cleaningDf[['num_reviews','avg_rating'
                     ,'cost_estimate_cents','result_position','time_since_logged_in','hour']]
contacted_cleaningDf_y = cleaningDf[['contacted']]

cleaningDf = cleaningDf.where(cleaningDf['contacted'] == 1).dropna(subset=['contacted'])
    #len(hired_cleaningDf_X)  208
hired_cleaningDf_X = cleaningDf[['num_reviews','avg_rating'
                     ,'cost_estimate_cents','result_position','time_since_logged_in','hour']]
hired_cleaningDf_y = cleaningDf[['hired']]

In [10]:
# Check sample sizes for the training sets to make sure none of the splits got to small
len(hired_cleaningDf_y) 
#hired_movingDf_y.head(50)

809

In [11]:
# Cleaning NAN values so training methods dont throw errors
values = {'num_reviews':0 ,
              'avg_rating': 0.00,
              'cost_estimate_cents': 0,
              'time_since_logged_in': 0
             }
contacted_movingDf_X = contacted_movingDf_X.fillna(value=values)
hired_movingDf_X = hired_movingDf_X.fillna(value=values)
contacted_cleaningDf_X = contacted_cleaningDf_X.fillna(value=values)
hired_cleaningDf_X = hired_cleaningDf_X.fillna(value=values)


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, auc, roc_auc_score

#Split data sets into splits for training where 25% of data is left out of 
    #training to use for validation tests
X_train, X_test, y_train, y_test = train_test_split(hired_cleaningDf_X.to_numpy(),
                                                hired_cleaningDf_y.to_numpy().ravel(),random_state = 0)
clf = (GradientBoostingClassifier( random_state = 0
#                                    ,learning_rate = .25,max_depth = 2,n_estimators = 8
                                     )
               .fit(X_train, y_train))

In [37]:
grid_values = {'learning_rate': [0.25, 0.1, 0.05, 0.01],
           'max_depth' : np.linspace(1, 4, 4, endpoint=True),
           'n_estimators': [1, 2, 4, 8]
          }

grid_clf_auc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf_auc.fit(X_train, y_train)
y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test) 

#print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
#print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
#print('Grid best score (AUC): ', grid_clf_auc.best_score_)

# Observations post parameter search
Good AUC scores for contact - pretty poor for hire

Details on the AUC scores and interpreting their score for model performance
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

# contacted_movingDf_X
    Test set AUC:  0.8103483661895761
    Grid best parameter (max. AUC):  {'learning_rate': 0.25, 'max_depth': 2.0, 'n_estimators': 8}
    Grid best score (AUC):  0.8027942934093323

# hired_movingDf_X
    Test set AUC:  0.46245107632093935
    Grid best parameter (max. AUC):  {'learning_rate': 0.01, 'max_depth': 4.0, 'n_estimators': 2}
    Grid best score (AUC):  0.4946914338501484

# contacted_cleaningDf_X
    Test set AUC:  0.8298953689757402
    Grid best parameter (max. AUC):  {'learning_rate': 0.25, 'max_depth': 3.0, 'n_estimators': 8}
    Grid best score (AUC):  0.8462944244914326

# hired_movingDf_X
    Test set AUC:  0.5615837539053112
    Grid best parameter (max. AUC):  {'learning_rate': 0.05, 'max_depth': 1.0, 'n_estimators': 8}
    Grid best score (AUC):  0.5639773130095711


In [31]:
#Train contacted_moving classifier model at best parameter setting to obtain the 
#  weights of each feature passed 

contacted_moving_X_train, contacted_moving_X_test, contacted_moving_y_train, contacted_moving_y_test = train_test_split(contacted_movingDf_X.to_numpy(),
                                                contacted_movingDf_y.to_numpy().ravel(),random_state = 0)

contacted_moving_clf = (GradientBoostingClassifier( random_state = 0
                                    , learning_rate = .25, max_depth = 2, n_estimators = 8
                                     )
               .fit(contacted_moving_X_train, contacted_moving_y_train))

contacted_moving_clfscore = contacted_moving_clf.decision_function(contacted_moving_X_test)
contacted_moving_fpr, contacted_moving_tpr, _ = roc_curve(contacted_moving_y_test, contacted_moving_clfscore)
contacted_moving_roc_auc = auc(contacted_moving_fpr, contacted_moving_tpr)

print(contacted_moving_clf.feature_importances_)
print(contacted_movingDf_X.columns)
print('Accuracy of GBDT classifier on training set: {:.2f}'
     .format(contacted_moving_clf.score(contacted_moving_X_train, contacted_moving_y_train)))
print('Accuracy of GBDT classifier on test set: {:.2f}'
    .format(contacted_moving_clf.score(contacted_moving_X_test, contacted_moving_y_test)))
print('ROC AUC of GBDT classifier on test set: {:.2f}'
    .format(contacted_moving_roc_auc))

[0.04224316 0.02392917 0.10179443 0.82228655 0.0097467  0.        ]
Index(['num_reviews', 'avg_rating', 'cost_estimate_cents', 'result_position',
       'time_since_logged_in', 'hour'],
      dtype='object')
Accuracy of GBDT classifier on training set: 0.90
Accuracy of GBDT classifier on test set: 0.90
ROC AUC of GBDT classifier on test set: 0.81


In [32]:
#Train hired_moving classifier model at best parameter setting to obtain the 
#  weights of each feature passed 

hired_moving_X_train, hired_moving_X_test, hired_moving_y_train, hired_moving_y_test = train_test_split(hired_movingDf_X.to_numpy(),
                                                hired_movingDf_y.to_numpy().ravel(),random_state = 0)

hired_moving_clf = (GradientBoostingClassifier( random_state = 0
                                    , learning_rate = .1, max_depth = 4, n_estimators = 2
                                     )
               .fit(hired_moving_X_train, hired_moving_y_train))

hired_moving_clfscore = hired_moving_clf.decision_function(hired_moving_X_test)
hired_moving_fpr, hired_moving_tpr, _ = roc_curve(hired_moving_y_test, hired_moving_clfscore)
hired_moving_roc_auc = auc(hired_moving_fpr, hired_moving_tpr)

print(hired_moving_clf.feature_importances_)
print(hired_movingDf_X.columns)
print('Accuracy of GBDT classifier on training set: {:.2f}'
     .format(hired_moving_clf.score(hired_moving_X_train, hired_moving_y_train)))
print('Accuracy of GBDT classifier on test set: {:.2f}'
    .format(hired_moving_clf.score(hired_moving_X_test, hired_moving_y_test)))
print('ROC AUC of GBDT classifier on test set: {:.2f}'
    .format(hired_moving_roc_auc))

[0.34499328 0.         0.1612866  0.05919121 0.22233896 0.21218995]
Index(['num_reviews', 'avg_rating', 'cost_estimate_cents', 'result_position',
       'time_since_logged_in', 'hour'],
      dtype='object')
Accuracy of GBDT classifier on training set: 0.76
Accuracy of GBDT classifier on test set: 0.84
ROC AUC of GBDT classifier on test set: 0.51


In [34]:
#Train contacted_cleaning classifier model at best parameter setting to obtain the 
#  weights of each feature passed 

contacted_cleaning_X_train, contacted_cleaning_X_test, contacted_cleaning_y_train, contacted_cleaning_y_test = train_test_split(contacted_cleaningDf_X.to_numpy(),
                                                contacted_cleaningDf_y.to_numpy().ravel(),random_state = 0)

contacted_cleaning_clf = (GradientBoostingClassifier( random_state = 0
                                    , learning_rate = .25, max_depth = 3, n_estimators = 8
                                     )
               .fit(contacted_cleaning_X_train, contacted_cleaning_y_train))

contacted_cleaning_clfscore = contacted_cleaning_clf.decision_function(contacted_cleaning_X_test)
contacted_cleaning_fpr, contacted_cleaning_tpr, _ = roc_curve(contacted_cleaning_y_test, contacted_cleaning_clfscore)
contacted_cleaning_roc_auc = auc(contacted_cleaning_fpr, contacted_cleaning_tpr)

print(contacted_cleaning_clf.feature_importances_)
print(contacted_cleaningDf_X.columns)
print('Accuracy of GBDT classifier on training set: {:.2f}'
     .format(contacted_cleaning_clf.score(contacted_cleaning_X_train, contacted_cleaning_y_train)))
print('Accuracy of GBDT classifier on test set: {:.2f}'
    .format(contacted_cleaning_clf.score(contacted_cleaning_X_test, contacted_cleaning_y_test)))
print('ROC AUC of GBDT classifier on test set: {:.2f}'
    .format(contacted_cleaning_roc_auc))

[0.04768329 0.07699484 0.04274963 0.77536943 0.0477916  0.0094112 ]
Index(['num_reviews', 'avg_rating', 'cost_estimate_cents', 'result_position',
       'time_since_logged_in', 'hour'],
      dtype='object')
Accuracy of GBDT classifier on training set: 0.96
Accuracy of GBDT classifier on test set: 0.96
ROC AUC of GBDT classifier on test set: 0.83


In [36]:
#Train hired_moving classifier model at best parameter setting to obtain the 
#  weights of each feature passed 

hired_cleaning_X_train, hired_cleaning_X_test, hired_cleaning_y_train, hired_cleaning_y_test = train_test_split(hired_cleaningDf_X.to_numpy(),
                                                hired_cleaningDf_y.to_numpy().ravel(),random_state = 0)

hired_cleaning_clf = (GradientBoostingClassifier( random_state = 0
                                    , learning_rate = .05, max_depth = 1, n_estimators = 8
                                     )
               .fit(hired_cleaning_X_train, hired_cleaning_y_train))

hired_cleaning_clfscore = hired_cleaning_clf.decision_function(hired_cleaning_X_test)
hired_cleaning_fpr, hired_cleaning_tpr, _ = roc_curve(hired_cleaning_y_test, hired_cleaning_clfscore)
hired_cleaning_roc_auc = auc(hired_cleaning_fpr, hired_cleaning_tpr)

print(hired_cleaning_clf.feature_importances_)
print(hired_cleaningDf_X.columns)
print('Accuracy of GBDT classifier on training set: {:.2f}'
     .format(hired_cleaning_clf.score(hired_cleaning_X_train, hired_cleaning_y_train)))
print('Accuracy of GBDT classifier on test set: {:.2f}'
    .format(hired_cleaning_clf.score(hired_cleaning_X_test, hired_cleaning_y_test)))
print('ROC AUC of GBDT classifier on test set: {:.2f}'
    .format(hired_cleaning_roc_auc))

[0. 0. 0. 0. 1. 0.]
Index(['num_reviews', 'avg_rating', 'cost_estimate_cents', 'result_position',
       'time_since_logged_in', 'hour'],
      dtype='object')
Accuracy of GBDT classifier on training set: 0.75
Accuracy of GBDT classifier on test set: 0.72
ROC AUC of GBDT classifier on test set: 0.56
