# Thumbtack Analytics Challenge
  
Thumbtack has decided to take a closer look at performance in two of its largest categories - House
Cleaning and Local Moving. Please complete the analyses suggested below and overlay your own
recommendations for how we can improve and grow our marketplace.

    ● Based on the data, what types of pros are customers interested in?
    
        Customers have a high rate of Contact for pros with
         1. Recent Site Activity (response time?)
         2. More Reviews
         3. High Search Rank
        Somewhat suprisingly, rating was not a leading indicator of contacting or not
        
        Differenciators for actually being hired
         1. Recent Site Activity (response time?)
         2. More Reveiews
         

    Dashboard below is a quick glance at the variables avaiable.

https://public.tableau.com/views/ThumbtackSampleProj/ConsumerPreferences?:language=en&:embed_code_version=3&:loadOrderID=2&:display_count=y&publish=yes&:origin=viz_share_link

     ● Based on the types of pros that customers are interested in, 
            how would you describe the quantity and quality of the search results? 
            What could be improved?

     Quantity and Quality:    
            * Very large presence of House Cleaner results that grade poorly in 
            features pertaining to being hired. (roughly 13x more than top graded) 
            * Local Moving is a job hired less and has a similar magnitude of 17x more 
            results in grade D than grade A
            
     What could be improved:
            Regardless of the number of job actually accepted, if the population of pros
            behaved in a way that reflected the top hires, the consumer would begin to 
            rely on other factors to pick whom to hire. (Perhaps availability, 
            response time, or positive reviews from friends/neighbors vs "the crowd". )
            
    Dashboard below is a quick glance at the grading spread for all the records of 
    search instances provided.


https://public.tableau.com/views/ThumbtackSampleProjGrading/GradingDistrobution?:language=en&:retry=yes&:display_count=y&:origin=viz_share_link

# The following is an example of using Python for using machine learning to express which "features" aka column lead to best predictors of being contacted and hired. 

# Prompt


Thumbtack is a marketplace for local services. Customers come to our website or mobile app to see our
directory of service professionals (example) in nearly 500 categories. As part of the search experience, customers can provide some basic details about their projects in the search filters to see pros that best match their needs. Customers can also see pros’ price estimates for their projects. From the list of pros, customers can then explore pro profiles, contact the pros that interest them, and ultimately hire a pro. In this process, Thumbtack generates revenue by charging pros for each customer that contacts them.

Downloaded Fiels : https://drive.google.com/drive/folders/1v8wmMVvQPFBHjtGutjYA4bL9V_eEt7Ii

# Visitors CSV
This dataset contains a list of search results. Each result is a pro that matched 

a specific visitor’s search.

    ● row_number (integer): row number in data set
    ● visitor_id (integer): unique identifier for the visitor that the 
        search result is associated with
    ● search_timestamp (timestamp): timestamp of when the visitor loaded 
        the search results
    ● category (string): category of the visitor’s search
    ● pro_user_id (integer): unique identifier for the pro
    ● num_reviews (integer): number of reviews that the pro had at the 
        time of the search
    ● avg_rating (float): average rating across pro’s reviews
    ● pro_last_active_time_before_search (timestamp): timestamp of when 
        the pro last responded to a customer that contacted them, prior 
        to the search_timestamp
    ● cost_estimate_cents (integer): pro’s price estimate for the visitor’s 
        project, in cents. For House Cleaning searches, this is the price estimate 
        for the entire project. For Local Moving searches, this is the estimated 
        hourly rate.
    ● result_position (integer): pro’s rank in search results. Rank = 1 means 
        the pro was ranked first among the search results.
    ● service_page_viewed (boolean): TRUE indicates that the visitor clicked 
        to view the pro’s profile, FALSE otherwise


# Contacts CSV
This dataset contains a list of customers reaching out to pros. Each row is a 

visitor that reached out to a pro through a search in the Visitors CSV.
    
    ● visitor_id (integer): unique identifier for the visitor that reached
        out to the pro
    ● pro_user_id (integer): unique identifier for the pro that the visitor contacted
    ● contact_id (integer): unique identifier for the visitor-pro contact
    ● hired (boolean): TRUE indicates that the visitor eventually hired 
        the pro, FALSE otherwise


In [3]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

visitorsDf = pd.read_csv('ThumbTack Proj/Visitors.csv')
visitorsDf = visitorsDf[['row_number', 'visitor_id', 'search_timestamp', 'category',
       'pro_user_id', 'num_reviews', 'avg_rating','pro_last_active_time_before_search'
        , 'cost_estimate_cents','result_position', 'service_page_viewed']]
contactsDf = pd.read_csv('ThumbTack Proj/Contacts.csv')
contactsDf = contactsDf[['visitor_id', 'pro_user_id', 'contact_id', 'hired']]


In [4]:
visitorsDf.columns
len(visitorsDf)

26102

In [5]:
# Run profile report across each DF and decide which cols to keep/drop
ProfileReport(visitorsDf)

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=25.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…






In [6]:
contactsDf.columns

Index(['visitor_id', 'pro_user_id', 'contact_id', 'hired'], dtype='object')

In [7]:
# Run profile report across each DF and decide which cols to keep/drop
ProfileReport(contactsDf)

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=18.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…






# Profile Report Warnings
This output was clipped from the report HTML renderings that I'm withholding from the final version of this notebook. Relevenct in making considerations for removal of sparce fields/features, constants, etc 

visitorsDf
    
    search_timestamp has a high cardinality: 3428 distinct values	
        High cardinality
    pro_last_active_time_before_search has a high cardinality: 14610 distinct values	
        High cardinality
    avg_rating has 1155 (4.4%) missing values	Missing
    pro_last_active_time_before_search has 1067 (4.1%) missing values	Missing
    cost_estimate_cents has 2158 (8.3%) missing values	Missing
    row_number has unique values	Unique
    num_reviews has 1155 (4.4%) zeros	Zeros

contactsDf
    
    contact_id is highly correlated with visitor_id	High correlation
    visitor_id is highly correlated with contact_id	High correlation

Open Q's at this point - 
    
    * Are visitor ID's unique to search results and regardless of if 
        visitor is a returning site visitor?
            ** appears to be unique or obscured 
    * What happens when joining ProUser ID and Visitor ID?  Row count explode?
            ** good to left join
    

In [8]:
comboDf = pd.merge(visitorsDf, contactsDf, how='left', left_on=['visitor_id','pro_user_id'], right_on=['visitor_id','pro_user_id'])
len(comboDf)
#comboDf.head(40)
#comboDf[comboDf['visitor_id']==343492100068655000]

26102

In [9]:
comboDf.columns

Index(['row_number', 'visitor_id', 'search_timestamp', 'category',
       'pro_user_id', 'num_reviews', 'avg_rating',
       'pro_last_active_time_before_search', 'cost_estimate_cents',
       'result_position', 'service_page_viewed', 'contact_id', 'hired'],
      dtype='object')

In [10]:
comboDf['search_timestamp'] = pd.to_datetime(comboDf['search_timestamp'])
comboDf['pro_last_active_time_before_search'] = pd.to_datetime(comboDf['pro_last_active_time_before_search'])
comboDf['time_since_logged_in'] = ((comboDf['pro_last_active_time_before_search']
                                -comboDf['search_timestamp'])/np.timedelta64(1,'h'))
comboDf['contacted'] =~ comboDf['hired'].isna()
comboDf['hour'] = comboDf['search_timestamp'].dt.hour
#comboDf.groupby('contacted')['contacted'].count()
#len(contactsDf)

In [11]:
#comboDf = comboDf[['row_number','category','hired','contacted',
#       'num_reviews','avg_rating','cost_estimate_cents','result_position',
#       'time_since_logged_in','hour']]

movingDf = comboDf.where(comboDf['category'] == 'Local Moving (under 50 miles)').dropna(subset=['category'])
    #len(movingDf) - 7048
cleaningDf = comboDf.where(comboDf['category'] == 'House Cleaning').dropna(subset=['category'])
    #len(cleaningDf) - 19054

### Moving Category
contacted_movingDf_X = movingDf[['row_number','num_reviews','avg_rating'
                     ,'cost_estimate_cents','result_position','time_since_logged_in','hour']]
contacted_movingDf_y = movingDf[['row_number','contacted']]

movingDf = movingDf.where(comboDf['hired'] == 1).dropna(subset=['hired'])
    #len(hired_movingDf_X) - 155
hired_movingDf_X = movingDf[['row_number','num_reviews','avg_rating'
                     ,'cost_estimate_cents','result_position','time_since_logged_in','hour']]
hired_movingDf_y = movingDf[['row_number','hired']]

### Cleaning Category 
contacted_cleaningDf_X = cleaningDf[['row_number','num_reviews','avg_rating'
                     ,'cost_estimate_cents','result_position','time_since_logged_in','hour']]
contacted_cleaningDf_y = cleaningDf[['row_number','contacted']]

cleaningDf = cleaningDf.where(cleaningDf['hired'] == 1).dropna(subset=['hired'])
    #len(hired_cleaningDf_X)  208
hired_cleaningDf_X = cleaningDf[['row_number','num_reviews','avg_rating'
                     ,'cost_estimate_cents','result_position','time_since_logged_in','hour']]
hired_cleaningDf_y = cleaningDf[['row_number','hired']]

In [12]:
len(hired_cleaningDf_X) 


208