## Set up API

To get information from Upwork's API, you need to first request an API Key. You'll get a public key and a secret key. Then you can install the upwork python package and request a token. After that, you can do requests until the cows come home (actually, only until you hit the 40K limit). 

In [None]:
import upwork
pk = YYY
sk = XXX
client = upwork.Client(public_key = pk, secret_key = sk)

In [None]:
client.auth.get_request_token()

In [None]:
import requests

authorize_url = client.auth.get_authorize_url()

# Invoke the authorize url
requests.get(authorize_url)

verifier = raw_input(
    'Please enter the verification code you get '
    'following this link:\n{0}\n\n> '.format(
        client.auth.get_authorize_url()))

access_token, access_token_secret = client.auth.get_access_token(verifier)

## Get all the Freelancers in the Data Science & Analytics Category on UpWork

In [None]:
from pymongo import MongoClient
client = MongoClient()
db = client.data_scientist_profiles
import json

In [None]:
client = upwork.Client(pk, sk, access_token, access_token_secret)
data = {'category2': 'Data Science & Analytics'}

Define a function that will get a certain number of pages of 99 data scientists each. 

In [None]:
def final_data_science_profiles(page_number):
    data = {'category2': 'Data Science & Analytics'}
    for i in range(1, page_number):
        # get 100 data scientists
        data_scientists = client.provider_v2.search_providers(data = data, page_offset = (i - 1) * 99 + 1, page_size = 99)
        #if no entries returned, stop 
        if len(data_scientists) == 0: 
            break
        # insert records into DB    
        db.final_profiles.insert_many(data_scientists)

In [None]:
final_data_science_profiles(1500)

Practical tip: this will sometimes result in an error because of "Duplicate timestamp/nonce combination, possible replay attack. Request rejected." I think this is because the requests were coming so quickly. Therefore, when actually gathering the data, I had to run the above function multiple times, each time starting where the last one stopped. 

In [None]:
len(db.final_profiles.distinct("id"))

Our final dataset has about 93,000 profiles of freelancers. 

## Get Detailed Profile Information

Upwork's API also provides you with very detailed profile information for a given ID. Therefore, once I had all the ids for the data scientists, I could use this API call to get this information, including all the jobs they completed. Because of the error noted above, I put in a try loop that put a 4 second sleep after each failed call. I also saved time by not putting every entry immediately into mongoDB.

Theoretically `get_provider` should also work with a list of up to 20 profile ids, but I could not get that working. 

In [None]:
import time

def get_profile_details(ids):
    profile_details = []
    for user_id in ids:
        try: 
            profile_detail = client.provider.get_provider(user_id)
            profile_details.append(profile_detail)
        except: 
            time.sleep(4)
            # write everyting we've gathered so far and reset profile_details to empty
            db.final_profile_details.insert_many(profile_details)
            profile_details = []
            # get the profile detail of the one that failed and write it
            pd = client.provider.get_provider(user_id)
            db.final_profile_details.insert_one(pd)
            continue
    # insert any profile details remaining
    db.final_profile_details.insert_many(profile_details)      

In [None]:
user_ids = db.final_profiles.distinct("id")
get_profile_details(user_ids)