# Creating Custom String Field for Salesforce Instance
## Example call for one input company

**Last Updated: January 1, 2021** | Created by <a href='https://www.linkedin.com/in/sophiaskowronski/'>Sophia Skowronski</a> | Data sourced from <a href='http://www.crunchbase.com/'>Crunchbase</a> via Enterprise License Agreement

Given an input company, generate dataframe of customized strings of affiliated individuals and companies:

- Current Board Members
- Former Board Members
- Current Board Advisors/Observers
- Former Board Advisors/Observers
- All Investors
- All Investors, grouped by financing type

#### Overview
1. Use `autocompletes` function to find the input company's uuid.
2. Use `makequery_board_affiliations` and `go_past_1000` functions to pull all current and former board affiliations of the input company.
3. Use `primary_info`function to obtain primary job title, primary organization, and LinkedIn of each individual.
4. Transform affiliations into dictionaries of concatenated strings. Save to CSV file.
5. Use `makequery_investors` and `go_past_1000` functions to pull all investors of the input company. 
6. Transform investment information into dictionaries of concatenated strings. Update CSV file & print out results.

In [1]:
# Import relevant libraries
import requests
import json
from json import JSONDecodeError
import pandas as pd
from pandas import json_normalize 

# P1s Crunchbase API user key
from user_key import userkey

## All functions

In [2]:
def url_count(query, query_type): 
    '''
    Return the total number of results of a query, and then, deserialize the results to a Python dictionary object.
    ''' 
    # POST method with API URL, query_type as a parameter, and passing query as json.
    r = requests.post('https://api.crunchbase.com/api/v4/searches/' + query_type, params=userkey, json=query)
    # Return total number of results of query
    return json.loads(r.text)['count']

def url_extraction(query, query_type):    
    '''
    Return the results for a query, and then, deserialize to a Python dictionary object.
    ''' 
    # Create global raw variable. This ensures that it can be updated if the API call needs to loop.
    global raw   
    # POST method with API URL, query_type as a parameter, and passing query as json.
    r = requests.post('https://api.crunchbase.com/api/v4/searches/' + query_type, params=userkey, json=query)
    # Return results of query
    result = json.loads(r.text)
    # Normalize semi-structured JSON data into a flat table, forcing it to fit into a relational data structure.
    normalized_raw = json_normalize(result['entities'])
    # Append normalized entity results to global raw variable
    raw = raw.append(normalized_raw, ignore_index=True)

def autocompletes(query, collection_ids_list=None, limit=None):
    '''
    Suggests matching Identifier entities based on the query and entity_def_ids provided.
    
    QUERY
    Value to perform the autocomplete search with.
    
    COLLECTION_IDS_LIST
    A comma separated list of collection ids to search against. 
    Leaving this blank means it will search across all identifiers. 
    Entity defs can be constrained to specific facets by providing them as facet collections. 
    Relationship collections will resolve to their underlying entity def.
    Collection ids are: organizations, people, funding_rounds, acquisitions, investments,
    events, press_references, funds, event_appearances, ipos, ownerships, categories, 
    category_groups, locations, jobs
    
    LIMIT
    Number of results to retrieve; default = 10, max = 25
    '''
    # Create parameter dictionary to pass into POST method
    params = {**userkey, 'query': query}
    # Add input collection ids to parameters dictionary
    if collection_ids_list and type(collection_ids_list) == list:
        params.update({'collection_ids': collection_ids_list})
    # Add input limit to parameters dictionary
    if limit and type(limit) == int:
        params.update({'limit': limit}) 
    # POST method with API URL, query_type as a parameter, and passing query as json.
    r = requests.get('https://api.crunchbase.com/api/v4/autocompletes', params=params)
    # Return results of query
    result = json.loads(r.text)
    # Normalize semi-structured JSON data into a flat table, forcing it to fit into a relational data structure.
    normalized_result = json_normalize(result['entities'])
    # Return results of autocompletes query as pandas dataframe
    return pd.DataFrame.from_dict(normalized_result)

def makequery_board_affiliations(uuid_list, limit=1000):
    '''
    Job Search: Board Affiliations
    - Organization includes list of `uuid` values
    - Excludes `employee` and `executive` level jobs
    '''
    query = {
        'field_ids': [ # ADD FIELD IDS HERE
            'entity_def_id',
            'identifier',
            'job_type',
            'name',
            'organization_identifier',
            'person_identifier',
            'short_description',
            'is_current',
            'started_on',
            'ended_on',
            'title',
            'updated_at',
            'uuid'],
        'limit': limit, # INPUT LIMIT
        'query': [ # FILTERING CONDITIONS HERE
            {
                'type': 'predicate',
                'field_id': 'organization_identifier',
                'operator_id': 'includes',
                'values': uuid_list # INPUT UUID_LIST
            },
            {
                'type': 'predicate',
                'field_id': 'job_type',
                'operator_id': 'not_includes',
                'values': ['employee', 'executive']
            }]
    }
    return query

def primary_info(person_id, field_ids_list=['primary_job_title','primary_organization','linkedin'], card_ids_list=None):
    '''
    PERSON_ID
    UUID or permalink of desired entity
    
    FIELD_IDS
    Fields to include on the resulting entity - 
    either an array of field_id strings in JSON 
    or a comma-separated list encoded as string
    
    CARD_IDS
    Cards to include on the resulting entity - 
    array of card_id strings in JSON encoded as string\ 
    Card Ids for Person: [degrees, event_appearances, fields, 
    founded_organizations, jobs, participated_funding_rounds, 
    participated_funds, participated_investments, partner_funding_rounds, 
    partner_investments, press_references, primary_job, primary_organization]
    '''
    # Create parameter dictionary to pass into POST method
    params = {**userkey}
    # Add input field ids to parameters dictionary
    if field_ids_list and type(field_ids_list) == list:
        params.update({'field_ids':','.join(field_ids_list)})
    # Add input cards ids to parameters dictionary
    if card_ids_list and type(card_ids_list) == list:
        params.update({'card_ids':','.join(card_ids_list)})
    # POST method with API URL, query_type as a parameter, and passing query as json.    
    r = requests.get('https://api.crunchbase.com/api/v4/entities/people/' + person_id, params=params)
    # Return results of query
    result = json.loads(r.text)
    # Pull uuid of searched individual
    uuid = result['properties']['identifier']['uuid']
    name = result['properties']['identifier']['value']
    # Pull LinkedIn URL from json results (if it exists)
    try:
        linkedin = result['properties']['linkedin']['value']
    except KeyError:
        linkedin = 'NA'
    # Pull primary job title from json results (if it exists)
    try:
        title = result['properties']['primary_job_title']
    except KeyError:
        title = 'NA' 
    # Pull primary organization from json results (if it exists)
    try:
        org = result['properties']['primary_organization']['value']
    except KeyError:
        org = 'NA'   
    # Pull primary organization uuid from json results (if it exists)    
    try:
        org_uuid = result['properties']['primary_organization']['uuid']
    except KeyError:
        org_uuid = 'NA'
    return {uuid:name}, {uuid:title}, {uuid:org}, {uuid:org_uuid}, {uuid:linkedin}

def makequery_investors(uuid, limit=1000):
    '''
    '''
    query = {
        'field_ids': [
            'name',
            'investor_identifier',
            'organization_identifier',
            'partner_identifiers'
        ],
        'limit': limit, # INPUT LIMIT
        'query': [ # FILTERING CONDITIONS HERE
            {
                'type': 'predicate',
                'field_id': 'organization_identifier',
                'operator_id': 'includes',
                'values': uuid
            }]
    }
    return query

def go_past_1000(query, query_type, comp_count):
    global raw
    data_acq = 0
    while data_acq < comp_count:
        # Query loop
        if data_acq != 0: 
            # Selects the most recently added result
            last_uuid = raw.uuid[len(raw.uuid)-1] 
            # Saves most recent uuid query so POST request starts after this one
            query['after_id'] = last_uuid 
            # Extracts data 
            url_extraction(query, query_type) 
            # Updates data_acq variable
            data_acq = len(raw.uuid)
        else:
            # Removes after_id in case its there before the query starts.
            if 'after_id' in query: 
                query = query.pop('after_id')
                # Extracts data 
                url_extraction(query, query_type)
                # Updates data_acq variable
                data_acq = len(raw.uuid)
            # Starting query loop
            else:
                # Extracts data 
                url_extraction(query, query_type)
                # Updates data_acq variable
                data_acq = len(raw.uuid)

def grab_uuid(x):
    try:
        return x[0]['uuid']
    except:
        return x

def grab_name(x):
    try:
        return x[0]['value']
    except:
        return x

# Column/field mapper dictionnairies
column_mapper = {'properties.organization_identifier.value':'company',
                 'properties.person_identifier.value':'person', 
                 'properties.person_identifier.uuid':'person_uuid',
                 'properties.title':'title', 
                 'properties.job_type':'job_type', 
                 'properties.started_on.value':'started_on',
                 'properties.ended_on.value':'ended_on',
                 'properties.is_current':'is_current',
                 'properties.updated_at':'record_last_updated',
                 'properties.investor_identifier.uuid':'investor_uuid',
                 'properties.investor_identifier.value':'investor_name'}
order = ['Grant','Pre Seed Round','Seed Round','Series A','Series B','Series C',
         'Series D','Series E','Series F','Series G','Series H','Series I',
         'Series J','Series K','Secondary Market','Private Equity Round',
         'Debt Financing','Angel Round','Funding Round','Venture Round',
         'Corporate Round','Non Equity Assistance', 'Convertible Note', 'Post-IPO Equity','']
shorthand = ['Grant','Pre Seed', 'Seed','A','B','C','D','E','F','G','H','I','J','K',
             'Secondary','Private Eq', 'Debt', 'Angel Rnd', 'Funding Rnd', 
             'Venture Rnd', 'Corporate Rnd', 'Non Equity Assist', 'Convert Note',
             'Post-IPO Equity','']
shorthand_map = dict(zip(order,shorthand))
order = {key: i for i, key in enumerate(order)}

## Output

In [5]:
company = 'Salesforce'

found = autocompletes(company, ['organizations'], limit=1)
uuid = found['identifier.uuid'][0]
    
print('*'*50)
print('Searching for {}'.format(company.upper()))
print('*'*50)
print('Found {} !!!!!!!\nDESCRIPTION: {}\n'.format(found['identifier.value'][0].upper(), found['short_description'][0]))

# Make query of current/former board affiliations for companies
query = makequery_board_affiliations([uuid])

# Global raw variable
raw = pd.DataFrame()

# Run query w/ API call, which populates dataframe with query results
url_extraction(query, 'jobs')

# Filter down the query dataframe to usable fields
board_affiliations = raw[['properties.organization_identifier.value',  # Company name
                          'properties.person_identifier.uuid', # Person UUID
                          'properties.person_identifier.value',  # Person name
                          'properties.title',  # Job title of board affiliation
                          'properties.job_type',  # Crunchbase job_type
                          'properties.is_current', # Boolean of whether job is current or not
                          'properties.started_on.value', # Job start date
                          'properties.ended_on.value', # Job end date
                          'properties.updated_at'] # When Crunchbase last updated the record
                        ].sort_values(['properties.organization_identifier.value']).reset_index(drop=True) # Sory by company name
board_affiliations = board_affiliations.rename(column_mapper, axis=1)

# Get UUIDs of people
uuid_board_members = list(set(board_affiliations['person_uuid'].to_list()))

# Display
print('Total records found: {}'.format(board_affiliations.shape[0]))
print('Total unique records found: {}\n'.format(len(uuid_board_members)))
    
# Start with empty dictionnaries
all_names = {}
all_titles = {}
all_orgs = {}
all_orgs_uuid = {}
all_linkedin = {}
no_primary_info = []

# For each API call, update dictionary if it's not empty
i = 0
print('Count of API calls, number of unique individuals found in query:')
while i < len(uuid_board_members):
    print(i+1,end=' ')
    person = uuid_board_members[i]
    try: 
        # API Call
        name, primary_job_title, primary_org, primary_org_uuid, linkedin = primary_info(person)
        all_names.update(name)
        # Update job title dictionary as long as its not equal to 'NA'
        if primary_job_title[person] != 'NA':
            all_titles.update(primary_job_title)
        # Update organization dictionary as long as its not equal to 'NA'
        if primary_org[person] != 'NA':
            all_orgs.update(primary_org)
        # Update organization dictionary as long as its not equal to 'NA'
        if primary_org_uuid[person] != 'NA':
            all_orgs_uuid.update(primary_org_uuid)
        # Update LinkedIn dictionary as long as its not equal to 'NA'
        if linkedin[person] != 'NA':
            all_linkedin.update(linkedin)
        # If any are equal to 'NA', store in no_primary_info list for safekeeping.
        if primary_job_title[person] == 'NA' or primary_org[person] == 'NA' or linkedin[person] == 'NA' or primary_org_uuid[person] =='NA':
            no_primary_info.append(person)
        # Continue looping
        i += 1
    except JSONDecodeError:
        print('[From Crunchbase: Usage limit exceeded. Pause for 5 seconds and continue.]',end =' ')
        time.sleep(5)

# Count of how many are missing Title, Organization, or LinkedIn
print('\n\n{} out of {} records are missing either a primary job title, primary organization, or LinkedIn url.\n'.format(len(no_primary_info),i))

# Add primary title, organization, and LinkedIn to dataframe
board_affiliations['person_title'] = board_affiliations['person_uuid'].map(all_titles)
board_affiliations['primary_org'] = board_affiliations['person_uuid'].map(all_orgs)
board_affiliations['person_linkedin'] = board_affiliations['person_uuid'].map(all_linkedin)

# Pull unique list of company names from series
company_list = list(set(board_affiliations['company'].to_list()))
company_list.sort()

# CURRENT BOARD MEMBERS
current_board_members = board_affiliations[(board_affiliations['is_current']) & 
                                           (board_affiliations['job_type']=='board_member')
                                          ].sort_values(['person'])
# FORMER BOARD MEMBERS
former_board_members = board_affiliations[(board_affiliations['is_current']==False) &
                                          (board_affiliations['job_type']=='board_member')
                                         ].sort_values(['person'])
# CURRENT BOARD ADVISORS/OBSERVERS
current_board_other = board_affiliations[(board_affiliations['is_current']) &
                                         (board_affiliations['job_type']!='board_member')
                                        ].sort_values(['person'])
# FORMER BOARD ADVISORS/OBSERVERS
former_board_other = board_affiliations[(board_affiliations['is_current'] == False) &
                                          (board_affiliations['job_type']!='board_member')
                                         ].sort_values(['person'])
# To iterate through
frames = [current_board_members, former_board_members, current_board_other, former_board_other]

# For saving to csv
all_dict = []
for df in frames:
    # Fill in affiliation dictionary
    people_dict = {}
    # For each company
    for org in company_list:
        # Filter df to unique company affiliations
        temp_df = df[df['company']==org]
        # Collapse individual names to list
        names = temp_df['person'].to_list()
        # Collapse individual organizations to list
        companies = temp_df['primary_org'].to_list()
        # Start with empty string
        board_string = ''
        # Exclude if there are no individuals affiliated
        if names != []:
            # Make temp dictionary of name:org
            board_info = dict(zip(names, companies))
            # Add them to string
            for name, company in sorted(board_info.items()):
                # If individual does not have a primary organization
                if pd.isna(company):
                    board_string += name + '; '
                # If individual has a primary organziation, place into parentheses
                else:
                    board_string += name + ' (' + company + '); '
            # Remove trailing semicolon and remove extra commas
            board_string = board_string[:-2].replace(',', '')
        # Add string to main dictionary 
        people_dict[org] = board_string
    all_dict.append(people_dict)
    
# Save to CSV
with open('output/search_example.csv', 'w') as f:
    for i,key in enumerate(company_list):
        if i == 0:
            # Add header
            f.write('Company,Current Board Members,Former Board Members,Current Board Advisors/Observers,Former Board Advisors/Observers\n')
            f.write('%s, %s, %s, %s, %s\n' % (key,all_dict[0][key],all_dict[1][key],all_dict[2][key],all_dict[3][key]))
        else:
            f.write('%s, %s, %s, %s, %s\n' % (key,all_dict[0][key],all_dict[1][key],all_dict[2][key],all_dict[3][key]))  

# Get all of the investors for P1 companies
query = makequery_investors([uuid])

# Global raw variable
raw = pd.DataFrame() 
comp_count = url_count(query, 'investments') 
go_past_1000(query, 'investments', comp_count)

# Create dataframe that contains the investor name, org name, and type of investment (for grouping)
investors = raw.sort_values('properties.organization_identifier.value')[['properties.investor_identifier.uuid',
                                                                            'properties.investor_identifier.value',
                                                                            'properties.identifier.value',
                                                                            'properties.organization_identifier.value',
                                                                           'properties.partner_identifiers']].reset_index(drop=True)
investors['properties.investor_identifier.value'] = investors['properties.investor_identifier.value'].str.strip('-')

# Extract financing type from title string
investors['type'] = investors['properties.identifier.value'].str.partition(' - ')[0].str.partition(' in ')[2]

# Map uuids w/ custom dictionnaries to add new dataframe columns
investors['partner_uuid'] = investors['properties.partner_identifiers'].apply(grab_uuid)
investors['partner_name'] = investors['properties.partner_identifiers'].apply(grab_name)
investors = investors.drop(['properties.identifier.value','properties.partner_identifiers'], axis=1)
investors = investors.fillna('Not Listed')

# Send through column_mapper
investors = investors.rename(column_mapper, axis=1)

# Remove duplicates
investors = pd.DataFrame(investors.groupby(['investor_uuid','investor_name','company','type','partner_uuid','partner_name']).count().reset_index())

all_investors = {}
for co in investors.company.unique():
    investor_str = ''
    # Create sub-DF with just the company's investors
    co_df = investors[investors['company']==co]
    # Turn into list
    co_investors = sorted(list(set(co_df['investor_name'].to_list())))
    for inv in co_investors:
        # Add investor to string
        investor_str += inv + '; '
    # Remove trailing semicolon
    investor_str = investor_str[:-2]
    all_investors[co] = investor_str  

all_investors_w_info = {}
for co in investors.company.unique():
    investor_str = ''
    # Create sub-DF with just the company's investors
    co_df = investors[investors['company']==co]
    # Create unique list of investment types
    lst = list(set(co_df['type'].to_list()))
    lst = sorted(lst, key=lambda d: order[d])
    for item in lst:
        # Create sub-list of investors with a particular investment type
        investor_type_lst = sorted(list(set(co_df['investor_name'][co_df['type']==item].to_list())))
        if investor_type_lst != []:
            # Add investment type to string
            investor_str += item + ' ('
            for com in investor_type_lst:
                # Add investor to string
                investor_str += com + '; '
            # Remove trailing semicolon and add end parenthesis
            investor_str = investor_str[:-2]
            investor_str += ') | '
    # Remove trailing characters
    investor_str = investor_str[:-3]
    all_investors_w_info[co] = investor_str  
    
# Add new concatenated strings to dataframe
mapping = pd.read_csv('output/search_example.csv')
mapping['Investors (All)'] = mapping['Company'].map(all_investors)
mapping['Investors (w/ Info)'] = mapping['Company'].map(all_investors_w_info)
mapping.to_csv('output/search_example.csv', index=False)

# Output
print('*'*50)
print('Results for {}'.format(org.upper()))
print('*'*50)
for idx,col in enumerate(mapping.columns):
    print('{}:\n{}\n\n'.format(col.upper(), mapping.loc[0,col]))

**************************************************
Searching for SALESFORCE
**************************************************
Found SALESFORCE !!!!!!!
DESCRIPTION: Salesforce is a global cloud computing company that develops CRM solutions and provides business software on a subscription basis.

Total records found: 30
Total unique records found: 30

Count of API calls, number of unique individuals found in query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

9 out of 30 records are missing either a primary job title, primary organization, or LinkedIn url.

**************************************************
Results for SALESFORCE
**************************************************
COMPANY:
Salesforce


CURRENT BOARD MEMBERS:
 Alan Hassenfeld (Salesforce); Colin Powell (Kleiner Perkins); Craig Conway (Salesforce); John Roos (Geodesic Capital); Larry Tomlinson (Salesforce); Magdalena Yesil (Broadway Angels); Marc Benioff (Salesforce); Maynard Webb (Web

# Visualization of Relationships w/ turicreate & NetworkX