
## **Project Overview**
**The purpose of this notebook is to ingest, clean, and transform five raw source files into four cleaned, standardized datasets.**

This process includes:
- Parsing and standardizing inconsistent fields (e.g., dates, phone numbers, emails)
- Deduplicating records and ensuring referential integrity
- Assigning unique identifiers to deals, companies, contacts, and marketing participants
- Structuring the data into clean, relational tables ready for downstream analysis, reporting, or database ingestion
- All functions used are defined in functions.py

---

## **Final Deliverables**

- **`deals_df`**: A clean list of deal opportunities with unique `Deal_IDs`
- **`historical_financial_data_df`**: Historical EBITDA metrics linked by `Deal_ID`
- **`companies_df`**: A master list of companies with unique `Company_IDs`
- **`contacts_df`**: A cleaned list of contacts with unique `Contact_IDs`
- **`marketing_participants_df`**: Event participants linked to `Contact_IDs`

Imports/Installs

In [1]:
import pandas as pd
import openpyxl
import re
from functions import *

### Business Services Pipeline - Ingestion & Cleaning

In [2]:
#Ingesting Business Services Pipeline

business_pipeline_df = pd.read_excel("/Users/sonamrupani/Desktop/Intapp Data Engineer Assessment - Data/data_files/Business Services Pipeline.xlsx",
                                     skiprows = 5,
                                     usecols = "A:V")

# leaving 'Date Added' out for additional date parsing/standardization
bsp_dtypes = {
    "Company Name": "string",
    "Project Name": "string",
    "Investment Bank": "string",
    "Banker": "string",
    "Sourcing": "string",
    "Transaction Type": "string",
    "LTM Revenue": "Float64",
    "LTM EBITDA": "Float64",
    "2014A EBITDA": "Float64",
    "2015A EBITDA": "Float64",
    "2016A EBITDA": "Float64",
    "2017A/E EBITDA": "Float64",
    "2018E EBITDA": "Float64",
    "Vertical": "string",
    "Sub Vertical": "string",
    "Enterprise Value": "Float64",
    "Est. Equity Investment": "Float64",
    "Status": "string",
    "Current Owner": "string",
    "Business Description": "string",
    "Lead MD": "string",
    "Notes": "string" 
}

#Financial columns to check
bsp_financial_columns = [
    "LTM Revenue",
    "LTM EBITDA",
    "2014A EBITDA",
    "2015A EBITDA",
    "2016A EBITDA",
    "2017A/E EBITDA",
    "2018E EBITDA",
    "Enterprise Value",
    "Est. Equity Investment"
]

#Cleansing white spaces and new lines
business_pipeline_df = cleanse_column_names(business_pipeline_df)

#Ensure proper null format - pd.NA
business_pipeline_df = modernize_nans(business_pipeline_df)

#Rename columns
business_pipeline_df = business_pipeline_df.rename(columns={"Invest. Bank": "Investment Bank", "Equity Investment Est.": "Est. Equity Investment"})

#TODO double check
#hardcoding CAD to USD rate as 0.73
business_deals, business_deals_audit_df= process_financial_dataframe(business_pipeline_df, bsp_financial_columns, cad_to_usd_rate=0.73)

business_deals = business_deals.astype(bsp_dtypes)

# 1. Backup original Date Added column
business_deals["Date Added (Original)"] = business_deals["Date Added"]

# 2. Apply the simple_date_parser function row-by-row
business_deals["Date Added"] = business_deals["Date Added (Original)"].apply(date_parsing)

display(business_deals)

Unnamed: 0,Company Name,Project Name,Date Added,Investment Bank,Banker,Sourcing,Transaction Type,LTM Revenue,LTM EBITDA,2014A EBITDA,...,Vertical,Sub Vertical,Enterprise Value,Est. Equity Investment,Status,Current Owner,Business Description,Lead MD,Notes,Date Added (Original)
0,Shermco,,2018-02-02,Harris Williams,,Auction,Sponsor to Sponsor,,,,...,Business Services,"Testing, Inspection & Certificaiton",267.0,133.5,Active,Oaktree,"Electrical testing, maintenance, and commissio...",Jeannie Blackwood,,2018-02-02 00:00:00
1,Kastle Systems,,2018-02-02,,,Trusted Netwok,Sponsor to Sponsor,,,,...,Business Services,Facilities Services,,,Active,Venturehouse,"Provider of comprehensive, turnkey security so...",Andrew Mah,,2018-02-02 00:00:00
2,CLEAResult,,2018-02-02,,,Trusted Netwok,Sponsor to Sponsor,,,,...,Business Services,Facilities Services,,,Active,General Atlantic,Provider of energy efficiency and demand manag...,Kripa Shah,,2018-02-02 00:00:00
3,PLH,,2018-02-02,Barclays,,Auction,Sponsor to Sponsor,,,,...,Business Services,Industrial & Environmental Services,680.0,340.0,Active,Energy Capital Partners,Specialty contractor serving the electric powe...,Russ Barner,,2018-02-02 00:00:00
4,BBB Industries,,2018-02-02,"Baird, Jefferies",,Auction,Sponsor to Sponsor,,,,...,Business Services,Specialty Distribution,1000.0,500.0,Active,Pamplona,Provider of remanufactured replacement parts t...,Matthew Kordonowy,,2018-02-02 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123,DH Corporation,,2016-11-01,Take-private,,Auction,Corporate Seller,,,,...,Business Services,Financial Technology,2536.02,,Dead,Finastra,Financial technology,Russ Barner,2017A/E EBITDA: CAD; originally stored as CAD;...,Nov-16
124,National Response Corporation (NRC),,2016-11-01,Harris Williams,,Auction,Sponsor to Sponsor,,,,...,Business Services,"Testing, Inspection & Certificaiton",350.0,175.0,Dead,JF Lehman,Compliance and environmental services,Matthew Kordonowy,,Nov-16
125,SunGard Insurance,,2016-10-01,Direct to Company,,Proprietary,Corporate Seller,,,,...,Business Services,Financial Technology,680.0,,Dead,FIS,Insurance software provider,Andrew Mah,,Oct-16
126,PSG,,2016-09-01,Centerview,,Auction,Other Private Buyout,,,,...,Business Services,Specialty Distribution,,,Dead,"Sagard Holdings, Fairfax Capital",Sporting goods,Kripa Shah,,Sep-16


### Consumer Retail & Healthcare Pipeline - Ingestion & Cleaning

In [3]:
consumer_retail_healthcare_pipeline = pd.read_excel("Consumer Retail and Healthcare Pipeline.xlsx",\
                                                    skiprows = 8, \
                                                    usecols = "B:W")

#Cleansing white spaces and new lines
consumer_retail_healthcare_pipeline = cleanse_column_names(consumer_retail_healthcare_pipeline)

consumer_retail_healthcare_pipeline = modernize_nans(consumer_retail_healthcare_pipeline)

#Excel formatting causing null trailing rows - remove
consumer_retail_healthcare_pipeline = consumer_retail_healthcare_pipeline.dropna(subset=['Company Name'])
cols = [col for col in consumer_retail_healthcare_pipeline.columns if col != 'Company Name']
consumer_retail_healthcare_pipeline = consumer_retail_healthcare_pipeline[~consumer_retail_healthcare_pipeline[cols].isna().all(axis=1)]

consumer_retail_health_deals = consumer_retail_healthcare_pipeline.copy()

#Leave our dates for further processing
crhp_dtypes = {
    "Company Name": "string",
    "Project Name": "string",
    "Banker": "string",
    "Banker Email": "string",
    "Banker Phone Number": "string",
    "Sourcing": "string",
    "Transaction Type": "string",
    "LTM Revenue": "Float64",
    "LTM EBITDA": "Float64",
    "Vertical": "string",
    "Sub Vertical": "string",
    "Enterprise Value": "Float64",
    "Est. Equity Investment": "Float64",
    "Status": "string",
    "Portfolio Company Status": "string",
    "Active Stage": "string",
    "Passed Rationale": "string",
    "Current Owner": "string",
    "Business Description": "string",
    "Lead MD": "string",
    "Date Added": "datetime64[ns]",
    "Date Added (Original)": "datetime64[ns]",
    "Invest. Bank": "string"
}

chrp_financial_cols = ['LTM Revenue', 'LTM EBITDA', 'Enterprise Value', 'Est. Equity Investment']

consumer_retail_health_deals, crhp_audit_log = process_financial_dataframe(consumer_retail_health_deals, chrp_financial_cols)

# 1. Backup original Date Added column
consumer_retail_health_deals["Date Added (Original)"] = consumer_retail_health_deals["Date Added"]

# 2. Apply the simple_date_parser function row-by-row
consumer_retail_health_deals["Date Added"] = consumer_retail_health_deals["Date Added (Original)"].apply(date_parsing)

consumer_retail_health_deals = update_data_types(consumer_retail_health_deals, crhp_dtypes)

consumer_retail_health_deals = consumer_retail_health_deals.rename(columns = {"Invest. Bank": "Investment Bank"})

display(consumer_retail_health_deals)


Unnamed: 0,Company Name,Project Name,Date Added,Investment Bank,Banker,Banker Email,Banker Phone Number,Sourcing,Transaction Type,LTM Revenue,...,Est. Equity Investment,Status,Portfolio Company Status,Active Stage,Passed Rationale,Current Owner,Business Description,Lead MD,Notes,Date Added (Original)
0,Acima Credit,,2018-01-23,,,,,Proprietary,Initial Capitalization,,...,,Active,,CIM Received,,Founder,Rent-to-own consumer financing provider for du...,Jeannie Blackwood,,2018-01-23
1,Array,Maple,2017-09-01,Jefferies; Baird,Bill Cooling (Jefferies); Shaun Westfall (Jeff...,,258-664-9089,Auction,Sponsor to Sponsor,291.0,...,198.0,Active,,IOI Submitted,,Carlyle,Provider of end-to-end beauty merchandising so...,Andrew Mah,,2017-09-01
2,Electrical Components International,,2018-02-01,Barclays,,,,,,,...,,Active,,New Deal,,,Designer and manufacturer of electrical wire h...,Kripa Shah,,2018-02-01
3,European Wax Center,Beauty,2017-10-17,SunTrust,Scott Paton,Scott Paton@SunTrust .com,942-254-1327,Auction,Other Private Buyout,,...,,Active,,CIM Received,,Founders,Operator of over 600 waxing centers across the...,Russ Barner,,2017-10-17
4,Guitar Center,,2018-02-09,Houlihan Lokey UBS,,,,,Sponsor to Sponsor,,...,,Active,,New Deal,,,Leading retailer of musical instruments in the...,Matthew Kordonowy,,2018-02-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186,Schweiger Dermatology,,2018-01-17,,,,,Trusted Network,Sponsor to Sponsor,,...,,Passed/Dead,,New deal,On hold,LLR Capital / Founders,Roll-up of dermatology practices / owned by LL...,Andrew Mah,,2018-01-17
187,Firebirds,,2017-12-01,North Point Advisors,,,,,,,...,,Passed/Dead,,,,,Owner and operator of 45 Firebirds branded res...,Kripa Shah,,2017-12-01
188,Pacon,,2017-12-01,Baird,,,,,,,...,,Passed/Dead,,,,,Producer and marketer of arts and crafts products,Russ Barner,,2017-12-01
189,Potpourri Group,,2017-12-01,Lincoln International,,,,,,,...,,Passed/Dead,,,,,Direct-to-consumer marketer of women's apparel...,Matthew Kordonowy,,2017-12-01


### Contacts - Ingest & Clean

In [4]:
tier_1_contacts = pd.read_excel("Contacts.xlsx", sheet_name = "Tier 1's")
tier_1_contacts["Tier"] = 1

tier_2_contacts = pd.read_excel("Contacts.xlsx", sheet_name = "Tier 2's")
tier_2_contacts["Tier"] = 2

contacts_df = pd.concat([tier_1_contacts, tier_2_contacts], ignore_index = True)

#Cleansing white spaces and new lines
contacts_df = cleanse_column_names(contacts_df)

contacts_df = modernize_nans(contacts_df)

contacts_dtypes = {
    "Firm": "string",
    "Name": "string",
    "Title": "string",
    "Group": "string",
    "Sub-Vertical": "string",
    "E-mail": "string",
    "Phone": "string",
    "Secondary Phone": "string",
    "City": "string",
    "Coverage Person": "string",
    "Preferred Contact Method": "string"
}

contacts_df = update_data_types(contacts_df, contacts_dtypes)

contacts_df['Birthday'] = pd.to_datetime(contacts_df['Birthday'])

display(contacts_df)

Unnamed: 0,Firm,Name,Title,Group,Sub-Vertical,E-mail,Phone,Secondary Phone,City,Birthday,Coverage Person,Preferred Contact Method,Tier
0,Harris Williams,Robert Baltimore,Managing Director,Business Services,Business Services,BBaltimore@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1966-02-25,Hannah Jumper,Email,1
1,Harris Williams,Brian Lucas,Managing Director,Business Services,Business Services,blucas@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1953-09-03,Kripa Shah,Business Phone,1
2,Harris Williams,Luke Semple,Managing Director,Business Services,Business Services,lsemple@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1962-03-27,Emily Royal,Cell Phone,1
3,Harris Williams,Drew Spitzer,Managing Director,Business Services,Business Services,aspitzer@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1964-04-28,Russ Barner,Business Phone,1
4,Harris Williams,Derek Lewis,Managing Director,Business Services,Business Services,dlewis@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1971-04-24,Daniel Ding,Cell Phone,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
306,Cowen,Kevin Manning,"Managing Director, Head of Diversified Industr...",Industrials,Industrial & Environmental Services,kevin.manning@cowen.com,(312) 577-2228,(773) 304-6721,"Chicago, IL",1980-09-10,Emily Royal,Cell Phone,2
307,Petsky Prunier,Sanjay Chadda,Managing Director & Partner,Marketing Services,Marketing Services,schadda@petskyprunier.com,212-842-6022,,"New York, NY",1980-09-05,Kripa Shah,Email,2
308,Petsky Prunier,Marc Flor,Director,Marketing Services,Marketing Services,mflor@petskyprunier.com,212-842-6034,,"New York, NY",1950-04-15,Hannah Jumper,Email,2
309,AdMedia,Oliver Schweitzer,Managing Director,Marketing Services,Marketing Services,oschweitzer@admediapartners.com,(212) 759-1870,,"New York, NY",1978-04-01,Jeannie Blackwood,Cell Phone,2


### Events - Ingest & Clean

In [5]:
leaders_partners_events = pd.read_excel("Events.xlsx", sheet_name = "Leaders and Partners Dinner")
leaders_partners_events['Event'] = "Leaders and Partners Dinner"

market_recap = pd.read_excel("Events.xlsx", sheet_name="2019 Market Re-Cap")
market_recap['Event'] = "2019 Market Re-Cap"

events_df = pd.concat([leaders_partners_events, market_recap], ignore_index=True)

#Cleansing white spaces and new lines
events_df = cleanse_column_names(events_df)

events_df = modernize_nans(events_df)

events_dtypes = {
    "Name": "string",
    "E-mail": "string",
    "Attendee Status": "string",
    "Event": "string"
}

events_df = update_data_types(events_df, events_dtypes)

display(events_df)

Unnamed: 0,Name,E-mail,Attendee Status,Event
0,Rob Baltimore,BBaltimore@harriswilliams.com,RSVP'd,Leaders and Partners Dinner
1,Brian Lucas,blucas@harriswilliams.com,Declined,Leaders and Partners Dinner
2,Luke Semple,lsemple@harriswilliams.com,Checked In,Leaders and Partners Dinner
3,Andrew Spitzer,aspitzer@harriswilliams.com,No Show,Leaders and Partners Dinner
4,Derek Lewis,dlewis@harriswilliams.com,Declined,Leaders and Partners Dinner
...,...,...,...,...
105,Greg Urban,gregory.urban@ubs.com,Checked In,2019 Market Re-Cap
106,Aftab Shahsingh,aftab.shahsingh@ubs.com,Checked In,2019 Market Re-Cap
107,Brendan Ryan,brendan.ryan@raymondjames.com,Checked In,2019 Market Re-Cap
108,Garrett DeNinno,garrett.deninno@raymondjames.com,Checked In,2019 Market Re-Cap


### PE Comps - Ingest & Clean

In [6]:
pe_companies = pd.read_excel("PE Comps.xlsx",
                             skiprows = 2,)

#Dropping empty row between header and data
pe_companies = pe_companies.drop(0)

#Cleansing white spaces and new lines
pe_companies = cleanse_column_names(pe_companies)

pe_companies = modernize_nans(pe_companies)

pe_companies['AUM(Mns)'] = pe_companies['AUM(Bns)'] * 1000

pe_companies_dtypes = {
    "Priority": "string", #empty, but keeping string for flexibility
    "Company Name": "string",
    "Website": "string",
    "Sectors": "string",
    "Sample Portfolio Companies": "string",
    "Contact Name 1": "string",
    "Contact 2": "string",
    "Comments": "string",
}

pe_companies = update_data_types(pe_companies, pe_companies_dtypes)

pe_companies = pe_companies.rename(columns = {"Contact 2": "Contact Name 2"})


for col in pe_companies.columns:
    pe_companies[col] = pe_companies[col].apply(clean_dash_text)

pe_companies_df = pe_companies[[
    "Priority", 
    "Company Name", 
    "Website", 
    "AUM(Mns)", 
    "Sectors", 
    "Sample Portfolio Companies", 
    "Contact Name 1",
    "Contact Name 2",
    "Comments"]]

display(pe_companies_df)


Unnamed: 0,Priority,Company Name,Website,AUM(Mns),Sectors,Sample Portfolio Companies,Contact Name 1,Contact Name 2,Comments
1,,AEA Investors LP,www.aeainvestors.com,10000.0,"Consumer products, Industrial","Traeger (Current), Barnet (Cosmetic)","Martin Eltrich, III, Partner",,
2,,Audax Private Equity,www.audaxprivateequity.com,11500.0,Industrial,Chem Specialty Chemicals,"Christopher Satti, Business Dev\n(857) 294, 6640",,We have experience with specialty chemicals wi...
3,,CCMP Capital,www.ccmpcapital.com,12000.0,"Consumer products, Industrial","Jetro Cash & Carry, Shoes for Crews",Richard Zannino,Will Jaudes \nPrincipal,Are big on Consumer and industrial
4,,Clayton Dubilier & Rice,"www.cdr, inc.com",18000.0,"Consumer products, Industrial","Roofing Supply Group, US Foods, HD Supply",,,
5,,Crestview Partners,www.crestview.com,20000.0,Industrial Products,"Key Safety Systems, Accuride corporation","Alex Rose, Partner",,Have industrial products focus
6,,Genstar Capital,www.gencap.com,8500.0,Industrial Products,"Pretium Packaging, Fort Dearborn Company","Tony Salewski, Managing Director\n415 834 2350",,
7,,Golden Gate Capital,www.goldengatecap.com,14000.0,"Consumer products, Industrial","Eddie Bauer, Pacsun",Dave Thomas \nManaging Director\n415 983 2700,Scott Middleman \nAssociate\n415 983 2700\nsmi...,Mr. Thomas focuses on investments in Industria...
8,,Gores Group,www.gores.com,2400.0,"Consumer products, Industrial",Sage Automotive (previously),,,One of their strategies on some of their case ...
9,,Harvest Partners,www.harvestpartners.com,2000.0,Industrial Products,"Associated Materials (Prior), Driven Brands (C...","Ira D. Kleinman, Senior Managing Director",Paige Daly \nManaging Director\n212 599 6300 e...,Did addons for Associated Materials while they...
10,,Irving Place Capital,www.irvingplacecapital.com,4400.0,"Consumer products, Industrial","Bendon, New York and Co, Rag and Bone","David Knoch, Strategic Services and Partners","Devraj Roy \nPartner, Industrials\n212 551 466...",Have both branded consumer companies and indus...


In [7]:
# FINAL INGESTED AND CLEANED DATAFRAMES

display(business_deals)
display(consumer_retail_health_deals)
display(contacts_df)
display(events_df)
display(pe_companies_df)

Unnamed: 0,Company Name,Project Name,Date Added,Investment Bank,Banker,Sourcing,Transaction Type,LTM Revenue,LTM EBITDA,2014A EBITDA,...,Vertical,Sub Vertical,Enterprise Value,Est. Equity Investment,Status,Current Owner,Business Description,Lead MD,Notes,Date Added (Original)
0,Shermco,,2018-02-02,Harris Williams,,Auction,Sponsor to Sponsor,,,,...,Business Services,"Testing, Inspection & Certificaiton",267.0,133.5,Active,Oaktree,"Electrical testing, maintenance, and commissio...",Jeannie Blackwood,,2018-02-02 00:00:00
1,Kastle Systems,,2018-02-02,,,Trusted Netwok,Sponsor to Sponsor,,,,...,Business Services,Facilities Services,,,Active,Venturehouse,"Provider of comprehensive, turnkey security so...",Andrew Mah,,2018-02-02 00:00:00
2,CLEAResult,,2018-02-02,,,Trusted Netwok,Sponsor to Sponsor,,,,...,Business Services,Facilities Services,,,Active,General Atlantic,Provider of energy efficiency and demand manag...,Kripa Shah,,2018-02-02 00:00:00
3,PLH,,2018-02-02,Barclays,,Auction,Sponsor to Sponsor,,,,...,Business Services,Industrial & Environmental Services,680.0,340.0,Active,Energy Capital Partners,Specialty contractor serving the electric powe...,Russ Barner,,2018-02-02 00:00:00
4,BBB Industries,,2018-02-02,"Baird, Jefferies",,Auction,Sponsor to Sponsor,,,,...,Business Services,Specialty Distribution,1000.0,500.0,Active,Pamplona,Provider of remanufactured replacement parts t...,Matthew Kordonowy,,2018-02-02 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123,DH Corporation,,2016-11-01,Take-private,,Auction,Corporate Seller,,,,...,Business Services,Financial Technology,2536.02,,Dead,Finastra,Financial technology,Russ Barner,2017A/E EBITDA: CAD; originally stored as CAD;...,Nov-16
124,National Response Corporation (NRC),,2016-11-01,Harris Williams,,Auction,Sponsor to Sponsor,,,,...,Business Services,"Testing, Inspection & Certificaiton",350.0,175.0,Dead,JF Lehman,Compliance and environmental services,Matthew Kordonowy,,Nov-16
125,SunGard Insurance,,2016-10-01,Direct to Company,,Proprietary,Corporate Seller,,,,...,Business Services,Financial Technology,680.0,,Dead,FIS,Insurance software provider,Andrew Mah,,Oct-16
126,PSG,,2016-09-01,Centerview,,Auction,Other Private Buyout,,,,...,Business Services,Specialty Distribution,,,Dead,"Sagard Holdings, Fairfax Capital",Sporting goods,Kripa Shah,,Sep-16


Unnamed: 0,Company Name,Project Name,Date Added,Investment Bank,Banker,Banker Email,Banker Phone Number,Sourcing,Transaction Type,LTM Revenue,...,Est. Equity Investment,Status,Portfolio Company Status,Active Stage,Passed Rationale,Current Owner,Business Description,Lead MD,Notes,Date Added (Original)
0,Acima Credit,,2018-01-23,,,,,Proprietary,Initial Capitalization,,...,,Active,,CIM Received,,Founder,Rent-to-own consumer financing provider for du...,Jeannie Blackwood,,2018-01-23
1,Array,Maple,2017-09-01,Jefferies; Baird,Bill Cooling (Jefferies); Shaun Westfall (Jeff...,,258-664-9089,Auction,Sponsor to Sponsor,291.0,...,198.0,Active,,IOI Submitted,,Carlyle,Provider of end-to-end beauty merchandising so...,Andrew Mah,,2017-09-01
2,Electrical Components International,,2018-02-01,Barclays,,,,,,,...,,Active,,New Deal,,,Designer and manufacturer of electrical wire h...,Kripa Shah,,2018-02-01
3,European Wax Center,Beauty,2017-10-17,SunTrust,Scott Paton,Scott Paton@SunTrust .com,942-254-1327,Auction,Other Private Buyout,,...,,Active,,CIM Received,,Founders,Operator of over 600 waxing centers across the...,Russ Barner,,2017-10-17
4,Guitar Center,,2018-02-09,Houlihan Lokey UBS,,,,,Sponsor to Sponsor,,...,,Active,,New Deal,,,Leading retailer of musical instruments in the...,Matthew Kordonowy,,2018-02-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186,Schweiger Dermatology,,2018-01-17,,,,,Trusted Network,Sponsor to Sponsor,,...,,Passed/Dead,,New deal,On hold,LLR Capital / Founders,Roll-up of dermatology practices / owned by LL...,Andrew Mah,,2018-01-17
187,Firebirds,,2017-12-01,North Point Advisors,,,,,,,...,,Passed/Dead,,,,,Owner and operator of 45 Firebirds branded res...,Kripa Shah,,2017-12-01
188,Pacon,,2017-12-01,Baird,,,,,,,...,,Passed/Dead,,,,,Producer and marketer of arts and crafts products,Russ Barner,,2017-12-01
189,Potpourri Group,,2017-12-01,Lincoln International,,,,,,,...,,Passed/Dead,,,,,Direct-to-consumer marketer of women's apparel...,Matthew Kordonowy,,2017-12-01


Unnamed: 0,Firm,Name,Title,Group,Sub-Vertical,E-mail,Phone,Secondary Phone,City,Birthday,Coverage Person,Preferred Contact Method,Tier
0,Harris Williams,Robert Baltimore,Managing Director,Business Services,Business Services,BBaltimore@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1966-02-25,Hannah Jumper,Email,1
1,Harris Williams,Brian Lucas,Managing Director,Business Services,Business Services,blucas@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1953-09-03,Kripa Shah,Business Phone,1
2,Harris Williams,Luke Semple,Managing Director,Business Services,Business Services,lsemple@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1962-03-27,Emily Royal,Cell Phone,1
3,Harris Williams,Drew Spitzer,Managing Director,Business Services,Business Services,aspitzer@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1964-04-28,Russ Barner,Business Phone,1
4,Harris Williams,Derek Lewis,Managing Director,Business Services,Business Services,dlewis@harriswilliams.com,(804) 648-0072,,"Richmond, VA",1971-04-24,Daniel Ding,Cell Phone,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
306,Cowen,Kevin Manning,"Managing Director, Head of Diversified Industr...",Industrials,Industrial & Environmental Services,kevin.manning@cowen.com,(312) 577-2228,(773) 304-6721,"Chicago, IL",1980-09-10,Emily Royal,Cell Phone,2
307,Petsky Prunier,Sanjay Chadda,Managing Director & Partner,Marketing Services,Marketing Services,schadda@petskyprunier.com,212-842-6022,,"New York, NY",1980-09-05,Kripa Shah,Email,2
308,Petsky Prunier,Marc Flor,Director,Marketing Services,Marketing Services,mflor@petskyprunier.com,212-842-6034,,"New York, NY",1950-04-15,Hannah Jumper,Email,2
309,AdMedia,Oliver Schweitzer,Managing Director,Marketing Services,Marketing Services,oschweitzer@admediapartners.com,(212) 759-1870,,"New York, NY",1978-04-01,Jeannie Blackwood,Cell Phone,2


Unnamed: 0,Name,E-mail,Attendee Status,Event
0,Rob Baltimore,BBaltimore@harriswilliams.com,RSVP'd,Leaders and Partners Dinner
1,Brian Lucas,blucas@harriswilliams.com,Declined,Leaders and Partners Dinner
2,Luke Semple,lsemple@harriswilliams.com,Checked In,Leaders and Partners Dinner
3,Andrew Spitzer,aspitzer@harriswilliams.com,No Show,Leaders and Partners Dinner
4,Derek Lewis,dlewis@harriswilliams.com,Declined,Leaders and Partners Dinner
...,...,...,...,...
105,Greg Urban,gregory.urban@ubs.com,Checked In,2019 Market Re-Cap
106,Aftab Shahsingh,aftab.shahsingh@ubs.com,Checked In,2019 Market Re-Cap
107,Brendan Ryan,brendan.ryan@raymondjames.com,Checked In,2019 Market Re-Cap
108,Garrett DeNinno,garrett.deninno@raymondjames.com,Checked In,2019 Market Re-Cap


Unnamed: 0,Priority,Company Name,Website,AUM(Mns),Sectors,Sample Portfolio Companies,Contact Name 1,Contact Name 2,Comments
1,,AEA Investors LP,www.aeainvestors.com,10000.0,"Consumer products, Industrial","Traeger (Current), Barnet (Cosmetic)","Martin Eltrich, III, Partner",,
2,,Audax Private Equity,www.audaxprivateequity.com,11500.0,Industrial,Chem Specialty Chemicals,"Christopher Satti, Business Dev\n(857) 294, 6640",,We have experience with specialty chemicals wi...
3,,CCMP Capital,www.ccmpcapital.com,12000.0,"Consumer products, Industrial","Jetro Cash & Carry, Shoes for Crews",Richard Zannino,Will Jaudes \nPrincipal,Are big on Consumer and industrial
4,,Clayton Dubilier & Rice,"www.cdr, inc.com",18000.0,"Consumer products, Industrial","Roofing Supply Group, US Foods, HD Supply",,,
5,,Crestview Partners,www.crestview.com,20000.0,Industrial Products,"Key Safety Systems, Accuride corporation","Alex Rose, Partner",,Have industrial products focus
6,,Genstar Capital,www.gencap.com,8500.0,Industrial Products,"Pretium Packaging, Fort Dearborn Company","Tony Salewski, Managing Director\n415 834 2350",,
7,,Golden Gate Capital,www.goldengatecap.com,14000.0,"Consumer products, Industrial","Eddie Bauer, Pacsun",Dave Thomas \nManaging Director\n415 983 2700,Scott Middleman \nAssociate\n415 983 2700\nsmi...,Mr. Thomas focuses on investments in Industria...
8,,Gores Group,www.gores.com,2400.0,"Consumer products, Industrial",Sage Automotive (previously),,,One of their strategies on some of their case ...
9,,Harvest Partners,www.harvestpartners.com,2000.0,Industrial Products,"Associated Materials (Prior), Driven Brands (C...","Ira D. Kleinman, Senior Managing Director",Paige Daly \nManaging Director\n212 599 6300 e...,Did addons for Associated Materials while they...
10,,Irving Place Capital,www.irvingplacecapital.com,4400.0,"Consumer products, Industrial","Bendon, New York and Co, Rag and Bone","David Knoch, Strategic Services and Partners","Devraj Roy \nPartner, Industrials\n212 551 466...",Have both branded consumer companies and indus...


### Data Modeling - Would make this into another portion of the pipeline

In [8]:
#Understand cols to ensure proper joins
print(f"Business Deals: {business_deals.columns.tolist()}")
print(f"Consumer Deals: {consumer_retail_health_deals.columns.tolist()}")
print(f"Events DF: {events_df.columns.tolist()}")
print(f"Contacts DF: {contacts_df.columns.tolist()}")
print(f"PE Companies DF: {pe_companies_df.columns.tolist()}")

Business Deals: ['Company Name', 'Project Name', 'Date Added', 'Investment Bank', 'Banker', 'Sourcing', 'Transaction Type', 'LTM Revenue', 'LTM EBITDA', '2014A EBITDA', '2015A EBITDA', '2016A EBITDA', '2017A/E EBITDA', '2018E EBITDA', 'Vertical', 'Sub Vertical', 'Enterprise Value', 'Est. Equity Investment', 'Status', 'Current Owner', 'Business Description', 'Lead MD', 'Notes', 'Date Added (Original)']
Consumer Deals: ['Company Name', 'Project Name', 'Date Added', 'Investment Bank', 'Banker', 'Banker Email', 'Banker Phone Number', 'Sourcing', 'Transaction Type', 'LTM Revenue', 'LTM EBITDA', 'Vertical', 'Sub Vertical', 'Enterprise Value', 'Est. Equity Investment', 'Status', 'Portfolio Company Status', 'Active Stage', 'Passed Rationale', 'Current Owner', 'Business Description', 'Lead MD', 'Notes', 'Date Added (Original)']
Events DF: ['Name', 'E-mail', 'Attendee Status', 'Event']
Contacts DF: ['Firm', 'Name', 'Title', 'Group', 'Sub-Vertical', 'E-mail', 'Phone', 'Secondary Phone', 'City', '

In [9]:
# CREATION OF DEALS DATASET

#Removing historical EBITDA metrics, will be added into separate table
business_deals = business_deals.reset_index(drop=True)
business_deals['Deal_ID'] = business_deals.index.map(lambda x: f"D{x+1:04d}")

#Table dedicated to historical financial metrics
historical_financial_data_df = business_deals[[
    "Deal_ID", "Company Name", "Project Name", 
    "2014A EBITDA", "2015A EBITDA", "2016A EBITDA", "2017A/E EBITDA", "2018E EBITDA"
]]

# create Deal_ID before segmenting into deals DF and financial history DF to allow for joins
business_deals_df = business_deals[[
    'Deal_ID', 'Company Name', 'Project Name', 'Date Added', 'Investment Bank', 'Banker',
    'Sourcing', 'Transaction Type', 'LTM Revenue', 'LTM EBITDA', 'Vertical', 'Sub Vertical',
    'Enterprise Value', 'Est. Equity Investment', 'Status', 'Current Owner', 'Business Description',
    'Lead MD', 'Notes'
]].copy()

#Include missing columns from CRHP data. Will remain empty at the moment, but would normally work with client to find a way to populate
business_deals_df['Banker Email'] = pd.Series(pd.NA, dtype='string')
business_deals_df['Banker Phone Number'] = pd.Series(pd.NA, dtype='string')
business_deals_df['Portfolio Company Status'] = pd.Series(pd.NA, dtype='string')
business_deals_df['Active Stage'] = pd.Series(pd.NA, dtype='string')
business_deals_df['Passed Rationale'] = pd.Series(pd.NA, dtype='string')

#Create empty field for proper column references after concatenation
consumer_retail_health_deals['Deal_ID'] = pd.Series(pd.NA, dtype='string')

#Copy to preserve original cleaned dataframe
consumer_deals_df = consumer_retail_health_deals[[
    'Deal_ID', 'Company Name', 'Project Name', 'Date Added', 'Investment Bank', 'Banker', 'Banker Email',
    'Banker Phone Number', 'Sourcing', 'Transaction Type', 'LTM Revenue', 'LTM EBITDA', 'Vertical',
    'Sub Vertical', 'Enterprise Value', 'Est. Equity Investment', 'Status', 'Portfolio Company Status',
    'Active Stage', 'Passed Rationale', 'Current Owner', 'Business Description', 'Lead MD', 'Notes'
]].copy()

#Concatenate the two
deals_df = pd.concat([business_deals_df, consumer_deals_df], ignore_index=True)


deals_df = deals_df.reset_index(drop=True)

# Fill missing Deal_IDs
deals_df['Deal_ID'] = deals_df.apply(
    lambda row: row['Deal_ID'] if pd.notna(row['Deal_ID']) else f"D{row.name+1:04d}",
    axis=1
)

display(deals_df)

  deals_df = pd.concat([business_deals_df, consumer_deals_df], ignore_index=True)


Unnamed: 0,Deal_ID,Company Name,Project Name,Date Added,Investment Bank,Banker,Sourcing,Transaction Type,LTM Revenue,LTM EBITDA,...,Status,Current Owner,Business Description,Lead MD,Notes,Banker Email,Banker Phone Number,Portfolio Company Status,Active Stage,Passed Rationale
0,D0001,Shermco,,2018-02-02,Harris Williams,,Auction,Sponsor to Sponsor,,,...,Active,Oaktree,"Electrical testing, maintenance, and commissio...",Jeannie Blackwood,,,,,,
1,D0002,Kastle Systems,,2018-02-02,,,Trusted Netwok,Sponsor to Sponsor,,,...,Active,Venturehouse,"Provider of comprehensive, turnkey security so...",Andrew Mah,,,,,,
2,D0003,CLEAResult,,2018-02-02,,,Trusted Netwok,Sponsor to Sponsor,,,...,Active,General Atlantic,Provider of energy efficiency and demand manag...,Kripa Shah,,,,,,
3,D0004,PLH,,2018-02-02,Barclays,,Auction,Sponsor to Sponsor,,,...,Active,Energy Capital Partners,Specialty contractor serving the electric powe...,Russ Barner,,,,,,
4,D0005,BBB Industries,,2018-02-02,"Baird, Jefferies",,Auction,Sponsor to Sponsor,,,...,Active,Pamplona,Provider of remanufactured replacement parts t...,Matthew Kordonowy,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
314,D0315,Schweiger Dermatology,,2018-01-17,,,Trusted Network,Sponsor to Sponsor,,12.5,...,Passed/Dead,LLR Capital / Founders,Roll-up of dermatology practices / owned by LL...,Andrew Mah,,,,,New deal,On hold
315,D0316,Firebirds,,2017-12-01,North Point Advisors,,,,,19.3,...,Passed/Dead,,Owner and operator of 45 Firebirds branded res...,Kripa Shah,,,,,,
316,D0317,Pacon,,2017-12-01,Baird,,,,,35.0,...,Passed/Dead,,Producer and marketer of arts and crafts products,Russ Barner,,,,,,
317,D0318,Potpourri Group,,2017-12-01,Lincoln International,,,,,36.0,...,Passed/Dead,,Direct-to-consumer marketer of women's apparel...,Matthew Kordonowy,,,,,,


In [10]:
#COMPANIES DF

pe_companies_subset = pe_companies_df[[
    'Company Name',
    'Website',
    'AUM(Mns)',
    'Sectors',
    'Sample Portfolio Companies',
    'Priority',
    'Comments'
]].copy()
 
deals_companies_subset = deals_df[[
    'Company Name',
    'Business Description',
    'Current Owner'
]].copy()

# Drop duplicates because same Company Name might appear in multiple deals
deals_companies_subset = deals_companies_subset.drop_duplicates(subset='Company Name')

#Normally would do an API call to look for certain fields like business description, current owner, website, sectors, sample portfolio companies
companies_df = pd.merge(
    deals_companies_subset,
    pe_companies_subset,
    how = "outer",
    on = "Company Name"
)

#Clean up nulls
companies_df = modernize_nans(companies_df)

#Clean up after join
companies_df = companies_df.dropna(how = "all")

#Reset index
companies_df = companies_df.reset_index(drop=True)

#Create Company ID
companies_df['Company_ID'] = companies_df.index.map(lambda x: f"CO{x+1:04d}")

#Moving to first column
cols = ['Company_ID'] + [col for col in companies_df.columns if col != 'Company_ID']
companies_df = companies_df[cols]

display(companies_df)


Unnamed: 0,Company_ID,Company Name,Business Description,Current Owner,Website,AUM(Mns),Sectors,Sample Portfolio Companies,Priority,Comments
0,CO0001,5-Hour Energy,Producer of liquid energy shots,,,,,,,
1,CO0002,A Place for Mom,Senior care referral services,Silverlake / GA,,,,,,
2,CO0003,ABC Billing,Software and billing provider for the health a...,Thoma Bravo,,,,,,
3,CO0004,ACG & PRP,"Largest IHOP franchisee, currently operating 1...",,,,,,,
4,CO0005,AEA Investors LP,,,www.aeainvestors.com,10000.0,"Consumer products, Industrial","Traeger (Current), Barnet (Cosmetic)",,
...,...,...,...,...,...,...,...,...,...,...
333,CO0334,Zoë's Kitchen,Operator of over 200 owned or franchised fast-...,Public,,,,,,
334,CO0335,iCracked Inc.,Franchisor of Checkers and Rally's restaurants...,,,,,,,
335,CO0336,iHerb,"Pure play online retailer of VMS, natural/orga...",,,,,,,
336,CO0337,littleBits,Supplier of baked goods to quick service resta...,,,,,,,


In [11]:
#Make a copy
contacts_df = contacts_df.copy()

#Strip leading/trailing spaces
contacts_df = contacts_df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

#Clean Name (Title Case it)
contacts_df['Name'] = contacts_df['Name'].apply(lambda x: x.title() if pd.notna(x) else x)

#Clean Email (lowercase)
contacts_df['E-mail'] = contacts_df['E-mail'].apply(lambda x: x.lower().strip() if pd.notna(x) else x)

#Clean Phone numbers (remove non-numeric)
contacts_df['Phone'] = contacts_df['Phone'].apply(clean_phone)
contacts_df['Secondary Phone'] = contacts_df['Secondary Phone'].apply(clean_phone)

#Clean Birthday (Format to MM,DD,YYYY)
contacts_df['Birthday'] = pd.to_datetime(contacts_df['Birthday'], errors='coerce')
contacts_df['Birthday'] = contacts_df['Birthday'].dt.strftime('%m,%d,%Y')

#Fill missing values for optional fields (optional logic)
contacts_df['Preferred Contact Method'] = contacts_df['Preferred Contact Method'].fillna('Email')
contacts_df['Tier'] = contacts_df['Tier'].fillna('Standard')

#Reset index first
contacts_df = contacts_df.reset_index(drop=True)

#Create 
contacts_df['Contact_ID'] = contacts_df.index.map(lambda x: f"C{x+1:04d}")

#Move Contact_ID to first column for cleanliness
cols = ['Contact_ID'] + [col for col in contacts_df.columns if col != 'Contact_ID']
contacts_df = contacts_df[cols]

contacts_df = contacts_df.rename(columns={"E-mail": "Email"})

display(contacts_df)

  contacts_df = contacts_df.applymap(lambda x: x.strip() if isinstance(x, str) else x)


Unnamed: 0,Contact_ID,Firm,Name,Title,Group,Sub-Vertical,Email,Phone,Secondary Phone,City,Birthday,Coverage Person,Preferred Contact Method,Tier
0,C0001,Harris Williams,Robert Baltimore,Managing Director,Business Services,Business Services,bbaltimore@harriswilliams.com,8046480072,,"Richmond, VA",02251966,Hannah Jumper,Email,1
1,C0002,Harris Williams,Brian Lucas,Managing Director,Business Services,Business Services,blucas@harriswilliams.com,8046480072,,"Richmond, VA",09031953,Kripa Shah,Business Phone,1
2,C0003,Harris Williams,Luke Semple,Managing Director,Business Services,Business Services,lsemple@harriswilliams.com,8046480072,,"Richmond, VA",03271962,Emily Royal,Cell Phone,1
3,C0004,Harris Williams,Drew Spitzer,Managing Director,Business Services,Business Services,aspitzer@harriswilliams.com,8046480072,,"Richmond, VA",04281964,Russ Barner,Business Phone,1
4,C0005,Harris Williams,Derek Lewis,Managing Director,Business Services,Business Services,dlewis@harriswilliams.com,8046480072,,"Richmond, VA",04241971,Daniel Ding,Cell Phone,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
306,C0307,Cowen,Kevin Manning,"Managing Director, Head of Diversified Industr...",Industrials,Industrial & Environmental Services,kevin.manning@cowen.com,3125772228,7733046721,"Chicago, IL",09101980,Emily Royal,Cell Phone,2
307,C0308,Petsky Prunier,Sanjay Chadda,Managing Director & Partner,Marketing Services,Marketing Services,schadda@petskyprunier.com,2128426022,,"New York, NY",09051980,Kripa Shah,Email,2
308,C0309,Petsky Prunier,Marc Flor,Director,Marketing Services,Marketing Services,mflor@petskyprunier.com,2128426034,,"New York, NY",04151950,Hannah Jumper,Email,2
309,C0310,AdMedia,Oliver Schweitzer,Managing Director,Marketing Services,Marketing Services,oschweitzer@admediapartners.com,2127591870,,"New York, NY",04011978,Jeannie Blackwood,Cell Phone,2


In [12]:
#Create marketing_participants_df
marketing_participants_df = events_df.rename(columns={
    'Name': 'Contact Name',
    'E-mail': 'Email',
    'Event': 'Event Name'
}).copy()

#Reset index
marketing_participants_df = marketing_participants_df.reset_index(drop=True)

#Create Participant_ID
marketing_participants_df['Participant_ID'] = marketing_participants_df.index.map(lambda x: f"M{x+1:04d}")

#Move ID to first column for cleanliness
cols = ['Participant_ID'] + [col for col in marketing_participants_df.columns if col != 'Participant_ID']
marketing_participants_df = marketing_participants_df[cols]

display(marketing_participants_df)

Unnamed: 0,Participant_ID,Contact Name,Email,Attendee Status,Event Name
0,M0001,Rob Baltimore,BBaltimore@harriswilliams.com,RSVP'd,Leaders and Partners Dinner
1,M0002,Brian Lucas,blucas@harriswilliams.com,Declined,Leaders and Partners Dinner
2,M0003,Luke Semple,lsemple@harriswilliams.com,Checked In,Leaders and Partners Dinner
3,M0004,Andrew Spitzer,aspitzer@harriswilliams.com,No Show,Leaders and Partners Dinner
4,M0005,Derek Lewis,dlewis@harriswilliams.com,Declined,Leaders and Partners Dinner
...,...,...,...,...,...
105,M0106,Greg Urban,gregory.urban@ubs.com,Checked In,2019 Market Re-Cap
106,M0107,Aftab Shahsingh,aftab.shahsingh@ubs.com,Checked In,2019 Market Re-Cap
107,M0108,Brendan Ryan,brendan.ryan@raymondjames.com,Checked In,2019 Market Re-Cap
108,M0109,Garrett DeNinno,garrett.deninno@raymondjames.com,Checked In,2019 Market Re-Cap


In [14]:
#Lowercase and strip emails
contacts_df['Email'] = contacts_df['Email'].str.lower().str.strip()
marketing_participants_df['Email'] = marketing_participants_df['Email'].str.lower().str.strip()

columns_to_drop = [col for col in marketing_participants_df.columns if 'Contact_ID' in col]
if columns_to_drop:
    marketing_participants_df = marketing_participants_df.drop(columns=columns_to_drop)

#Merge Contact_ID onto marketing_participants_df - grab contact ID
marketing_participants_df = marketing_participants_df.merge(
    contacts_df[['Contact_ID', 'Email']],
    on='Email',
    how='left'
)

#Reorder columns cleanly
cols = ['Participant_ID', 'Contact_ID'] + [col for col in marketing_participants_df.columns if col not in ['Participant_ID', 'Contact_ID']]
marketing_participants_df = marketing_participants_df[cols]

#Reset index
marketing_participants_df = marketing_participants_df.reset_index(drop=True)

display(marketing_participants_df)


Unnamed: 0,Participant_ID,Contact_ID,Contact Name,Email,Attendee Status,Event Name
0,M0001,C0001,Rob Baltimore,bbaltimore@harriswilliams.com,RSVP'd,Leaders and Partners Dinner
1,M0002,C0002,Brian Lucas,blucas@harriswilliams.com,Declined,Leaders and Partners Dinner
2,M0003,C0003,Luke Semple,lsemple@harriswilliams.com,Checked In,Leaders and Partners Dinner
3,M0004,C0004,Andrew Spitzer,aspitzer@harriswilliams.com,No Show,Leaders and Partners Dinner
4,M0005,C0005,Derek Lewis,dlewis@harriswilliams.com,Declined,Leaders and Partners Dinner
...,...,...,...,...,...,...
105,M0106,C0058,Greg Urban,gregory.urban@ubs.com,Checked In,2019 Market Re-Cap
106,M0107,C0059,Aftab Shahsingh,aftab.shahsingh@ubs.com,Checked In,2019 Market Re-Cap
107,M0108,C0060,Brendan Ryan,brendan.ryan@raymondjames.com,Checked In,2019 Market Re-Cap
108,M0109,C0061,Garrett DeNinno,garrett.deninno@raymondjames.com,Checked In,2019 Market Re-Cap


In [15]:
# Save final outputs
deals_df.to_excel('final_deals.xlsx', index=False)
historical_financial_data_df.to_excel('final_financial_data.xlsx', index=False)
companies_df.to_excel('final_companies.xlsx', index=False)
contacts_df.to_excel('final_contacts.xlsx', index=False)
marketing_participants_df.to_excel('final_marketing_participants.xlsx', index=False)

print("All files saved successfully!")

All files saved successfully!
