# Decoding Lair East Labs: A Blueprint for Future Investment Decisions
Hello,
This project was initiated by Akshay Shivdasani, a LEL Venture Fellow, in the summer of 2023. I envision this script as the inception of a project aimed at enabling Lair East Labs to streamline the pitching process using Machine Learning and to evaluate previous investment decisions to provide guidance for future Venture Fellows.

<a id="0"></a> <br>
 ## Table of Contents  
1. [Section 0 - Install and Load Necessary Libraries](#1)
2. [Section 1 - Preprocessing Data](#2)
3. [Section 2 - Obtaining Updated Funding Data using Webscraping](#3)
4. [Section 3 - Evaluating the Success of all Venture Fellow Pitches](#4)
5. [Section 4 - List of Startups Pitched which were Successful](#5)
6. [Section 5 - An ML Algorithm to Analyze Future Startups Being Added to Pitch Deck](#6)
7. [Section 6 - Use Model to Analzye a Startup](#7)

<a id="1"></a> <br>
## Section 0 - Install and Load Necessary Libraries/Functions
To run the code in this script, you'll need to download pip and the latest version of Python. You can find resources online to assist with these downloads. Here are some links to help you download pip and Python on macOS:

To download pip: https://www.geeksforgeeks.org/how-to-install-pip-in-macos/
To download Python 3: https://www.python.org/downloads/macos/

Additionally, this script was created using Visual Studio Code.

To download Visual Studio Code, visit the following link: https://code.visualstudio.com/download

In [1]:
#Download Necessary Packages
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn
!pip install nltk
!pip install requests
!pip install bs4
!pip install openpyxl
!pip install selenium
!pip install bs4
!pip install time
!pip install re
!pip install tqdm
!pip install contractions
!pip install sklearn
!pip install tensorflow

After downloading these packages, we must load the following packages to execute our code.

In [2]:
#Load Necessary Packages
import pandas as pd
import numpy as np
import seaborn as sns
import requests
import openpyxl
import time
import re
import contractions
import sys
import json
from tqdm import tqdm
from bs4 import BeautifulSoup
from scipy.stats import norm
import statistics

import matplotlib.pyplot as plt
from matplotlib import rc
import matplotlib.ticker as ticker
from matplotlib.font_manager import FontProperties
import matplotlib.patheffects as path_effects

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

import xgboost as xgb
from xgboost import XGBClassifier

plt.style.use('ggplot')

In [3]:
#Required Custom Functions to Load

#Format company names for CBInsights URL format
def add_dashes(company):
    try:
        company = company.split('\n')[0].split('\t')[0].lower().replace(' ', '-')
        return company
    except:
        return company

#Function which unformats number (ex. $7.6M into 7600000)
def convert_to_number(value_str):
    if isinstance(value_str, str) and value_str != 'None':
        value_str = value_str.replace('$', '')
        value_str = value_str.replace('+', '')
        
        value_str = re.sub(r'[^0-9MK.km]', '', value_str)  # Modified regex to exclude only '.'
        
        multiplier = 1
        if 'M' in value_str:
            multiplier = 1000000
        elif 'm' in value_str:
            multiplier = 1000000
        elif 'k' in value_str:
            multiplier = 1000
        elif 'K' in value_str:
            multiplier = 1000
        
        value_str = re.sub(r'[KkMm]', '', value_str)
        
        # Convert to float to handle decimal points and then multiply
        return int(float(value_str) * multiplier)
    else:
        return None

def URL_from_Yahoo(company_name):
    driver.get('https://www.yahoo.com/')

    # Find and interact with the Yahoo search bar
    search_box = driver.find_element(By.NAME, 'p')  # Yahoo search input field by name
    search_box.send_keys('CBinsights ' + add_dashes(str(company_name)))
    search_box.submit()
    
    # Wait for the search results page to load (you may need to add WebDriverWait)
    # Parse the Yahoo search results to find CBInsights URLs
    
    # Loop through search results and find the first non-ad CBInsights URL
    cbinsights_url = None
    search_results = driver.find_elements(By.CSS_SELECTOR, 'a')
    max_retries = 3  # You can adjust the number of retries as needed
    retry_count = 0
    
    while retry_count < max_retries:
        for result in search_results:
            try:
                link = result.get_attribute('href')
                if link and 'cbinsights.com' in link and 'ad' not in link:
                    cbinsights_url = link
                    break
            except:
                # Handle the stale element reference error, e.g., by ignoring it
                pass
    
        if cbinsights_url:
            break
    
        # Sleep for a short time before retrying
        time.sleep(1)  # You can adjust the sleep duration as needed
        retry_count += 1
    
    # Navigate to the first non-ad CBInsights URL (if found)
    if cbinsights_url:
        if '/company/' in cbinsights_url and 'cb-insights' not in cbinsights_url:
            return cbinsights_url
        else:
            return None
    else:
        return None

def format_currency(amount):
    try: 
        x = "${:,.2f}".format(float(amount))
        return x
    except:
        return "Not Provided"

def bold(text):
    bold_start = '\033[1m'
    bold_end = '\033[0m'
    return bold_start + text + bold_end

class Format:
    end = '\033[0m'
    underline = '\033[4m'

def underline_text(text):
    return(Format.underline + text + Format.end)

def line(num):
    line_of_dashes = '-' * num
    return line_of_dashes

df_final_analysis = []
def get_info(company, col2, df = df_final_analysis, col1 = 'Company Name'):
    index = df[df[col1] == company].index

    if not index.empty:
        return df.at[index[0], col2]
    else: 
        return 'Unknown'

def filter_outliers(data):
    # Calculate the IQR for both columns
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    # Define a filter to remove outliers
    filter_mask = ((data['Funding'] >= Q1['Funding'] - 1.5 * IQR['Funding']) &
                   (data['Funding'] <= Q3['Funding'] + 1.5 * IQR['Funding']) &
                   (data['Multiple'] >= Q1['Multiple'] - 1.5 * IQR['Multiple']) &
                   (data['Multiple'] <= Q3['Multiple'] + 1.5 * IQR['Multiple']))

    # Apply the filter to the DataFrame
    data_filtered = data[filter_mask]

    # Extract the filtered data for plotting
    filtered_funding = data_filtered['Funding'].tolist()
    filtered_multiple = data_filtered['Multiple'].tolist()

    return filtered_funding, filtered_multiple

def format_y_tick(value, _):
    return f"${value/1e6:.1f}M"

# Add data labels above the bars with formatted currency and adjust fontsize
def add_currency_labels(bars, data):
    for bar, value in zip(bars, data):
        formatted_value = format_currency(value, 'USD')  # Change 'USD' to your desired currency code
        ax.annotate(formatted_value, xy=(bar.get_x() + bar.get_width() / 2, value),
                    xytext=(0, 3), textcoords='offset points', ha='center', fontsize=8)  # Adjust the fontsize as needed

<a id="2"></a> <br>
## Section 1: Preprocessing Data
Please note that this section is dedicated to creating an aggregated dataset of all past pitches before summer 2023. This script will provide this dataset, so you can skip this step if you prefer to use the pre-aggregated pitch dataset.

In [4]:
#Set directory
DIR = '/Users/akshay/Desktop/LEL Data Project/Pitch_List_Files'

In [5]:
#Read in Data
#REDACTED

<a id="3"></a> <br>
## Section 2: Obtaining Updated Funding Data using Webscraping
The following code utilizes web scraping to extract industry, funding stage, and funding amount information from a wide range of startups' pitch decks using CBInsights. It's worth mentioning that this code can be further refined if someone discovers more efficient ways to extract data from platforms like Pitchbook or Crunchbase, which are typically more challenging to scrape (or if someone pays for the API).

### Read if you'd like to add your own data:
If you have your own pitch dataset to add to this aggregated dataset, it must adhere to the following column structure:

- 'Ranking'
- 'Week'
- 'Lead'
- 'Company Name'
- 'Brief Introduction'
- 'Company Type'
- 'Industry/Sector'
- 'Location'
- 'Year Founded'
- 'Founder(s)'
- 'School (0-5)'
- 'Corporate Experience/Top Organization (0-5)'
- 'Startup Experience (0-5)'
- 'Tech Background (0-5)'
- 'Product Stage (0-5)'
- 'User/Client Stage (0-5)'
- 'Revenue (0-5)'
- 'SUM'
- 'Last Round Raised'
- 'Total Money Raised'
- 'Investors/Funding Source'
- 'TAM'
- 'Competitors'
- 'Comments/Additional Notes'

Lair East Labs uses the following ranking system for pitched startups: Yellow (potentially promising), Green (very promising), or no highlight (likely not promising). To integrate this ranking system into your dataset, manually create a column named "Ranking" at the beginning. Assign one of the following scores to each startup:

- 0 (no highlight)
- 1 (yellow highlight)
- 2 (green highlight)
- 3 (LEL invested in it)

To merge your dataset with this column into the aggregated dataset, follow these steps:

1. Load your dataset: yourdataset = pd.read_csv('file path')
2. Preprocess it: yourdataset = yourdataset.fillna(0)
3. Load the aggregated dataset: aggregated = pd.read_csv('file path of aggregated dataset')
4. Combine your dataset with the aggregated dataset: df_final = pd.concat([aggregated, yourdataset])

In [6]:
#yourdataset = pd.read_csv('file path')
#yourdataset = yourdataset.fillna(0)
#aggregated = pd.read_csv('file path of aggregated dataset')
#df_final = pd.concat([aggregated, yourdataset])

In [7]:
#Set directory
DIR = '/Users/akshay/Desktop/LEL Data Project/Pitch_List_Files'

#Read in Data
df_final = pd.read_excel(DIR + '/Complete_Pitch_Dataset_from_2020_to_2023_Summer.xlsx')

#Format dataset to have a column for the amount raised in total current data and current funding stage
df_final['Total Raised'] = 0
df_final['Stage'] = 0

#Create a copy of this df, fix indexes, format columns to be ready for webscraping.
df_final_copy = df_final.copy()
df_final_copy.set_index('index', inplace=True)
df_final_copy.reset_index(inplace=True)

for index, row in df_final_copy.iterrows():
    row['Company Name'] = str(row['Company Name'])

The following code will open a chrome tab to webscrape from. During the process, you must keep the tab open. I would recommend minimizing the window.

In [8]:
#Open Chrome tab to webscrape from
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver = webdriver.Chrome()

In [6]:
#Lopp through every company in the dataset
checked_URLS = []
rowstodrop = []

for index, row in df_final_copy.iterrows():
    
    name = str(add_dashes(row['Company Name']))
    try:
        url = URL_from_Yahoo(name)
    except:
        pass
        
    if not url:
        #Obtain most likely CBInsights URL for each company
        url = "https://www.cbinsights.com/company/" + name + "/financials"
    elif '/financials' not in url:
         url += "/financials"
    print(url)
    
    #Search up URL
    try:
        if url in checked_URLS:
            rowstodrop.append(index)
            continue
        else:
            driver.get(url)
            checked_URLS.append(url)
    except:
        continue

    #Obtain html from searched URL
    html = driver.page_source

    #Regex patterns to obtain funding amount and funding stage
    pattern1 = r"raised\s\$[\d,\.]+[MK]?"
    pattern2 = r'(?<=<\/strong>was a<strong>)\s*(.*?)(?=\s+(?:for\s+\$[\d,\.]+M)?)'
    match1 = re.search(pattern1, html)
    match2 = re.search(pattern2, html)

    #Strings to print
    total_funding = 'Total Funding Raised: {}'
    funding_stage = 'Funding Stage: {}'
    industry_print = 'Industry: {}'

    #If regex found, add to df
    if match1:
        amount = match1.group(0)
        amount = amount.replace("raised ", "").replace(" ", "")
        print(total_funding.format(amount))
        df_final_copy.at[index, "Total Raised"] = amount
    if match2:
        stage = match2.group(1)
        stage = stage.replace(" ", "")
        if stage == 'Series':
            startnum = match2.start() + 1
            stopnum = match2.start() + 9
            stage = html[startnum:stopnum]
            print(funding_stage.format(stage))
            df_final_copy.at[index, "Stage"] = stage
        else:
            print(funding_stage.format(stage))
            df_final_copy.at[index, "Stage"] = stage
    
    #Look to see if CBInsights has the industry of the company in CBInsights.
    try:
        industry_start = html.find(',"subindustry":"') + len(',"subindustry":"')
        industry_end = html.find('","idSector')
        industry = html[industry_start:industry_end]
        industry = re.sub(r'\\u[0-9a-fA-F]{4}', lambda x: chr(int(x.group()[2:], 16)), industry)
        if len(industry) < 40 and len(industry) > 5:
            df_final_copy.at[index, 'Industry/Sector'] = industry
            print(industry_print.format(industry))
    except:
        pass

In [10]:
df_final_copy = df_final_copy.drop('index', axis = 1)

In [11]:
df_final_copy.drop(rowstodrop, inplace=True)
df_final_copy.reset_index(inplace = True)

In [12]:
#invested_in_total_raised_keys = REDACTED

#invested_in_stage_keys = REDACTED

invested_in_total_raised_values = [1000000, 100000, None, 180000, None, 160000, 
                                   1000000, 660000, 50000, 6370, 50000, 50000, 470000, 1860000]

invested_in_stage_values = ['Seed', 'Dead']

invested_in_funding = dict(zip(invested_in_total_raised_keys, invested_in_total_raised_values))
invested_in_stage = dict(zip(invested_in_stage_keys, invested_in_stage_values))

for index, row in df_final_copy.iterrows():
    item = row['Company Name']
    if item in invested_in_funding:
        row['Total Raised'] = invested_in_funding[item]
    
    if item in invested_in_stage:
        row['Stage'] = invested_in_stage[item]

Here are the companies not found in CBInsights. 

This code can be optimized to minimize these failures by using a more comprehensive API like Pitchbook or Crunchbase (paid subscriptions).

In [7]:
#Find failures
failures = []
for index, row in df_final_copy.iterrows():
    if row['Total Raised'] == 'None' or row['Total Raised'] == 0 or row['Total Raised'] == '0':
        failures.append(row['Company Name'])

s = "Number of Companies with Total Funding Info not Found: {}/{}"
numfailures = str(len(failures))
totalstartups = str(df_final_copy.shape[0])
print(s.format(numfailures, totalstartups))

stagefailures = []
for index, row in df_final_copy.iterrows():
    if row['Stage'] == 0 or row['Stage'] == 'None':
        stagefailures.append(row['Company Name'])

s = "Number of Companies with Funding Stage Info not Found: {}/{}"
numfailures = str(len(stagefailures))
print(s.format(numfailures, totalstartups))

s = "Number of Total 'Failures': {}/{}"
resulting_list = list(failures)
resulting_list.extend(x for x in stagefailures if x not in failures)
numfailures = str(len(resulting_list))
print(s.format(numfailures, totalstartups))

In [14]:
# Initialize an empty list to store converted values
numlist = []

# Iterate through the DataFrame rows and convert 'Total Raised' values to strings
for index, row in df_final_copy.iterrows():
    numlist.append(str(convert_to_number(row['Total Raised'])))

# Update the 'Total Raised' column in the DataFrame with the converted values
for index, row in df_final_copy.iterrows():
    df_final_copy.at[index, 'Total Raised'] = numlist[index]

# Initialize another empty list for 'Total Money Raised' conversions
numlist = []

# Iterate through the DataFrame rows, attempt to convert 'Total Money Raised' to strings,
# and handle exceptions by appending 0
for index, row in df_final_copy.iterrows():
    try:
        numlist.append(str(convert_to_number(row['Total Money Raised'])))
    except:
        numlist.append(0)
        continue

# Update the 'Total Money Raised' column in the DataFrame with the converted values
for index, row in df_final_copy.iterrows():
    df_final_copy.at[index, 'Total Money Raised'] = numlist[index]

# Create a copy of the DataFrame for analysis
df_final_analysis = df_final_copy.copy()

# Iterate through the DataFrame rows and drop rows where 'Total Raised' is "None"
for index, row in df_final_analysis.iterrows():
    if row['Total Raised'] == "None":
        df_final_analysis = df_final_analysis.drop(index=index, axis=0)

In [15]:
# Set the 'level_0' column as the DataFrame index
df_final_analysis = df_final_analysis.set_index('index')

# Reset the DataFrame index
df_final_analysis = df_final_analysis.reset_index(drop=True)

In [8]:
#Remove startups with greater than $1.5M funding when pitched
counter = 0
s = "Number of startups pitched with greater than $2M funding when pitched (filtering these out): {}"
for index, row in df_final_analysis.iterrows():
    if row['Total Money Raised'] != 'None':
        if int(row['Total Money Raised']) >= 1500000:
            df_final_analysis.drop(index, inplace=True)
            counter += 1

print(s.format(counter))

In [17]:
df_final_analysis = df_final_analysis.reset_index(drop=True)

<a id="4"></a> <br>
## Section 3 - Evaluating the Success of all Venture Fellow Pitches
In this section, we evaluate the outcomes of past startup pitches made at Lair East Labs. Our analysis aims to gauge the effectiveness of LEL's investment decisions and identify any missed opportunities for returns. We utilize metrics like funding growth, current funding levels, and funding stage progression to gain insights into investment outcomes and potential undiscovered value.

In [9]:
#Pie Chart of Funding Raised
    #Raise1: 0-100k
    #Raise2: 100k-500k
    #Raise3: 500k-1M
    #Raise4: 1M-5M
    #Raise5: 5M-10M
    #Raise6: 10M-20M
    #Raise7: 20M-50M
    #Raise8: 50M+
    #Raise9: Contains every company
    #Raise10: Failures

raise1 = []
raise2 = []
raise3 = []
raise4 = []
raise5 = []
raise6 = []
raise7 = []
raise8 = []
raise9 = []
raise10 = []

for index, row in df_final_analysis.iterrows():
    raise9.append(row['Company Name'])
    if int(float(row['Total Raised'])) <= 100000:
        raise1.append(row['Company Name'])
    elif int(float(row['Total Raised'])) <= 500000: 
        raise2.append(row['Company Name'])
    elif int(float(row['Total Raised'])) <= 1000000:
        raise3.append(row['Company Name'])
    elif int(float(row['Total Raised'])) <= 5000000:
        raise4.append(row['Company Name'])
    elif int(float(row['Total Raised'])) <= 10000000:
        raise5.append(row['Company Name'])
    elif int(float(row['Total Raised'])) <= 20000000:
        raise6.append(row['Company Name'])
    elif int(float(row['Total Raised'])) <= 50000000:
        raise7.append(row['Company Name'])
    elif int(float(row['Total Raised'])) <= 1000000000:
        raise8.append(row['Company Name'])
    else:
        raise10.append(row['Company Name'])

labels = ["0-100k", "100k-500k", "500k-1M", "1M-5M", "5M-10M", "10M-20M", "20M-50M", "50M+"]
data = [len(raise1), len(raise2), len(raise3), len(raise4), len(raise5), len(raise6), len(raise7), len(raise8)]

#Font
rc('font', family='serif')

# Define colors for the pie chart segments
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99', '#c2c2f0', '#ffb3e6', '#c2f0c2', '#ff6666']

# Define explode (pull out) for specific segments (e.g., explode the '50M+' segment)
explode = (0, 0, 0, 0, 0, 0.05, 0.05, 0.05)

# Create the pie chart with customized attributes
fig, ax = plt.subplots(figsize=(12, 8))
ax.pie(data, labels=labels, autopct='%1.1f%%', startangle=90, colors=colors, explode=explode,
       shadow=True, wedgeprops={'edgecolor': 'gray'})

# Add a title with a white background
ax.set_title("Funding Raised by Companies from Pitch List", fontsize=16, bbox={'facecolor': 'white', 'alpha': 0.8}, y = 1.05)

# Add a legend
plt.legend(labels, loc="best")

# Display the chart
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()

In [10]:
tally = {}
for item in df_final_analysis['Stage']:
    if item in tally:
        tally[item] += 1
    else:
        tally[item] = 1

labels = list(tally.keys())
data = []

for value in tally.values():
    data.append(value)

# Define colors for the pie chart segments
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99', '#c2c2f0', '#ffb3e6', '#c2f0c2', '#ff6666']

# Create a list of (label, data) tuples and sort them by data in descending order
labels = sorted(zip(labels, data), key=lambda x: x[1], reverse=True)
labels, data = zip(*labels)

#Font
rc('font', family='serif')

# Create the pie chart with customized attributes
fig, ax = plt.subplots(figsize=(24, 16))
ax.pie(data, labels=labels, autopct='%1.1f%%', startangle=90, colors=colors,
       shadow=True, wedgeprops={'edgecolor': 'gray'})
plt.subplots_adjust(left=0.1, right=0.9)

# Add a title with a white background
ax.set_title("The Funding Stage which Startups from Pitch List are at Present Day", fontsize=16, bbox={'facecolor': 'white', 'alpha': 0.8}, y = 1.05)

# Calculate the total sum of data (for percentage calculation)
total_sum = sum(data)

# Calculate percentages
percentages = [(value / total_sum) * 100 for value in data]

# Add a legend
legend_labels = [f'{label}: {value:.1f}%' for label, value in zip(labels, percentages)]
plt.legend(legend_labels, loc="best")

# Display the chart
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()

In [11]:
tally = {}
total_count = 0

# Count occurrences of each item and calculate the total count
for item in df_final_analysis['Industry/Sector']:
    if item == 0:
        continue
    item = item.split()
    item[0] = re.sub(r'[^a-zA-Z0-9]', '', item[0])
    if item:
        first_part = item[0]
        if first_part in tally:
            tally[first_part] += 1
        else:
            tally[first_part] = 1
        total_count += 1

# Determine the threshold for the "Other" category
threshold = 0.007 * total_count

# Create a list to store data for the pie chart
labels = []
data = []

# Populate labels and data for categories exceeding the threshold
for key, value in tally.items():
    if value >= threshold:
        labels.append(key)
        data.append(value)

# Calculate the total count of items in the "Other" category
other_count = total_count - sum(data)

# Add "Other" category to labels and data
if other_count > 0:
    labels.append("Other")
    data.append(other_count)

# Define colors for the pie chart segments
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99', '#c2c2f0', '#ffb3e6', '#c2f0c2', '#ff6666']

# Create a list of (label, data) tuples and sort them by data in descending order
labels = sorted(zip(labels, data), key=lambda x: x[1], reverse=True)
labels, data = zip(*labels)

#Font
rc('font', family='serif')

# Create the pie chart with customized attributes
fig, ax = plt.subplots(figsize=(24, 16))
ax.pie(data, labels=labels, autopct='%1.1f%%', startangle=90, colors=colors,
       shadow=True, wedgeprops={'edgecolor': 'gray'})

# Add a title with a white background
ax.set_title("The Distribution/Sector of Startups Pitched", fontsize=16, bbox={'facecolor': 'white', 'alpha': 0.8}, y = 1.05)

# Calculate the total sum of data (for percentage calculation)
total_sum = sum(data)

# Calculate percentages
percentages = [(value / total_sum) * 100 for value in data]

# Add a legend
legend_labels = [f'{label}: {value:.1f}%' for label, value in zip(labels, percentages)]
plt.legend(legend_labels, loc="best")

# Display the chart
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()

In [12]:
df_final_multiple = df_final_analysis.copy()

for index, row in df_final_multiple.iterrows():
    if row['Total Money Raised'] == "None":
        df_final_multiple = df_final_multiple.drop(index=index, axis=0)
    if row['Total Money Raised'] == 0:
        df_final_multiple = df_final_multiple.drop(index=index, axis=0)
    if row['Total Raised'] == 0:
        df_final_multiple = df_final_multiple.drop(index=index, axis=0)

df_final_multiple.reset_index(inplace=True)

#Rank startups based on the multiple they gained from original to current funding

df_final_multiple['Funding Multiple'] = 0

for index, row in df_final_multiple.iterrows():
    y = float(convert_to_number(row['Total Money Raised']))
    try:
        df_final_multiple.at[index, 'Funding Multiple'] = round(float(df_final_multiple.at[index, 'Total Raised']) / y)
    except:
        continue

#Pie Chart of Funding Raised
    #multipl1: <=1
    #multipl2: 1-5
    #multipl3: 5-10
    #multipl4: 10-20
    #multipl5: 20-50
    #multipl6: 50-100
    #multipl7: 100+
    #multiply8: error
    
multipl1 = []
multipl2 = []
multipl3 = []
multipl4 = []
multipl5 = []
multipl6 = []
multipl7 = []
multipl8 = []

for index, row in df_final_multiple.iterrows():
    if float(row['Funding Multiple']) <= 1:
        multipl1.append(row['Company Name'])
    elif float(row['Funding Multiple']) <= 5:
        multipl2.append(row['Company Name'])
    elif float(row['Funding Multiple']) <= 10:
        multipl3.append(row['Company Name'])
    elif float(row['Funding Multiple']) <= 20:
        multipl4.append(row['Company Name'])
    elif float(row['Funding Multiple']) <= 50:
        multipl5.append(row['Company Name'])
    elif float(row['Funding Multiple']) <= 100:
        multipl6.append(row['Company Name'])
    elif float(row['Funding Multiple']) > 100:
        multipl7.append(row['Company Name'])
    else:
        multipl8.append(row['Company Name'])
    

labels = ["<=1x", "1-5x", "5-10x", "10-20x", "20-50x", "50-100x", ">100x", "error"]
data = [len(multipl1), len(multipl2), len(multipl3), len(multipl4), len(multipl5), len(multipl6), len(multipl7), len(multipl8)]

# Define colors for the pie chart segments
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99', '#c2c2f0', '#ffb3e6', '#c2f0c2', '#ff6666']

# Define explode (pull out) for specific segments (e.g., explode the '50M+' segment)
explode = (0, 0, 0, 0, 0.05, 0.05, 0.05, 0)

#Font
rc('font', family='serif')

# Create the pie chart with customized attributes
fig, ax = plt.subplots(figsize=(24, 16))
ax.pie(data, labels=labels, autopct='%1.1f%%', startangle=90, colors=colors, explode=explode,
       shadow=True, wedgeprops={'edgecolor': 'gray'})

# Add a title with a white background
ax.set_title("The Multiple of Startups on the Pitch List between when Originally Pitched and Present Day", fontsize=16, 
             bbox={'facecolor': 'white', 'alpha': 0.8}, y = 1.05)

# Calculate the total sum of data (for percentage calculation)
total_sum = sum(data)

# Calculate percentages
percentages = [(value / total_sum) * 100 for value in data]

# Add a legend
legend_labels = [f'{label}: {value:.1f}%' for label, value in zip(labels, percentages)]
plt.legend(legend_labels, loc="best")

# Display the chart
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()

In [22]:
ranking0_funding = []
ranking0_multiple = []

ranking1_funding = []
ranking1_multiple = []

ranking2_funding = []
ranking2_multiple = []

ranking3_funding = []
ranking3_multiple = []

df_final_multiple['Ranking'] = pd.to_numeric(df_final_multiple['Ranking'], errors='coerce')
df_final_analysis['Ranking'] = pd.to_numeric(df_final_analysis['Ranking'], errors='coerce')
df_final_multiple['Total Raised'] = pd.to_numeric(df_final_multiple['Total Raised'], errors='coerce')
df_final_analysis['Total Raised'] = pd.to_numeric(df_final_analysis['Total Raised'], errors='coerce')
df_final_multiple['Funding Multiple'] = pd.to_numeric(df_final_multiple['Funding Multiple'], errors='coerce')

for index, row in df_final_multiple.iterrows():
    if row['Ranking'] == 0:
        ranking0_funding.append(row['Total Raised'])
        ranking0_multiple.append(row['Funding Multiple'])
    elif row['Ranking'] == 1:
        ranking1_funding.append(row['Total Raised'])
        ranking1_multiple.append(row['Funding Multiple'])
    elif row['Ranking'] == 2:
        ranking2_funding.append(row['Total Raised'])
        ranking2_multiple.append(row['Funding Multiple'])
    elif row['Ranking'] == 3:
        ranking3_funding.append(row['Total Raised'])
        ranking3_multiple.append(row['Funding Multiple'])
    else:
        continue

# Assuming you have two pandas Series for ranking1_funding and ranking1_multiple
ranking0_data = pd.DataFrame({'Funding': ranking0_funding, 'Multiple': ranking0_multiple})

# Filter outliers for ranking1
filtered_ranking0_funding, filtered_ranking0_multiple = filter_outliers(ranking0_data)

# Assuming you have two pandas Series for ranking1_funding and ranking1_multiple
ranking1_data = pd.DataFrame({'Funding': ranking1_funding, 'Multiple': ranking1_multiple})

# Filter outliers for ranking1
filtered_ranking1_funding, filtered_ranking1_multiple = filter_outliers(ranking1_data)

# Assuming you have two pandas Series for ranking2_funding and ranking2_multiple
ranking2_data = pd.DataFrame({'Funding': ranking2_funding, 'Multiple': ranking2_multiple})

# Filter outliers for ranking2
filtered_ranking2_funding, filtered_ranking2_multiple = filter_outliers(ranking2_data)

# Assuming you have two pandas Series for ranking2_funding and ranking2_multiple
ranking3_data = pd.DataFrame({'Funding': ranking3_funding, 'Multiple': ranking3_multiple})

# Filter outliers for ranking2
filtered_ranking3_funding, filtered_ranking3_multiple = filter_outliers(ranking3_data)


nonfiltered_ranking_data = {
    "Not Promising": ranking0_funding,
    "Potentially Promising": ranking1_funding,
    "Very Promising": ranking2_funding,
    "Invested In": ranking3_funding
}

filtered_ranking_data = {
    "Not Promising": filtered_ranking0_funding,
    "Potentially Promising": filtered_ranking1_funding,
    "Very Promising": filtered_ranking2_funding,
    "Invested In": filtered_ranking3_funding
}

nonfiltered_multiple_data = {
    "Not Promising": ranking0_multiple,
    "Potentially Promising": ranking1_multiple,
    "Very Promising": ranking2_multiple,
    "Invested In": ranking3_multiple
}

filtered_multiple_data = {
    "Not Promising": filtered_ranking0_multiple,
    "Potentially Promising": filtered_ranking1_multiple,
    "Very Promising": filtered_ranking2_multiple,
    "Invested In": filtered_ranking3_multiple
}

In [13]:
from statistics import mean
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
from babel.numbers import format_currency  # Make sure you have 'babel' installed

# Set the font family and size
rc('font', family='serif', size=12)  # Change the size to your desired value

# Calculate average total raised for filtered data
ranking_levels = [0, 1, 2, 3]
average_total_raised = []
average_total_raised.append(mean(filtered_ranking_data['Not Promising']))
average_total_raised.append(mean(filtered_ranking_data['Potentially Promising']))
average_total_raised.append(mean(filtered_ranking_data['Very Promising']))
average_total_raised.append(mean(filtered_ranking_data['Invested In']))

outliers_average_total_raised = []
outliers_average_total_raised.append(mean(nonfiltered_ranking_data['Not Promising']))
outliers_average_total_raised.append(mean(nonfiltered_ranking_data['Potentially Promising']))
outliers_average_total_raised.append(mean(nonfiltered_ranking_data['Very Promising']))
outliers_average_total_raised.append(mean(nonfiltered_ranking_data['Invested In']))

# Create the bar chart with multiple bar plots
fig, ax = plt.subplots(figsize=(10, 6))
width = 0.35  # Width of each bar
x = np.arange(len(ranking_levels))

# Define colors based on whether the bars represent outliers or non-outliers
colors_average_total_raised = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
colors_outliers_average_total_raised = ['#aec7e8', '#ffbb78', '#98df8a', '#ff9896']

bars1 = ax.bar(x - width/2, average_total_raised, width, label='Average Total Raised', color=colors_average_total_raised)
bars2 = ax.bar(x + width/2, outliers_average_total_raised, width, label='Outliers Avg. Total Raised', color=colors_outliers_average_total_raised)

# Add labels and title
ax.set_xlabel("Ranking Levels", fontsize=14)  # Adjust the fontsize as needed
ax.set_title("Average Total Raised by Ranking Levels (Including Outliers)", fontsize=16)  # Adjust the fontsize as needed

# Set x-axis labels and adjust their fontsize
x_labels = ["Nonpromising", "Potentially Promising", "Very Promising", "Invested In"]
ax.set_xticks(x)
ax.set_xticklabels(x_labels, fontsize=12)  # Adjust the fontsize as needed

# Remove y-axis labels
ax.set_yticklabels([])

add_currency_labels(bars1, average_total_raised)
add_currency_labels(bars2, outliers_average_total_raised)

# Show the plot with adjustable aspect ratio
plt.gca().set_aspect('auto')

# Display the legend
#ax.legend(fontsize=12)  # Adjust the fontsize as needed

# Display the plot
plt.tight_layout()
plt.show()

In [14]:
filtered_ranking_data = {
    "Not Promising": filtered_ranking0_funding,
    "Potentially Promising": filtered_ranking1_funding,
    "Very Promising": filtered_ranking2_funding,
    "Invested In": filtered_ranking3_funding
}
# Create a figure with a single subplot
fig, ax = plt.subplots(figsize=(10, 8))

# Create violin plots for funding by ranking category
sns.violinplot(data=list(filtered_ranking_data.values()), palette="Set1", inner="quartile", ax=ax)

# Customize the appearance
ax.set_ylabel("Funding", fontsize=14)
ax.set_title("Distribution of Funding by Ranking Category (Outliers Removed)", fontsize=16)
ax.set_xlabel("Ranking Categories", fontsize=14)
ax.set_xticklabels(list(filtered_ranking_data.keys()), fontsize=12)
ax.yaxis.set_major_formatter(ticker.FuncFormatter(format_y_tick))
ax.set_ylim(0, 2.25e7)

bold_font = FontProperties(weight='bold')

medians = [np.median(data) for data in filtered_ranking_data.values()]

# Label medians
highlight_effect = [path_effects.withStroke(linewidth=3, foreground='yellow')]
for i, median in enumerate(medians):
    ax.annotate(f"Median: ${median/1e6:.1f}M", xy=(i, median), xytext=(10, 0), textcoords='offset points',
                arrowprops=dict(arrowstyle="->", color='black', path_effects=highlight_effect))

means = [np.mean(data) for data in filtered_ranking_data.values()]
for i, mean in enumerate(means):
    ax.annotate(f"Mean: ${mean/1e6:.1f}M", xy=(i, mean), xytext=(10, 0), textcoords='offset points',
                arrowprops=dict(arrowstyle="->", color='black', path_effects=highlight_effect))

for label in ax.get_ymajorticklabels():
    label.set_fontweight('bold')

# Show the plot
plt.show()

<a id="5"></a> <br>
## Section 4 - List of Startups Pitched which were Successful
### Based on multiple between original funding and funding now, and current amount of funding

In [25]:
#Create list of startups that did well that we mised based on multiple and amount of funding
missed_startups = {}
potential_misses = {}
non_misses = {}

multipl1_value = '<=1x'
multipl2_value = '1-5x'
multipl3_value = '5-10x'
multipl4_value = '10-20x'
multipl5_value = '20-50x'
multipl6_value = '50-100x'
multipl7_value = '100x+'
multipl8_value = 'error'

for item in multipl4:
    potential_misses[item] = multipl4_value

for item in multipl5:
    potential_misses[item] = multipl5_value

for item in multipl6:
    missed_startups[item] = multipl6_value

for item in multipl7:
    missed_startups[item] = multipl7_value

def multipl_value(item):
    #Have company name
    #Need to get index of row from company name
    index = df_final_multiple[df_final_multiple['Company Name'] == item].index

    if not index.empty:
        #Need to get specific multiple value
        multiple = df_final_multiple.at[index[0], "Funding Multiple"]
        multiple = float(multiple)
        #Need to return the multiple value string
        if multiple <= 1:
            return multipl1_value
        elif multiple <=5:
            return multipl2_value
        elif multiple <=10:
            return multipl3_value
        elif multiple <=20:
            return multipl4_value
        elif multiple <=50:
            return multipl5_value
        elif multiple <=100:
            return multipl6_value
        elif multiple > 100:
            return multipl7_value
        else:
            return miltipl8_value
    else:
        return multipl8_value

for item in raise1:
    if item in non_misses:
        continue
    else:
        non_misses[item] = multipl_value(item)

for item in raise2:
    if item in non_misses:
        continue
    else:
        non_misses[item] = multipl_value(item)

for item in raise3:
    if item in non_misses:
        continue
    else:
        non_misses[item] = multipl_value(item)

for item in raise4:
    if item in non_misses:
        continue
    else:
        non_misses[item] = multipl_value(item)

for item in raise5:
    if item in potential_misses:
        continue
    else:
        potential_misses[item] = multipl_value(item)


for item in raise6:
    if item in missed_startups:
        continue
    else:
        missed_startups[item] = multipl_value(item)

for item in raise7:
    if item in missed_startups:
        continue
    else:
        missed_startups[item] = multipl_value(item)

for item in raise8:
    if item in missed_startups:
        continue
    else:
        missed_startups[item] = multipl_value(item)

#Add startups that were acquired to list
for index, row in df_final_multiple.iterrows():
    company_name = row['Company Name']
    stage = row['Stage']
    
    if company_name in raise9 and stage == 'Acquired' and company_name not in missed_startups:
        missed_startups[company_name] = multipl_value(company_name)
    else:
        continue

In [26]:
df_final_analysis = df_final_analysis.reset_index(drop=True)

In [15]:
def get_info(company, col2, df = df_final_analysis, col1 = 'Company Name'):
    index = df[df[col1] == company].index

    if not index.empty:
        return str(df.at[index[0], col2])
    else: 
        return 'Unknown'
        
#Create list of missed investments based on being acquired/exited, funding stage, and multiple
print(bold(underline_text('List of Missed Investments')))
print('\n' + bold('LEL Ranking System: '))
print('0 — Given to startups that weren\'t considered')
print('1 — Given to startups that were highlighted yellow (potentially promising)')
print('2 — Given to startups that were highlighted green (very promising)')
print('3 — Given to startups LEL invested in')
for item in missed_startups:
    index = df_final_analysis[df_final_analysis['Company Name'] == item].index
    if df_final_analysis.at[index[0], 'Ranking'] in [0, 1, 2]:
        print(line(100))
        print('\n' + bold(item + " — ") + get_info(item, 'Stage') + ' Stage')
        print('\n' + 'Industry/Sector: ' + get_info(item, 'Industry/Sector'))
        print('Ranking LEL Gave: ' + get_info(item, 'Ranking') + ' out of 3')
        try:
            print('\n' + 'Funding when Pitched: ' + format_currency(get_info(item, 'Total Money Raised'), 'USD'))
        except:
            print('\n' + 'Funding when Pitched: ' + get_info(item, 'Total Money Raised'))
        print(('Current Funding:      ') + format_currency(get_info(item, 'Total Raised'), 'USD'))
        print(('Multiple:             ') + missed_startups[item])
        print('\n' + 'Description: ')
        print(get_info(item, 'Brief Introduction'))

In [16]:
#Create list of missed investments of Ranking 2
print(bold(underline_text('List of Startups Given Ranking of 2 that Didn\'t Meet Threshold of Being Called a Missed Investment')))
print('\n' + bold('LEL Ranking System: '))
print('0 — Given to startups that weren\'t considered')
print('1 — Given to startups that were highlighted yellow (potentially promising)')
print('2 — Given to startups that were highlighted green (very promising)')
print('3 — Given to startups LEL invested in')


for item in non_misses:
    index = df_final_analysis[df_final_analysis['Company Name'] == item].index
    stage = df_final_analysis.at[index[0], 'Stage']
    if df_final_analysis.at[index[0], 'Ranking'] in [2] and stage != 'Acquired':
        print(line(100))
        print('\n' + bold(item + " — ") + get_info(item, 'Stage') + ' Stage')
        print('\n' + 'Industry/Sector: ' + get_info(item, 'Industry/Sector'))
        print('Ranking LEL Gave: ' + get_info(item, 'Ranking') + ' out of 3')
        try:
            print('\n' + 'Funding when Pitched: ' + format_currency(get_info(item, 'Total Money Raised'), 'USD'))
        except:
            print('\n' + 'Funding when Pitched: ' + get_info(item, 'Total Money Raised'))
        print(('Current Funding:      ') + format_currency(get_info(item, 'Total Raised'), 'USD'))
        print(('Multiple:             ') + non_misses[item])
        print('\n' + 'Description: ')
        print(get_info(item, 'Brief Introduction'))

In [17]:
numbers = []
for item in missed_startups:
    index = df_final_analysis[df_final_analysis['Company Name'] == item].index
    if df_final_analysis.at[index[0], 'Ranking'] in [0, 1, 2]:
        numbers.append(get_info(item, 'Ranking'))

# Count the occurrences of each unique number
unique_numbers = list(set(numbers))
counts = [numbers.count(num) for num in unique_numbers]

# Calculate the total length of numbers
total_length = len(numbers)

# Sort unique_numbers and counts together by unique_numbers
unique_numbers, counts = zip(*sorted(zip(unique_numbers, counts)))

# Set the font family and size
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.size'] = 12

# Create a bar plot with bars ordered by unique numbers
fig, ax = plt.subplots(figsize=(10, 6))
width = 0.35  # Width of each bar
x = range(len(unique_numbers))

# Define colors for bars
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']

bars = ax.bar(x, counts, width, color=colors)

# Set x-axis labels as unique numbers
x_labels = [str(num) for num in unique_numbers]
ax.set_xticks(x)
ax.set_xticklabels(x_labels, fontsize=10)
ax.set_ylim(0, 100)

# Calculate relative frequency (count / total length) for each unique number
relative_frequencies = [count / total_length for count in counts]

# Set custom tick positions and labels on the x-axis
ax.set_xticks(x)
ax.set_xticklabels([f"{num} ({rf:.2%})" for num, rf in zip(unique_numbers, relative_frequencies)], fontsize=10)

# Add labels and title
ax.set_xlabel("Rankings LEL Gave", fontsize=14)
plt.ylabel('Percent out of 100')
ax.set_title("Percent of 'Missed' Investments For Each LEL Ranking", fontsize=16)

# Show the plot with adjustable aspect ratio
plt.gca().set_aspect('auto')

# Display the plot
plt.tight_layout()
plt.show()

In [18]:
counts = list(counts)
counts.append(sum(counts))

# Create a bar plot with bars ordered by unique numbers
fig, ax = plt.subplots(figsize=(10, 6))
width = 0.35  # Width of each bar
x = range(len(counts))

# Define colors for bars
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']

bars = ax.bar(x, counts, width, color=colors)

# Set x-axis labels as unique numbers
x_labels = ['0', '1', '2', 'Total Num of Missed Investments']
ax.set_xticks(x)
ax.set_xticklabels(x_labels, fontsize=10)
ax.set_ylim(0, 120)

# Add percentage labels on top of the bars
for i, v in enumerate(counts):
    ax.text(i, v + 0.2, f"{v}", ha='center', va='bottom', fontsize=10)

# Add labels and title
ax.set_xlabel("Rankings LEL Gave", fontsize=14)
plt.ylabel('Number of Missed Investments with this Ranking')
ax.set_title("The Number of Missing Investments Categorized by LEL Ranking", fontsize=16)

# Show the plot with adjustable aspect ratio
plt.gca().set_aspect('auto')

# Display the plot
plt.tight_layout()
plt.show()

In [19]:
numbers_compared_to_all = {'0': 0, '1': 0, '2': 0, '% of Missed Investments out of Total Screened':0}

for x in numbers:
    numbers_compared_to_all[x] += 1

total0s = 0
total1s = 0
total2s = 0

for index, item in df_final_analysis.iterrows():
    ranking = int(item['Ranking'])
    if ranking == 0:
        total0s += 1
    elif ranking == 1:
        total1s += 1
    elif ranking == 2:
        total2s += 1
    else:
        pass

numbers_compared_to_all['% of Missed Investments out of Total Screened'] = round(((numbers_compared_to_all['0'] + numbers_compared_to_all['1'] + numbers_compared_to_all['2'])/ (total0s + total1s + total2s)) * 100, 2)
numbers_compared_to_all['0'] = round((numbers_compared_to_all['0'] / total0s) * 100, 2)
numbers_compared_to_all['1'] = round((numbers_compared_to_all['1'] / total1s) * 100, 2)
numbers_compared_to_all['2'] = round((numbers_compared_to_all['2'] / total2s) * 100, 2)

# Create a bar plot with bars ordered by unique numbers
fig, ax = plt.subplots(figsize=(10, 6))
width = 0.35  # Width of each bar
x = range(len(numbers_compared_to_all))

# Define colors for bars
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']

bars = ax.bar(x, list(numbers_compared_to_all.values()), width, color=colors)

# Set x-axis labels as unique numbers
x_labels = list(numbers_compared_to_all.keys())
ax.set_xticks(x)
ax.set_xticklabels(x_labels, fontsize=10)
ax.set_ylim(0, 100)

# Add percentage labels on top of the bars
for i, v in enumerate(numbers_compared_to_all.values()):
    ax.text(i, v + 0.2, f"{v}%", ha='center', va='bottom', fontsize=10)

# Add labels and title
ax.set_xlabel("Rankings LEL Gave", fontsize=14)
plt.ylabel('Percent out of 100')
ax.set_title("Percent of Startups Given their LEL Ranking that did Well", fontsize=16)

# Show the plot with adjustable aspect ratio
plt.gca().set_aspect('auto')

# Display the plot
plt.tight_layout()
plt.show()

<a id="6"></a> <br>
## Section 5 - An ML Algorithm to Analyze Future Startups Being Added to Pitch Deck
### First we use our data on startup multiples to measure VC performance based on how much LEL missed (was visualized in previous section).
### Then clean up text, create train and test data, and train ML model

In [34]:
#Measure VC performance based on how much we missed
df_final_performance = df_final_multiple.copy()
df_final_performance = df_final_performance.drop(['index', 'Ranking', 'Week', 'Lead', 'Year Founded', 'Founder(s)', 'Last Round Raised', 'Investors/Funding Source', 'TAM', 'Competitors', 'Total Raised', 'Stage', 'Funding Multiple', 'Company Type', 'Location', 'Total Money Raised'], axis = 1)
df_final_performance['Performance'] = 0

for item in potential_misses:
    index = df_final_performance[df_final_performance['Company Name'] == item].index
    if not index.empty:
        df_final_performance.at[index[0], 'Performance'] = 1
for item in missed_startups:
    index = df_final_performance[df_final_performance['Company Name'] == item].index
    if not index.empty:
        df_final_performance.at[index[0], 'Performance'] = 2
for index, row in df_final_performance.iterrows():
    a = int(float(row['School (0-5)']))
    try:
        b = int(float(row['Corporate Experience/\nTop Organization (0-5)']))
    except:
        b = 0
    c = int(float(row['Startup Experience (0-5)']))
    d = int(float(row['Tech Background (0-5)']))
    try:
        e = int(float(row['Product Stage (0-5)']))
    except:
        e = 3
    try:
        f = int(float(row['User/Client Stage (0-5)']))
    except:
        f = 3
    try:
        g = int(float(row['Revenue (0-5)']))
    except:
        g = 3
    if g == 0 or 1 or 2:
        g = 3
    df_final_performance.at[index, 'School (0-5)'] = a
    df_final_performance.at[index, 'Corporate Experience/\nTop Organization (0-5)'] = b
    df_final_performance.at[index, 'Startup Experience (0-5)'] = c
    df_final_performance.at[index, 'Tech Background (0-5)'] = d
    df_final_performance.at[index, 'Product Stage (0-5)'] = e
    df_final_performance.at[index, 'User/Client Stage (0-5)'] = f
    df_final_performance.at[index, 'Revenue (0-5)'] = g
    df_final_performance.at[index, 'SUM'] =  a + b + c + d + e + f + g

df_final_performance = df_final_performance.drop(['Company Name'], axis = 1)

In [35]:
stopWords = nltk.corpus.stopwords.words("english")
token = nltk.tokenize.RegexpTokenizer(r"\w+")
lemmatizer = nltk.stem.WordNetLemmatizer()


def cleaner(DF, col="text"):
    stopWords = set(stopwords.words("english"))
    lemmatizer = WordNetLemmatizer()
    emojis = []  # You need to define your emoji list here
    col_index = np.where(DF.columns == col)[0][0]

    cleaned_texts = []
    for i, row in DF.iterrows():
        sentence = str(row[col_index])  # Convert to string
        sentence = sentence.lower().split(" ")
        sentence = [word for word in sentence if word not in emojis]
        sentence = [word for word in sentence if 
                  "http" not in word and 
                  "https" not in word and 
                  "@" not in word]
        sentence = [contractions.fix(word) for word in sentence]
        sentence = " ".join(sentence).lower()
        sentence = word_tokenize(sentence)
        sentence = [word for word in sentence if word not in stopWords]
        sentence = [lemmatizer.lemmatize(word) for word in sentence]
        sentence = [word.strip() for word in sentence]
        sentence = [word for word in sentence if not re.match(r"\S*\d+\S*", word)]
        sentence = [word for word in sentence if 
                  word != "rt" and 
                  word != "û_" and 
                  word != "amp" and 
                  word != "ûª" and
                  word != "ûªs" and
                  word != "ûò" and
                  word != "åè" and
                  word != "ìñ1"] 
        sentence = [re.sub(r"(.)\1{2,}\B", r"\1\1", word) for word in sentence]
        sentence = [re.sub(r"(.)\1{2,}\b", r"\1\1", word) for word in sentence]
        sentence = [word for word in sentence if len(word) > 1]
        sentence = [word for word in sentence if len(word) < 30]

        cleaned_texts.append(" ".join(sentence))

    DF[col] = cleaned_texts  # Update the original DataFrame with the cleaned column
    return DF


In [36]:
df_final_performance = cleaner(df_final_performance, 'Brief Introduction')
df_final_performance = cleaner(df_final_performance, 'Industry/Sector')
df_final_performance = cleaner(df_final_performance, 'Comments/Additional Notes')

In [37]:
# Preprocess numerical features
numeric_features = df_final_performance[['School (0-5)',
       'Corporate Experience/\nTop Organization (0-5)',
       'Startup Experience (0-5)', 'Tech Background (0-5)',
       'Product Stage (0-5)', 'User/Client Stage (0-5)', 'Revenue (0-5)',
       'SUM']]
scaler = StandardScaler()
numeric_features = scaler.fit_transform(numeric_features)

#Preprocess textual features
text_data = df_final_performance[['Brief Introduction', 'Industry/Sector', 'Comments/Additional Notes']].astype(str)
text_columns = ['Brief Introduction', 'Industry/Sector', 'Comments/Additional Notes']
text_sequences_list = []

tokenizer = Tokenizer(num_words=None)  # Removed unnecessary tokenizer initialization inside the loop

for col in text_columns:
    text_sequences = tokenizer.texts_to_sequences(text_data[col])  # Use text_data[col] to access the column
    text_sequences = pad_sequences(text_sequences, maxlen=40)
    text_sequences_list.append(text_sequences)

# Combine processed textual features into a single array
text_sequences_combined = np.concatenate(text_sequences_list, axis=1)  # Use axis=2 for concatenation

# Ensure numerical and textual features have the same number of samples
assert len(numeric_features) == len(text_sequences_combined), "Number of samples mismatch between numerical and textual features"

# Combine processed numerical and textual features into X
X = np.concatenate([text_sequences_combined, numeric_features], axis=1)  # Combine along axis=1

# Target variable
y = df_final_performance['Performance'].values

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
from sklearn.metrics import mean_absolute_error

imputer = KNNImputer()

xgb_clf = XGBClassifier(eval_metric='logloss')

#pipeline = Pipeline(steps=[('imputer', imputer),('model', model)])

xgb_clf.fit(X_train, y_train)

y_pred = xgb_clf.predict(X_test)

print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_pred, y_test)))

#my_model = XGBRegressor(n_estimators=1000)
#my_model.fit(X_train, y_train, early_stopping_rounds=5, 
#             eval_set=[(X_test, y_test)], verbose=False)

#my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
#my_model.fit(X_train, y_train, early_stopping_rounds=5, 
#             eval_set=[(X_test, y_test)], verbose=False)

#make predictions
#predictions = my_model.predict(X_test)

#print("Mean Absolute Error : " + str(mean_absolute_error(predictions, y_test)))

In [39]:
# Save the trained model
xgb_clf.save_model('xgb_model.model')

# Load the saved model
loaded_model = xgb.Booster()
loaded_model.load_model('xgb_model.model')

<a id="7"></a> <br>
## Section 6 - Use Model to Analzye a Startup

In [44]:
#Input your Startup and getting info and score
options = Options()
driver = webdriver.Chrome()

In [21]:
name = input('What is the name of the startup being analyzed?')

nameformat = False
try:
    newname = add_dashes(name)
    nameformat = True
except:
    print('Error Occured with function add_dashes, manual input required')

url = "https://www.cbinsights.com/company/" + newname + "/financials"

urlcheck = False
try:
    driver.get(url)
    html = driver.page_source
    urlcheck = True
except:
    print('Error Occured: URL not found on CBInsights')
    sys.exit(1)

manual = False
if nameformat is True and urlcheck is True:
    ask = input("Would you like to manually input the introduction and industry of " + name + " (alternative is it being done automatically by the system)? Type Y or N")
    if ask == 'Y':
        manual = True
    else:
        manual = False


if nameformat is False or urlcheck is False or manual is True:
    if nameformat is False or urlcheck is False:
        print('Error: Manual input will be required for all categories')
    introduction = input('Provide a brief introduction for ' + name)
    industry = input('Provide the industry of ' + name)
    school = input('Rank the school of the team of ' + name + ' on a scale of 0-5')
    corporate = input('Rank the corporate experience of the team of ' + name + ' on a scale of 0-5')
    startup = input('Rank the startup experience of the team of ' + name + ' on a scale of 0-5')
    tech = input('Rank the tech experience of the team of ' + name + ' on a scale of 0-5')
    productstage = input('Rank the product stage of ' + name + ' on a scale of 0-5')
    clientstage = input('Rank the user/client stage of ' + name + ' on a scale of 0-5')
    revenue = input('Rank the revenue of ' + name + ' on a scale of 0-5')
    SUM = int(float(school)) + int(float(corporate)) + int(float(startup)) + int(float(tech)) + int(float(productstage)) + int(float(clientstage)) + int(float(revenue))
    comments = input('Provide comments/additional notes for ' + name)

else:
    
    introcheck = False
    try:
        description_start = html.find('png","description":"') + len('png","description":"')
        description_end = html.find('","url"', description_start)
        introduction = html[description_start:description_end]
        introcheck = True
    except:
        pass
        
    industrycheck = False
    try:
        industry_start = html.find(',"subindustry":"') + len(',"subindustry":"')
        industry_end = html.find('","idSector')
        curindustry = html[industry_start:industry_end]
        curindustry = re.sub(r'\\u[0-9a-fA-F]{4}', lambda x: chr(int(x.group()[2:], 16)), curindustry)
        if len(curindustry) < 40 and len(curindustry) > 5:
            industry = curindustry
        industrycheck = True
    except:
        pass

    if introcheck is False:
        introduction = input('Provide a brief introduction for ' + name)
    
    if industrycheck is False:
        industry = input('Provide the industry of ' + name)

    school = input('Rank the school of the team of ' + name + ' on a scale of 0-5')
    corporate = input('Rank the corporate experience of the team of ' + name + ' on a scale of 0-5')
    startup = input('Rank the startup experience of the team of ' + name + ' on a scale of 0-5')
    tech = input('Rank the tech experience of the team of ' + name + ' on a scale of 0-5')
    productstage = input('Rank the product stage of ' + name + ' on a scale of 0-5')
    clientstage = input('Rank the user/client stage of ' + name + ' on a scale of 0-5')
    revenue = input('Rank the revenue of ' + name + ' on a scale of 0-5')
    SUM = int(float(school)) + int(float(corporate)) + int(float(startup)) + int(float(tech)) + int(float(productstage)) + int(float(clientstage)) + int(float(revenue))
    comments = input('Provide comments/additional notes for ' + name)

startupdict = { 
    'Brief Introduction' : introduction, 
    'Industry/Sector' : industry, 
    'School (0-5)' : school, 
    'Corporate Experience/\nTop Organization (0-5)' : corporate, 
    'Startup Experience (0-5)' : startup, 
    'Tech Background (0-5)' : tech,
    'Product Stage (0-5)' : productstage,
    'User/Client Stage (0-5)' : clientstage,
    'Revenue (0-5)' : revenue,
    'SUM' : SUM,
    'Comments/Additional Notes' : comments}

numeric_features = np.array([
    startupdict['School (0-5)'],
    startupdict['Corporate Experience/\nTop Organization (0-5)'],
    startupdict['Startup Experience (0-5)'],
    startupdict['Tech Background (0-5)'],
    startupdict['Product Stage (0-5)'],
    startupdict['User/Client Stage (0-5)'],
    startupdict['Revenue (0-5)'],
    startupdict['SUM']
]).reshape(1, -1)

scaler = StandardScaler()
scaler.fit(numeric_features)
numeric_features = scaler.transform(numeric_features)

# Preprocess textual features
text_data = {
    'Brief Introduction': startupdict['Brief Introduction'],
    'Industry/Sector': startupdict['Industry/Sector']
}
text_columns = ['Brief Introduction', 'Industry/Sector']
text_sequences_list = []

for col in text_columns:
    text_sequences = tokenizer.texts_to_sequences([text_data[col]])
    text_sequences = pad_sequences(text_sequences, maxlen=40)
    text_sequences_list.append(text_sequences)

# Combine processed textual features into a single array
text_sequences_combined = np.concatenate(text_sequences_list, axis=1)

# Preprocess additional features
has_comments = 1 if pd.notna(startupdict['Comments/Additional Notes']) else 0
comments_length = len(str(startupdict['Comments/Additional Notes']))
additional_features = np.array([[has_comments, comments_length]])

# Combine all the features into X
X = np.concatenate([text_sequences_combined, numeric_features, additional_features], axis=1)

dmatrix = xgb.DMatrix(X)

# Make predictions using the loaded model
predictions = loaded_model.predict(dmatrix)


#Print the predictions
print(bold(name))
print('Probability of startup not being a promising investment (0): ' + str(predictions[0][0] * 100) + "%")
print('Probability of startup being a potentially promising investment (1): ' + str(predictions[0][1] * 100) + "%")
print('Probability of startup being a very promising investment (2): ' + str(predictions[0][2] * 100) + "%")
new_predictions = np.argmax(np.array(predictions))
if int(new_predictions) == 0:
    print(bold('Ranking given: 0.'))
    print('The model does not consider this startup to be a promising investment.')
elif int(new_predictions) == 1:
    print(bold('Ranking given: 1.'))
    print('The model considers this startup to be a potentially promising investment.')
elif int(new_predictions) == 2:
    print(bold('Ranking given: 2.'))
    print('The model considers this startup to be a very promising investment.')
else:
    print(bold('Ranking given: ') + new_predictions)

print('\n' + bold('Introduction: ') + introduction)
print(bold('Industry: ') + industry)
print('\n' + bold('School (0-5): ') + school)
print(bold('Corporate Experience, Top Organization (0-5): ') + corporate)
print(bold('Startup Experience (0-5): ') + startup)
print(bold('Tech Background (0-5): ') + tech)
print(bold('Product Stage (0-5): ') + productstage)
print(bold('Client Stage (0-5): ') + clientstage)
print(bold('Revenue (0-5): ') + revenue)
print(bold('\n' + 'SUM: ') + str(SUM))
print(bold('Comments/Additional Notes: ') + comments)