<a href="https://colab.research.google.com/github/srenna/moonshot_finder/blob/main/moonshot_finder_falcon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Import and Cleaning

In [1]:
import pandas as pd

In [2]:
encodings_to_try = ['utf-8', 'latin-1', 'ISO-8859-1']

for encoding in encodings_to_try:
    try:
        df = pd.read_csv('AI EarthHack Dataset.csv', encoding=encoding)
        break
    except UnicodeDecodeError:
        continue

In [3]:
# check all the data is there
len(df)

# look for null values
df.isnull().sum()

# drop null values
df = df.dropna()

In [4]:
df.head()

Unnamed: 0,id,problem,solution
0,1,The construction industry is indubitably one o...,"Herein, we propose an innovative approach to m..."
1,2,"I'm sure you, like me, are feeling the heat - ...","Imagine standing on a green hill, not a single..."
2,3,The massive shift in student learning towards ...,"Implement a """"Book Swap"""" program within educa..."
3,4,The fashion industry is one of the top contrib...,The proposed solution is a garment rental serv...
4,5,The majority of the materials used in producin...,An innovative concept would be a modular elect...


## Falcon 7B Model - by TII
---
1. Model Name: Falcon-7b-instruct
2. Model Parameters: 7 Billion
3. Training: Instruction-tuned Model
4. Link: https://huggingface.co/tiiuae/falcon-7b-instruct
---

In [5]:
# install dependencies
!pip install transformers
!pip install einops
!pip install accelerate



In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

# load model
model = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model)

falcon_pipeline = transformers.pipeline("text-generation",
                                        model=model,
                                        tokenizer=tokenizer,
                                        torch_dtype=torch.bfloat16,
                                        trust_remote_code=True,
                                        device_map="auto"
                                        )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
# define completion function
def get_completion_falcon(input):
  system = f"""
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  """
  prompt = f"#### System: {system}\n#### User: \n{input}\n\n#### Response from falcon-7b-instruct:"

  falcon_response = falcon_pipeline(prompt,
                                    max_length=1000,
                                    do_sample=True,
                                    top_k=10,
                                    num_return_sequences=1,
                                    eos_token_id=tokenizer.eos_token_id,
                                    pad_token_id=tokenizer.eos_token_id
                                    )


  return falcon_response

## Prompt Engineering


In [8]:
# let's prompt
# prompt = "Explain to me the difference between nuclear fission and fusion."
# prompt = "Why is the Sky blue?"
prompt = '''Herein, we propose an innovative approach to mitigate this problem: Modular Construction.
This method embraces recycling and reuse, taking a significant stride towards a circular economy.
  Modular construction involves utilizing engineered components in a manufacturing facility that are
  later assembled on-site. These components are designed for easy disassembling, enabling them to be reused
  in diverse projects, thus significantly reducing waste and conserving resources.
  Not only does this method decrease construction waste by up to 90%, but it also decreases
  construction time by 30-50%, optimizing both environmental and financial efficiency.
  This reduction in time corresponds to substantial financial savings for businesses.
  Moreover, the modular approach allows greater flexibility, adapting to changing needs over time.
  We believe, by adopting modular construction, the industry can transit from a 'take, make and dispose'
  model to a more sustainable 'reduce, reuse, and recycle' model, driving the industry towards a more circular
  and sustainable future. The feasibility of this concept is already being proven in markets around the globe,
  indicating its potential for scalability and real-world application. Is this a moonshot idea or incremental?'''
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
Herein, we propose an innovative approach to mitigate this problem: Modular Construction.
This method embraces recycling and reuse, taking a significant stride towards a circular economy.
  Modular construction involves utilizing engineered components in a manufacturing facility that are
  later assembled on-site. These components are designed for easy disassembling, enabling them to be reused
  in diverse projects, thus significantly reducing waste and conserving resou

In [9]:
prompt = '''Imagine standing on a green hill, not a single towering, noisy windmill in sight, and yet,
you're surrounded by wind power generation! Using existing, yet under-utilized technology,
I propose a revolutionary approach to harness wind energy on a commercial scale, without those
""monstrously large and environmentally damaging windmills"".
With my idea, we could start construction tomorrow and give our electrical grid the jolt it needs,
creating a future where clean, quiet and efficient energy isn't a dream, but a reality we live in.
This is not about every home being a power station, but about businesses driving a green revolution
from the ground up! Is this a moonshot idea or incremental?'''
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
Imagine standing on a green hill, not a single towering, noisy windmill in sight, and yet,
you're surrounded by wind power generation! Using existing, yet under-utilized technology,
I propose a revolutionary approach to harness wind energy on a commercial scale, without those
""monstrously large and environmentally damaging windmills"".
With my idea, we could start construction tomorrow and give our electrical grid the jolt it needs,
creating a future where clean, quiet

In [10]:
prompt = '''Implement a ""Book Swap"" program within educational institutions and local communities.
This platform allows students to trade books they no longer need with others who require them,
reducing the need for new book production and hence, lowering the rate of resource depletion.
Furthermore, the platform could have a digital component to track book exchanges, giving users credits
for each trade, which they can accrue and redeem. This system encourages and amplifies the benefits of
reusing and sharing resources, thus contributing to the circular economy.   By integrating gamification,
getting students and parents involved and providing an easy-to-use platform, the program could influence
a cultural shift towards greater resource value appreciation and waste reduction.
In terms of the financial aspect, less reliance on purchasing new books could save money for students,
parents and schools. Is this idea moonshot or incremental?'''
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
Implement a ""Book Swap"" program within educational institutions and local communities.
This platform allows students to trade books they no longer need with others who require them,
reducing the need for new book production and hence, lowering the rate of resource depletion.
Furthermore, the platform could have a digital component to track book exchanges, giving users credits
for each trade, which they can accrue and redeem. This system encourages and amplifies the be

In [11]:
prompt = '''This is a solution to a problem."Companies can offer products as a service,
where customers pay for access or usage rather than ownership.
This can be done through subscription models or pay-per-use systems."
Is this idea 'moonshot' or 'incremental'?'''
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is a solution to a problem."Companies can offer products as a service,
where customers pay for access or usage rather than ownership.
This can be done through subscription models or pay-per-use systems."
Is this idea 'moonshot' or 'incremental'?

#### Response from falcon-7b-instruct:
The term'moonshot' can be used to describe both incremental and grand, ambitious projects. The idea of offering products as a service, rather than ownership, could be considered a moo

In [12]:
prompt = "This is the problem." + df['problem'][0] + "This is the solution." + df['solution'][0] + "Is this idea moonshot or incremental?"
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the problem.The construction industry is indubitably one of the significant contributors to global waste, contributing approximately 1.3 billion tons of waste annually, exerting significant pressure on our landfills and natural resources. Traditional construction methods entail single-use designs that require frequent demolitions, leading to resource depletion and wastage.   This is the solution.Herein, we propose an innovative approach to mitigate this problem:

In [13]:
prompt = '''This is a solution to a problem."Companies can offer products as a service,
where customers pay for access or usage rather than ownership.
This can be done through subscription models or pay-per-use systems."
Is this idea 'moonshot' or 'incremental'? Answer with exactly one word either "Moonshot" or "Incremental".'''
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is a solution to a problem."Companies can offer products as a service,
where customers pay for access or usage rather than ownership.
This can be done through subscription models or pay-per-use systems."
Is this idea 'moonshot' or 'incremental'? Answer with exactly one word either "Moonshot" or "Incremental".

#### Response from falcon-7b-instruct:
Moonshot.


In [14]:
prompt = "This is the problem." + df['problem'][0] + "This is the solution." + df['solution'][0] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the problem.The construction industry is indubitably one of the significant contributors to global waste, contributing approximately 1.3 billion tons of waste annually, exerting significant pressure on our landfills and natural resources. Traditional construction methods entail single-use designs that require frequent demolitions, leading to resource depletion and wastage.   This is the solution.Herein, we propose an innovative approach to mitigate this problem:

In [15]:
prompt = "This is the solution." + df['solution'][80] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the solution.A refined circular economy idea is a ""Tech-Device as a Service"" model. This takes the original idea of leasing and refurbishing, but enhances feasibility by operationalizing the service model in tech industries.  In this model, manufacturers or specialized service providers do not just lease the device, but also provide various essential services attached to it â such as timely software upgrades, preventive maintenance, repairs, data security, a

In [16]:
prompt = "This is the solution:" + df['solution'][1103] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])


#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the solution:The proposed solution involves establishing a ""Container Deposit Scheme"" (CDS), much like a ""recycling refund"", for businesses in the food and beverage industry. Here, customers pay a small deposit when they purchase a bottled or canned drink, which is refunded to them when they return the item after use. The returned containers are then collected by the manufacturers, who are responsible for the containers' recycling or reuse.  This way, the CD

In [17]:
generated_text = response[0]['generated_text']
header = "#### Response from falcon-7b-instruct:"
start_index = generated_text.find(header)
response_from_falcon = generated_text[start_index + len(header):].strip()

print(response_from_falcon)


This solution is neither moonshot nor incremental. It is a feasible, medium-term solution that could lead to circular economy innovation.


In [18]:
prompt = "This is the solution:" + df['solution'][1102] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the solution:An improved circular economy idea to address this is 'Bio-Degradable Packaging-as-a-Service' combined with 'Localised Composting Programs.' In this model, businesses would partner with innovative packaging suppliers to provide 100% compostable packaging made from plant-based materials.  When a customer purchases a product, they can use the packaging, and instead of returning it, they can dispose of it in their compost bin at home. If composting at h

In [19]:
prompt = "This is the solution: " + df['solution'][1102] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_falcon(prompt)
print(response[0]['generated_text'])



#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the solution: An improved circular economy idea to address this is 'Bio-Degradable Packaging-as-a-Service' combined with 'Localised Composting Programs.' In this model, businesses would partner with innovative packaging suppliers to provide 100% compostable packaging made from plant-based materials.  When a customer purchases a product, they can use the packaging, and instead of returning it, they can dispose of it in their compost bin at home. If composting at 

In [20]:
generated_text = response[0]['generated_text']
header = "#### Response from falcon-7b-instruct:"
start_index = generated_text.find(header)
response_from_falcon = generated_text[start_index + len(header):].strip()

print(response_from_falcon)


This solution is definitely 'Moonshot'. It's an extremely ambitious idea that has the potential to have a huge impact on the environment and the global economy. It's also very feasible because of the growing popularity of compostable packaging and increasing interest in circular economy businesses.


## Loop the through the Spreadsheet

In [21]:
## Use BERT to classify the LLM output as Moonshot, Incremental or Uncertain

In [22]:
responses = []
break_counter = 0

for value in df['solution']:
    prompt = "This is the solution: " + str(value) + " Is this idea moonshot or incremental? If highly ambitious, groundbreaking, or an audacious project, answer 'moonshot'. If feasible, answer 'incremental'. Do not include any # symbols."
    response = get_completion_falcon(prompt)

    # Append the response to the list
    generated_text = response[0]['generated_text']
    header = "#### Response from falcon-7b-instruct:"
    start_index = generated_text.find(header)
    response_from_falcon = generated_text[start_index + len(header):].strip()
    print(response_from_falcon)
    responses.append(response_from_falcon)

    break_counter = break_counter + 1

    if break_counter == 20:
      break

# Create a DataFrame from the responses list
response_df = pd.DataFrame({'Generated Response': responses})

# Print the DataFrame to check the values
# print(response_df)

# Save the DataFrame to a CSV file
response_df.to_csv('responses.csv', index=False)

In the context of this proposal, the idea of modular construction can be considered "moonshot" because it involves utilizing engineered components to construct a structure that is later assembled on-site. Unlike traditional construction methods, which typically involve a lot of waste and a longer time frame to complete, modular construction reduces waste and construction time while increasing sustainability. Additionally, it is a highly feasible approach, as it has been adopted in markets around the globe.
This question is subjective and difficult to answer. However, I would say that the proposed solution is definitely an incremental idea, as it involves implementing an existing technology in a new and innovative way, rather than developing a completely new method from scratch that involves a large investment of time and money. The concept of harnessing wind power without large, noisy windmill infrastructure could be interpreted as a moonshot idea, as it involves a significant change i

In [23]:
import pandas as pd
import torch
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from transformers import BertTokenizer, BertForSequenceClassification
import spacy


data = pd.read_csv('/content/responses.csv')
data.isnull().sum()

Generated Response    0
dtype: int64

In [24]:
# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
# Tokenize and encode the data
values = data["Generated Response"].tolist()
print(values)

['In the context of this proposal, the idea of modular construction can be considered "moonshot" because it involves utilizing engineered components to construct a structure that is later assembled on-site. Unlike traditional construction methods, which typically involve a lot of waste and a longer time frame to complete, modular construction reduces waste and construction time while increasing sustainability. Additionally, it is a highly feasible approach, as it has been adopted in markets around the globe.', 'This question is subjective and difficult to answer. However, I would say that the proposed solution is definitely an incremental idea, as it involves implementing an existing technology in a new and innovative way, rather than developing a completely new method from scratch that involves a large investment of time and money. The concept of harnessing wind power without large, noisy windmill infrastructure could be interpreted as a moonshot idea, as it involves a significant cha

In [26]:
#!pip install spacy

In [29]:
import spacy

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# List of strings to be classified
strings = values  # Make sure 'values' is defined before this point

# Function to classify a string as "moonshot" or "incremental"
def classify_as_moonshot_or_incremental(text):
    doc = nlp(text)

    # Check for the presence of relevant keywords in the text
    for i, token in enumerate(doc):
        if token.text.lower() in ["moonshot", "ambitious", "revolutionary"]:
            return "moonshot"
        elif token.text.lower() in ["incremental", "feasible", "practical"]:
            return "incremental"
        elif token.text.lower() == "not" and i + 1 < len(doc):
            next_token = doc[i + 1]
            if next_token.text.lower() in ["moonshot", "ambitious", "revolutionary"]:
                return "incremental"  # Negation of moonshot is considered incremental
        elif token.text.lower() == "not" and i + 2 < len(doc):
            next_token = doc[i + 1]
            next_next_token = doc[i + 2]
            if next_token.text.lower() == "a" and next_next_token.text.lower() == "moonshot":
                return "incremental"  # Handles "not a moonshot"

    return "uncertain"  # If neither moonshot nor incremental keywords are found

# Classify each string in the list
for i, string in enumerate(strings):
    classification = classify_as_moonshot_or_incremental(string)
    print(f"String {i + 1}: {classification}")


String 1: moonshot
String 2: incremental
String 3: moonshot
String 4: moonshot
String 5: moonshot
String 6: moonshot
String 7: moonshot
String 8: incremental
String 9: incremental
String 10: moonshot
String 11: incremental
String 12: moonshot
String 13: moonshot
String 14: incremental
String 15: uncertain
String 16: incremental
String 17: moonshot
String 18: incremental
String 19: moonshot
String 20: moonshot


In [28]:
import csv

csv_file_path = "classifications.csv"
with open(csv_file_path, mode="w", newline="", encoding="utf-8") as csv_file:
    writer = csv.writer(csv_file)

    # Write the header
    writer.writerow(["Classification"])

    # Write the classified results
    for string in strings:
        writer.writerow([classify_as_moonshot_or_incremental(string)])

print(f"Classifications saved to {csv_file_path}")

Classifications saved to classifications.csv
