<a href="https://colab.research.google.com/github/srenna/moonshot_finder/blob/main/moonshot_finder_databricks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

In [None]:
encodings_to_try = ['utf-8', 'latin-1', 'ISO-8859-1']

for encoding in encodings_to_try:
    try:
        df = pd.read_csv('/content/AI_EarthHack_Dataset.csv', encoding=encoding)
        break
    except UnicodeDecodeError:
        continue

In [None]:
# check all the data is there
len(df)

# look for null values
df.isnull().sum()

# drop null values
df = df.dropna()

In [None]:
df.head()

Unnamed: 0,id,problem,solution
0,1,The construction industry is indubitably one o...,"Herein, we propose an innovative approach to m..."
1,2,"I'm sure you, like me, are feeling the heat - ...","Imagine standing on a green hill, not a single..."
2,3,The massive shift in student learning towards ...,"Implement a """"Book Swap"""" program within educa..."
3,4,The fashion industry is one of the top contrib...,The proposed solution is a garment rental serv...
4,5,The majority of the materials used in producin...,An innovative concept would be a modular elect...


## Dolly 3B Model - by Databricks
---

1.  Model Name: Dolly-v2-3b
2.  Model Parameters: 3 Billion
2.   Training: Instruction-tuned Model
1.   Link: https://huggingface.co/databricks/dolly-v2-3b

---

In [None]:
# install dependencies
!pip install transformers
!pip install sentencepiece
!pip install accelerate

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0


In [None]:
import torch                        # allows Tensor computation with strong GPU acceleration
from transformers import pipeline   # fast way to use pre-trained models for inference
import os

In [None]:
# load model
dolly_pipeline = pipeline(model="databricks/dolly-v2-3b",
                            torch_dtype=torch.bfloat16,
                            trust_remote_code=True,
                            device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

instruct_pipeline.py:   0%|          | 0.00/9.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/databricks/dolly-v2-3b:
- instruct_pipeline.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

In [None]:
# define helper function
def get_completion_dolly(input):
  system = f"""
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  """
  prompt = f"#### System: {system}\n#### User: \n{input}\n\n#### Response from Dolly-v2-3b:"
  print(prompt)
  dolly_response = dolly_pipeline(prompt,
                                  max_new_tokens=500
                                  )
  return dolly_response

## Single Test Prompts

In [None]:
# let's prompt
prompt = '''Herein, we propose an innovative approach to mitigate this problem: Modular Construction.
This method embraces recycling and reuse, taking a significant stride towards a circular economy.
  Modular construction involves utilizing engineered components in a manufacturing facility that are
  later assembled on-site. These components are designed for easy disassembling, enabling them to be reused
  in diverse projects, thus significantly reducing waste and conserving resources.
  Not only does this method decrease construction waste by up to 90%, but it also decreases
  construction time by 30-50%, optimizing both environmental and financial efficiency.
  This reduction in time corresponds to substantial financial savings for businesses.
  Moreover, the modular approach allows greater flexibility, adapting to changing needs over time.
  We believe, by adopting modular construction, the industry can transit from a 'take, make and dispose'
  model to a more sustainable 'reduce, reuse, and recycle' model, driving the industry towards a more circular
  and sustainable future. The feasibility of this concept is already being proven in markets around the globe,
  indicating its potential for scalability and real-world application. Is this a moonshot idea or incremental?'''
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
Herein, we propose an innovative approach to mitigate this problem: Modular Construction.
This method embraces recycling and reuse, taking a significant stride towards a circular economy.
  Modular construction involves utilizing engineered components in a manufacturing facility that are
  later assembled on-site. These components are designed for easy disassembling, enabling them to be reused
  in diverse projects, thus significantly reducing waste and conserving resou

In [None]:
prompt = '''Imagine standing on a green hill, not a single towering, noisy windmill in sight, and yet,
you're surrounded by wind power generation! Using existing, yet under-utilized technology,
I propose a revolutionary approach to harness wind energy on a commercial scale, without those
""monstrously large and environmentally damaging windmills"".
With my idea, we could start construction tomorrow and give our electrical grid the jolt it needs,
creating a future where clean, quiet and efficient energy isn't a dream, but a reality we live in.
This is not about every home being a power station, but about businesses driving a green revolution
from the ground up! Is this a moonshot idea or incremental? Pick one word.'''
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
Imagine standing on a green hill, not a single towering, noisy windmill in sight, and yet,
you're surrounded by wind power generation! Using existing, yet under-utilized technology,
I propose a revolutionary approach to harness wind energy on a commercial scale, without those
""monstrously large and environmentally damaging windmills"".
With my idea, we could start construction tomorrow and give our electrical grid the jolt it needs,
creating a future where clean, quiet

In [None]:
prompt = '''Implement a ""Book Swap"" program within educational institutions and local communities.
This platform allows students to trade books they no longer need with others who require them,
reducing the need for new book production and hence, lowering the rate of resource depletion.
Furthermore, the platform could have a digital component to track book exchanges, giving users credits
for each trade, which they can accrue and redeem. This system encourages and amplifies the benefits of
reusing and sharing resources, thus contributing to the circular economy.   By integrating gamification,
getting students and parents involved and providing an easy-to-use platform, the program could influence
a cultural shift towards greater resource value appreciation and waste reduction.
In terms of the financial aspect, less reliance on purchasing new books could save money for students,
parents and schools. Is this idea moonshot or incremental?'''
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
Implement a ""Book Swap"" program within educational institutions and local communities.
This platform allows students to trade books they no longer need with others who require them,
reducing the need for new book production and hence, lowering the rate of resource depletion.
Furthermore, the platform could have a digital component to track book exchanges, giving users credits
for each trade, which they can accrue and redeem. This system encourages and amplifies the be

In [None]:
prompt = '''This is a solution to a problem."Companies can offer products as a service,
where customers pay for access or usage rather than ownership.
This can be done through subscription models or pay-per-use systems."
Is this idea 'moonshot' or 'incremental'?'''
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is a solution to a problem."Companies can offer products as a service,
where customers pay for access or usage rather than ownership.
This can be done through subscription models or pay-per-use systems."
Is this idea 'moonshot' or 'incremental'?

#### Response from Dolly-v2-3b:
'moonshot'


In [None]:
prompt = "This is the problem." + df['problem'][0] + "This is the solution." + df['solution'][0] + "Is this idea moonshot or incremental?"
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the problem.The construction industry is indubitably one of the significant contributors to global waste, contributing approximately 1.3 billion tons of waste annually, exerting significant pressure on our landfills and natural resources. Traditional construction methods entail single-use designs that require frequent demolitions, leading to resource depletion and wastage.   This is the solution.Herein, we propose an innovative approach to mitigate this problem:

In [None]:
prompt = '''This is a solution to a problem."Companies can offer products as a service,
where customers pay for access or usage rather than ownership.
This can be done through subscription models or pay-per-use systems."
Is this idea 'moonshot' or 'incremental'? Answer with exactly one word either "Moonshot" or "Incremental".'''
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is a solution to a problem."Companies can offer products as a service,
where customers pay for access or usage rather than ownership.
This can be done through subscription models or pay-per-use systems."
Is this idea 'moonshot' or 'incremental'? Answer with exactly one word either "Moonshot" or "Incremental".

#### Response from Dolly-v2-3b:
Moonshot


In [None]:
prompt = "This is the problem." + df['problem'][0] + "This is the solution." + df['solution'][0] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the problem.The construction industry is indubitably one of the significant contributors to global waste, contributing approximately 1.3 billion tons of waste annually, exerting significant pressure on our landfills and natural resources. Traditional construction methods entail single-use designs that require frequent demolitions, leading to resource depletion and wastage.   This is the solution.Herein, we propose an innovative approach to mitigate this problem:

In [None]:
prompt = "This is the solution." + df['solution'][80] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the solution.A refined circular economy idea is a ""Tech-Device as a Service"" model. This takes the original idea of leasing and refurbishing, but enhances feasibility by operationalizing the service model in tech industries.  In this model, manufacturers or specialized service providers do not just lease the device, but also provide various essential services attached to it â such as timely software upgrades, preventive maintenance, repairs, data security, a

In [None]:
prompt = "This is the solution:" + df['solution'][1103] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the solution:The proposed solution involves establishing a ""Container Deposit Scheme"" (CDS), much like a ""recycling refund"", for businesses in the food and beverage industry. Here, customers pay a small deposit when they purchase a bottled or canned drink, which is refunded to them when they return the item after use. The returned containers are then collected by the manufacturers, who are responsible for the containers' recycling or reuse.  This way, the CD

In [None]:
# generated_text = response[0]['generated_text']
# print(response[0]['generated_text'])
# header = "#### Response from Dolly-v2-3b:"
# start_index = generated_text.find(header)
# print(start_index)
# response_from_dolly = generated_text[start_index + len(header):].strip()

# print(response_from_dolly)

In [None]:
prompt = "This is the solution:" + df['solution'][1102] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the solution:An improved circular economy idea to address this is 'Bio-Degradable Packaging-as-a-Service' combined with 'Localised Composting Programs.' In this model, businesses would partner with innovative packaging suppliers to provide 100% compostable packaging made from plant-based materials.  When a customer purchases a product, they can use the packaging, and instead of returning it, they can dispose of it in their compost bin at home. If composting at h

In [None]:
prompt = "This is the solution: " + df['solution'][1102] + "Is this idea moonshot or incremental? Answer with exactly one word either 'Moonshot' or 'Incremental'. Do not include other words."
response = get_completion_dolly(prompt)
print(response[0]['generated_text'])

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the solution: An improved circular economy idea to address this is 'Bio-Degradable Packaging-as-a-Service' combined with 'Localised Composting Programs.' In this model, businesses would partner with innovative packaging suppliers to provide 100% compostable packaging made from plant-based materials.  When a customer purchases a product, they can use the packaging, and instead of returning it, they can dispose of it in their compost bin at home. If composting at 

In [None]:
# generated_text = response[0]['generated_text']
# header = "#### Response from falcon-7b-instruct:"
# start_index = generated_text.find(header)
# response_from_falcon = generated_text[start_index + len(header):].strip()

# print(response_from_falcon)


This is an incredible idea. It's a very incremental and achievable moonshot.


## Loop the through the Spreadsheet

In [None]:
responses = []
break_counter = 0

for value in df['solution']:
    prompt = "This is the solution: " + str(value) + " Is this idea moonshot or incremental? Do not include other words."
    response = get_completion_dolly(prompt)

    # Append the response to the list
    # generated_text = response[0]['generated_text']
    # header = "#### Response from falcon-7b-instruct:"
    # start_index = generated_text.find(header)
    # response_from_falcon = generated_text[start_index + len(header):].strip()

    responses.append(response[0]['generated_text'])

    break_counter = break_counter + 1

    if break_counter == 20:
      break

# Create a DataFrame from the responses list
response_df = pd.DataFrame({'Generated Response': responses})

# Print the DataFrame to check the values
print(response_df)

# Save the DataFrame to a CSV file
response_df.to_csv('responses.csv', index=False)

#### System: 
  You are an expert venture capital (VC) expert.
  You are good at looking at a stack of potential startup pitches to evaluate innovative circular economy business opportunities that were crowdsourced from an exciting innovation contest.
  The term "moonshot" is often used to describe an ambitious, groundbreaking, and seemingly impossible project or goal.
  The term "incremental" will be use to describe less ambitious, more feasible ideas.
  Identify each idea as 'moonshot' or 'incremental'.
  
#### User: 
This is the solution: Herein, we propose an innovative approach to mitigate this problem: Modular Construction. This method embraces recycling and reuse, taking a significant stride towards a circular economy.   Modular construction involves utilizing engineered components in a manufacturing facility that are later assembled on-site. These components are designed for easy disassembling, enabling them to be reused in diverse projects, thus significantly reducing waste an

In [None]:
## Use BERT to classify the LLM output as

In [None]:
import pandas as pd
import torch
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from transformers import BertTokenizer, BertForSequenceClassification
import spacy


data = pd.read_csv('/content/responses.csv')
data.isnull().sum()

Generated Response    0
dtype: int64

In [None]:
# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Tokenize and encode the data
values = data["Generated Response"].tolist()
print(values)

['Moonshot', 'The solution: It is a moonshot.', 'Moonshot', 'Moonshot', 'Moonshot', 'Moonshot', 'moonshot', 'Moonshot!', "I would define'moonshot' as an ambitious, far-fetched, and impossible to accomplish goal. Since this definition is incompatible with'moonshot', I would not include the term in the response.", 'Moonshot', "'Moonshot'", "This is the solution: Our solution to this is to transform the way we consume fashion through the creation of a shared fashion platform â\x80\x93 a fashion library. The fashion library will function on the concept of lending versus owning; it's like Airbnb but for clothes.   \n\nCustomers become members and can borrow from a vast clothing collection for a duration of their choice, starting from day-long rentals for special occasions to month-long arrangements for regular wear. The clothes are then returned, cleaned, and made available again for the second round. This creates a constantly rotating wardrobe, reducing the need for production of new cloth

In [None]:
# pip install spacy


In [None]:
# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# List of strings to be classified
strings = values

# Function to classify a string as "moonshot" or "incremental"
def classify_as_moonshot_or_incremental(text):
    doc = nlp(text)

    # Check for the presence of relevant keywords in the text
    if any(token.text.lower() in ["moonshot", "ambitious", "revolutionary"] for token in doc):
        return "moonshot"
    elif any(token.text.lower() in ["incremental", "feasible", "practical"] for token in doc):
        return "incremental"
    else:
        return "uncertain"  # If neither moonshot nor incremental keywords are found

# Classify each string in the list
for i, string in enumerate(strings):
    classification = classify_as_moonshot_or_incremental(string)
    print(f"String {i + 1}: {classification}")


String 1: moonshot
String 2: moonshot
String 3: moonshot
String 4: moonshot
String 5: moonshot
String 6: moonshot
String 7: moonshot
String 8: moonshot
String 9: moonshot
String 10: moonshot
String 11: moonshot
String 12: moonshot
String 13: moonshot
String 14: moonshot
String 15: moonshot
String 16: moonshot
String 17: moonshot
String 18: moonshot
String 19: moonshot
String 20: moonshot


In [None]:
import csv

csv_file_path = "classifications.csv"
with open(csv_file_path, mode="w", newline="", encoding="utf-8") as csv_file:
    writer = csv.writer(csv_file)

    # Write the header
    writer.writerow(["Classification"])

    # Write the classified results
    for string in strings:
        writer.writerow([classify_as_moonshot_or_incremental(string)])

print(f"Classifications saved to {csv_file_path}")

Classifications saved to classifications.csv
