### Goal of project: Replicate a few existing projects in Langchain and answer questions I have so I can build a full-stack use case soon. 

**Information to note:** I followed a few tutorials to write this code. I confirmed my understanding of the langchain package code by accessing the langchain open-source repo on GitHub and typing the classes, functions, methods used in the package into ChatGPT with python-like comments. The markdown in the notebook was validated as *"correct, capturing the essential points, spot on"* by ChatGPT too. 

ChatGPT comment on this strategy: ""ChatGPT is to answering questions and providing information on various topics as unittests are to testing Python code and ensuring its correctness."

### What is os module, why is it useful for API Key?

The os module is a package in Python that allows programmers to interact with their core OS. In my case that's Mac. Because of this package I can access files, work with directories, run commands with my system, etc. The os.environ dictionary-like object is useful for an API Key because when I run the code it makes the key available in the global environment. And, it's easier to manage sensitive information via this dictionary-like object because the value isn't hard-coded. 

### What is an API Key doing when you call it inside your code?

When I send a prompt with an API request, I'm sending a string to the server where my request is processed and sent back to me on the client side where I can see the output in my IDE. API keys are useful because they authenticate a user's requests and can ensure that I'm a verified user and confirm my permissions when I send a request. 

In [1]:
import os 
os.environ["OPENAI_API_KEY"] = "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"

### What is the langchain.llms module? Why is it useful/why should someone think about using langchain?

The langchain.llms module includes foundational llm models like hugging face, deep infra, gooseai, openai, etc. Many of these models are are based on similar underlying technology like GPT-3 or other transformer models.

Foundational LLMs are building blocks for applications that want to implement them without the details of training, completing RLHF, etc. Langchain is useful because it provides a way for developers to work with those building blocks so they can build applications that generate text that's pulled from OpenAI's API and other foundational models but be applied to use cases that aren't an option in the playground or ChatGPT. 

In [2]:
from langchain.llms import OpenAI

### When you access the OpenAI API, how does it know to shift temperature, token size (if you defined this), any other parameter?

When a user accesses the OpenAI API and provides parameters, this information is passed along with the prompt and the behavior of the model adjusts accordingly. 

All parameters are included in something called the *request payload* which is data sent by a client to the server as part of an API request. These payloads are usually in a structured format like JSON, XML, or other data types. 

The server uses the payload to process a request and send something back to the client. 

In [3]:
# creating a variable on the OpenAI package that sets temperature to 0.9. 
# temperature is the amount of token randomness or 'creativity' that the model outputs. 
# ranges from 0 to 1 where 0 is deterministic and 1 is most probabilistic
llm = OpenAI(temperature=0.9)

### This output is just the transformer running on the input, the same way a query on ChatGPT would give an output, correct? 

I can specify which text model, version of OpenAI's models I'd like to access in my request payload but the outputs use the same underlying transformer. 

In [5]:
text = "What are 5 vacation destnations for someone who likes to eat pasta?"
print(llm(text))



1. Rome, Italy 
2. Bologna, Italy 
3. Naples, Italy 
4. Verona, Italy 
5. Florence, Italy


### Agents

### What are langchain agents? What are they useful for, why are they helpful in this use case?

Agents are systems langchain has designed to interact with outside tools and services to perform more complex tasks and take actions on behalf of a user. An agent can interact with an email API to draft and send emails, manage a calendar by interacting with a calendar API, fetch information from external APIs like weather or news ervices, etc. 

Agents are more powerful chatbots that can do more tasks and integrate with other services to provide a comprehensive user experience. 

Agents are autonomous. Meaning when given a task they can complete it and continue running while the user who programmed the agent can be doing something else. In an example of drafting and sending emails, once the code for this process is built and implemented, the programmer doesn't have to interact at all. everything is done without their intervention. whereas chatbots require the presence of a human and chatbot at all times. so i don't know if they're comparable. 

In [6]:
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.llms import OpenAI 

In [7]:
llm = OpenAI(temperature=0)

### What is Serp API? What is LLM-math?
SERP stands for search engine results page and allows developers to access search engine results from Bing, Google, Yahoo. The results can be in structured format like links, snippets, or another form. 

SERP APIs include Google Search API, Bing Search API, and other third party API's. 

### What is load_tools method?

When I write the `load_tools` method, I'm setting up a tool that can interact with the Google Search API and process results using the OpenAI GPT model which I defined earlier and gave a temperature to. 

In [8]:
os.environ["SERPAPI_API_KEY"] = "d8b941f0dde14003edf51159e2fac1d0330d11ad2dc6c5c7477e06fd87070af5"
tools = load_tools(['serpapi'], llm=llm)

### What is happening in the code below? Especially when you write agent=""? 

*Marked as correct in understanding by ChatGPT:* initialize_agent, I'm assuming is a function in langchain's code that takes in several parameters and then me calling agent.run allows me to access the params with a command that I assume uses the information form params like tools (which has the llm I'd like to use and the api) to complete the API request. I wonder what agent does. I wonder if zero-shot means it hasn't seen this prompt or been trained on the question I'm about to ask. And, I know verbose = true so I can see the output of how the model completes the request to feed me an answer. 

### How does verbose work?

1. The agent starts by researching the topic online using the SERP API. It queries "Chamath Palihapitya."
2. Agent gathers information, as shown in observation category and then decides to do another search. It's using information that it's gathered to figure out what to search next. 
3. The process of searching for more information with context from the previous observation and acknowledgement of the initial question continues until the agent thinks it's answered the question well enough. When it reaches this point it takes the observations and puts them into OpenAI's API via a request and gets an output, ending the executor chain.

In [9]:
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

In [None]:
agent.run("Who is Chamath Palihapitya? Where did he go to school? Where does he work? What's the best way to get in contact with him?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should start by researching him online.
Action: Search
Action Input: "Chamath Palihapitya"[0m
Observation: [36;1m[1;3mChamath Palihapitiya is a Sri Lankan-born Canadian and American venture capitalist, engineer, SPAC sponsor, founder and CEO of Social Capital. Palihapitiya was an early senior executive at Facebook, working at the company from 2007 to 2011.[0m
Thought:[32;1m[1;3m I should look for more information about his education and current work.
Action: Search
Action Input: "Chamath Palihapitya education"[0m
Observation: [36;1m[1;3mChamath Palihapitiya is a Sri Lankan-born Canadian and American venture capitalist, engineer, SPAC sponsor, founder and CEO of Social Capital. Palihapitiya was an early senior executive at Facebook, working at the company from 2007 to 2011.[0m
Thought:[32;1m[1;3m I should look for more information about how to contact him.
Action: Search
Action Input: "Chamath Palihapitya contac

'Chamath Palihapitiya is a Sri Lankan-born Canadian and American venture capitalist, engineer, SPAC sponsor, founder and CEO of Social Capital. He went to the University of Waterloo and Stanford University. The best way to get in contact with him is to call the number 9817xxxxxxx.'

# What are these packages?

# Explain what they are and the code that makes them work. Have .md files that show the code in the file and the explanation of what is going on. Ie - remove layers of abstraction

# what is json library?

In [16]:
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
import pandas as pd
import json 

### Where does the ChatOpenAI command come from and why is it useful here?

`ChatOpenAI` is a Langchain command that's used to simulate a conversation with an OPenAI model. It's designed with a similar experience but within the Langchain framework. 

In [17]:
# using the ChatOpenAI command because it's cheaper than calling OpenAI davinci model using API
chat_model = ChatOpenAI(temperature = 0)

#### What is a response schema?

In the example, and broadly, a response schema is the developer specifying what the expected format is and what the response from the API, function, class is. In the example below, ResponseSchema is a class that takes in a key [input industry] and a description [what I want the model to do] and runs it through the function to serve me a response. 

In [62]:
# THIS CODE IS FROM LANGCHAIN, MY COMMENTS ARE ME UNDERSTANDING WHAT THE CODE DOES

""" 
class StructuredOutputParser(BaseOutputParser):
    response_schemas: List[ResponseSchema]
# used to call a method on the class itself instead of a specific instance of the class
    @classmethod
# creates a new instance of StructuredOutputParser using a list from ResponseSchema
# provides an alternative constructor. Alternative constructors are useful because they allow programmers
# to create an instance directly from the list of input objects (what does this mean?)
    def from_response_schemas(
        cls, response_schemas: List[ResponseSchema]
    ) -> StructuredOutputParser:
        return cls(response_schemas=response_schemas)

    def get_format_instructions(self) -> str:
        schema_str = "\n".join(
            [_get_sub_string(schema) for schema in self.response_schemas]
        )
        return STRUCTURED_FORMAT_INSTRUCTIONS.format(format=schema_str)

    def parse(self, text: str) -> BaseModel:
        json_string = text.split("```json")[1].strip().strip("```").strip()
        json_obj = json.loads(json_string)
        for schema in self.response_schemas:
            if schema.name not in json_obj:
                raise ValueError(
                    f"Got invalid return object. Expected key `{schema.name}` "
                    f"to be present, but got {json_obj}"
                )
        return json_obj
"""

' \nclass StructuredOutputParser(BaseOutputParser):\n    response_schemas: List[ResponseSchema]\n# used to call a method on the class itself instead of a specific instance of the class\n    @classmethod\n# creates a new instance of StructuredOutputParser using a list from ResponseSchema\n# provides an alternative constructor. Alternative constructors are useful because they allow programmers\n# to create an instance directly from the list of input objects (what does this mean?)\n    def from_response_schemas(\n        cls, response_schemas: List[ResponseSchema]\n    ) -> StructuredOutputParser:\n        return cls(response_schemas=response_schemas)\n\n    def get_format_instructions(self) -> str:\n        schema_str = "\n".join(\n            [_get_sub_string(schema) for schema in self.response_schemas]\n        )\n        return STRUCTURED_FORMAT_INSTRUCTIONS.format(format=schema_str)\n\n    def parse(self, text: str) -> BaseModel:\n        json_string = text.split("```json")[1].strip(

In [18]:
response_schemas = [
    ResponseSchema(name="input_industry", description="This is the input industry from the user."),
    ResponseSchema(name="standardized_industry", description="This is the industry you feel most closely matched to the ")
]

output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

In [20]:
# how is this .get() different from _get() - one is retrieval from a variable in a different scope.
# the other get method is called to retrieve some instructions from 
format_instructions = output_parser.get_format_instructions()
print(output_parser.get_format_instructions())

The output should be a markdown code snippet formatted in the following schema:

```json
{
	"input_industry": string  // This is the input industry from the user.
	"standardized_industry": string  // This is the industry you feel most closely matched to the 
}
```


In [21]:
template = """
You will be given a series of industry names from a user. 
Find the best corresponding match on the list of standardized names. 
The closest match will be the one with the closest semantic meaning. Not just string similarity. 

{format_instructions}

Wrap your final output with closed and open brackets (a list of json objects)

input_industry INPUT:
{user_industries}

STANDARDIZED INDUSTRIES: 
{standardized_industries}

YOUR RESPONSE: 
"""

prompt = ChatPromptTemplate(
    messages = [
        HumanMessagePromptTemplate.from_template(template)
    ], 
    input_variables = ["user_industries", "standardized_industries"],
    partial_variables = {'format_instructions' : format_instructions}
)

In [22]:
import requests
from bs4 import BeautifulSoup
import csv

url = "https://back2marketingschool.com/linkedin-industries-list/"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find("table")

# Extract table headers
headers = [th.text for th in table.find_all("th")]

# Extract table rows
rows = []
for tr in table.find_all("tr"):
    row_data = [td.text.strip() for td in tr.find_all("td")]
    if row_data:  # Skip empty rows
        rows.append(row_data)

# Save data to a CSV file
with open("linkedin_industries.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    writer.writerows(rows)

print("Data saved to linkedin_industries.csv")

Data saved to linkedin_industries.csv


In [60]:
import pandas as pd

df = pd.read_csv('linkedin_industries.csv')
df = df.drop(['Subcatergory'], axis=1)

# Convert the values in the 'Industry Catergory' column to strings
df['Industry Catergory'] = df['Industry Catergory'].astype(str)

standardized_industries = ','.join(df['Industry Catergory'].values)
print(standardized_industries)

Education,E-Learning,Higher Education,Primary/Secondary Education,Research,nan,Construction,Civil Engineering,Construction,nan,Design,Design,Graphic Design,nan,Corporate Services,Business Supplies & Equipment,Environmental Services,Events Services,Executive Office,Facilities Services,Human Resources,Information Services,Management Consulting,Outsourcing/Offshoring,Professional Training & Coaching,Security & Investigations,Staffing & Recruiting,nan,Retail,Supermarkets,Wholesale,nan,Energy & Mining,Oil & Energy,Utilities,nan,Manufacturing,Aviation & Aerospace,Chemicals,Defense & Space,Electrical & Electronic Manufacturing,Food Production,Glass, Ceramics & Concrete,Industrial Automation,Machinery,Mechanical or Industrial Engineering,Packaging & Containers,Paper & Forest Products,Plastics,Railroad Manufacture,Renewables & Environment,Shipbuilding,Textiles,nan,Finance,Capital Markets,Financial Services,Insurance,Investment Banking,Investment Management,Venture Capital & Private Equity,nan,R

In [43]:
# checking to see if it's iterable or not
type(standardized_industries)

str

In [47]:
# split breaks a string into a list of substrings based on a delimiter. In this case ',' is the delimiter.
# probably could've used a pandas method to remove all .na values instead of this list comp. 
# then you would've only needed .join method and .values at the end
industry_list = standardized_industries.split(',')

filtered_industry_list = [industry for industry in industry_list if industry.lower() != 'nan']

filtered_standardized_industries = ', '.join(filtered_industry_list)
print(filtered_standardized_industries)

Education, E-Learning, Higher Education, Primary/Secondary Education, Research, Construction, Civil Engineering, Construction, Design, Design, Graphic Design, Corporate Services, Business Supplies & Equipment, Environmental Services, Events Services, Executive Office, Facilities Services, Human Resources, Information Services, Management Consulting, Outsourcing/Offshoring, Professional Training & Coaching, Security & Investigations, Staffing & Recruiting, Retail, Supermarkets, Wholesale, Energy & Mining, Oil & Energy, Utilities, Manufacturing, Aviation & Aerospace, Chemicals, Defense & Space, Electrical & Electronic Manufacturing, Food Production, Glass,  Ceramics & Concrete, Industrial Automation, Machinery, Mechanical or Industrial Engineering, Packaging & Containers, Paper & Forest Products, Plastics, Railroad Manufacture, Renewables & Environment, Shipbuilding, Textiles, Finance, Capital Markets, Financial Services, Insurance, Investment Banking, Investment Management, Venture Capi

In [50]:
# purpose of adding [''] at the beginning and end is to ensure that values are correctly checked even if they appear at the beginning 
# or the end of the string when we join them back into a comma-separated string
contains_nan = ',nan,' in ','.join([''] + filtered_standardized_industries.lower().split(',') + [''])
print("Contains 'nan': ", contains_nan)

Contains 'nan':  False


In [53]:
""" 
This shows programmer what the input looks like. 
"""

user_input = 'air LineZ', 'airline', 'aviation', 'planes that fly', 'farming', 'bread', 'wifi networks', 'twitter media agency'
# _ before a variable means it's set for internal use 
# the user_industries and standardized_industries variables are from the template variable defined code blocks above via
# Human___ Prompt

# * what does the format_prompt look like in code? - not too complex, probably some fstrings for output in notebook. *
_input = prompt.format_prompt(user_industries=user_input, standardized_industries=standardized_industries)

print(f" There are {len(_input.messages)} messages(s)")
print(f"Type: {type(_input.messages[0])}")
print("-----------------------------")
print(_input.messages[0].content)

 There are 1 messages(s)
Type: <class 'langchain.schema.HumanMessage'>
-----------------------------

You will be given a series of industry names from a user. 
Find the best corresponding match on the list of standardized names. 
The closest match will be the one with the closest semantic meaning. Not just string similarity. 

The output should be a markdown code snippet formatted in the following schema:

```json
{
	"input_industry": string  // This is the input industry from the user.
	"standardized_industry": string  // This is the industry you feel most closely matched to the 
}
```

Wrap your final output with closed and open brackets (a list of json objects)

input_industry INPUT:
('air LineZ', 'airline', 'aviation', 'planes that fly', 'farming', 'bread', 'wifi networks', 'twitter media agency')

STANDARDIZED INDUSTRIES: 
Education,E-Learning,Higher Education,Primary/Secondary Education,Research,nan,Construction,Civil Engineering,Construction,nan,Design,Design,Graphic Design,nan

In [54]:
output = chat_model(_input.to_messages())

In [55]:
# this is where semantic search is happening. I'd be curious to try this with word embedding to see vector simiarlity versus semantic search & see how output changes.

print(type(output))
print(output.content)

<class 'langchain.schema.AIMessage'>


[
	{
		"input_industry": "air LineZ",
		"standardized_industry": "Aviation & Aerospace"
	},
	{
		"input_industry": "airline",
		"standardized_industry": "Aviation & Aerospace"
	},
	{
		"input_industry": "aviation",
		"standardized_industry": "Aviation & Aerospace"
	},
	{
		"input_industry": "planes that fly",
		"standardized_industry": "Aviation & Aerospace"
	},
	{
		"input_industry": "farming",
		"standardized_industry": "Farming"
	},
	{
		"input_industry": "bread",
		"standardized_industry": "Food Production"
	},
	{
		"input_industry": "wifi networks",
		"standardized_industry": "Hardware & Networking"
	},
	{
		"input_industry": "twitter media agency",
		"standardized_industry": "Marketing & Advertising"
	}
]


In [57]:
# what is this saying/what does it mean?
if "```json" in output.content:
    json_string = output.content.split("```json")[1].strip()
else: 
    json_string = output.content

In [58]:
structured_data = json.loads(output.content)
structured_data

[{'input_industry': 'air LineZ',
  'standardized_industry': 'Aviation & Aerospace'},
 {'input_industry': 'airline',
  'standardized_industry': 'Aviation & Aerospace'},
 {'input_industry': 'aviation',
  'standardized_industry': 'Aviation & Aerospace'},
 {'input_industry': 'planes that fly',
  'standardized_industry': 'Aviation & Aerospace'},
 {'input_industry': 'farming', 'standardized_industry': 'Farming'},
 {'input_industry': 'bread', 'standardized_industry': 'Food Production'},
 {'input_industry': 'wifi networks',
  'standardized_industry': 'Hardware & Networking'},
 {'input_industry': 'twitter media agency',
  'standardized_industry': 'Marketing & Advertising'}]

In [59]:
pd.DataFrame(structured_data)

Unnamed: 0,input_industry,standardized_industry
0,air LineZ,Aviation & Aerospace
1,airline,Aviation & Aerospace
2,aviation,Aviation & Aerospace
3,planes that fly,Aviation & Aerospace
4,farming,Farming
5,bread,Food Production
6,wifi networks,Hardware & Networking
7,twitter media agency,Marketing & Advertising
