Skip to content

Commit b6fb4c0

Browse files
authored
Merge branch 'pre/beta' into temp
2 parents e686541 + 214de44 commit b6fb4c0

32 files changed

+324
-199
lines changed

CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,22 @@
1+
## [1.15.0-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.14.1-beta.1...v1.15.0-beta.1) (2024-08-23)
2+
3+
4+
### Features
5+
6+
* ligthweigthing the library ([62f32e9](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/62f32e994bcb748dfef4f7e1b2e5213a989c33cc))
7+
8+
9+
### Bug Fixes
10+
11+
* Azure OpenAI issue ([a92b9c6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/a92b9c6970049a4ba9dbdf8eff3eeb7f98c6c639))
12+
13+
## [1.14.1-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.14.0...v1.14.1-beta.1) (2024-08-21)
14+
15+
16+
### Bug Fixes
17+
18+
* **models_tokens:** add llama2 and llama3 sizes explicitly ([b05ec16](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/b05ec16b252d00c9c9ee7c6d4605b420851c7754))
19+
120
## [1.14.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.13.3...v1.14.0) (2024-08-20)
221

322

README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,28 @@ playwright install
3232

3333
**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱
3434

35+
By the way if you to use not mandatory modules it is necessary to install by yourself with the following command:
36+
37+
### Installing "Other Language Models"
38+
39+
This group allows you to use additional language models like Fireworks, Groq, Anthropic, Hugging Face, and Nvidia AI Endpoints.
40+
```bash
41+
pip install scrapegraphai[other-language-models]
42+
43+
```
44+
### Installing "More Semantic Options"
45+
46+
This group includes tools for advanced semantic processing, such as Graphviz.
47+
```bash
48+
pip install scrapegraphai[more-semantic-options]
49+
```
50+
### Installing "More Browser Options"
51+
52+
This group includes additional browser management options, such as BrowserBase.
53+
```bash
54+
pip install scrapegraphai[more-browser-options]
55+
```
56+
3557
## 💻 Usage
3658
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file).
3759

docs/README.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,6 @@ markmap:
99

1010
## **Short-Term Goals**
1111

12-
- Integration with more llm APIs
13-
14-
- Test proxy rotation implementation
15-
16-
- Add more search engines inside the SearchInternetNode
17-
1812
- Improve the documentation (ReadTheDocs)
1913
- [Issue #102](https://github.com/VinciGit00/Scrapegraph-ai/issues/102)
2014

@@ -23,9 +17,6 @@ markmap:
2317
## **Medium-Term Goals**
2418

2519
- Node for handling API requests
26-
27-
- Improve SearchGraph to look into the first 5 results of the search engine
28-
2920
- Make scraping more deterministic
3021
- Create DOM tree of the website
3122
- HTML tag text embeddings with tags metadata
@@ -70,5 +61,3 @@ markmap:
7061
- Automatic generation of scraping pipelines from a given prompt
7162

7263
- Create API for the library
73-
74-
- Finetune a LLM for html content

docs/source/scrapers/llm.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,3 +194,35 @@ We can also pass a model instance for the chat model and the embedding model. Fo
194194
"model_instance": embedder_model_instance
195195
}
196196
}
197+
198+
Other LLM models
199+
^^^^^^^^^^^^^^^^
200+
201+
We can also pass a model instance for the chat model and the embedding model through the **model_instance** parameter.
202+
This feature enables you to utilize a Langchain model instance.
203+
You will discover the model you require within the provided list:
204+
205+
- `chat model list <https://python.langchain.com/v0.2/docs/integrations/chat/#all-chat-models>`_
206+
- `embedding model list <https://python.langchain.com/v0.2/docs/integrations/text_embedding/#all-embedding-models>`_.
207+
208+
For instance, consider **chat model** Moonshot. We can integrate it in the following manner:
209+
210+
.. code-block:: python
211+
212+
from langchain_community.chat_models.moonshot import MoonshotChat
213+
214+
# The configuration parameters are contingent upon the specific model you select
215+
llm_instance_config = {
216+
"model": "moonshot-v1-8k",
217+
"base_url": "https://api.moonshot.cn/v1",
218+
"moonshot_api_key": "MOONSHOT_API_KEY",
219+
}
220+
221+
llm_model_instance = MoonshotChat(**llm_instance_config)
222+
graph_config = {
223+
"llm": {
224+
"model_instance": llm_model_instance,
225+
"model_tokens": 5000
226+
},
227+
}
228+

examples/model_instance/.env.example

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
MOONLIGHT_API_KEY="YOUR MOONLIGHT API KEY"
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper and model_instace
3+
"""
4+
5+
import os, json
6+
from scrapegraphai.graphs import SmartScraperGraph
7+
from scrapegraphai.utils import prettify_exec_info
8+
from langchain_community.chat_models.moonshot import MoonshotChat
9+
from dotenv import load_dotenv
10+
load_dotenv()
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
17+
llm_instance_config = {
18+
"model": "moonshot-v1-8k",
19+
"base_url": "https://api.moonshot.cn/v1",
20+
"moonshot_api_key": os.getenv("MOONLIGHT_API_KEY"),
21+
}
22+
23+
24+
llm_model_instance = MoonshotChat(**llm_instance_config)
25+
26+
graph_config = {
27+
"llm": {
28+
"model_instance": llm_model_instance,
29+
"model_tokens": 10000
30+
},
31+
"verbose": True,
32+
"headless": True,
33+
}
34+
35+
# ************************************************
36+
# Create the SmartScraperGraph instance and run it
37+
# ************************************************
38+
39+
smart_scraper_graph = SmartScraperGraph(
40+
prompt="List me what does the company do, the name and a contact email.",
41+
source="https://scrapegraphai.com/",
42+
config=graph_config
43+
)
44+
45+
result = smart_scraper_graph.run()
46+
print(json.dumps(result, indent=4))
47+
48+
# ************************************************
49+
# Get graph execution info
50+
# ************************************************
51+
52+
graph_exec_info = smart_scraper_graph.get_execution_info()
53+
print(prettify_exec_info(graph_exec_info))

examples/moonshot/.env.example

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
MOONLIGHT_API_KEY="YOUR MOONLIGHT API KEY"

examples/moonshot/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
This folder offer an example of how to use ScrapeGraph-AI with Moonshot and SmartScraperGraph. More usage examples can refer to openai exapmles.
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper and model_instace
3+
"""
4+
5+
import os, json
6+
from scrapegraphai.graphs import SmartScraperGraph
7+
from scrapegraphai.utils import prettify_exec_info
8+
from langchain_community.chat_models.moonshot import MoonshotChat
9+
from dotenv import load_dotenv
10+
load_dotenv()
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
17+
llm_instance_config = {
18+
"model": "moonshot-v1-8k",
19+
"base_url": "https://api.moonshot.cn/v1",
20+
"moonshot_api_key": os.getenv("MOONLIGHT_API_KEY"),
21+
}
22+
23+
24+
llm_model_instance = MoonshotChat(**llm_instance_config)
25+
26+
graph_config = {
27+
"llm": {
28+
"model_instance": llm_model_instance,
29+
"model_tokens": 10000
30+
},
31+
"verbose": True,
32+
"headless": True,
33+
}
34+
35+
# ************************************************
36+
# Create the SmartScraperGraph instance and run it
37+
# ************************************************
38+
39+
smart_scraper_graph = SmartScraperGraph(
40+
prompt="List me what does the company do, the name and a contact email.",
41+
source="https://scrapegraphai.com/",
42+
config=graph_config
43+
)
44+
45+
result = smart_scraper_graph.run()
46+
print(json.dumps(result, indent=4))
47+
48+
# ************************************************
49+
# Get graph execution info
50+
# ************************************************
51+
52+
graph_exec_info = smart_scraper_graph.get_execution_info()
53+
print(prettify_exec_info(graph_exec_info))

pyproject.toml

Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,7 @@
11
[project]
22
name = "scrapegraphai"
3-
4-
5-
version = "1.14.0"
6-
7-
3+
version = "1.15.0b1"
84
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
9-
105
authors = [
116
{ name = "Marco Vinciguerra", email = "mvincig11@gmail.com" },
127
{ name = "Marco Perini", email = "perinim.98@gmail.com" },
@@ -15,32 +10,24 @@ authors = [
1510

1611
dependencies = [
1712
"langchain>=0.2.14",
18-
"langchain-fireworks>=0.1.3",
19-
"langchain_community>=0.2.9",
2013
"langchain-google-genai>=1.0.7",
21-
"langchain-google-vertexai>=1.0.7",
2214
"langchain-openai>=0.1.22",
23-
"langchain-groq>=0.1.3",
24-
"langchain-aws>=0.1.3",
25-
"langchain-anthropic>=0.1.11",
2615
"langchain-mistralai>=0.1.12",
27-
"langchain-huggingface>=0.0.3",
28-
"langchain-nvidia-ai-endpoints>=0.1.6",
16+
"langchain_community>=0.2.9",
17+
"langchain-aws>=0.1.3",
2918
"html2text>=2024.2.26",
3019
"faiss-cpu>=1.8.0",
3120
"beautifulsoup4>=4.12.3",
3221
"pandas>=2.2.2",
3322
"python-dotenv>=1.0.1",
3423
"tiktoken>=0.7",
3524
"tqdm>=4.66.4",
36-
"graphviz>=0.20.3",
3725
"minify-html>=0.15.0",
3826
"free-proxy>=1.1.1",
3927
"playwright>=1.43.0",
40-
"google>=3.0.0",
4128
"undetected-playwright>=0.3.0",
29+
"google>=3.0.0",
4230
"semchunk>=1.0.1",
43-
"browserbase>=0.3.0",
4431
]
4532

4633
license = "MIT"
@@ -79,6 +66,25 @@ requires-python = ">=3.9,<4.0"
7966
burr = ["burr[start]==0.22.1"]
8067
docs = ["sphinx==6.0", "furo==2024.5.6"]
8168

69+
# Group 1: Other Language Models
70+
other-language-models = [
71+
"langchain-fireworks>=0.1.3",
72+
"langchain-groq>=0.1.3",
73+
"langchain-anthropic>=0.1.11",
74+
"langchain-huggingface>=0.0.3",
75+
"langchain-nvidia-ai-endpoints>=0.1.6",
76+
]
77+
78+
# Group 2: More Semantic Options
79+
more-semantic-options = [
80+
"graphviz>=0.20.3",
81+
]
82+
83+
# Group 3: More Browser Options
84+
more-browser-options = [
85+
"browserbase>=0.3.0",
86+
]
87+
8288
[build-system]
8389
requires = ["hatchling"]
8490
build-backend = "hatchling.build"

scrapegraphai/graphs/abstract_graph.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@
77
import uuid
88
import warnings
99
from pydantic import BaseModel
10-
from langchain_community.chat_models import ErnieBotChat
11-
from langchain_nvidia_ai_endpoints import ChatNVIDIA
1210
from langchain.chat_models import init_chat_model
1311
from ..helpers import models_tokens
1412
from ..models import (
@@ -147,12 +145,18 @@ def handle_model(model_name, provider, token_key, default_token=8192):
147145
warnings.simplefilter("ignore")
148146
return init_chat_model(**llm_params)
149147

150-
known_models = ["chatgpt","gpt","openai", "azure_openai", "google_genai", "ollama", "oneapi", "nvidia", "groq", "google_vertexai", "bedrock", "mistralai", "hugging_face", "deepseek", "ernie", "fireworks"]
148+
known_models = ["chatgpt","gpt","openai", "azure_openai", "google_genai",
149+
"ollama", "oneapi", "nvidia", "groq", "google_vertexai",
150+
"bedrock", "mistralai", "hugging_face", "deepseek", "ernie", "fireworks"]
151+
151152

152153
if llm_params["model"].split("/")[0] not in known_models and llm_params["model"].split("-")[0] not in known_models:
153154
raise ValueError(f"Model '{llm_params['model']}' is not supported")
154155

155156
try:
157+
if "azure" in llm_params["model"]:
158+
model_name = llm_params["model"].split("/")[-1]
159+
return handle_model(model_name, "azure_openai", model_name)
156160
if "fireworks" in llm_params["model"]:
157161
model_name = "/".join(llm_params["model"].split("/")[1:])
158162
token_key = llm_params["model"].split("/")[-1]
@@ -184,7 +188,6 @@ def handle_model(model_name, provider, token_key, default_token=8192):
184188
model_name = llm_params["model"].split("/")[-1]
185189
return handle_model(model_name, "mistralai", model_name)
186190

187-
# Instantiate the language model based on the model name (models that do not use the common interface)
188191
elif "deepseek" in llm_params["model"]:
189192
try:
190193
self.model_token = models_tokens["deepseek"][llm_params["model"]]
@@ -194,6 +197,8 @@ def handle_model(model_name, provider, token_key, default_token=8192):
194197
return DeepSeek(llm_params)
195198

196199
elif "ernie" in llm_params["model"]:
200+
from langchain_community.chat_models import ErnieBotChat
201+
197202
try:
198203
self.model_token = models_tokens["ernie"][llm_params["model"]]
199204
except KeyError:
@@ -211,6 +216,8 @@ def handle_model(model_name, provider, token_key, default_token=8192):
211216
return OneApi(llm_params)
212217

213218
elif "nvidia" in llm_params["model"]:
219+
from langchain_nvidia_ai_endpoints import ChatNVIDIA
220+
214221
try:
215222
self.model_token = models_tokens["nvidia"][llm_params["model"].split("/")[-1]]
216223
llm_params["model"] = "/".join(llm_params["model"].split("/")[1:])
@@ -271,4 +278,4 @@ def _create_graph(self):
271278
def run(self) -> str:
272279
"""
273280
Abstract method to execute the graph and return the result.
274-
"""
281+
"""

0 commit comments

Comments
 (0)