# Comparing llama-index Semantic Chunking and SentenceSplitter

### Import Packages

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import os
import random

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import ServiceContext, StorageContext, load_index_from_storage
from llama_index.core.node_parser import SentenceSplitter, SimpleNodeParser, SemanticSplitterNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

  from .autonotebook import tqdm as notebook_tqdm


### Load HF Token

In [3]:
HF_TOKEN = open('./HF-Token.txt').read()

### Load Data

In [4]:
data_folder = "./data"
documents = SimpleDirectoryReader(input_dir=data_folder).load_data()

In [14]:
print(f"total number of documents :: {len(documents)}")
print(100*'#')
print("sample Documents ::")
print(100*'#')
print(documents[0].get_content())
print(100*'#')
print(documents[-10].get_content())
print(100*'#')
print(documents[random.randint(0, len(documents))].get_content())

total number of documents :: 234
####################################################################################################
sample Documents ::
####################################################################################################
Ben G. W eber
Data Science in Production:
Building Scalable Model
Pipelines with Python
####################################################################################################
8.2 Dataflow Streaming 215
a callback pattern where you provide a function that is used to
process messages as they arrive. In this example we simply print
the data field in the message and then acknowledge that the mes-
sage has been received. The for loop at the bottom of the code
block is used to keep the script running, because none of the other
commands suspend when executed.
import time
from google.cloud import pubsub_v1
subscriber = pubsub_v1.SubscriberClient ()
subscription_path = subscriber.subscription_path (
"your_project_name" ,"dsp" )
de

### Define Service Context

In [49]:
embed_model = HuggingFaceEmbedding(model_name= "mixedbread-ai/mxbai-embed-large-v1")
llm = HuggingFaceInferenceAPI(
        model_name="HuggingFaceH4/zephyr-7b-beta", token=HF_TOKEN, num_output=1024)
service_context = ServiceContext.from_defaults(llm = llm, embed_model=embed_model, chunk_size=1024)

  service_context = ServiceContext.from_defaults(llm = llm, embed_model=embed_model, chunk_size=1024)


### Define Sentence and Semantic chunker

In [40]:
semantic_splitter = SemanticSplitterNodeParser(buffer_size=2, embed_model=embed_model, breakpoint_percentile_threshold=0.95)
base_splitter = SentenceSplitter(chunk_size=512)

In [41]:
semantic_nodes = semantic_splitter.get_nodes_from_documents(documents)
base_nodes = base_splitter.get_nodes_from_documents(documents)

### Sample nodes and documents

In [62]:
semantic_nodes[0].get_content()
base_nodes[0].get_content()

'Ben G. W eber\nData Science in Production:\nBuilding Scalable Model\nPipelines with Python'

'Ben G. W eber\nData Science in Production:\nBuilding Scalable Model\nPipelines with Python'

In [63]:
semantic_nodes[2].get_content()
base_nodes[2].get_content()

'Contents\nPreface vii\n0.1 Prerequisites . '

'Contents\nPreface vii\n0.1 Prerequisites . . . . . . . . . . . . . . . . . . . . vii\n0.2 Book Contents . . . . . . . . . . . . . . . . . . . viii\n0.3 Code Examples . . . . . . . . . . . . . . . . . . . x\n0.4 Acknowledgements . . . . . . . . . . . . . . . . . x\n1 Introduction 1\n1.1 Applied Data Science . . . . . . . . . . . . . . . 3\n1.2 Python for Scalable Compute . . . . . . . . . . . 4\n1.3 Cloud Environments . . . . . . . . . . . . . . . . 6\n1.3.1 Amazon W eb Services (A WS) . . . . . . . 7\n1.3.2 Google Cloud Platform (GCP) . . . . . . 8\n1.4 Coding Environments . . . . . . . . . . . . . . . 9\n1.4.1 Jupyter on EC2 . . . . . . . . . . . . . . . 9\n1.5 Datasets . . . . . . . . . . . .'

### Creating VectorStore Indices

In [50]:
semantic_index = VectorStoreIndex(semantic_nodes, service_context= service_context)
base_index = VectorStoreIndex(base_nodes, service_context=service_context)

In [51]:
semantic_query_engine = semantic_index.as_query_engine()
basic_query_engine = base_index.as_query_engine()

### Inference

In [52]:
def get_response(engine, qry):
    response = engine.query(qry)
    print(response)

In [53]:
get_response(semantic_query_engine, "what data science principles are outlined in this book?")
get_response(basic_query_engine, "what data science principles are outlined in this book?")



The book outlines the discipline of applied data science and emphasizes the importance of quickly delivering a proof of concept and iteratively improving a model once it has been shown to provide value to an organization. It also discusses the use of Python and provides an overview of automated feature engineering. The book uses specific data sets, models, and cloud environments throughout its chapters.


The book focuses on building predictive model services for product teams, and aims to provide data scientists with a set of tools to build scalable model pipelines using Python. It assumes prior knowledge of Python and Pandas, as well as some experience with modeling packages such as scikit-learn. The book covers a range of topics, including data ingestion, data cleaning, feature engineering, model training, model evaluation, model deployment, and model monitoring. It also introduces tools and cloud environments commonly used in industry settings, such as AWS SageMaker, Google Cloud

In [55]:
get_response(semantic_query_engine, "what are best models that I can use if I have a high sparse data?")
print(100*"#")
get_response(basic_query_engine, "what are best models that I can use if I have a high sparse data?")

1. Random Forest: Random Forest is an ensemble learning method that uses multiple decision trees to make a prediction. It can handle high sparse data as it can handle missing values and noisy data.

2. Gradient Boosting: Gradient Boosting is another ensemble learning method that combines multiple weak learners to make a strong prediction. It can handle high sparse data by iteratively improving the model's performance.

3. XGBoost: XGBoost is an optimized distributed gradient boosting library that can handle high sparse data with its distributed computing capabilities.

4. LightGBM: LightGBM is a gradient boosting framework that uses a gradient-based one-side sampling algorithm to handle high sparse data efficiently.

5. CatBoost: CatBoost is a gradient boosting framework that uses a combination of gradient boosting and categorical feature boosting to handle high sparse data with categorical features.

These models can be trained using Python libraries like scikit-learn, XGBoost, LightG

In [56]:
get_response(semantic_query_engine, "explain random forest model")
print(100*"#")
get_response(basic_query_engine, "explain random forest model")



A random forest is an ensemble learning method that combines multiple decision trees to improve the accuracy and robustness of predictions. Each decision tree is trained on a random subset of features and a random subset of training data, which helps to reduce overfitting and improve the model's ability to handle noisy and irrelevant features. The final prediction is made by averaging the predictions of all the decision trees in the forest. Random forests can be used for both classification and regression tasks, and are particularly effective for handling high-dimensional data with many features. In the context of the provided text, the random forest model is being used for regression to predict weights based on various input features. After searching through the parameter space and using cross validation to select the best hyperparameters, the model is retrained on the complete training data set and applied to make predictions on the test data set. The VectorAssembler transformer is

In [57]:
get_response(semantic_query_engine, "how can i use spark to create a pipeline?")
#semantic engine performs better by giving example
print(100*"#")
get_response(basic_query_engine, "how can i use spark to create a pipeline?")



To create a pipeline in Spark, you can follow these general steps:

1. Define the stages of your pipeline, which can include data sources, transformations, and evaluations.
2. Chain the stages together using the `Pipeline` class provided by Spark MLlib.
3. Fit the pipeline to your training data using the `fit()` method.
4. Transform your test data using the `transform()` method.
5. Evaluate the performance of your pipeline using the `evaluate()` method.

Here's an example of how to create a simple pipeline in Spark:

```python
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Load the data
data = spark.createDataFrame(
    pd.read_csv('path/to/data.csv', header=None),
    schema=StructType([StructField('feature1', FloatType()),
                       StructField('feature2', FloatType()),
               

In [58]:
get_response(semantic_query_engine, "what are the spark deployments mentioned in the book")
# semantic response performs poor compared to basic engine
print(100*"#")
get_response(basic_query_engine, "what are the spark deployments mentioned in the book")

1. Self Hosted: An engineering team manages a set of clusters and provides console and notebook access.
####################################################################################################


The book mentions three types of Spark deployments:
1. Self-hosted: An engineering team manages a set of clusters and provides console and notebook access.
2. Cloud solutions: Amazon Web Services (AWS) provides a managed Spark option called EMR, and Google Cloud Platform (GCP) has Cloud Dataproc.
3. Vendor solutions: Databricks, Cloudera, and other vendors provide fully-managed Spark environments.

The author recommends using a freely-available notebook environment for getting up and running with Spark as quickly as possible, especially for data scientists. The author also mentions that as the size of the team using Spark scales, additional considerations such as multi-tenancy and isolation become important, and self-hosted solutions require significant engineering work to support t

In [59]:
get_response(semantic_query_engine, "what are the coding environments that the book talks about?")
print(100*"#")
get_response(basic_query_engine, "what are the coding environments that the book talks about?")

1. IDEs (Integrated Development Environments)
2. Text editors
3. Notebooks (which are becoming more and more common as the place to write Python scripts)

The book mentions that the best environment to use likely varies based on what you are building, but it provides an overview of three types of coding environments for Python: IDEs, text editors, and notebooks. Notebooks are becoming increasingly popular as a place to write Python scripts for data science.
####################################################################################################


The book talks about three types of coding environments for writing Python code for data science: IDEs, text editors, and notebooks. The author recommends using notebook environments for exploratory analysis and productizing models, and text editors for building web applications with Flask and Dash. The author also mentions collaborative note-books in Databricks and Google Colab, and suggests sharing notebooks in version control sy

In [60]:
get_response(semantic_query_engine, "give me python code for logistic regression ?")
print(100*"#")
get_response(basic_query_engine, "give me python code for logistic regression ?")



Here's an example of how to build a logistic regression model using scikit-learn in Python:

```python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('your_dataset.csv')

# Preprocess the data as needed (e.g., one-hot encoding categorical features)
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = model.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC AUC:", roc_auc)
```

In this example, we first import the necessary modules from scikit-learn. We then load the dataset, preprocess it as needed, and split it into training and testing sets using the `train_test_s

In [61]:
get_response(semantic_query_engine, "give me the same using keras")
print(100*"#")
get_response(basic_query_engine, "give me the same using keras")



To achieve the same using Keras, you can follow these steps:

1. Import the required libraries:

```python
import tensorflow as tf
import keras
from keras import models, layers
```

2. Define the network structure:

```python
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10,)))
model.add(layers.Dropout(0.1))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
```

3. Compile the model:

```python
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=[auc])

def auc(y_true, y_pred):
    auc = tf.metrics.auc(y_true, y_pred)[1]
    keras.backend.get_session().run(tf.local_variables_initializer())
    return auc
```

In this example, we're defining a custom metric called `auc` using the `tf.metrics.auc` function. This metric will be used to evaluate the model's performance during training and testing.

4. Train and evaluate the model:

```python
model.fit(x_train, y_train, epochs=100, va

Although we can see Semantic performs well in some case by giving code examples but fails to give complete answers in other cases. <br>
Nevertheless both are equally good in retrieving the relevant nodes from the indexed data