### Import Packages

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import os
import random

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import ServiceContext, StorageContext, load_index_from_storage
from llama_index.core.node_parser import SentenceSplitter, SimpleNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings


  from .autonotebook import tqdm as notebook_tqdm


### Load HF Token

In [3]:
HF_TOKEN = open('./HF-Token.txt').read()

### Load and Create Indexed Data

In [5]:
PERSIST_DIR = "./storage"
data_folder = "./data"

# initialize the service context LLM and embedding model

# using huggingface's inference api for generating the answer based on retrieved docs
llm = HuggingFaceInferenceAPI(
        model_name="HuggingFaceH4/zephyr-7b-beta", token=HF_TOKEN, num_output=1024)

# loading embedding model ref {https://huggingface.co/spaces/mteb/leaderboard} for available list of models
embed_model = HuggingFaceEmbedding(model_name= "mixedbread-ai/mxbai-embed-large-v1", cache_folder='./embed_model')

# create a service context by plugging in llm and embedding model that is passed as an argument during creating / loading the index data 
service_context = ServiceContext.from_defaults(llm = llm, embed_model=embed_model, chunk_size=1024)

if not os.path.exists(PERSIST_DIR):
    
    files = os.listdir(data_folder)
    print(f"Reading the files from {data_folder} folder. files present {files}")
    
    documents = SimpleDirectoryReader(data_folder).load_data()
    parser = SimpleNodeParser()
    nodes = parser.get_nodes_from_documents(documents)
    
    print(f"number of documents -> {len(documents)} :: number of nodes -> len(nodes)")
    print(f"sample document :: \n {documents[random.randint(0, len(documents))]}")
    print(50 * "#")
    print(f"sample node :: \n {nodes[random.randint(0, len(nodes))]}")
    
    storage_context = StorageContext.from_defaults()
    index = VectorStoreIndex(nodes=nodes, service_context=service_context, storage_context=storage_context)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
    query_engine = index.as_query_engine()
    
else:
    
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context, service_context = service_context)
    query_engine = index.as_query_engine()

config.json: 100%|████████████████████████████████████████████████████████████████████| 677/677 [00:00<00:00, 70.9kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model.safetensors: 100%|████████████████████████████████████████████████████████████| 670M/670M [00:37<00:00, 17.9MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████| 1.24k/1.24k [00:00<00:00, 483kB/s]
vocab.txt: 100%|████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 16.5MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████| 711k/711k [00:01<00:00, 690kB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████| 695/695 [00:00<00:00, 319kB/s]
  service_conte

Reading the files from ./data folder. files present ['Data Science in Production Building Scalable Model Pipelines with Python by Ben G Weber (z-lib.org).pdf']
number of documents -> 234 :: number of nodes -> len(nodes)
sample document :: 
 Doc ID: 8ee01dcd-e66d-471f-bf52-44fbd6218eaa
Text: 196 7 Cloud Dataflow for Batch Modeling     FIGURE 7.5: Running
the managed pipeline with autoscaling. source, which can take a
significant amount of time for libraries such as Pandas. T o avoid
lengthy startup delays, it’s helpful to avoid including libraries in
the requirements file that are already included in the Dataflow SDK1.
F or example,...
##################################################
sample node :: 
 Node ID: f58e8ae7-c969-48d2-85c8-ffbe2991c0c7
Text: 7.2 Batch Model Pipeline 195 command line argument, as shown
below. The project parameter is needed to read and write data with
BigQuery . After running the pipeline, you can validate that the
workflow was successful by navigating to the

### Inference

In [6]:
response = query_engine.query("what data science principles are outlined in this book?")
print(response)



The book focuses on building predictive model services for product teams, and aims to provide data scientists with a set of tools to build scalable model pipelines using Python. It assumes prior knowledge of Python and Pandas, as well as some experience with modeling packages such as scikit-learn. The book covers a range of topics, including data ingestion, data cleaning, feature engineering, model training, model evaluation, model deployment, and model monitoring. It also introduces tools and cloud environments commonly used in industry settings, such as AWS SageMaker, Google Cloud AI Platform, and Kubernetes. The book emphasizes the importance of version control, testing, and documentation in data science workflows, and provides guidance on how to implement these principles using Git, Docker, and Jupyter Notebooks. Overall, the book aims to provide a practical, hands-on approach to data science, with a focus on building data products for product teams.


In [7]:
response = query_engine.query("what are best models that I can use if I have a high sparse data?")
print(response)



For high sparse data, some of the best models that you can use are:

1. Random Forest: This is an ensemble learning method that uses multiple decision trees to make a prediction. It can handle high sparse data as it can handle missing values and noisy data.

2. Gradient Boosting: This is another ensemble learning method that uses multiple weak learners to make a prediction. It can handle high sparse data as it can handle missing values and noisy data.

3. XGBoost: This is an optimized distributed gradient boosting library that can handle high sparse data. It is designed to be fast and efficient, making it a popular choice for big data applications.

4. LightGBM: This is a gradient boosting framework that can handle high sparse data. It uses a tree-based learning algorithm and is known for its speed and accuracy.

5. Deep Learning Models: These are neural network models that can handle high sparse data. They can learn complex relationships between features and can handle missing value

In [8]:
response = query_engine.query("explain random forest model")
print(response)



A random forest is an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random forest is an extension of the bagging meta-algorithm, which builds multiple decision trees at training time (also known as base learners), and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. In random forest, each decision tree is constructed using a random vector, where each element is selected from the training set. This is the main difference between bagging and random forest. Random forest is a supervised learning algorithm, which means it requires labeled data for training. The random forest algorithm is used for both classification and regression tasks. In classification tasks, the output of the rand

In [9]:
response = query_engine.query("how can i use spark to create a pipeline?")
print(response)



To create a pipeline using Spark, follow these steps:

1. Stage your input data in a distributed storage layer, such as S3.
2. Load the data into a Spark DataFrame using Spark's data source APIs.
3. Preprocess the data as needed, such as cleaning, transforming, or feature engineering.
4. Split the data into training and testing sets.
5. Train a machine learning model using Spark MLlib or another library.
6. Evaluate the model's performance on the testing set.
7. Save the model to persistent storage, such as S3 or a database.
8. Load the saved model into a production environment and use it to make predictions on new data.

In summary, creating a pipeline using Spark involves staging data, loading it into Spark, preprocessing it, splitting it into training and testing sets, training a model, evaluating it, saving it, and loading it into production.


In [10]:
response = query_engine.query("what are the spark deployments mentioned in the book")
print(response)



The book mentions three types of Spark deployments:

1. Self-hosted: An engineering team manages a set of clusters and provides console and notebook access.

2. Cloud solutions: AWS EMR and GCP Cloud Dataproc are mentioned as examples.

3. Vendor solutions: Databricks, Cloudera, and other vendors provide fully-managed Spark environments.

The author recommends using a freely-available notebook environment for getting up and running with Spark as quickly as possible, especially for data scientists. The author also mentions that as the size of the team using Spark scales, additional considerations such as multi-tenancy and isolation become important, and self-hosted solutions require significant engineering work to support these features. Many organizations use cloud or vendor solutions for Spark due to these factors.


In [11]:
response = query_engine.query("what are the coding environments that the book talks about?")
print(response)



The book talks about three types of coding environments for writing Python code for data science: IDEs, text editors, and notebooks. The author recommends using notebook environments for exploratory analysis and productizing models, and text editors for building web applications with Flask and Dash. The author also mentions collaborative note-books in Databricks and Google Colab, and suggests sharing notebooks in version control systems like GitHub for collaboration. The author recommends working on a remote machine like EC2 to gain experience with cloud environments and setting up Python environments outside of the local machine.


In [12]:
response = query_engine.query("give me python code for logistic regression ?")
print(response)



Here's an example of how to implement logistic regression in Python using the scikit-learn library:

```python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv('dataset.csv')

# Preprocess the data (e.g., encoding categorical variables)
X = data.drop('target', axis=1)
y = data['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the logistic regression model to the training data
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model on the testing data
y_pred = model.predict(X_test)

# Calculate the accuracy and ROC AUC score
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

print("Accuracy:", accuracy)
print("ROC AUC:", auc)
```

In this example, we fir

In [13]:
response = query_engine.query("give me the same using keras")
print(response)



To build a neural network for predicting the likelihood of purchasing a game using Keras, follow these steps:

1. Import the required libraries:

```python
import tensorflow as tf
import keras
from keras import models, layers
import matplotlib.pyplot as plt
```

2. Define the network structure:

```python
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10,)))
model.add(layers.Dropout(0.1))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
```

3. Compile the model:

```python
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=[auc])
```

4. Define the `auc` metric for evaluating the model:

```python
def auc(y_true, y_pred):
    auc = tf.metrics.auc(y_true, y_pred)[1]
    keras.backend.get_session().run(tf.local_variables_initializer())
    return auc
```

5. Train and evaluate the model:

```python
x_train, x_test, y_train, y_test = train_test_split(gamesDF.drop(['label'], axis=1