<h1>Recipe Helper Chatbot using Langchain Dolly-v2-3b Model</h1>


In this Kaggle notebook, we will explore how to build a recipe helper chatbot using the powerful Langchain Dolly-v2-3b language model. The chatbot will assist users in finding recipes, providing cooking instructions, and offering personalized recommendations. We will leverage the rich RecipeNLG dataset available on Kaggle to train and fine-tune the Dolly-v2-3b model, enabling it to generate contextually relevant and informative recipe answers.

**1.Understanding Dolly-v2-3b:**

Dolly-v2-3b is a state-of-the-art conversational AI language model specifically designed for interactive applications. It possesses advanced language understanding capabilities and can generate coherent and contextually relevant responses. We will explore the key features and architecture of the Dolly-v2-3b model, highlighting its suitability for building our recipe helper chatbot.

**Preparing the Recipe Dataset:**

We will acquire the RecipeNLG dataset from Kaggle, which contains a vast collection of cooking recipes. We will perform data preprocessing, including cleaning, formatting, and organizing the dataset to ensure compatibility with the Dolly-v2-3b model.

**Integrating Langchain:**

To handle the large-scale nature of the recipe dataset, we will leverage Langchain, a powerful tool for managing and loading language data efficiently. We will explore the process of creating and optimizing the Langchain, enabling seamless access and retrieval of recipe information during chatbot interactions.

**Training and Fine-tuning Dolly-v2-3b:**

Using the Langchain and the RecipeNLG dataset, we will train and fine-tune the Dolly-v2-3b model specifically for recipe-related tasks. We will delve into the training process, including data preparation, model configuration, and optimization techniques, ensuring that the model becomes proficient in generating accurate and informative recipe answers.

**Integrating the Chatbot Functionality: **

We will develop the recipe helper chatbot by incorporating the trained Dolly-v2-3b model into our application. We will build an intuitive user interface, allowing users to enter recipe queries and receive generated recipe answers based on the chatbot's understanding of natural language inputs

In [1]:
!pip install transformers accelerate  langchain sentence-transformers chromadb

Collecting langchain
  Downloading langchain-0.0.202-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
[?25hCollecting chromadb
  Downloading chromadb-0.3.26-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.6/123.6 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting langchainplus-sdk>=0.0.9 (from langchain)
  Downloading langchainplus_sdk-0.0.10-py3-none-any.whl (21 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [2]:
data_path = "/kaggle/input/recipenlg/RecipeNLG_dataset.csv"
display(data_path)

'/kaggle/input/recipenlg/RecipeNLG_dataset.csv'

In [3]:
import pandas as pd
recipe_df = pd.read_csv(data_path)
recipe_df.tail()

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."
2231139,2231139,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[""radish"", ""Sesame oil"", ""White sesame seeds"",..."
2231140,2231140,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay l..."
2231141,2231141,Polpette in Spicy Tomato Sauce,"[""1 pound ground veal"", ""1/2 pound sweet Itali...","[""Preheat the oven to 350."", ""In a bowl, mix t...",www.foodandwine.com/recipes/polpette-spicy-tom...,Recipes1M,"[""ground veal"", ""sausage"", ""bread crumbs"", ""mi..."


In [4]:
# Get the number of rows and columns
num_rows = recipe_df.shape[0]
num_columns = recipe_df.shape[1]

print("Number of rows:", num_rows)
print("Number of columns:", num_columns)

Number of rows: 2231142
Number of columns: 7


In [5]:
# Take a random sample of the DataFrame
sample_df = recipe_df.sample(n=500)  # Change the 'n' value to the desired sample size

# Save the sampled data to a new CSV file
sample_df.to_csv('/kaggle/working/sample_recipe.csv', index=False)
print(sample_df)

         Unnamed: 0                             title  \
1669807     1669807  Veal Sauteed With Peppers Recipe   
1202411     1202411            Holiday Corn Casserole   
729983       729983                 Jewish Apple Cake   
1846502     1846502    Spaghetti con Pomodoro e Tonno   
1347401     1347401           Nadia'S Eggplant Curry    
...             ...                               ...   
225762       225762                     Rhubarb Sauce   
881157       881157                 Honeybee Ambrosia   
1609429     1609429                 Chocolate Pudding   
1689975     1689975       Cheesy Potato Mushroom Soup   
45779         45779                        Honey Bars   

                                               ingredients  \
1669807  ["1/4 c. sliced onions", "1/2 c. sliced mushro...   
1202411  ["1/4 cup butter", "1 cup sour cream", "1 egg"...   
729983   ["4 large apples, peeled and sliced thin", "4 ...   
1846502  ["1/4 cup plus 2 tablespoons olive oil", "1 ta...   
13474

In [6]:
from langchain.embeddings import HuggingFaceEmbeddings
 
# Download model from Hugging face
hf_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [7]:
sample_data_path = '/kaggle/working/sample_recipe.csv'
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path=sample_data_path)

recipes_data = loader.load()

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# This parameter can be modified based on your documents and use case.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=100, length_function=len
)
texts = text_splitter.split_documents(recipes_data)
len(texts)

824

In [9]:
texts[2]

Document(page_content='Unnamed: 0: 729983\ntitle: Jewish Apple Cake\ningredients: ["4 large apples, peeled and sliced thin", "4 tsp. cinnamon", "5 Tbsp. sugar", "3 c. flour, unsifted", "3 tsp. baking powder", "2 1/2 tsp. vanilla", "2 1/2 c. sugar", "1/2 tsp. salt", "1/2 c. orange juice", "1 c. oil", "4 eggs"]\ndirections: ["Mix apples, cinnamon and 5 tablespoons sugar; let stand.", "In a separate bowl mix flour, baking powder, vanilla, 2 1/2 cups sugar, salt, orange juice, oil and eggs.", "Beat well.", "In a greased tube pan put a layer of dough, then apples, alternating, ending with apples.", "Bake at 350\\u00b0 for 1 1/2 to 2 hours.", "Serve warm or cold."]\nlink: www.cookbooks.com/Recipe-Details.aspx?id=602119\nsource: Gathered\nNER: ["apples", "cinnamon", "sugar", "flour", "baking powder", "vanilla", "sugar", "salt", "orange juice", "oil", "eggs"]', metadata={'source': '/kaggle/working/sample_recipe.csv', 'row': 2})

In [10]:
from langchain.vectorstores import Chroma
vector_db_store = '/kaggle/working/'

vector_db = Chroma.from_documents(documents=texts, 
                                 embedding=hf_embed,
                                 persist_directory=vector_db_store)
vector_db.persist()

Batches:   0%|          | 0/26 [00:00<?, ?it/s]

In [11]:
! zip -r data.zip /kaggle/working/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  adding: kaggle/working/ (stored 0%)
  adding: kaggle/working/chroma-embeddings.parquet (deflated 19%)
  adding: kaggle/working/sample_recipe.csv (deflated 70%)
  adding: kaggle/working/chroma-collections.parquet (deflated 50%)
  adding: kaggle/working/__notebook__.ipynb (deflated 84%)
  adding: kaggle/working/index/ (stored 0%)
  adding: kaggle/working/index/index_2f53a0ef-5430-4ede-8773-5236671f5e19.bin (deflated 10%)
  adding: kaggle/working/index/index_metadata_2f53a0ef-5430-4ede-8773-5236671f5e19.pkl (deflated 14%)
  adding: kaggle/working/index/id_to_uuid_2f53a0ef-5430-4ede-8773-5236671f5e19.pkl (deflated 36%)
  adding: kaggle/working/index/uuid_to_id_2f53a0ef-5430-4ede-8773-5236671f5e19.pkl 

In [12]:
!ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
__notebook__.ipynb	    chroma-embeddings.parquet  index
chroma-collections.parquet  data.zip		       sample_recipe.csv


In [13]:
from IPython.display import FileLink
FileLink(r'data.zip')

In [14]:
persist_directory = '/kaggle/input/recipedb'
vectordb = Chroma(persist_directory=persist_directory,
                   embedding_function=hf_embed)
vectordb.get()

{'ids': ['04987bea-0c4c-11ee-9b48-0242ac130202',
  '04987a0a-0c4c-11ee-9b48-0242ac130202',
  '04987a6e-0c4c-11ee-9b48-0242ac130202',
  '04987aa0-0c4c-11ee-9b48-0242ac130202',
  '04987ac8-0c4c-11ee-9b48-0242ac130202',
  '04987afa-0c4c-11ee-9b48-0242ac130202',
  '04987b2c-0c4c-11ee-9b48-0242ac130202',
  '04987b5e-0c4c-11ee-9b48-0242ac130202',
  '04987b86-0c4c-11ee-9b48-0242ac130202',
  '04987bb8-0c4c-11ee-9b48-0242ac130202',
  '049879d8-0c4c-11ee-9b48-0242ac130202',
  '04987c12-0c4c-11ee-9b48-0242ac130202',
  '04987c44-0c4c-11ee-9b48-0242ac130202',
  '04987c76-0c4c-11ee-9b48-0242ac130202',
  '04987ca8-0c4c-11ee-9b48-0242ac130202',
  '04987cd0-0c4c-11ee-9b48-0242ac130202',
  '04987d02-0c4c-11ee-9b48-0242ac130202',
  '04987d34-0c4c-11ee-9b48-0242ac130202',
  '04987d66-0c4c-11ee-9b48-0242ac130202',
  '04987d8e-0c4c-11ee-9b48-0242ac130202',
  '049877da-0c4c-11ee-9b48-0242ac130202',
  '04987636-0c4c-11ee-9b48-0242ac130202',
  '0498765e-0c4c-11ee-9b48-0242ac130202',
  '04987690-0c4c-11ee-9b48-