# Turkish Crowdfunding RAG Pipeline

This notebook demonstrates how to build a **Retrieval-Augmented Generation (RAG)** system using **LlamaIndex** for a Turkish Crowdfunding dataset.

## Steps Covered:
1. **Load Data**: Read the CSV dataset using Pandas.
2. **Create Documents**: Convert data rows into LlamaIndex Document objects using correct Turkish column names.
3. **Ingestion Pipeline**: Chunk text using a SentenceSplitter.
4. **Vector Index**: Create embeddings using a local HuggingFace model and build an index.
5. **Configure LLM**: Set up a local LLM (Ollama) or OpenAI for answering queries.
6. **Query Engine**: Ask questions about the dataset.

### Prerequisites
Ensure you have the necessary libraries installed. Uncomment and run the cell below if needed.

In [None]:
# !pip install pandas llama-index llama-index-embeddings-huggingface llama-index-llms-ollama

In [1]:
import pandas as pd
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.core import Settings

  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Load the Dataset

We use **Pandas** to read the `turkish_crowdfunding_data.csv` file. This allows us to inspect the data before processing.

In [2]:
# Load the dataset
csv_path = "turkish_crowdfunding_data.csv"
df = pd.read_csv(csv_path)

# Display the first few rows to understand the structure
df.head()

Unnamed: 0,id,platform_adi,kitle_fonlamasi_turu,kategori,fon_sekli,proje_adi,proje_sahibi,proje_sahibi_cinsiyet,kac_proje_destekledi,kac_projeye_abone,...,sosyal_medya,sm_sayisi,sm_takipci,etiket_sayisi,icerik_kelime_sayisi,proje_aciklamasi,hedef_miktari,toplanan_tutar,destek_orani,basari_durumu
0,1,fongogo,ödül,diğer,ya hep ya hiç,Gerçek Gizlidir Filmleri,Lob Ekibi,belirsiz,1.0,0.0,...,var,3.0,274.0,0.0,301.0,Türkiye’de ilk kez her aşaması ‘online’ olarak...,40000.0,54410.0,136%,başarılı
1,2,fongogo,ödül,diğer,ya hep ya hiç,Fongogo - Hayat Bulsun!,Fongogo Team,belirsiz,3.0,0.0,...,var,3.0,5634.0,0.0,167.0,Fongogo ekibi olarak büyümek,,,,
2,3,fongogo,ödül,diğer,ya hep ya hiç,PAYLAŞIMIN,,,,,...,,,,,,,,,,
3,4,fongogo,ödül,diğer,ya hep ya hiç,Türk Kadının 100.yıl Seyri,Deniz Tutkusu Seyirde,kadın,0.0,0.0,...,var,2.0,3338.0,5.0,287.0,‘Sözde değil Özde İzindeyiz ‘ sloganıyla Atatü...,35000.0,35450.0,101%,başarılı
4,5,fongogo,ödül,diğer,ya hep ya hiç,Doğadaki Çocuk,Seren Kutadgu,kadın,2.0,0.0,...,var,1.0,144.0,3.0,571.0,İstanbul'daki çocuk yuvalarında yaşayan dezava...,30000.0,31250.0,104%,başarılı


## Step 2: Convert to LlamaIndex Documents

LlamaIndex requires data to be in `Document` format. We will iterate through the DataFrame and create a `Document` for each row, combining relevant columns into the text payload.

In [3]:
documents = []

for index, row in df.iterrows():
    # Combine relevant fields into a single text block using the correct Turkish column names
    text_content = (
        f"Project Name: {row.get('proje_adi', 'N/A')}\n"
        f"Description: {row.get('proje_aciklamasi', 'N/A')}\n"
        f"Category: {row.get('kategori', 'N/A')}\n"
        f"Funding Goal: {row.get('hedef_miktari', 'N/A')}\n"
        f"Status: {row.get('basari_durumu', 'N/A')}"
    )
    
    # Create a Document object
    doc = Document(text=text_content, metadata={"row_id": index})
    documents.append(doc)

print(f"Created {len(documents)} documents.")

Created 1711 documents.


## Step 3: Ingestion Pipeline with SentenceSplitter

An `IngestionPipeline` allows us to apply transformations to our documents. Here we use `SentenceSplitter` to break large text into smaller, manageable chunks for embedding.

In [4]:
# Initialize the pipeline with a splitter
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=20)
    ]
)

# Run the pipeline to process documents into nodes
nodes = pipeline.run(documents=documents)

print(f"Processed {len(nodes)} nodes from {len(documents)} documents.")

Processed 1711 nodes from 1711 documents.


## Step 4: Build Vector Index with Local Embeddings

We use a local embedding model from HuggingFace to convert our text nodes into vector representations. This avoids external API calls for embeddings.

In [5]:
# Set the embedding model to a local HuggingFace model
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

# Create the VectorStoreIndex from the processed nodes
index = VectorStoreIndex(nodes)

print("Vector Index built successfully!")

Vector Index built successfully!


## Step 5: Configure LLM

To avoid memory issues, we will use a smaller, efficient model like **Llama 3.2 1B**.
**Important**: You must run the following command in your terminal before executing this cell:
`ollama pull llama3.2:1b`

In [6]:
# Configured for local usage with Ollama
Settings.llm = Ollama(model="llama3.2:1b", request_timeout=360.0)


## Step 6: Query Engine

Now that the data is loaded correctly, we can ask relevant questions.

In [9]:
# Create the query engine
query_engine = index.as_query_engine()

# Ask the specific question: "Which projects focus on education?"
# Note: Or you can ask "What is the dataset about?"
response = query_engine.query("What are the most important details about this dataset?")

# Print the response
print("Response:")
print(response)

Response:
Projeketin(status="nan") için funding_goal (n) olarak birincilik derecesi 0'a sahip olmasıdır.
