<a href="https://colab.research.google.com/github/tinetor/MLOPS_FRAUD/blob/main/casualinference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install -U datasets huggingface_hub fsspec

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

def load_resources(model_name):
    """Returns: (tokenizer, model)"""
    return (
        AutoTokenizer.from_pretrained(model_name),
        AutoModelForQuestionAnswering.from_pretrained(model_name)
    )

def tokenize_input(tokenizer, question, context):
    """Returns: inputs (dict)"""
    return tokenizer(question, context, return_tensors="pt")

def run_model(model, inputs):
    """Returns: outputs (ModelOutput)"""
    return model(**inputs)

def process_outputs(tokenizer, inputs, outputs):
    """Returns: answer (str)"""
    answer_start = outputs.start_logits.argmax()
    answer_end = outputs.end_logits.argmax() + 1
    return tokenizer.decode(inputs["input_ids"][0][answer_start:answer_end])

def answer_question(question, context, model_name="distilbert-base-cased-distilled-squad"):
    """Complete pipeline with chained function outputs"""
    tokenizer, model = load_resources(model_name)
    inputs = tokenize_input(tokenizer, question, context)
    outputs = run_model(model, inputs)
    return process_outputs(tokenizer, inputs, outputs)

In [None]:
question = "What is the Capital of Brazil?"
context = "Brasília is the capital of Brazil."
answer_question(question, context)

'Brasília'

In [None]:
pip install transformers faiss-cpu torch



In [None]:
pip install  llama-index



In [None]:
# === 1. CARREGA O DOCUMENTO E DIVIDE EM TRECHOS ===
'''
def split_text(text, max_length=300):
    sentences = text.split('. ')
    chunks, current = [], ""
    for sent in sentences:
        if len(current) + len(sent) < max_length:
            current += sent + ". "
        else:
            chunks.append(current.strip())
            current = sent + ". "
    if current:
        chunks.append(current.strip())
    return chunks
'''
file = '/content/drive/MyDrive/RAG/test_1.txt'
with open(file, "r", encoding="utf-8") as f:
    document = f.read()


from llama_index.core.node_parser import SentenceSplitter

llama_splitter = SentenceSplitter(chunk_size=300, chunk_overlap=50)
chunks = llama_splitter.split_text(document)
print(f"[LlamaIndex] {len(chunks)} chunks")

[LlamaIndex] 2 chunks


In [None]:
from sentence_transformers import SentenceTransformer

model_embeddings = SentenceTransformer("all-MiniLM-L6-v2")  # ou melhor: "all-MiniLM-L6-v2" faster distilbert-base-uncased
embeddings = model_embeddings.encode(chunks, convert_to_numpy=True, normalize_embeddings=True)


In [None]:
# === 3. CRIA O FAISS INDEX ===
import faiss
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

In [None]:
query = "What is often considered the core ideal driving innovation and entrepreneurship in American culture?"
query_embedding = model_embeddings.encode([query], convert_to_numpy=True)

k = 1  # número de chunks relevantes que você quer recuperar
distances, indices = index.search(query_embedding, k)

relevant_chunks = [chunks[i] for i in indices[0]]
relevant_chunks

['Supportive legal frameworks, including strong intellectual property rights, further incentivize innovation by protecting inventors and their creations.\n\nMoreover, the diverse population of the U.S. is a continuous source of varied perspectives and problem-solving approaches. Immigrants, bringing their unique experiences and talents, have historically been disproportionately represented among successful innovators and entrepreneurs. This "melting pot" of ideas often leads to cross-pollination and novel solutions. Finally, a pragmatic approach to problem-solving, often favoring practical solutions over theoretical dogma, ensures that innovations frequently address real-world needs and find broad adoption.\n\nIn essence, American innovation is a complex interplay of systemic support (education, funding, legal protection), cultural values (risk-taking, resilience), and demographic strengths (diversity), all contributing to its enduring influence.']

In [None]:
relevant_chunks

['Supportive legal frameworks, including strong intellectual property rights, further incentivize innovation by protecting inventors and their creations.\n\nMoreover, the diverse population of the U.S. is a continuous source of varied perspectives and problem-solving approaches. Immigrants, bringing their unique experiences and talents, have historically been disproportionately represented among successful innovators and entrepreneurs. This "melting pot" of ideas often leads to cross-pollination and novel solutions. Finally, a pragmatic approach to problem-solving, often favoring practical solutions over theoretical dogma, ensures that innovations frequently address real-world needs and find broad adoption.\n\nIn essence, American innovation is a complex interplay of systemic support (education, funding, legal protection), cultural values (risk-taking, resilience), and demographic strengths (diversity), all contributing to its enduring influence.']

In [None]:
def search_faiss_chunks(
    query: str,
    model_embeddings,
    index,
    chunks,
    k: int = 5
):
    """
    Faz busca em índice FAISS e retorna chunks relevantes sem duplicatas.

    Args:
        query (str): A pergunta ou texto de busca.
        model_embeddings: Modelo para gerar embedding da query (e.g. SentenceTransformer).
        index: Índice FAISS já treinado/populado.
        chunks (List[str]): Lista dos textos originais (na mesma ordem dos embeddings).
        k (int): Número máximo de resultados relevantes.

    Returns:
        List[str]: Lista de textos relevantes sem duplicatas.
    """
    query_embedding = model_embeddings.encode([query], convert_to_numpy=True)
    k = min(k, index.ntotal)  # Evita pedir mais do que o índice tem
    distances, indices = index.search(query_embedding, k)

    # Filtra índices inválidos (-1)
    valid_indices = [i for i in indices[0] if i != -1]

    # Recupera os chunks e remove duplicatas mantendo a ordem
    relevant_chunks = [chunks[i] for i in valid_indices]
    relevant_chunks = list(dict.fromkeys(relevant_chunks))

    return relevant_chunks

In [None]:
qa_pairs = {
    "What kind of cultural outlook has consistently altered industries and societies globally in the U.S.?":
        "American innovation",

    "What key element promotes this creative impulse, especially within the nation's academic institutions?":
        "the nation's robust educational system",

    "How does the attitude towards mistakes contribute to progress in American innovation?":
        "a culture that tolerates failure as a learning opportunity rather than a definitive setback contributes significantly.",

    "Which financial system is essential for transforming new concepts into marketable products and services?":
        "The venture capital ecosystem",

    "What type of legal protections motivate inventors by safeguarding their creations?":
        "strong intellectual property rights",

    "How does the varied populace in the U.S. contribute to novel problem-solving?":
        "The diverse population of the U.S. is a continuous source of varied perspectives and problem-solving approaches.",

    "What is the central concept that underpins American innovation according to the text?":
        "a profound cultural mindset",

    "What is the term for the inclination to start new businesses that is strongly encouraged?":
        "entrepreneurship",

    "What ensures that novelties frequently address practical requirements?":
        "a pragmatic approach to problem-solving, often favoring practical solutions over theoretical dogma",

    "What is the common term for financial support provided to high-potential new companies?":
        "venture capital"
}

In [None]:
# --- O for loop que você pediu ---
for question, expected_answer in qa_pairs.items():
    print(f"Pergunta: {question}")
    # Aqui você chamaria sua função que obtém a resposta do modelo (ex: answer_rag_model(question))
    # Por enquanto, vamos usar um placeholder:
    predicted_answer = answer_question(question, search_faiss_chunks(question,model_embeddings,index,chunks,1)[0])

    print(f"Resposta do Modelo: {predicted_answer}")
    print(f"Resposta Esperada: {expected_answer}")
    print("-" * 30) # Apenas para separar as saídas

Pergunta: What kind of cultural outlook has consistently altered industries and societies globally in the U.S.?
Resposta do Modelo: risk - taking, resilience
Resposta Esperada: American innovation
------------------------------
Pergunta: What key element promotes this creative impulse, especially within the nation's academic institutions?
Resposta do Modelo: entrepreneurship
Resposta Esperada: the nation's robust educational system
------------------------------
Pergunta: How does the attitude towards mistakes contribute to progress in American innovation?
Resposta do Modelo: a complex interplay of systemic support ( education, funding, legal protection ), cultural values ( risk - taking, resilience ), and demographic strengths ( diversity ), all contributing to its enduring influence
Resposta Esperada: a culture that tolerates failure as a learning opportunity rather than a definitive setback contributes significantly.
------------------------------
Pergunta: Which financial system is

In [None]:
search_faiss_chunks('What is often considered the core ideal driving innovation and entrepreneurship in American culture?',model_embeddings,index,chunks,1)[0]

'Supportive legal frameworks, including strong intellectual property rights, further incentivize innovation by protecting inventors and their creations.\n\nMoreover, the diverse population of the U.S. is a continuous source of varied perspectives and problem-solving approaches. Immigrants, bringing their unique experiences and talents, have historically been disproportionately represented among successful innovators and entrepreneurs. This "melting pot" of ideas often leads to cross-pollination and novel solutions. Finally, a pragmatic approach to problem-solving, often favoring practical solutions over theoretical dogma, ensures that innovations frequently address real-world needs and find broad adoption.\n\nIn essence, American innovation is a complex interplay of systemic support (education, funding, legal protection), cultural values (risk-taking, resilience), and demographic strengths (diversity), all contributing to its enduring influence.'