## Follow the instructions to request access. (You'll need a Hugging Face account.)

1.  ### Open https://huggingface.co/
2.  ### Sign in to your account

3.  ### Get Access to LLaMA 3
*   Go to this website: https://huggingface.co/meta-llama/Meta-Llama-3-8B
*   Click on "Access repository"
*   Accept the terms of use
*   Wait for approval (usually quick)


4.   ### You will need your Hugging Face token.
*   To get it, Go to "Settings"
*   Click "Access Tokens" on the left
*   Click "New token"
*   Give it a name like "llama_access" and select “Write” access.
*   Copy the token (it looks like "hf_xxxxxxxxxxxxx")
*   Replace "YOUR_HUGGINGFACE_ACCESS_TOKEN" in the code with your access token you copied.


5.   ### To try a different model
*   Just change the model_id in your code to any other model from Hugging Face.
*   For Llama 2: "meta-llama/Llama-2-7b-chat-hf"
*   For Llama 3: "meta-llama/Meta-Llama-3-8B"
*   For other models: Use the model name from Hugging Face website after you gain the required access.


6.  ### Requirements before you run the notebook:

* Python 3.8 or newer
* Transformers, torch library
* Hugging Face account with approved access
* Llama 3-7B: At least 16GB RAM
* Storage (SSD/HDD): Llama 3-7B: ~15GB free space


7. ### Run the cells in the notebook






















In [None]:
access_token = ""

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B"
cache_dir = "./llama3_cache"
access_token = ""
try:
    tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, token=access_token)
    model = AutoModelForCausalLM.from_pretrained(model_id, cache_dir=cache_dir, token=access_token)
    print("Model and tokenizer loaded successfully!")
except Exception as e:
    print(f"Error loading model: {e}")

- ### The cache_dir "./llama3_cache" is a local folder where the model and tokenizer files are downloaded and stored.

- ### The tokenizer converts human-readable text into tokens that the model can understand.




In [None]:
# Simple function to generate responses
def get_llama_response(prompt_text, max_tokens=75):
    inputs = tokenizer(prompt_text, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


The function get_llama_response takes your text input (prompt_text), feeds it to the model, and returns the model's response as text, with max_tokens controlling the response length.

In [None]:
prompt = "MLOPs is a set of"
print(f"Prompt: {prompt}")
print(f"Response: {get_llama_response(prompt, max_tokens=75)}")

In [None]:
prompt = "The key to life is"
print(f"Prompt: {prompt}")
print(f"Response: {get_llama_response(prompt, max_tokens=75)}")