Skip to content

Latest commit

 

History

History
336 lines (272 loc) · 15.2 KB

llama.md

File metadata and controls

336 lines (272 loc) · 15.2 KB

🏠Home

Text-generation-webui manual installation on Windows WSL2 / Ubuntu

Important:

  • For a simple automatic install, use the one-click installers provided in the original repo.
  • This tech is absolutely bleeding edge, methods and tools change on a daily basis, consider this page as outdates as soon as it's updated, things break - regularily
  • Look for more recent tutorials on youtube and reddit

Advanced WSL2 Ubuntu install 2023-05-15

reddit comments but will also eventually be outdated again

# 1 install WSL2 on Windows 11, then:
sudo apt update
sudo apt-get install build-essential
sudo apt install git -y

# optional: install a better terminal experience, otherwise skip to step 4
# 2 install brew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
(echo; echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"') >> /home/$USER/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
brew doctor

# 3 install oh-my-posh
brew install jandedobbeleer/oh-my-posh/oh-my-posh
$(brew --prefix oh-my-posh)/themes
#	copy the path and add it below to the second eval line:
sudo nano ~/.bashrc
#	add this to the end:
#		eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
#		eval "$(oh-my-posh init bash --config '/home/linuxbrew/.linuxbrew/opt/oh-my-posh/themes/atomic.omp.json')"
#		   plugins=(
#			 git
#			 # other plugins
#		   )
#	CTRL+X to end editing
#	Y to save changes
#	ENTER to finally exit
source ~/.bashrc
exec bash

# 4 install mamba instead of conda, because it's faster https://mamba.readthedocs.io/en/latest/installation.html
mkdir github
mkdir downloads
cd downloads
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-$(uname)-$(uname -m).sh

# 5 install the correct cuda toolkit 11.7, not 12.x
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
sudo sh cuda_11.7.0_515.43.04_linux.run
naon ~/.bashrc
#	add the following line, in order to add the cuda library to the environment variable
#		export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
#	after the plugins=() code block, above conda initialize
#	CTRL+X to end editing
#	Y to save changes
#	ENTER to finally exit
source ~/.bashrc
cd ..

# 6 install ooba's textgen
mamba create --name textgen python=3.10.9
mamba activate textgen
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio -f https://download.pytorch.org/whl/cu117/torch_stable.html
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

# 7 Install 4bit support through GPTQ-for-LLaMa
mkdir repositories
cd repositories
# choose ONE of the following:
# A) for fast triton https://www.reddit.com/r/LocalLLaMA/comments/13g8v5q/fastest_inference_branch_of_gptqforllama_and/
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b fastest-inference-4bit
# B) for triton
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b triton
# C) for newer cuda
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda
# D) for widely compatible old cuda
	git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
# groupsize, act-order, true-sequential
#	--act-order (quantizing columns in order of decreasing activation size)
#	--true-sequential (performing sequential quantization even within a single Transformer block)
#	Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.
#	--groupsize
#	Currently, groupsize and act-order do not work together and you must choose one of them.
#	Ooba: There is a pytorch branch from qwop, that allows you to use groupsize and act-order together.
#	Models without group-size (better for the 7b model)
#	Models with group-size (better from 13b upwards)
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install
cd ..
cd ..

# 8 Test ooba with a 4bit GPTQ model
python download-model.py 4bit/WizardLM-13B-Uncensored-4bit-128g
python server.py --wbits 4 --model_type llama --groupsize 128 --chat

# 9 install llama.cpp
cd repositories
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
nano ~/.bashrc
#	add the cuda bin folder to the path environment variable in order for make to find nvcc:
#		export PATH=/usr/local/cuda/bin:$PATH
#	after the export LD_LIBRARY_PATH line
#	CTRL+X to end editing
#	Y to save changes
#	ENTER to finally exit
source ~/.bashrc
make LLAMA_CUBLAS=1
cd models
wget https://huggingface.co/TheBloke/WizardLM-13B-Uncensored-GGML/resolve/main/wizardLM-13B-Uncensored.ggmlv3.q4_0.bin
cd ..

# 10 test llama.cpp with GPU support
./main -t 8 -m models/wizardLM-13B-Uncensored.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: write a story about llamas ### Response:" --n-gpu-layers 30
cd ..
cd ..

# 11 prepare ooba's textgen for llama.cpp support, by compiling llama-cpp-python with cuda GPU support
pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

Windows 11 WSL2 Ubuntu / Native Ubuntu

Installation guide from 2023-03-01 (outdated)

Install Ubuntu WSL2 on Windows 11

  1. Press the Windows key + X and click on "Windows PowerShell (Admin)" or "Windows Terminal (Admin)" to open PowerShell or Terminal with administrator privileges.
  2. wsl --install You may be prompted to restart your computer. If so, save your work and restart.
  3. Install Windows Terminal from Windows Store
  4. Install Ubuntu on Windows Store
  5. Choose the desired Ubuntu version (e.g., Ubuntu 20.04 LTS) and click "Get" or "Install" to download and install the Ubuntu app.
  6. Once the installation is complete, click "Launch" or search for "Ubuntu" in the Start menu and open the app.
  7. When you first launch the Ubuntu app, it will take a few minutes to set up. Be patient as it installs the necessary files and sets up your environment.
  8. Once the setup is complete, you will be prompted to create a new UNIX username and password. Choose a username and password, and make sure to remember them, as you will need them for future administrative tasks within the Ubuntu environment.
  9. If you prefer to use Windows Terminal from now on, close this console and start Windows Terminal then open a new Ubuntu console by clicking the drop down icon on top of Terminal and choose Ubuntu. Otherwise stay in the existing console window.

Install Anaconda + Build Essentials

  1. sudo apt update
  2. sudo apt upgrade
  3. sudo apt install git
  4. sudo apt install wget
  5. mkdir downloads
  6. cd downloads/
  7. wget https://repo.anaconda.com/archive/Anaconda3-2023.03-1-Linux-x86_64.sh
  8. chmod +x ./Anaconda3-2022.05-Linux-x86_64.sh
  9. ./Anaconda3-2022.05-Linux-x86_64.sh and follow the defaults
  10. sudo apt install build-essential
  11. cd ..

Install text-generation-webui

  1. conda create -n textgen python=3.10.9
  2. conda activate textgen
  3. pip3 install torch torchvision torchaudio
  4. mkdir github
  5. cd github
  6. git clone https://github.com/oobabooga/text-generation-webui
  7. cd text-generation-webui
  8. pip install -r requirements.txt
  9. pip install chardet cchardet

Build and install GPTQ

If you want to try the triton branch, skip to Newer GPTQ-Triton

Older GPTQ-Cuda fork by pobabooga

  • Works on Windows, Linux, WSL2.
  • Supports 3 & 4 bit models
  • Only supports no-act-order models
  • Slower than triton
  • Works best with --groupsize 128 --wbits 4 and no-act-order models
  1. mkdir repositories
  2. cd repositories
  3. git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda (or try the newer https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda build)
  4. cd GPTQ-for-LLaMa
  5. python -m pip install -r requirements.txt
  6. python setup_cuda.py install if this gives an error about g++, try installing the correct g++ version: conda install -y -k gxx_linux-64=11.2.0
  7. cd ../..

Newer GPTQ-Triton

This triton branch or this one:

  • Works on Linux and WSL2
  • Supports 4 bit quantized models
  • Is faster than cuda
  • Works best with the --groupsize 128 --wbits 4 flags and act-order models
  1. mkdir repositories
  2. cd repositories
  3. conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
  4. git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa (or try https://github.com/fpgaminer/GPTQ-triton)
  5. cd GPTQ-for-LLaMa
  6. pip install -r requirements.txt
  7. cd ../..

AutoGPTQ to install any (Newer Cuda, Newer Triton, older Cuda)

Alternatively you can try AutoGPTQ to install cuda, older llama-cuda, or triton variants:

  1. run one of these:
  • pip install auto-gptq to install cuda branch for newer models
  • pip install auto-gptq[llama] if your transformers is outdated or you are using older models that don't support it
  • pip install auto-gptq[triton] to install triton branch for triton compatible models
  1. cd ../..

LAN port forwarding from Ubuntu WSL

If you want to open the webui from within your home network, enable port forwarding on your windows machine, with this command in an administrator terminal: netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=7860 connectaddress=localhost connectport=7860

Install bitsandbytes cuda

  • Either always run export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib before running the sever.py below
  • Or trying to install pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113

Install xformers

Allows for faster, but non-deterministic inference. Optional:

  • pip install xformers
  • then use the --xformers flag later, when running the server.py below

You're done with the Ubuntu / WSL2 installation, you can skip to Download models section.

Windows 11 native

Install Miniconda

  1. Download and install miniconda
  2. Download and install git for windows
  3. Open Anaconda Prompt (Miniconda 3) from the Start Menu

Install text-generation-webui

  1. It should load in C:\Users\yourusername>
  2. mkdir github
  3. cd github
  4. conda create --name textgen python=3.10
  5. conda activate textgen
  6. conda install pip
  7. conda install -y -k pytorch[version=2,build=py3.10_cuda11.7*] torchvision torchaudio pytorch-cuda=11.7 cuda-toolkit ninja git -c pytorch -c nvidia/label/cuda-11.7.0 -c nvidia
  8. git clone https://github.com/oobabooga/text-generation-webui.git
  9. python -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/raw/main/bitsandbytes-0.38.1-py3-none-any.whl
  10. cd text-generation-webui
  11. pip install -r requirements.txt --upgrade
  12. mkdir repositories
  13. cd repositories
  14. git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
  15. python -m pip install -r requirements.txt
  16. python setup_cuda.py install might fail, continue with the next command if so
  17. pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/main/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl skip this command, if the previous one didn't fail
  18. cd ..\..\..\ (go back to text-generation-webui)
  19. pip install faust-cchardet
  20. pip install chardet

Download models

  1. Still in your terminal, make sure you are in the /text-generation-webui/ folder and type python download-model.py
  2. select other to download a custom model
  3. paste the huggingface user/directory, for example: TheBloke/wizardLM-7B-GGML and let it download the model files

Run

The base command to run. You have to add further flags, depending on the model and environment you want to run in:

  1. if you are on WSL2 Ubuntu, run export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib always, before running the server.py
  2. python server.py --model-menu --chat
  • --model-menu to allow the change of models in the UI
  • --chat loads the chat instead of the text completion UI
  • --wbits 4 loads a 4-bit quantized model
  • --groupsize 128 if the model specifies groupsize, add this parameter
  • --model_type llama if the model name is unknown, specify it's base model. if you run llama derrived models like vicuna, alpaca, gpt4-x, codecapybara or wizardLM you have to define it as llama. If you load OPT or GPT-J models, define the flag accordingly
  • --xformers if you have properly installed xformers and want faster but nondeterministic answer generation

Troubleshoot

cuda lib not found

If you get a cuda lib not found error, especially on Windows WSL2 Ubuntu, try executing export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib before running the server.py above

ModuleNotFoundError: No module named 'chardet'

  1. pip install faust-cchardet
  2. pip install chardet

or the other way around. Then try to start the server again.

No GPU support on bitsandbytes

On Windows Native, try:

  • pip uninstall bitsandbytes
  • pip install git+https://github.com/Keith-Hon/bitsandbytes-windows.git
  • here are some discussion, but some solutions are for Windows WSL2, some for Windows native

Or try these prebuilt wheel on windows:

Still having problems, try to manually copy the libraries

On Linux or Windows WSL2 Ubuntu, try:

  • make sure you run export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib before running the server.py every time!
  • alternatively, you can try pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113 and see if it works without the above command

Install xformers prebuilt Windows wheels

  • pip install xformers==0.0.16rc425

Prebuilt GPTQ Windows Wheels (may be outdated)

Apple Silicon

Use llama.cpp, HN discussion

Resources

3rd party models

See an up to date list of most models you can run locally: awesome-ai open-models

Other tools

See the awesome-ai LLM section for more tools, GUIs etc.

Other resources