Final project for the class "Deep Learning Systems Algorithms and Implementation" from CMU, where we try to make needle work with Apple M1 chips.
To run the code in this repo (with M1 as NDArray backend), we assume you have a Mac machine with M1 chip on it. There are some setup steps you need to follow before you can run our code, as indicated below.
First, you need to download XCode and its SDK. You can install Xcode through your Mac’s app store.
To ensure Metal is installed successfully, run this in command line:
xcrun metal
if you run into the following error:
xcrun: error: unable to find utility "metal", not a developer tool or in PATH
run the following command to fix it:
xcode-select --switch /Applications/Xcode.app/Contents/Developer
Then, run the following in the Mac terminal (you can skip these if you already have these installed on your system):
brew install llvm
brew install cmake
We recommend using conda for installing all the python packages needed for running unit tests in this codebase. After installing conda on your system, run the following to install the conda environment needed for our code.
conda env create --file environment.yaml
Run the following to activate the environment
conda activate dlsys-needle-m1
If you want, you can also use pip to install all the packages listed in environment.yml
and not use conda.
Some of the unit tests require CIFAR and PTB data. You can download them by running
python3 download_data.py
NOTE: you should run download_data.py
under the project root directory, because download_data.py
hardcodes the data path.
The project mainly consists two parts: the automatic differentiation framework needle (under ./needle
, written in Python) and different ndarray backends (under ndarray_backend
, written in C++, supports CPU, GPU and M1).
First compile the backend code:
make
Then install the needle
package in editable mode:
pip3 install -e .
We combined all local tests from hw3 and hw4, and make it available for m1-backend, except for mugrade tests and language model training(it fails due to our hw4 implementation error)
According to this PyTorch GitHub issue, currently sequential models are not friendly for m1 GPUs even for Apple's MPS backend and PyTorch(it takes much longer than CPU), so we reduced seq_len, input_size, hidden_states for rnn and lstm tests so that they can be passed within acceptable time.
Run unit tests on everything:
python3 -m pytest -v
Run part of the tests according to test names, e.g. "m1":
python3 -m pytest -v -k "m1"
python3 apps/benchmark_matmul.py
- You should see the following plots that compare the matrix multiplication speed on m1 vs cpu. As we can see, for matrices with size bigger than 100, m1 consistently have ~3x speedup comparing to cpu. In some cases such as when matrix size is 2500, m1 achieves 70x speedup comparing to cpu!
- this is likely due to the reason that the tile size is set to 8 for our CPU matrix multiplication code, so CPU matrix multiplication is significantly slower for matrix whose size is not a multiple of 8 (e.g. 500, 900, 1500, 2500). Thus the speedup number for matrix size of 500, 900, 1500, 2500 seems huge, not because M1 is being very fast on these matrix sizes, but because CPU is being very slow on these matrix sizes.