Instructions for using the Kellogg Linux Cluster (KLC) for our research projects.
For broader programming guides, see my R Guide, Python Guide and Stata Guide.
Processing of large datasets should be done on KLC. What is a large dataset? As a rule of thumb, any dataset approximating your RAM size will not run well on your computer. Additionally, any processes that are slow and can be parallelized (for example, performing estimations using machine learning) should be run on the server. The workflow is the following:
-
Upload data and scripts to server.
-
Upload scripts to GitHub.
-
Upload data with FileZilla to KLC.
-
-
Process data on the server
-
Download results
-
Push results to GitHub and download locally (i.e. on your own computer), or
-
Download processed data files with FileZilla.
-
- Download and install the required programs to access the KLC server:
- Global Protect VPN:Necessary if you want to connect to KLC from outside campus.
- FileZilla: An FTP client is used for transferring files to and from the server. Another option is CyberDuck.
- Cygwin: If you have Windows, this is necessary for using Linux commands from the command line.
- Set up GitHub on the server:
-
You can find a more detailed explanation on this guide. Here is a summary of the commands needed to perform the setup. Make sure to substitute your username and project name, and get a personal access token to use instead of password (after
git fetch
).cd existing_folder git init git remote add origin https://github.com/user123/myproject git fetch git checkout origin/main -ft
-
Scripts should be synced using GitHub. Data files (raw and processed) are excluded from version control by including the data and proc folders in the .gitignore
file, so you will need to upload data files through an FTP client such as Filezilla.
- Again, you can look at this in-depth guide on how to work with GitHub if you're not familiar with it. You simply need to
git push
locally andgit pull
from the server.
- Enter login details: Start by entering your login details on FileZilla:
-
Host:
-
First. look at the list of available nodes here and select the one with the most open cores and ram.
-
Then, input that as the host. For example,
klc0307.ci.northwestern.edu
.
-
-
Username: (your NetID, i.e. the letter and number combination)
-
Password: (your password for your NetID)
-
Port: 22
-
- Connect: Click on Quickconnect.
- Upload files: Drag files from local folders to server folders
-
On FileZilla, local folders are displayed on the left pane and server folders on the right pane:
-
Navigate on the left (local folders) to your local directory for the project folder.
-
Navigate on the right (server folders) to the project folder on the server.
-
Drag files from the left to the right folders to upload.
- An alternative is to drag files from any folder in your system directly to the server folders. This is easier if you already have the local project folder open.
-
-
Connect off-campus: If you are not on campus, use your NetID to connect via VPN. Note that every time you want to connect to the server, you first need to connect to the VPN if you are not on campus. Remember to disconnect from the VPN once done using the server.
-
Open the terminal/command-line: If you have a Mac, open the terminal. If you have Windows, you can open the command line with Windows+R, type cmd, Enter (remember to install Cygwin so that you can use Linux commands from the command line).
-
Input username and node: In the terminal or command line, type the following:
ssh <netID>@klc<node>.ci.northwestern.edu
Where you should replace
<netID>
with your NetID made up of letters and numbers, e.g. abc1324, and<node>
with the node you select from this list, e.g. 0307. -
Input password: Enter the password you created for your NetID.
Now you should be connected to the server.
-
Change directory: Next, you need to
cd
to the "parent folder" of the project you're working on. For example:cd my_project
In projects which are stored on Sean's folder, you will type instead
cd /kellogg/proj/skh2820/my_project
You can email Kellogg Research Support and copy Sean to gain access to this folder.
-
Review folder structure: Here you can type
ls
to see the folder structure. You can also typels -ltr
to list all files with the date they were created and the most recent at the bottom. This command can also be applied to subfolders, e.g.ls -ltr scripts
orls -ltr logs
.
-
Review modules: Search for the most recent version of the program you're using for (e.g., R or Stata) by typing
module avail <program>/
, substituting<program>
for your program of choice. For example,module avail stata
.-
You can skip this step if you already know what the most recent module of your program is.
-
When looking for R modules, don't forget to type the
/
, otherwise you will find all modules containing the word R. Some additional tips on installing R packages on the server can be found here.
-
-
Load modules: e.g.:
module load R/4.2.0
- Pull scripts from GitHub:
git pull
the scripts you need from GitHub. - Run script:
-
It's best practice to use a
00_run
script on R and Stata to run the files you want to run on the server. You can find the setup in the Stata guide or in the R guide. If you only have one script to run on the server, then you can simply run that script by itself. -
Run the scripts the following way.
For R:
nohup R CMD BATCH --vanilla -q scripts/scriptname.R logs/scriptname.log &
- For R scripts, all of the output will be stored in logs/scriptname.log.
For Stata:
nohup stata-mp -b do scripts/scriptname.do &
-
For Stata do files, the scripts you run (not 00_run.do itself but the scripts that are run by 00_run.do) should always create a log file, and the name of the log file should start with the name of the do file, followed by a timestamp. For example:
local project "my_project" ** LOG ** time_stamp start_log using "${logs}/`project'_`time'.log" ********* (RUN SCRIPT) ********* ** WRAP UP ** log close
For both R and Stata:
-
nohup
is so that if you get logged out the script keeps running, -
&
is so that while the command is running, you get the command prompt back and can do other things, -
You must be in the project root directory (not one of its subfolders) to run these commands.
-
- Check script progress: You can review the progress of your scripts by typing the following commands:
-
ps x
: Check if the job is still running and which jobs are running. -
ls -ltr logs
: Double check if log files are being created in the folder logs, and when they were last updated. -
tail logs/logfile.log
: Check the status of your job by viewing the last 10 lines of the log file. -
tail -100 logs/logfile.log
:Shows the last 100 lines of the log file if you need to see more.- For Stata, you can run
tail 00_run.log
to review the log file created in the project root directory. Thenohup stata-mp
command always creates this log whih you can look at to debug in case something goes wrong and your log files aren't even recorded. An alternative totail
ismore
, which shows the file starting from the beginning and you hit enter to show more of it.
- For Stata, you can run
-
kill <proc_number>
: Kills a process. You can find the job/process number withps x
.
-
To download figures and tables, push results to GitHub and download locally (i.e. on your own computer).
To download processed data files, you can use FileZilla to download them to your local project folder.