new updates

vivamoto · May 27, 2021 · 6fbb6ff · 6fbb6ff
1 parent abb42ef
commit 6fbb6ff
Show file tree

Hide file tree

Showing 19 changed files with 184 additions and 84 deletions.
diff --git a/docs/build/doctrees/about.doctree b/docs/build/doctrees/about.doctree
diff --git a/docs/build/doctrees/client_configuration.doctree b/docs/build/doctrees/client_configuration.doctree
diff --git a/docs/build/doctrees/code_development.doctree b/docs/build/doctrees/code_development.doctree
diff --git a/docs/build/doctrees/custom_module.doctree b/docs/build/doctrees/custom_module.doctree
diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle
diff --git a/docs/build/doctrees/exts/sphinxcontrib/README.doctree b/docs/build/doctrees/exts/sphinxcontrib/README.doctree
diff --git a/docs/build/doctrees/gpucheck.doctree b/docs/build/doctrees/gpucheck.doctree
diff --git a/docs/build/doctrees/index.doctree b/docs/build/doctrees/index.doctree
diff --git a/docs/build/doctrees/install_gdrive.doctree b/docs/build/doctrees/install_gdrive.doctree
diff --git a/docs/build/doctrees/license.doctree b/docs/build/doctrees/license.doctree
diff --git a/docs/build/doctrees/linux.doctree b/docs/build/doctrees/linux.doctree
diff --git a/docs/build/doctrees/resources.doctree b/docs/build/doctrees/resources.doctree
diff --git a/docs/build/doctrees/server_configuration.doctree b/docs/build/doctrees/server_configuration.doctree
diff --git a/docs/build/doctrees/slurm.doctree b/docs/build/doctrees/slurm.doctree
diff --git a/docs/build/doctrees/support.doctree b/docs/build/doctrees/support.doctree
diff --git a/docs/build/doctrees/tensorflow_settings.doctree b/docs/build/doctrees/tensorflow_settings.doctree
diff --git a/docs/source/gpucheck.rst b/docs/source/gpucheck.rst
@@ -67,91 +67,170 @@ Example of error message::
 Check all servers
 -----------------
 
-A more practical way to check the GPU usage in all servers is using a script. The ``gpu_mon.py`` script connects to each server and checks the GPU status. Then, it prints a list of servers with idle and falty GPUs and creates a bar plot::
+A more practical way to check the GPU usage in all servers is using a script. The ``gpu_mon.py`` script connects to each server and checks the GPU status. Then, it prints a list of servers with idle and falty GPUs, creates a bar plot and sends an e-mail with GPU usage::
 
 	$ python gpu_mon.py
-	===== Idle GPUs ==========
-	lince2-001.hpc.usp.br
-	lince2-009.hpc.usp.br
-	lince2-011.hpc.usp.br
-	lince2-021.hpc.usp.br
-	lince2-022.hpc.usp.br
-	===== Faulty GPUs ==========
-	lince2-008.hpc.usp.br
-	lince2-011.hpc.usp.br
-	
+	Number of servers: 32 
+
+	Two GPUs in use: 13 servers.
+	------------------------------
+	lince2-003
+	lince2-005
+	lince2-008
+	lince2-012
+	lince2-013
+	lince2-014
+	lince2-017
+	lince2-018
+	lince2-020
+	lince2-021
+	lince2-026
+	lince2-027
+	lince2-032
+
+	One GPUs in use: 10 servers.
+	------------------------------
+	lince2-001
+	lince2-002
+	lince2-004
+	lince2-009
+	lince2-011
+	lince2-016
+	lince2-023
+	lince2-024
+	lince2-025
+	lince2-028
+
+	No GPUs in use: 8 servers.
+	------------------------------
+	lince2-006
+	lince2-007
+	lince2-010
+	lince2-019
+	lince2-022
+	lince2-029
+	lince2-030
+	lince2-031
+
+	Faulty GPUs: 0 servers.
+	------------------------------
+
+	Connection failure: 1 servers.
+	------------------------------
+	lince2-015
 .. image:: images/gpu_usage.png
 
 
 gpu_mon script::
 
 	#!/scratch/11568881/miniconda3/bin/python
+	#%%
 	"""
 	This module collects GPU utilization on all servers in lince cluster. This is useful to help
 	identify possible improvements in job speed and free resources for other users.
 	Ideally GPU utilizatin should be high for the most part of the time.
 
 	Process:
-	1. Connect to all servers, collect GPU utilization and save in a log file.
-	2. Read the log file and create a data frame with server and GPUs 0 and 1 utilizations.
-	3. Create a horizontal bar chart.
-	4. Save the plot and data frame.
+	1. Connect to all servers via SSH and collect GPU usage.
+	2. Create a data frame with server and both GPUs usage.
+	3. Create a horizontal bar chart of GPU usage by server.
+	4. Send a summary and plot by e-mail.
 	"""
-	import os, re, datetime, tempfile
+	import os, re, datetime
 	import numpy as np
 	import pandas as pd
 	import matplotlib.pyplot as plt
+	import smtplib, mimetypes
+	from email.message import EmailMessage
+
+	def message(msg, servers):
+		"""Format server status message."""
+		text = '\n'
+		text += msg + str(len(servers)) + " servers.\n" 
+		text += "-"*30 + "\n"
+		for server in servers:
+			text += server + '\n'
+		return text
+
+	def gpustatus(result_fname, summary_fname):
+		"""Connect to each server and collect GPU information.
+	Result is saved in a log file.
+	"""
+		gpu = {}        # GPU utilization
+		no_gpu = []     # Servers with 0 GPUs in use
+		one_gpu = []    # Servers with 1 GPU in use
+		two_gpu = []    # Servers with 2 GPUs in use
+		gpudown = []    # Servers with faulty GPUs
+		no_route = []   # Servers with connection failure
+		servers = []    # List of servers
+		df = pd.DataFrame(columns = ['Server', 'GPU 0', 'GPU 1'])
 
-	# Connect to each server and collect GPU information.
-	# Result is saved in a log file.
-	def gpustatus(logfile):
-		if os.path.exists(logfile):
-			os.system('rm ' + logfile)
-		for n in range(1, 32):
-			s = ("000" + str(n))[-3:]
-			os.system('ssh lince2-{} "hostname;nvidia-smi|egrep \'Tesla|%|Unable|No running processes found\'" >> {}'.format(s, logfile))
+		for n in range(1, 33):
+			# Connect to each server in the cluster and send commands
+			server_name = 'lince2-' + ("000" + str(n))[-3:]
+			servers.append(server_name)
+			cmd = 'ssh {} "hostname;nvidia-smi"'.format(server_name)
+			pipe = os.popen(cmd,'r')
 
-		return
-		
-	# Opens the log file and check the GPU utilization and availability.
-	# Print the list of servers with idle GPU and GPU down.
-	def procfile(logfile):
-		gpu = {}
-		nogpu = []
-		gpudown = []
-		df = pd.DataFrame(columns = ['Server', 'GPU 0', 'GPU 1'])
-		with open(logfile, 'r') as f:
-			rows = f.readlines()
-			for row in rows:
+			print("Processing server:", server_name)
+			for row in pipe.read().split('\n'):
 				#lince2-001.hpc.usp.br
 				server_re = re.search(r'(lince\d-(\d+))\.hpc', row)
 				#|   0  Tesla K20m          Off  | 00000000:05:00.0 Off |                    0 |
 				gpuId_re = re.search(r'\|\s+(\d)\s+Tesla', row)
 				#| N/A   62C    P0   104W / 225W |     78MiB /  4743MiB |     73%      Default |
 				utilization_re = re.search(r'B \|\s+(\d+)%\s+', row)
+				
+				# Read server name
 				if server_re:
 					server = server_re.group(1)
+				# Read GPU error message
 				elif "Unable to determine the device handle for GPU" in row:
 					gpudown.append(server)
-				# GPU ID: 0 or 1
+				# Read GPU ID: 0 or 1
 				elif gpuId_re:
 					gpuId = int(gpuId_re.group(1))      # GPU 0 or 1
-				# GPU utilization
+				# Read GPU utilization
 				elif utilization_re:
 					gpu[gpuId] = int(utilization_re.group(1))
 					if gpuId:
 						df.loc[len(df) + 1] = server, gpu[0], gpu[1]
-				if "No running processes found" in row:
-					nogpu.append(server)
+						# Identify number of GPUs in use
+						if not (gpu[0] or gpu[1]): 
+							no_gpu.append(server)   # 0 GPUs in use
+						elif gpu[0] and gpu[1]:
+							two_gpu.append(server)  # 1 GPU in use
+						else: 
+							one_gpu.append(server)  # 2 GPUs in use
+
+			pipe.close()
+
+		# Connection failure
+		checked_servers = two_gpu +  one_gpu +  no_gpu +  gpudown
+		for server in servers:
+			if not server in checked_servers:
+				no_route.append(server)
+		checked_servers += no_route
+		# Summary of GPU usage 
+		n =  len(checked_servers)
+		summary = "Number of servers: {} \n".format(str(n))
+		summary += message("Two GPUs in use: ", two_gpu)
+		summary += message("One GPUs in use: ", one_gpu)
+		summary += message("No GPUs in use: ", no_gpu)
+		summary += message("Faulty GPUs: ", gpudown)
+		summary += message("Connection failure: ", no_route)
+		
+		print(summary)
+		# Save data frame and summary
+		df.to_csv(result_fname)
+		with open(summary_fname, 'w') as f:
+			f.write(summary)
 
-		# Print lists of idle and faulty GPUs
-		print("="*5, "Idle GPUs", "="*10)
-		for item in nogpu:
-			print(item)
-		print("="*5,"Faulty GPUs", "="*10)
-		for item in gpudown:
-			print(item)
+		return 
 
+	def create_plot(result, plot):
+		"""Create plot of GPU usage per server."""
+		df = pd.read_csv(result)
 		# Create plot
 		x = np.arange(len(df['Server']))  # the label locations
 		width = 0.35  # the width of the bars
@@ -177,23 +256,66 @@ gpu_mon script::
 
 		fig.tight_layout()
 
-		# Save plot and data frame
-		dt = now.strftime("%Y%m%d_%H%M")
-		plt.savefig('plot/gpu_status_{}.png'.format(dt),
-				dpi=300, bbox_inches='tight')
-		df.to_csv('result/gpu_status_{}.csv'.format(dt), index=False, decimal=',', sep='\t')
-
+		# Save and show plot
+		plt.savefig(plot, dpi=300, bbox_inches='tight')
 		plt.show()    
 		
 		return 
-	# Execute GPU checks
-	home = os.environ['HOME']               # get home directory
-	os.chdir('{}/project/'.format(home))    # change to project dir
-	logfile = 'log/hwmon.log'               # log file name
-	for path in ['log', 'plot', 'result']:
-		if not os.path.exists(path):
-			os.mkdir(path, 0755)
-	gpustatus(logfile = logfile)
-	procfile(logfile = logfile)
 
+	def send_email(receiver, message, plot_fname):
+	   """Send email with summary of GPU usage and plot."""
+	   # Create message and set text content
+	   sender = 'no-reply@lince2.hpc.usp.br'
+	   msg = EmailMessage()
+	   msg['Subject'] = 'Lince: GPUs Status'
+	   msg['From'] = sender
+	   msg['To'] = receiver
+
+	   # Message content
+	   body = """*** Automatic e-mail, do not reply. ***
+	   
+	Status of lince servers.
+
+	See attached plot.
+	  
+	""" 
+	   body += message
+	   msg.set_content(body)
+
+	   # Attach plot
+	   with open(plot_fname, 'rb') as fp:
+		  file_data = fp.read()
+		  maintype, _, subtype = (mimetypes.guess_type(plot_fname)[0] or 'application/octet-stream').partition("/")
+		  msg.add_attachment(file_data, maintype=maintype, subtype=subtype, filename=plot_fname)
+
+	   # Send e-mail
+	   with smtplib.SMTP('localhost') as server:
+		  server.sendmail(sender, receiver, msg.as_string())
+		  print("Successfully sent email")
+
+
+	if __name__ == '__main__':
+		now = datetime.datetime.now()
+		dt = now.strftime("%Y%m%d_%H%M")
+		plot_fname = 'plot/gpu_status_{}.png'.format(dt)
+		result_fname = 'gpu.csv'
+		summary_fname = 'summary.txt'
+
+		# 1. Execute GPU checks
+		gpustatus(result_fname, summary_fname)
+
+		# 2. Create plot
+		create_plot(result_fname, plot_fname)
+
+		# 3. Send result by e-mail
+		receiver = 'your@email.com'
+		with open(summary_fname, 'r') as f:
+			rows = f.readlines()
+			msg = ''
+			for row in rows:
+				msg += row
+		send_email(receiver, msg, plot_fname)
 
+		# 4. Delete files
+		for fname in [plot_fname, result_fname, summary_fname]:
+			os.unlink(fname)
diff --git a/docs/source/images/gpu_usage.png b/docs/source/images/gpu_usage.png
diff --git a/docs/source/resources.rst b/docs/source/resources.rst
@@ -32,7 +32,6 @@ The curriculum includes:
 Researcher Academy
 ------------------
 
-*Learn*
 `Researcher Academy <https://researcheracademy.elsevier.com/>`_ provides free access to countless e-learning resources designed to support researchers on every step of their research journey. Browse our extensive module catalogue to uncover a world of knowledge, and earn certificates and rewards as you progress.
 
 RESEARCH PREPARATION
@@ -71,27 +70,6 @@ COMMUNICATING YOUR RESEARCH
 * Inclusion and Diversity for Researchers
 
 
-*Career path*
-A career in research can take many twists and turns. Whether you decide to stay in academia or move onto industry, `Researcher Academy’s <https://researcheracademy.elsevier.com/>`_ career resources will help you plan accordingly. Browse the different career sections for sound practical advice to tackle your future.
-
-Career planning
-^^^^^^^^^^^^^^^
-Being an effective researcher requires method and order, and those strong organization skills can prove just as useful when it comes to mapping out your career.
-
-In these career planning modules, we learn about the importance of investigating and considering all options before deciding on your next step. We discover why knowing and understanding yourself can be the key to making good choices. We explore whether a PhD is valued outside academia, and we hear why changing career, or even your field of study, can bring big benefits and opportunities for fresh and novel thinking.
-
-Job search
-^^^^^^^^^^
-You’ve decided to explore career options in industry, but are there any special points to consider when job hunting? The answer is yes, and in these modules, we highlight what they are and how you can prepare for them.
-
-You’ll find out how to take the experiences you’ve gained in the lab and translate them into practical skills any employer can relate to. You’ll also learn how to write a CV that will appeal to a company boss. Finally, we run you through a typical interview process and some of the questions you are likely to be asked.
-
-Career guidance
-^^^^^^^^^^^^^^^
-As a researcher, your career path will be littered with crossroads and side tracks. Faced with all those choices, how do you decide on the right route for you? In our career guidance resources, you will find some helpful words of wisdom from seasoned colleagues.
-
-We examine the ingredients required to make the ideal supervisor who can support and advise you on your journey. We explore how you can maintain a healthy balance between work and family life. We discover why it’s so important to be bold and nurture fresh thinking. And, we run through some of the differences you will encounter if you move from academia to industry.
-
 
 Reference managers
 ------------------