-
-
Notifications
You must be signed in to change notification settings - Fork 577
GSoC 2023 SN Pradeep
- Name : SN Pradeep
- Email : pradeepsn606@gmail.com
- Github : @gmrpr321
- College : Velammal College of Engineering and Technology, Madurai, TamilNadu,India.
- Element : @pradeepsn:matrix.org
- Timezone : Indian Standard Time (UTC +05:30)
I am SN.Pradeep, currently pursuing B.E in Computer Science. I am passionate about exploring the exciting intersection of computer science and engineering to solve complex problems and create innovative solutions. I am also interested in astronomy and physics, and I am excited to work on an open-source project that combines these interests with my passion for computer science and engineering
The Scraper class is responsible for implementing the web scraping functionality within SunPy, and is used by some of the simpler internal clients to scrape web pages for data files and metadata within a specified time range (sunpy.time.timerange.TimeRange). Current implementation of Scraper class heavily relies on Python Regular Expression in formatting URLs, constructing filePaths and extracting dateTime information. This reduces the overall Readability and Maintainability Of Scraper module.
The goal is to construct a newer Scraper module that retains the functionality of the pre-existing module while improving its readability and Maintainability.
Several areas of the code can be improved if used parse instead of regex. Here are some methods in Sunpy.net.scraper.Scraper that will have improvements
- In the init method, the pattern parameter is already formatted using kwargs. Instead of using regex to extract datetime formats from the pattern, parse can be used to extract named placeholders from the pattern
- In the _extractDateURL method, regex is used to extract the date from a particular URL following the pattern. This can be simplified using parse to extract named placeholders from the pattern and parse the date using the corresponding format.
- In the _URL_followsPattern method, TIME_CONVERSIONS (a dictionary to map time format codes to regular expressions) is used to replace parts of the pattern to ensure the given URL follows the pattern. But this reduces the readability. A parser can be used to simply things
The current Implementation to replace time
`TIME_CONVERSIONS = {'%Y': r'\d{4}', '%y': r'\d{2}',
'%b': '[A-Z][a-z]{2}', '%B': r'\W', '%m': r'\d{2}',
'%d': r'\d{2}', '%j': r'\d{3}',
'%H': r'\d{2}', '%I': r'\d{2}', '%M': r'\d{2}',
'%S': r'\d{2}', '%e': r'\d{3}', '%f': r'\d{6}'}
pattern = self.pattern
for k, v in TIME_CONVERSIONS.items():
pattern = pattern.replace(k, v)
matches = re.match(pattern, url)`
This can be re-written without using regex , here is an example
import datetime
input_pattern = 'http://proba2.oma.be/swap/data/bsd/%Y/%m/%d/swap_lv1_%Y%m%d_%H%M%S.fits'
input_url = 'http://proba2.oma.be/swap/data/bsd/2000/04/10/swap_lv1_20041224_120743.fits'
supported_date_formats = [
'%Y/%m/%d',
'%Y%m%d_%H%M%S',
'%Y-%m-%d',
'%Y-%m-%dT%H:%M:%S.%f',
'%Y%m%dT%H%M%S.%f',
#can add more formats if needed
]
pattern_lst = []
input_lst = []
start_index = 0
end_index = 0
# split the pattern based on supported patterns
for format_date in supported_date_formats:
if(format_date in input_pattern):
end_index = input_pattern.find(format_date) +
len(format_date)
pattern_lst.append(input_pattern[start_index : end_index])
current_pattern = input_pattern[start_index : end_index]
start_index = end_index
# split the input url based on pattern_lst
start_index = 0
end_index = 0
for pattern in pattern_lst:
end_index = start_index+len(pattern)
if('%Y' in pattern):
end_index+=2
input_lst.append(input_url[start_index:end_index])
start_index = end_index
# for each pattern_lst, check if input_list contains valid format
for x in range(0,len(pattern_lst)):
try:
datetime_obj =
datetime.datetime.strptime(input_lst[x],pattern_lst[x])
except ValueError:
print('Input URL doesn't match')
else:
print('Input URL matches')
- In the _localfilelist method, regex is used to return a list of local file paths that match a certain pattern. This can be simplified using parse to extract named placeholders from the pattern and generate a list of file paths by substituting datetime objects in the placeholders.
These are just a few examples, overall the readability and maintainability of Scraper module could be improved with proposed implementation.
- Write utility functions to extract datetime, calculating time range and to match given file paths with a specific date and time.
- To develop partial Scraper and to provide documentation for classes created thus far.
- Create tests for all classes created thus far.
- A fully functional Scraper written to support http and ftp protocols.
- Provide adequate Examples and documentation for the entire refactored module
- Implement performance improvements and create benchmarks
- Integration of the updated Scraper class into the sunpy codebase
- Approach the mentors to understand the structure of the current scraper
- Read Documentation / Get familiar with sunpy library's codebase
- Implement parse-based URL pattern conversion
- Modify the Scraper class to use parse instead of regex for URL pattern conversion
- Update the class documentation to reflect the changes made
- Test the modified Scraper class using existing test cases and fix any issues that arise
- Refactor the matches method
- Update the method documentation to reflect the changes made
- Test the modified method using existing test cases and fix any issues that arise
- Refactor the _URL_followsPattern and _extractDateURL methods to use parse instead of regex for URL pattern matching and date extraction
- Update the method documentation to reflect the changes made
- Test the modified methods using existing test cases and fix any issues that arise
- Refactor the filelist method to use parser instead of regex for URL pattern matching and date extraction
- Update the method documentation to reflect the changes made
- Test the modified method using existing test cases and fix any issues that arise
- Refactor the _ftpfileslist and _localfilelist and extract_files_meta methods to use parse instead of regex for URL pattern matching and date extraction
- Update the method documentation to reflect the changes made
- Test the modified methods using existing test cases and fix any issues that arise
- Finalize the documentation for the modified Scraper class, including any changes made to the method documentation
- Test the entire modified Scraper class using existing test cases and fix any remaining issues
- Work on performance improvements
- Make final code and documentation improvements
- Submit final code and documentation for review and evaluation
#6895 Marks sunpy.io readers Private
- Have you participated previously in GSoC? When? With which project? No
- Are you also applying to other projects? No
- Due to final exams, I will not be available during May 15 – 31.
- Due to practical exams (May 1 - 6) , I could only spend a limited amount of Time(2-3 hrs)
- On other days, I can work for 5-6 hrs