Skip to content

GSoC 2023 Akshit Tyagi

Nabil Freij edited this page Feb 22, 2024 · 2 revisions

Scraper Redesign

Personal Information

Background

I'm a 19-year-old undergraduate student in the second-year of my education at Jaypee Institute of Information Technology, Noida, Uttar Pradesh, India. Relevant to the project, I've been programming in Python for a little over 5 years now (small personal projects mostly) and am familiar with working on open-source projects, sphinx, tox, Jupyter Notebooks. My other skills include >3 years of experience working on linux-based systems, git, vim/nvim, containerization, socket-programming, networking and working with API's. In the past I've also held an inclination towards cybersecurity which helps with my familiarity with networking.

Physics, Astronomy, Technology and just Science in general have always been of a lot of interest to me. I participated in the Kharagpur Winter of Code, a program meant to encourage people to contribute to open-source as a mentor while also getting more familiar with a program like GSoC. Here's my KWoC Mentor Certificate. The project I mentored was a personal project of mine called DeGrasse, a Discord bot, with an awkward sense of humour, that sends cool information about space and astrophysics (and often quotes Neil DeGrasse Tyson, the pop astrophysicist). I loved the experience of encouraging others and gradually getting them to learn more and contribute.

Relevant Projects and Achievements:

The DeGrasse Discord Bot: So far it uses NASA's Open API's for the information, to send the Astronomy Picture of the Day or to send images from the Mars Rover. Working with SunPy code exposed me to pre-commit and pytest which I plan to be implemented there as well, (it's far from a big/active project, just for the learning experience).

PyTubeAPI: Relevant to this project, I have experience working with classic BeautifulSoup4 web-scrapers as well, for example, I've worked on this (really basic) personal project which scrapes one of the many YouTube clone websites, parses and returns the information in JSON as an API (using FastAPI), as an alternative to the official (rate-limited) YouTube Search API.

soforced: Python script to bruteforce college wifi to get the sophos credentials of students since the sophos passwords aren't unique and they repeat.

OSDC Events and Material: Aside from organising events, I have written workshops and given talks as a core-member of the Open Source Developers Community. Some of the ones which I'm most proud of are:

  • Wait WTF?: A troubleshooting/googling contest about being comfortable with adapting to new technology on the fly which I believe is a skill extremely central to Open Source development and software development in general.
  • Codejam-v3: An Early 90's themed hackathon-with-a-twist where we teamed up participants on our own to make a fun project over a week.
  • Hitchhiker's Guide to the Cooler Parts Of The Internet: Talk I gave covering a really wide range of topics I felt interested about.
  • Hacktoberfest Hands-On: Workshop on getting people started to contribute during Hacktoberfest using git and the github workflow to make a first PR.
  • OSDContenst: A content-writing contest that encourages people to make a non-code contribution by making a PR.

Have you participated previously in GSoC? When? With which project?

No, I haven't.

Are you also applying to other projects?

No, I'm not.

Are you eligible to receive payments from Google?

Yes, I am.

How much time do you plan to invest in the project before, during, and after the Summer of Code?

During the Summer of Code, I plan to work on the project as a 175 hour GSoC project. I'll be devoting more time if the project requires it. Before that, I plan to contribute to SunPy in general and other projects in the SunPy ecosystem and plan to keep contributing to the project and work on other issues and ideas after the Summer of Code ends.

Details about the project:

SunPy provides ways for the user to search and download solar data from web sources (>13 at the time of writing). The dataretriever submodule includes 8 such sources which are considerably “simpler” to implement than other more complex clients.

All the dataretriever clients rely on the Scraper class to search and fetch data (the downloading part being handled by the parfive downloader). It is able to so by taking advantage of the fact that data provided by the sources is stored/organised in a very efficient fashion, i.e. filenames and directories are named based on dates/time-ranges. For a particular source, we can pass this pattern to the scraper class, and ask it to calculate how the path may look for a different timerange. Building on this, the Scraper can perform a lot of useful functions for accessing the data.

For instance, the scraper class also supports returning ranges of possible paths which (if supported by source) involves detecting a time-step at which the paths differ and generating them, depending on whether such a system is supported in the provided pattern or not. The way this is currently implemented is via using regex and strftime().

Part of the issue is that over time, functions like these have grown too complex and difficult to maintain. Though regex has been used for parsing so far, we can use parse instead which would increase code-readability. The use of regex in a pattern also disallows using python's string formatting with named arguments since the characters used in regex are interpreted differently by the format function. Other issues include the Scraper not being able to work properly if the regex involves the '.' character since the pattern is split by '.' to split the filename and extension. This project involves rewriting the scraper so that it uses parse instead and is easier to work with, while keeping the same API. The overall process will involve converting the url pattern to file-tree and then subsequently walking the file-tree to match the query. The rewrite would also include extracting common code to higher levels, making functions more independently testable where ideal and implementing newer functionality as well.

The different issues which we'll need to take care of have been discussed here in brief and I intend to address all the points during my work.

For instance, there's a question of whether we should split the scraper into subclasses for the URL, FTP and file scrapers or keep the current design of having them as cases in a single Scraper class. I believe we should go for the former since it'd be much easier to work with, which is why the project exists in the first place. I'll be discussing decisions like these with mentors as they come up and if they require their attention.

The SunPy community and me:

I have experienced working with the awesome and inspiringly-active community at SunPy. Aside from the occasionally being able to help someone out over a beginner github issue or in the Matrix server here's a quick list of PR's that I've worked on over time:

I've opened an issue in Cadair/parfive about using asyncio.Taskgroup for parallel downloading, and have also contributed to the forum via this post about whether we should work on including SWPC data access directly from SunPy. Here are links to some other discussions I've been able to contribute to.

Timeline

Community Bonding Period (May 4 - 28)

  • I'll make myself more familiar with the dataretriever clients and the use of the Scraper class. I'll be attending the weekly SunPy meetings to get familiar with the development community and practices. I'll go over my setup once again, to make sure it follows the SunPy workflow, get myself more versed with tools like tox and sphinx.
  • I'll get myself more familiar with working with pytest, regex, r1chardj0n3s/parse and parallel and asynchronous programming and discuss project details with mentors.
  • I also intend to work on a few open issues during this time, mostly those that relate to the net submodule (with incremental package familiarity requirement) and could overlap with my project. Aside from helping out SunPy in general, I've also found this to be helpful for me to get familiar with the coding style used in the project and in production-ready code in general. It gives me an opportunity to make sure I'm familiar with concepts relating to system-design like the Factory design pattern which is used for things like the Fido client or something more general like talking about what skeleton-classes are. There are other projects in the ecosystem as well which hold interest to me (like Parfive and the Helioviewer-Project/python-api) and I'll look there for approachable and doable issues as well where I could help out in this time-period.

Coding Period

Week 1-2 (May 29 - June 11)

  • Begin working on implementing a basic version of all the major functions that the Scraper class supports/exposes in its present state, namely the range, filelist and get_timerange_from_exdict using parse instead of regex. This will also involve appropriately dividing tasks into a set of internal functions, and working on the documentation and tests.
  • Define how we want to handle http, ftp and file clients.

Week 3-4 (June 12 - June 18)

  • Move out common code (replacing regex with parse) with tests and documentation.
  • Make changes based on feedback from mentors.

Week 5-6 (June 19 - July 2)

  • Flesh out all the functions implemented so far, cover edge cases (there'll be lots of edge cases for different time ranges and time steps), parallelise requests having already gotten a basic version down.
  • Make changes based on feedback from mentors.

Midterm Evaluation (July 10)

  • A major part of the existing functionality like filelist, range should be implemented and any functions that are better suited to be at a higher level would be moved, alongwith good tests and documentation coverage. The work so far should ensure this is implemented in time for Midterm Evaluations submission period from July 10-14.

Week 7-9 (July 3 - July 23)

  • Work on extracting metadata which doesn't exist in url and start working on the url-pattern-to-file-tree.
  • Make changes based on feedback from mentors

Week 10-11 (July 24 - August 6)

  • I expect working with the file-tree to extend to this period.
  • Flesh out the newer functions implemented. Work on their documentation and increase test coverage.
  • Make changes based on feedback from mentors.

Week 12-13 (August 7 - August 20)

  • A buffer period - leaving room for new unforeseen tasks that come up along the way. I'll be working on any tasks that are yet to be finished.
  • If all tasks are complete, cover any other edge cases or increase coverage.

Final Evaluation (August 21)


Deliverables:

A fully functioning Scraper (and some newly moved out independent functions), with all the mentioned features implemented and ready to merge with main along with tests and documentation.

Availability:

I will be having exams for around a week from 18-26 May during the Community Bonding period. After these exams, the college would be off for summer holidays (and hence so will all volunteering activities) and I will not be having any other commitments to attend to during the time which makes me available to work throughout the duration of the GSoC timeline.

Clone this wiki locally