Skip to content
Nabil Freij edited this page Feb 22, 2024 · 2 revisions

Sunpy - Rewriting to improve the maintainability of Scraper

Personal Details

The PROJECT

Abstract

The Sunpy.Net.Scraper is a scraper object that is responsible for scraping data over the web and bringing metadata and data files to different Fido clients. The current project aims at rebuilding the Structure of the scraper class such that it meets all the predetermined requirements and aims to eliminate the pile of unmaintainable code it has .In this proposal , I will try to provide the implementation details for the scraper client and extend its support to Fido.I will be working with Nabobalis and Hayesla as my mentors of this project to guide me through the journey.

Reason to choose Sunpy

My open-source journey began with SunPy about 5 months ago around October where I enjoyed having to look at the visualization capabilities of SunPy and developed a keen interest in it , thereby attempting to solve its issues. The efficient usage of python and purposeful open source contribution was the main drive for me.Another driving force for me is that I am particularly interested in astrophysics and Sunpy has been the gateway for me to explore its amazing world. Thus I am particularly bent on contributing to Sunpy.

Approach

First and foremost we need to determine the structure of the scraper. As we know that the domain of querying is predetermined within a certain scope , the simplest way to implement it would be using the “requests” library . A powerful tool for http clients , which I presume is the case for most of the scope. In other cases , where we have multiple types of requests,(like ftp) . the scraper object can be instantiated on the basis of type of request. The “requests” library with a few changes can handle ftp requests as well. The parser will work as a separate entity ( In abstraction).It will provide all the utilities to the scraper by different methods , including all methods which are currently implemented and other methods which shall be thoroughly discussed with mentors. Here is a 3 - phase plan to redevelop the scraper.

  1. Phase 1 - Develop and test the URL-Parser

  2. Phase 2 - Develop and test the Scraper , and integrate with Parser and Fido.

  3. Phase 3 - Rewrite examples to demonstrate the working of the Scraper.

Why Different Classes ?

The main idea of the project is to achieve the maintainability of the code. Continuing With the object oriented programming approach , we will conceptually divide the current scraper into two entities -> “scraper” and “parser”. Although I am not sure that the second entity should be called “parser” , because it might do things other than “url parsing”. On the other hand Scraper will be an entity which will purely “scrape” and retrieve web data. By doing this we can actually simplify the code-distribution.

PHASE -1

In the first phase , I plan to build the “Parser” entity. It would be responsible to perform all the functions except for “scraping”. For example parsing URL and extract important metadata from the url and return the Results to Fido Client being queried by the user. Another important function is to generate the URL,based on the arguments provided to the Fido Client. Now as requested in the original issue, to replace the regex used in the scraper currently for pattern matching. I would like to propose the “Brute Force” approach here.

Note : the parser if required can have html file parsing if needed, but this would be a matter of discussion with the mentors , as we need to check if this is within the scope or not.

Brute Force Approach

Now that we know the entire scope of the Scraper. (that is different websites from where we will scrape the data). We can write down all the different base_urls of different websites. While writing the base url, we will add flags to different positions in the url string wherever the arguments are needed to be inserted. Then based on the query arguments provided by the user , we can choose the appropriate string and insert parameters in orderly fashion to generate the actual URL , which we will scrape. It lacks the complex logic and focuses on simplicity in terms of implementation as well. As the last part of this phase , I would want to test the working of the parser , check for its robustness , and see to it that it has all of the functionalities and none of them are broken. After ensuring this , I will proceed onto the next phase.

PHASE-2

After the Parser , I will focus on implementing the scraper. The Scraper will just “scrape” the data over the web and not do any other functions. All the utilities to scraper will be provided by the parser (working here as an abstraction) by calling it whenever something is needed. The goal would be to make the Scraper robust to failures while maintaining its simplicity. I propose to use “Requests” . Requests is a python library that has been fine tuned over time to become a robust library to make several types of queries (mostly http). With some additional small extension packages , requests can also handle ftp requests with ease. Right now , the scraper is being called as an object instance by different Fido Clients. This means that basically Scraper is an apil-less entity. Even with the split right now, The rest of the code base is virtually going to go untouched ,Thus retaining almost all of the original code. After developing the Scraper , the main task would be to test its robustness. Check for any point if it's failing at and write the tests for the scraper as an addition to the codebase if required.

PHASE-3

The final phase would be to re-write the examples , regarding the new scraper. Update the documentation of Sunpy docs and make sure that the old documentation is properly updated at each place where scraper is being used. Also Giving final touches to the code-base , adding explanations about each method in the codebase. Beside this , I can make changes , add features etc in case it is requested by the mentors.

DEVELOPMENT TIMELINE

Community Bonding Period [ may4 - may 28]

  • Meet with mentors and request which part of the proposal needs the most work.
  • Investigate and research the part of the project which requires the most work.
  • Familiarize with sunpy.net.scraper , in case if anything needs to be researched

Week- 1 - 2 [may 29- june 13]

  • Finalize and concretise the scope of the scraper and parser.
  • Finalize the structure of scraper and parser . ( plan on its utilities , methods etc).

Week 3 - 4 [ june 14 - july 28]

  • Develop the Parser Unit.

Week 5 - 6 [ june 29 - july 12]

  • Write tests for Parser and perform checks *Give the code to mentors for phase-1 submission

Week 7 - 8[ july 13 - aug 10]

  • Develop the scraper Unit.
  • Integrate the Scraper Unit with parser

Week 9 - 10 [ july 13 - aug 10]

  • Write tests for the Scraper and perform checks.
  • Crush bugs if found.

Week 11 [ aug 11 - aug 18 ]

  • Rewrite the examples for Scraper if needed and update the documentation.
  • Final Review by mentors , and implement anything upon request by mentors.
  • Give the project for final submission

What I Wish to gain from GSOC

Primarily, I aim to gain more experience working with open-source communities. I have contributed a bit to other repositories before, and hope that Summer of Code serves as an entry point for me to give back more to the community. Secondly, I'm excited about being mentored during this project. Even during my recent contributions to Sunpy, I understood that developing software at this level is different from academic projects. I want to grow these skills and hope to learn the difference between good, bad, and great code.

#GSoC Experience So Far This is a very niche experience of mine in google summer of code. It grabbed my attention when I got to know that I will be working with the community and grow mutually. I Am fully dedicated to contribute back to the community that has been doing so much for student developers like me , by providing such marvelous open source projects which are making the process of the software development handy and easy.

Are you also applying to other projects?

I'm not applying to any other projects this year

Are you eligible to receive payments from Google?

Yes, I am fully eligible to receive payments from google and am above 18 to do so.

How much time do you plan to invest in the project before, during, and after the Summer of Code?

I plan to spend ~(18-20h) per week working on this GSoC program, but I am prepared to put in more hours if the project turns out to be more difficult than anticipated. The only time I might be a little busy could be during the last weeks of May (due to my semester exams), but I don't think it's going to interfere much with the Community Bonding Period. Except for my exams , I don't have any commitment in particular which will be a hindrance for dedication to this project.

Programming Experience

As a part of my journey at pursuing B.tech in Computer Science and Engineering , I've come across a myriad of programming languages. Right now I am Most well versed in C++ , its fundamentals and object oriented programming concepts. As a part of my hobby , I practice competitive programming often and have been to ICPC 2021, a prestigious Programming olympiad consisting of top teams from different colleges , due to which I've become proficient in C++. I have good hands on experience with Git and Github for uploading my personal as well as hackathon related projects. Although before contributing to Sunpy , I've had the experience of contributing to CodeDrills - Hire Fast , The official online judge used for conducting ICPC Regionals contests in India. Along with that I have a decent grasp of python , which I have used in my machine-learning projects on Github.

Contributions made to Sunpy

So far I've made 7 pull requests ,listing them below. After Trying to solve these issues ,I've got a good grasp of the codebase of Sunpy, along with a decent hang of Fido Clients , Sunpy Maps etc.

List of Merged Pull Requests

Sunpy - Rewriting to improve the maintainability of Scraper

Personal Details

The PROJECT

Abstract

The Sunpy.Net.Scraper is a scraper object that is responsible for scraping data over the web and bringing metadata and data files to different Fido clients. The current project aims at rebuilding the Structure of the scraper class such that it meets all the predetermined requirements and aims to eliminate the pile of unmaintainable code it has .In this proposal , I will try to provide the implementation details for the scraper client and extend its support to Fido.I will be working with Nabobalis and Hayesla as my mentors of this project to guide me through the journey.

Reason to choose Sunpy

My open-source journey began with SunPy about 5 months ago around October where I enjoyed having to look at the visualization capabilities of SunPy and developed a keen interest in it , thereby attempting to solve its issues. The efficient usage of python and purposeful open source contribution was the main drive for me.Another driving force for me is that I am particularly interested in astrophysics and Sunpy has been the gateway for me to explore its amazing world. Thus I am particularly bent on contributing to Sunpy.

Approach

First and foremost we need to determine the structure of the scraper. As we know that the domain of querying is predetermined within a certain scope , the simplest way to implement it would be using the “requests” library . A powerful tool for http clients , which I presume is the case for most of the scope. In other cases , where we have multiple types of requests,(like ftp) . the scraper object can be instantiated on the basis of type of request. The “requests” library with a few changes can handle ftp requests as well. The parser will work as a separate entity ( In abstraction).It will provide all the utilities to the scraper by different methods , including all methods which are currently implemented and other methods which shall be thoroughly discussed with mentors. Here is a 3 - phase plan to redevelop the scraper.

  1. Phase 1 - Develop and test the URL-Parser

  2. Phase 2 - Develop and test the Scraper , and integrate with Parser and Fido.

  3. Phase 3 - Rewrite examples to demonstrate the working of the Scraper.

Why Different Classes ?

The main idea of the project is to achieve the maintainability of the code. Continuing With the object oriented programming approach , we will conceptually divide the current scraper into two entities -> “scraper” and “parser”. Although I am not sure that the second entity should be called “parser” , because it might do things other than “url parsing”. On the other hand Scraper will be an entity which will purely “scrape” and retrieve web data. By doing this we can actually simplify the code-distribution.

PHASE -1

In the first phase , I plan to build the “Parser” entity. It would be responsible to perform all the functions except for “scraping”. For example parsing URL and extract important metadata from the url and return the Results to Fido Client being queried by the user. Another important function is to generate the URL,based on the arguments provided to the Fido Client. Now as requested in the original issue, to replace the regex used in the scraper currently for pattern matching. I would like to propose the “Brute Force” approach here.

Note : the parser if required can have html file parsing if needed, but this would be a matter of discussion with the mentors , as we need to check if this is within the scope or not.

Brute Force Approach

Now that we know the entire scope of the Scraper. (that is different websites from where we will scrape the data). We can write down all the different base_urls of different websites. While writing the base url, we will add flags to different positions in the url string wherever the arguments are needed to be inserted. Then based on the query arguments provided by the user , we can choose the appropriate string and insert parameters in orderly fashion to generate the actual URL , which we will scrape. It lacks the complex logic and focuses on simplicity in terms of implementation as well. As the last part of this phase , I would want to test the working of the parser , check for its robustness , and see to it that it has all of the functionalities and none of them are broken. After ensuring this , I will proceed onto the next phase.

PHASE-2

After the Parser , I will focus on implementing the scraper. The Scraper will just “scrape” the data over the web and not do any other functions. All the utilities to scraper will be provided by the parser (working here as an abstraction) by calling it whenever something is needed. The goal would be to make the Scraper robust to failures while maintaining its simplicity. I propose to use “Requests” . Requests is a python library that has been fine tuned over time to become a robust library to make several types of queries (mostly http). With some additional small extension packages , requests can also handle ftp requests with ease. Right now , the scraper is being called as an object instance by different Fido Clients. This means that basically Scraper is an apil-less entity. Even with the split right now, The rest of the code base is virtually going to go untouched ,Thus retaining almost all of the original code. After developing the Scraper , the main task would be to test its robustness. Check for any point if it's failing at and write the tests for the scraper as an addition to the codebase if required.

PHASE-3

The final phase would be to re-write the examples , regarding the new scraper. Update the documentation of Sunpy docs and make sure that the old documentation is properly updated at each place where scraper is being used. Also Giving final touches to the code-base , adding explanations about each method in the codebase. Beside this , I can make changes , add features etc in case it is requested by the mentors.

DEVELOPMENT TIMELINE

Community Bonding Period [ may4 - may 28]

  • Meet with mentors and request which part of the proposal needs the most work.
  • Investigate and research the part of the project which requires the most work.
  • Familiarize with sunpy.net.scraper , in case if anything needs to be researched

Week- 1 - 2 [may 29- june 13]

  • Finalize and concretise the scope of the scraper and parser.
  • Finalize the structure of scraper and parser . ( plan on its utilities , methods etc).

Week 3 - 4 [ june 14 - july 28]

  • Develop the Parser Unit.

Week 5 - 6 [ june 29 - july 12]

  • Write tests for Parser and perform checks *Give the code to mentors for phase-1 submission

Week 7 - 8[ july 13 - aug 10]

  • Develop the scraper Unit.
  • Integrate the Scraper Unit with parser

Week 9 - 10 [ july 13 - aug 10]

  • Write tests for the Scraper and perform checks.
  • Crush bugs if found.

Week 11 [ aug 11 - aug 18 ]

  • Rewrite the examples for Scraper if needed and update the documentation.
  • Final Review by mentors , and implement anything upon request by mentors.
  • Give the project for final submission

What I Wish to gain from GSOC

Primarily, I aim to gain more experience working with open-source communities. I have contributed a bit to other repositories before, and hope that Summer of Code serves as an entry point for me to give back more to the community. Secondly, I'm excited about being mentored during this project. Even during my recent contributions to Sunpy, I understood that developing software at this level is different from academic projects. I want to grow these skills and hope to learn the difference between good, bad, and great code.

#GSoC Experience So Far This is a very niche experience of mine in google summer of code. It grabbed my attention when I got to know that I will be working with the community and grow mutually. I Am fully dedicated to contribute back to the community that has been doing so much for student developers like me , by providing such marvelous open source projects which are making the process of the software development handy and easy.

Are you also applying to other projects?

I'm not applying to any other projects this year

Are you eligible to receive payments from Google?

Yes, I am fully eligible to receive payments from google and am above 18 to do so.

How much time do you plan to invest in the project before, during, and after the Summer of Code?

I plan to spend ~(18-20h) per week working on this GSoC program, but I am prepared to put in more hours if the project turns out to be more difficult than anticipated. The only time I might be a little busy could be during the last weeks of May (due to my semester exams), but I don't think it's going to interfere much with the Community Bonding Period. Except for my exams , I don't have any commitment in particular which will be a hindrance for dedication to this project.

Programming Experience

As a part of my journey at pursuing B.tech in Computer Science and Engineering , I've come across a myriad of programming languages. Right now I am Most well versed in C++ , its fundamentals and object oriented programming concepts. As a part of my hobby , I practice competitive programming often and have been to ICPC 2021, a prestigious Programming olympiad consisting of top teams from different colleges , due to which I've become proficient in C++. I have good hands on experience with Git and Github for uploading my personal as well as hackathon related projects. Although before contributing to Sunpy , I've had the experience of contributing to CodeDrills - Hire Fast , The official online judge used for conducting ICPC Regionals contests in India. Along with that I have a decent grasp of python , which I have used in my machine-learning projects on Github.

Contributions made to Sunpy

So far I've made 7 pull requests ,listing them below. After Trying to solve these issues ,I've got a good grasp of the codebase of Sunpy, along with a decent hang of Fido Clients , Sunpy Maps etc.

List of Merged Pull Requests

Clone this wiki locally