# **Top Repositories for Github Topics**

### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Google Colab.

#### Project Outline:-

- We're going to scrape https://github.com/topics
- We'll get list of topics. For each topic, we'll get topic title, topic page URL and topic description.
- For each topic, we'll get the top 30 repositories in the topic from the topic page.
- For each repository, we'll grab the repo name, username, stars and repo URL.
- For each topic we'll create a CSV file in the following format:



```
Repo Name, Username, Stars, Repo URL
three.js, mrdoob, 105000, https://github.com/mrdoob/three.js
react-three-fiber, pmndrs, 28300, https://github.com/pmndrs/react-three-fiber
```


### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [None]:
# !pip install requests --upgrade --quite

--quite will not shows downloading process but if there is any error it will only show errors.

In [None]:
import requests

In [None]:
topics_url = 'https://github.com/topics'

- Sends a `GET` request to `topics_url` using the `requests` library.  
- Displays the HTTP status code of the response to check if the request was successful (`200` indicates success).

In [None]:
response = requests.get(topics_url)

In [None]:
response.status_code

200

In [None]:
len(response.text)

206070

- Extracts the HTML content of the response as text.  
- Takes the first 1000 characters and splits them into lines.  
- Prints each line to get a preview of the webpage's HTML structure.

In [None]:
page_contents = response.text
page_content_example = page_contents[:1000].split('\n')
for i in page_content_example:
  print(i)



<!DOCTYPE html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"
  data-a11y-animated-images="system" data-a11y-link-underlines="true"
  
  >



  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">

  

  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-605318cbe3a1.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-bd1cb5575fff.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media=

- Saves the entire HTML content of the webpage into a file named **`webpage.html`**.  
- This allows for offline analysis and debugging of the scraped data.

In [None]:
with open('webpage.html', 'w') as f:
  f.write(page_contents)

This 'webpage.html' is saved in 'Files' option is left pannel shown in colab file.

### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

In [None]:
from bs4 import BeautifulSoup

In [None]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [None]:
type(doc)

- Defines the CSS class used to locate topic titles on the webpage.  
- Uses `BeautifulSoup` to find all `<p>` tags with the specified class.  
- Comments explain an alternative method using `class_` for filtering.  
- Returns the total number of topic title tags found.

![](https://i.imgur.com/Q634wxh.png)

In [None]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class})
# we can also do ```topic_title_tags = doc.find_all('p',  class_ = selection_class)``` but above is more generic way.
len(topic_title_tags)

30

- Displays the first **5** topic title tags extracted from the webpage.  
- Helps verify if the correct elements have been selected before further processing.

In [None]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

- Defines the CSS class used to locate topic descriptions on the webpage.  
- Uses `BeautifulSoup` to find all `<p>` tags with the specified class.  
- Returns the total number of topic description tags found.

In [None]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})
len(topic_desc_tags)

30

- Displays the first **5** topic description tags extracted from the webpage.  
- Helps verify if the correct elements have been selected before further processing.

In [None]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

##### Below parent is just for understanding purpose

- Extracts the first topic title tag from the list.  
- Displays its raw HTML content for inspection.

In [None]:
topic_title_tag0 = topic_title_tags[0]
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

- Retrieves the **parent** element of the first topic title tag.  
- Helps analyze the surrounding HTML structure for better navigation or additional data extraction.

In [None]:
topic_title_tag0.parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

- Retrieves the **name** of the parent tag of the first topic title element.  
- Helps understand the HTML hierarchy and structure for more efficient scraping.

In [None]:
topic_title_tag0.parent.name

'a'

- Retrieves the **attributes** of the parent tag of the first topic title element.  
- Helps inspect additional metadata (e.g., class names, IDs) that might be useful for scraping.

In [None]:
topic_title_tag0.parent.attrs

{'href': '/topics/3d',
 'class': ['no-underline', 'flex-1', 'd-flex', 'flex-column']}

In [None]:
topic_title_tag0.parent.parent

<div class="py-4 border-bottom d-flex flex-justify-between">
<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>
<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>
<div class="flex-grow-0">
<div class="d-block" data-view-component="true">
<a aria-label="You must be signed in to star a repository" class="tooltipped tooltipped-sw btn-sm btn" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"star button","repository_id":null,"auth_type":"LOG_IN","originating_url":"https://github.com/topics","user_id":null}}' data-hydro-click-hmac="

Above code is given just for learning purpose.

In [None]:
url_selector = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a', {'class': url_selector})
len(topic_link_tags)

30

In [None]:
topic_link_tags[0]

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [None]:
topic_link_tags[0]['href']

'/topics/3d'

- Extracts the **href** attributes from `topic_link_tags` to get topic URL suffixes.  
- Prepends `'https://github.com'` to each suffix to form complete URLs.  
- Displays the first **5** full topic URLs for verification.

In [None]:
topic_url_suffix = []
for i in topic_link_tags:
  topic_url_suffix.append(i['href'])
topic_url_suffix[:5]

['/topics/3d',
 '/topics/ajax',
 '/topics/algorithm',
 '/topics/amphp',
 '/topics/android']

In [None]:
topic_urls = []
for i in topic_url_suffix:
  topic_urls.append('https://github.com' + i)

topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Remember that above one is not the only option. you can apply other method to get urls, there are many more ways to get urls.

In [None]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [None]:
topic_title_tags[0].text

'3D'

In [None]:
topic_titles = []

for tag in topic_title_tags:
  topic_titles.append(tag.text)

topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [None]:
topic_descriptions = []
for tag in topic_desc_tags:
  topic_descriptions.append(tag.text.strip())

topic_descriptions[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

### Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [None]:
import pandas as pd

- Creates a dictionary **`data`** with three keys:  
  - `'topic'` → List of topic titles  
  - `'topic_desc'` → List of topic descriptions  
  - `'topic_URL'` → List of full topic URLs  
- Converts the dictionary into a **Pandas DataFrame** (`df`).  
- Displays the first **5** rows using `df.head()` to verify the structured data.

In [None]:
data = {'topic': topic_titles,
        'topic_desc': topic_descriptions,
        'topic_URL': topic_urls}

df = pd.DataFrame(data = data)

df.head()

Unnamed: 0,topic,topic_desc,topic_URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [None]:
df.shape

(30, 3)

We can find CSV file in file icon given at left pannel in Google colab.

- Saves the DataFrame **`df`** to a CSV file named **`topics.csv`**.  
- Sets `index=None` to exclude the index column from the CSV file.

In [None]:
# 'index = None' will not include index in csv file
df.to_csv('topics.csv', index=None)

### **Getting Information out of a topic page**

- Extracts the **URL** of the first topic page from the `topic_urls` list.  
- Displays the URL for verification before further processing.

In [None]:
topic_page_url0 = topic_urls[0]

topic_page_url0

'https://github.com/topics/3d'

- Sends a **GET** request to the first topic page URL.  
- Displays the **HTTP status code** to check if the request was successful (`200` indicates success).

In [None]:
response = requests.get(topic_page_url0)

response.status_code

200

In [None]:
len(response.text)

514426

- Parses the HTML content of the topic page using **BeautifulSoup**.  
- Stores the parsed HTML in `topic_doc` for further data extraction.

In [None]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

- Extracts all `<h3>` tags with the specified class, which likely contain repository names.  
- Stores the extracted tags in `repo_tags` for further processing.

![](https://i.imgur.com/TYg4LQf.png)

In [None]:
repo_tags = topic_doc.find_all('h3', { 'class' : 'f3 color-fg-muted text-normal lh-condensed' } )

In [None]:
len(repo_tags)

20

In [None]:
repo_tags[0].text.split('\n')

['', 'mrdoob          /', '          three.js ']

- Extracts all `<a>` tags from the first repository tag (`repo_tags[0]`).  
- Displays the extracted anchor tags.

In [None]:
a_tags0 = repo_tags[0].find_all('a')
a_tags0

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">three.js</a>]

- Retrieves the **href** attribute from the second `<a>` tag in `a_tags0`.  
- Likely represents the repository's relative URL on GitHub.

In [None]:
a_tags0[1]['href']

'/mrdoob/three.js'

In [None]:
a_tags0[0].text

'mrdoob'

- Initializes empty lists to store **usernames, repository names, and repository URLs**.  
- Iterates through `repo_tags` to extract:  
  - **Username** from the first `<a>` tag.  
  - **Repository name** from the second `<a>` tag.  
  - **Full repository URL** by appending the **href** value to `base_url`.

In [None]:
usernames = []
repository_names = []
repository_urls = []
base_url = 'https://www.github.com'

for tags in repo_tags:
  a_tags = tags.find_all('a')
  usernames.append(a_tags[0].text)
  repository_names.append(a_tags[1].text)
  repository_urls.append(base_url + a_tags[1]['href'])

In [None]:
usernames[:5]

['mrdoob', 'pmndrs', 'libgdx', 'BabylonJS', 'FreeCAD']

In [None]:
repository_names[:5]

['three.js', 'react-three-fiber', 'libgdx', 'Babylon.js', 'FreeCAD']

In [None]:
repository_urls[:5]

['https://www.github.com/mrdoob/three.js',
 'https://www.github.com/pmndrs/react-three-fiber',
 'https://www.github.com/libgdx/libgdx',
 'https://www.github.com/BabylonJS/Babylon.js',
 'https://www.github.com/FreeCAD/FreeCAD']

- Extracts all `<span>` tags with the **ID** `'repo-stars-counter-star'`, which likely contain star counts.  
- Retrieves the **text** from the third (`index 2`) star tag to check its value.

![](https://i.imgur.com/MkAawLX.png)

In [None]:
star_tags = topic_doc.find_all('span', { 'id' : 'repo-stars-counter-star'})
star_tags[2].text

'23.8k'

- Defines a function `parse_star_count(stars_str)` to convert GitHub star counts into integers.  
- **Logic:**  
  - Strips any extra spaces.  
  - If the last character is `'k'`, converts the value to an integer by multiplying by **1000**.  
  - Otherwise, directly converts the string to an integer.

In [None]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

- Calls `parse_star_count()` on the text of the third (`index 2`) star tag.  
- Converts the extracted star count into an integer, to check function works fine or need improvements.

In [None]:
parse_star_count(star_tags[2].text)

23800

- Defines `get_single_repo_info(h1_tag, star_tag)` to extract repository details.  
- **Extracts:**  
  - **Username** from the first `<a>` tag.  
  - **Repository name** from the second `<a>` tag.  
  - **Repository URL** by appending the **href** value to `base_url`.  
  - **Star count** using `parse_star_count()`.  
- Returns **(username, repository name, star count, repository URL)** as a tuple.

In [None]:
def get_single_repo_info(h1_tag, star_tag):
  # Get Username, Repository Name and Repository URL
  base_url = 'https://www.github.com'
  a_tags = h1_tag.find_all('a')
  username = a_tags[0].text
  repository_name = a_tags[1].text
  repository_url = base_url + a_tags[1]['href']

  # Get Star Info
  star_str = star_tag.text
  star_count = parse_star_count(star_str)

  return username, repository_name, star_count, repository_url

In fucntion, we have add procedure to scrape details of username, repository_name and respository_url, although we have scraped it already but it will makes things easy when we get info about more than one topics

In [None]:
def get_topic_page(topic_url):
  # Download the Page
  response = requests.get(topic_url)

  # Check Successful response
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))


  # Parse using beautifulsoup
  topic_doc = BeautifulSoup(response.text, 'html.parser')

  return topic_doc


def get_single_repo_info(h1_tag, star_tag):
  # Get Username, Repository Name and Repository URL
  base_url = 'https://www.github.com'
  a_tags = h1_tag.find_all('a')
  username = a_tags[0].text
  repository_name = a_tags[1].text
  repository_url = base_url + a_tags[1]['href']

  # Get Star Info
  star_str = star_tag.text
  star_count = parse_star_count(star_str)

  return username, repository_name, star_count, repository_url


def get_single_topic_info(topic_doc):

  # Get the h1 tag that contains repo title, repo URL and username
  repo_tags = topic_doc.find_all('h3', { 'class' : 'f3 color-fg-muted text-normal lh-condensed' })

  # Get star tags that contains star information
  star_tags = topic_doc.find_all('span', { 'id' : 'repo-stars-counter-star'})


  single_topic_info = {
      'username' : [],
      'repository_name' : [],
      'star_count' : [],
      'repository_url' : []
  }

  # Get repo information
  if len(repo_tags) == len(star_tags):
    for i in range(len(repo_tags)):
      repo_info = get_single_repo_info(repo_tags[i], star_tags[i])
      single_topic_info['username'].append(repo_info[0])
      single_topic_info['repository_name'].append(repo_info[1])
      single_topic_info['star_count'].append(repo_info[2])
      single_topic_info['repository_url'].append(repo_info[3])
  else:
    print('Length of list \'repo_tags\' and list \'star_tags\' are not equal')
  return pd.DataFrame(data = single_topic_info)

- Defines `scrape_topics()` to extract GitHub topics and their details.  
- **Steps:**  
  1. Sends a **GET** request to `'https://github.com/topics'`.  
  2. Parses the HTML using **BeautifulSoup**.  
  3. Extracts:  
     - **Topic titles** (`<p>` tags with a specific class).  
     - **Topic descriptions** (`<p>` tags with another class).  
     - **Topic URLs** (`<a>` tags with a unique class).  
  4. Checks if all extracted lists have the same length to ensure data consistency.  
  5. Appends extracted details into a dictionary (`data`).  
  6. Converts `data` into a **Pandas DataFrame** and returns it.

In [None]:
def scrape_topics():
  topics_url = 'https://github.com/topics'

  response = requests.get(topics_url)
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topics_url))

  doc = BeautifulSoup(response.text, 'html.parser')

  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p', {'class': selection_class})

  desc_selector = 'f5 color-fg-muted mb-0 mt-1'
  topic_desc_tags = doc.find_all('p', {'class': desc_selector})

  url_selector = 'no-underline flex-1 d-flex flex-column'
  topic_link_tags = doc.find_all('a', {'class': url_selector})

  data = {
      'topic': [],
      'topic_desc': [],
      'topic_url': [],
  }

  base_url = 'https://www.github.com'

  if (len(topic_title_tags) == len(topic_desc_tags)) and (len(topic_desc_tags) == len(topic_link_tags)):
    for i in range(len(topic_title_tags)):
      data['topic'].append(topic_title_tags[i].text)
      data['topic_desc'].append(topic_desc_tags[i].text.strip())
      data['topic_url'].append(base_url + topic_link_tags[i]['href'])
  else:
    print('Length of list \'topic_title_tags\' and list \'topic_desc_tags\' and list \'topic_link_tags\' are not equal')

  return pd.DataFrame(data = data)


In [None]:
topics = scrape_topics()
topics.head()

Unnamed: 0,topic,topic_desc,topic_url
0,3D,3D refers to the use of three-dimensional grap...,https://www.github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://www.github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://www.github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://www.github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://www.github.com/topics/android


#### Finally we will download and save csvs of all topics and repos

- Defines `download_topic_repo_csvs(topics)` to scrape GitHub topics, generate CSV files, and download them as a ZIP archive.  
- **Steps:**  
  1. Creates a folder **`github_topic_wise_csv_files`** to store CSVs.  
  2. Writes a **`Guide.txt`** file explaining the folder's contents.  
  3. Scrapes GitHub topics and saves them in **`topics.csv`**.  
  4. Iterates through each topic to:  
     - Scrape the top 20 repositories.  
     - Save them in a CSV file named **`<topic_name>.csv`**.  
  5. Creates a ZIP archive of the folder.  
  6. Downloads the ZIP file using `files.download()`.  
  7. Prints a success message. ✅

In [None]:
from google.colab import files

def download_topic_repo_csvs(topics):
  folder_name = "github_topic_wise_csv_files"
  os.makedirs(folder_name, exist_ok=True)

  txt_file_content = '''
  This Zip folder contains main 'topics.csv' file that has list of topics on github that are trending.

  And each topic in 'topics.csv' file will have github link of that topic.

  By redirect to a specific topic link available in 'topics.csv', we get top 20 repositories of that topic, and that repository data is also available in zip folder.

  Each topic in 'topics.csv' files has a separate csv file in zip folder by name 'topic_name.csv'.

  Ex. In 'topics.csv' file has one topic name '3D' then there will be one file available in zip folder with name '3D.csv'.
  and that '3D.csv' file contains list of top 20 repositories of topic '3D' with username, repository name, star in that repository and url of that repository.
  '''

  file_path_guide_txt = os.path.join(folder_name, 'Guide.txt')
  with open(file_path_guide_txt, 'w') as file:
    file.write(txt_file_content)

  file_path_topic_csv = os.path.join(folder_name, 'topics.csv')
  df = scrape_topics()
  df.to_csv(file_path_topic_csv, index = None)

  for index, row in topics.iterrows():
    topic_doc = get_topic_page(row['topic_url'])
    single_topic_df = get_single_topic_info(topic_doc)
    fname = row['topic'] + '.csv'
    file_path = os.path.join(folder_name, fname)
    single_topic_df.to_csv(file_path, index = None)

  shutil.make_archive(folder_name, 'zip', folder_name)
  # shutil.make_archive(zip folder name, 'compression type ex. zip, tar, gztar', folder_name)

  files.download('{}.zip'.format(folder_name))

  print("✅All files zipped and downloaded successfully!")



Google Colab has a limitation where files.download() only works for about 10 files per session due to browser restrictions on multiple downloads. After that, it silently stops downloading without an error.

So we store all csvs in one folder and zip that folder.

- **Imports required libraries**:  
  - `requests` (for web requests)  
  - `BeautifulSoup` (for web scraping)  
  - `google.colab.files` (for file downloads)  
  - `os` and `shutil` (for file handling and compression)  
- **Runs the scraping workflow**:  
  1. Calls `scrape_topics()` to get a DataFrame of GitHub topics.  
  2. Passes the topics DataFrame to `download_topic_repo_csvs()`, which:  
     - Scrapes repository data for each topic.  
     - Saves topic-wise CSVs.  
     - Creates a ZIP archive and downloads it. ✅

In [None]:
import requests
from bs4 import BeautifulSoup
from google.colab import files
import os
import shutil

# above are required libraries to run below 2 functions

topics = scrape_topics()
download_topic_repo_csvs(topics)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅All files zipped and downloaded successfully!


## References & Future work

### Summary:-
- The project scrapes **GitHub Topics** and extracts details like topic names, descriptions, and URLs, saving them in **`topics.csv`**.  
- For each topic, it scrapes the **top 20 repositories**, extracting **username, repository name, stars, and URL**, storing them in separate **CSV files**.  
- All CSV files are zipped and **downloaded automatically** for easy access.  
- The project is implemented in **Google Colab** using `requests`, `BeautifulSoup`, `pandas`, and file-handling libraries. ✅


### References:-
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- https://docs.python-requests.org/en/v2.0.0/
- https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
- https://www.geeksforgeeks.org/create-a-new-text-file-in-python/
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html


### Ideas for Future Work
- Github shows only top 20 repositories when page is loaded.
- And if we want to see data of more repositories than we have to click on 'load more' area given at bottom of page and then it load more 20 (may be 20), and then if we want to show further data then again click on 'load more' and so on.
- Problem is web page have only 20 repositories html tags and hence we have scraped top 20 repositories data.
- We can make a web scraping script that scrapes top 100 repositories data from Github.