<a href="https://colab.research.google.com/github/rohit-s-s/scrapping-githhub-topic-repository/blob/main/scrapping_github_topics_repositories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# scrapping-github-topics-repositories

Use the "Run" button to execute the code.

###Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


#### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

Installing *request* libraray

In [45]:
!pip install requests --upgrade --quiet

In [46]:
import requests

In [47]:
topic_url = 'https://github.com/topics'

In [48]:
response = requests.get(topic_url)

In [49]:
response.status_code

200

In [50]:
len(response.text)

164634

In [51]:
response.text[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-8cafbcbd78f4.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-31dc14e38457.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="

In [52]:
content = response.text

In [53]:
with open("webpage.html", 'w') as f:
  f.write(content)


In [54]:
!pip install beautifulsoup4 --upgrade --quiet

In [55]:
from bs4 import BeautifulSoup

In [56]:
doc = BeautifulSoup(content, 'html.parser')

In [57]:
topic_titles_tag = doc.find_all('p', {'class': "f3 lh-condensed mb-0 mt-1 Link--primary"})

In [58]:
len(topic_titles_tag)

30

In [59]:
topic_titles_tag[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [60]:
topic_titles_tag[0].text

'3D'

In [61]:
topic_titles = []
for i in range(len(topic_titles_tag)):
  topic_titles.append(topic_titles_tag[i].text)

In [62]:
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [63]:
topic_des_tag = doc.find_all("p", {"class":"f5 color-fg-muted mb-0 mt-1"})

In [64]:
len(topic_des_tag)

30

In [65]:
topic_des_tag[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [66]:
topic_des_tag[0].text.strip()

'3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.'

In [67]:
topic_desc = []
for i in range(len(topic_des_tag)):
  topic_desc.append(topic_des_tag[i].text.strip())
print(topic_desc)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

In [68]:
topic_link_tag=doc.find_all("a",{"class":"no-underline flex-1 d-flex flex-column"})

In [69]:
topic_link_tag[0]["href"]

'/topics/3d'

In [71]:
base_link = " https://github.com"
print(base_link+topic_link_tag[0]["href"])

 https://github.com/topics/3d


In [72]:
topic_link = []
for i in range(len(topic_link_tag)):
  topic_link.append(base_link+topic_link_tag[i]["href"])

In [73]:
!pip install pandas --upgrade --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.0.3 which is incompatible.
google-colab 1.0.0 requires requests==2.27.1, but you have requests 2.31.0 which is incompatible.[0m[31m
[0m

In [74]:
import pandas as pd

In [75]:
topic_dict = {
    "title" : topic_titles,
    "Description" : topic_desc,
    "Link": topic_link
}

In [76]:
topic_df = pd.DataFrame(topic_dict)

In [77]:
topic_df

Unnamed: 0,title,Description,Link
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [78]:
topic_df.to_csv("topic.csv", index=None)

###Getting ionformation on Topic Page

In [79]:
topic_page_url = topic_link[0]

In [80]:
topic_page_url

' https://github.com/topics/3d'

In [81]:
response = requests.get(topic_page_url)

In [82]:
response.status_code

200

In [84]:
response.text[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-8cafbcbd78f4.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-31dc14e38457.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="

In [85]:
topic_doc = BeautifulSoup(response.text, "html.parser")

In [86]:
repo_tag = topic_doc.find_all("h3", {"class":"f3 color-fg-muted text-normal lh-condensed"})

In [87]:
len(repo_tag)

20

In [88]:
repo_tag[:5]

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/thr

In [93]:
a_tag=repo_tag[0].find_all("a")

In [101]:
a_tag[1]["href"]

'/spritejs/spritejs'

In [102]:
username = []
repository = []
repo_link = []
for i in range(len(repo_tag)):
  a_tag=repo_tag[i].find_all("a")
  username.append(a_tag[0].text.strip())
  repository.append(a_tag[1].text.strip())
  repo_link.append(base_link+a_tag[1]["href"])

print(repo_link)

[' https://github.com/mrdoob/three.js', ' https://github.com/pmndrs/react-three-fiber', ' https://github.com/libgdx/libgdx', ' https://github.com/BabylonJS/Babylon.js', ' https://github.com/ssloy/tinyrenderer', ' https://github.com/lettier/3d-game-shaders-for-beginners', ' https://github.com/aframevr/aframe', ' https://github.com/FreeCAD/FreeCAD', ' https://github.com/CesiumGS/cesium', ' https://github.com/metafizzy/zdog', ' https://github.com/isl-org/Open3D', ' https://github.com/timzhang642/3D-Machine-Learning', ' https://github.com/blender/blender', ' https://github.com/a1studmuffin/SpaceshipGenerator', ' https://github.com/domlysz/BlenderGIS', ' https://github.com/FyroxEngine/Fyrox', ' https://github.com/google/model-viewer', ' https://github.com/nerfstudio-project/nerfstudio', ' https://github.com/openscad/openscad', ' https://github.com/spritejs/spritejs']


In [103]:
topic_repos_dict ={
    "Username":username,
    "Repository":repository,
    "Link":repo_link
}

In [105]:
topic_repo_df = pd.DataFrame(topic_repos_dict)

In [106]:
topic_repo_df

Unnamed: 0,Username,Repository,Link
0,mrdoob,three.js,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-fo...
6,aframevr,aframe,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,https://github.com/metafizzy/zdog


In [107]:
topic_repo_df.to_csv("topic_repo.csv", index=None)