# scraping-github-topics

Use the "Run" button to execute the code.

### Pick a website and describe your objective

- We are going to scrape https://github.com/topics
- We'll get a list of topic. For each topic, we'll get topic title, topic page URL and topic description.
- For each repository we'll grab repo name, username, repo URL and stars.
- For each topic we'll create csv in the following formate
'''

'''

### Use the requests library to download web pages

In [1]:
!pip install requests --upgrade

Collecting requests
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 1.8 MB/s             
Installing collected packages: requests
  Attempting uninstall: requests
    Found existing installation: requests 2.26.0
    Uninstalling requests-2.26.0:
      Successfully uninstalled requests-2.26.0
Successfully installed requests-2.28.1


In [2]:
import requests

In [3]:
url = 'https://github.com/topics'

In [4]:
response = requests.get(url)

In [5]:
response.status_code

200

In [6]:
page_content = response.text

In [7]:
!pip install beautifulsoup4 --upgrade

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
     |████████████████████████████████| 128 kB 6.1 MB/s            
Installing collected packages: beautifulsoup4
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.10.0
    Uninstalling beautifulsoup4-4.10.0:
      Successfully uninstalled beautifulsoup4-4.10.0
Successfully installed beautifulsoup4-4.11.1


### Use Beautiful Soup to parse and extract information

In [8]:
from bs4 import BeautifulSoup

In [9]:
doc = BeautifulSoup(page_content, 'html.parser')

In [10]:
topic_title_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [11]:
topic_desc_tags = doc.find_all('p', {'class':'f5 color-fg-muted mb-0 mt-1'})

In [12]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [13]:
topic_link_tages = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})
len(topic_link_tages)

30

In [14]:
topic_link_tages[:5]

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/amphp">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>
 <p class="f

In [15]:
topic_url ="https://github.com" + topic_link_tages[0]['href']

In [16]:
topic_url

'https://github.com/topics/3d'

In [17]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [18]:
topic_desc = []
for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
topic_desc   

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azu

In [19]:
topic_urls = []
for tag in topic_link_tages:
    topic_urls.append('https://github.com'+tag['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [20]:
!pip install pandas 



In [21]:
import pandas as pd

In [22]:
topic_dict = {
    'title':topic_titles ,
    'description':topic_desc,
    'url': topic_urls
}

In [23]:
topic_df = pd.DataFrame(topic_dict)
topic_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSV file(s) with the extracted information

In [24]:
topic_df.to_csv('github-topics.csv', index= False)

### Scraping Topics repo username description and stars

In [25]:
topic_url = topic_urls[0]

In [26]:
topic_response = requests.get(topic_url)
topic_response.status_code

200

In [27]:
len(topic_response.text)

454947

In [28]:
topic_doc = BeautifulSoup(topic_response.text, 'html.parser')

In [29]:
repo_tags =topic_doc.find_all('h3', {'class':'f3 color-fg-muted text-normal lh-condensed'})

In [30]:
len(repo_tags)

20

In [31]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [32]:
a_tags = repo_tags[0].find_all('a')

In [33]:
repo_username = a_tags[0].text.strip()
repo_username

'mrdoob'

In [34]:
repo_name = a_tags[1].text.strip()
repo_name 

'three.js'

In [35]:
repo_url = 'https://github.com' + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [124]:
repo_stars = topic_doc.find_all('span', {'class':'Counter js-social-count'})
len(repo_stars)

20

In [130]:
repo_stars[0].text

SyntaxError: invalid syntax (3537212002.py, line 1)

In [38]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
     return int(float(stars_str[:-1])*1000) 

In [39]:
repo_stars[0].text

'87.6k'

In [40]:
parse_star_count(repo_stars[0].text)

87600

In [41]:
def get_repo_info(repo_tags,repo_stars):
    a_tags = repo_tags.find_all('a')
    repo_username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = 'https://github.com' + a_tags[1]['href']
    stars = parse_star_count(repo_stars.text)
    return repo_username, repo_name, repo_url, stars

In [42]:
repo_inf = get_repo_info(repo_tags[0], repo_stars[0])
repo_inf

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 87600)

In [43]:
topic_repo_dict ={
    'repo_username': [],
    'rep_oname': [],
    'repo_url': [],
    'stars':[]
}

for i in range(len(repo_tags)):
    repo_inf = get_repo_info(repo_tags[i], repo_stars[i])
    topic_repo_dict['repo_username'].append(repo_inf[0])
    topic_repo_dict['rep_oname'].append(repo_inf[1])
    topic_repo_dict['repo_url'].append(repo_inf[2])
    topic_repo_dict['stars'].append(repo_inf[3])

In [44]:
topic_repo_dict

{'repo_username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'ssloy',
  'aframevr',
  'lettier',
  'FreeCAD',
  'CesiumGS',
  'metafizzy',
  'timzhang642',
  'isl-org',
  'a1studmuffin',
  'blender',
  'domlysz',
  'FyroxEngine',
  'openscad',
  'google',
  'spritejs',
  'jagenjo'],
 'rep_oname': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'tinyrenderer',
  'aframe',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'cesium',
  'zdog',
  '3D-Machine-Learning',
  'Open3D',
  'SpaceshipGenerator',
  'blender',
  'BlenderGIS',
  'Fyrox',
  'openscad',
  'model-viewer',
  'spritejs',
  'webglstudio.js'],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/libgdx/libgdx',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/BabylonJS/Babylon.js',
  'https://github.com/ssloy/tinyrenderer',
  'https://github.com/aframevr/aframe',
  'https://github.com/lettier/3d-game-shaders-for-beginners',
  'https://github.com/FreeCAD/Fr

## Writing single function for scraping topics 
## Final_code

In [45]:
def get_topic_title(doc):
    topic_title_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles
    
def get_topic_desc(doc):   
    topic_desc_tags = doc.find_all('p', {'class':'f5 color-fg-muted mb-0 mt-1'})
    topic_desc = []
    for tag in topic_desc_tags:
            topic_desc.append(tag.text.strip())
    return topic_desc  

def get_topic_link(doc): 
    topic_link_tages = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})
    topic_url ="https://github.com" + topic_link_tages[0]['href']
    topic_urls = []
    for tag in topic_link_tages:
        topic_urls.append('https://github.com'+tag['href'])
    return topic_urls

def scrap_to_topics():
    url = 'https://github.com/topics'
    response = requests.get(url)
    
    if response.status_code != 200:
        raise Exception("Failed to load {}".format(url))

    page_content = response.text

    doc = BeautifulSoup(page_content, 'html.parser')

    topic_dict = {
        'title':get_topic_title(doc) ,
        'description':get_topic_desc(doc),
        'url': get_topic_link(doc)
    }

    topic_df = pd.DataFrame(topic_dict)
    return topic_df

    #topic_df.to_csv('topics.csv', index = False)


In [46]:
scrap_to_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [196]:
def get_topic_page(topic_url):
    topic_response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception("Failed to load {}".format(topic_url))
        
    topic_doc = BeautifulSoup(topic_response.text, 'html.parser')
    return topic_doc

def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000) 

def get_repo_info(topic_doc, star_tag):
    a_tags = repo_tags[0].find_all('a')
    repo_username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip() 
    repo_url = 'https://github.com' + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return repo_username, repo_name, stars, repo_url

def get_topic_repo(topic_doc):
    
    repo_tags =topic_doc.find_all('h3', {'class':'f3 color-fg-muted text-normal lh-condensed'})    
    repo_stars = topic_doc.find_all('span', {'class':'Counter js-social-count'})
    
    #for i in repo_stars:
     #   stars = i.text
   # stars = parse_star_count(stars)
    #return repo_username, repo_name, repo_url, stars
    
    topic_repo_dict ={
    'repo_username': [],
    'rep_oname': [],
    'repo_url': [],
    'stars':[] }

    for i in range(len(repo_tags)):
        repo_inf = get_repo_info(repo_tags[i], repo_stars[i])
        topic_repo_dict['repo_username'].append(repo_inf[0])
        topic_repo_dict['rep_oname'].append(repo_inf[1])
        topic_repo_dict['repo_url'].append(repo_inf[2])
        topic_repo_dict['stars'].append(repo_inf[3]) 
    
    topic_repo_df = pd.DataFrame(topic_repo_dict)
    return topic_repo_df

def scrap_topic(topic_url, topic_name):
    topic_df = get_topic_repo(get_topic_page(topic_url))
    topic_df.to_csv(topic_name+'.csv',index = None)

In [197]:
def scraping_top_topic():
    topics_df = scrap_to_topics()
    for index, row in topic_df.iterrows():
        print("scraping top repositries for {}".format(row['title']))
        scrap_topic(row['url'],row['title'])

In [198]:
scraping_top_topic()

scraping top repositries for 3D
scraping top repositries for Ajax
scraping top repositries for Algorithm
scraping top repositries for Amp
scraping top repositries for Android
scraping top repositries for Angular
scraping top repositries for Ansible
scraping top repositries for API
scraping top repositries for Arduino
scraping top repositries for ASP.NET
scraping top repositries for Atom
scraping top repositries for Awesome Lists
scraping top repositries for Amazon Web Services
scraping top repositries for Azure
scraping top repositries for Babel
scraping top repositries for Bash
scraping top repositries for Bitcoin
scraping top repositries for Bootstrap
scraping top repositries for Bot
scraping top repositries for C
scraping top repositries for Chrome
scraping top repositries for Chrome extension
scraping top repositries for Command line interface
scraping top repositries for Clojure
scraping top repositries for Code quality
scraping top repositries for Code review
scraping top reposit