<a href="https://colab.research.google.com/github/soukarsha122/web_scraping/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Top Repositories for GitHub Topics

## Project Overview / Idea

- We are going to scrape https://github.com/topics
- We'll get a list of topics. For each topic we will get the topic title, topic page URL and topic description
- For each topic we will get top 25 repositories from the topic page
- For each repository we will take the repo name, repo url, username and stars
- For each topic we will make a seperate CSV file in following format

```
Repo Name, Username, Stars, Repo URL
<Data...............................>
```

### ***requets*** library is used to download web pages

In [None]:
!pip install requests --upgrade --quiet

In [None]:
import requests

In [None]:
topics_url = "https://github.com/topics"

In [None]:
response = requests.get(topics_url)

In [None]:
if response.status_code != 200:
    raise Exception(f"Request failed with status code: {response.status_code}")

In [None]:
page_contents = response.text

In [None]:
with open('webpage.html','w') as f:
  f.write(page_contents)

### Using **beautiful_soup** to parse and extract information

In [None]:
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
from bs4 import BeautifulSoup

In [None]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [None]:
#topic title tags
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p',{'class' : selection_class})

# procure topic titles from the title tag
topic_titles = []
for tag in topic_title_tags:
  topic_titles.append(tag.text)

In [None]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [None]:
#topic description tags
description_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class' : description_selector})

# procure topic descriptions from the description tags
topic_descs = []
for tag in topic_desc_tags:
  topic_descs.append(tag.text.strip())

In [None]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [None]:
#topic link tags
topic_link_tags = []
for i in range(len(topic_title_tags)):
  topic_link_tags.append(topic_title_tags[i].parent)

# procure topic urls/links from the link tags
topic_url = []
for i in range(len(topic_link_tags)):
  topic_url.append("https://github.com" + topic_link_tags[i].get("href"))

In [None]:
import pandas as pd

topics_dict = {

    'title' : topic_titles,
    'description' : topic_descs,
    'url' : topic_url
}
topics_df = pd.DataFrame(topics_dict)

In [None]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Creating CSV file with extracted information

In [None]:
topics_df.to_csv('topics.csv', index = None)

### Getting information out of a topic page




In [None]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k' :
    return int(float(stars_str[:-1])*1000)

In [None]:
def get_repo_info(h3_tag, star_tag):
  username = h3_tag.find_all('a')[0].text.strip()
  repo_name = h3_tag.find_all('a')[1].text.strip()
  repo_url = "https://github.com"+h3_tag.find_all('a')[1].get("href")
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name, stars, repo_url

In [None]:
def get_topic_repos(topic_url):

  # get the response for the particular topic url
  response = requests.get(topic_url)

  if response.status_code != 200:
    raise Exception('Failed to connect'.format(topic_url))

  # convert to html using beautiful soup
  topic_doc = BeautifulSoup(response.text,'html.parser')

  repo_tags = topic_doc.find_all('h3',{'class' : 'f3 color-fg-muted text-normal lh-condensed'})
  star_tags = topic_doc.find_all('span',{'id' : 'repo-stars-counter-star'})

  topic_repo_dict = {
      'username' : [],
      'repo_name' : [],
      'stars' : [],
      'repo_url' : []
    }

  for i in range(len(repo_tags)):
    username, repo_name, stars, repo_url = get_repo_info(repo_tags[i], star_tags[i])
    topic_repo_dict['username'].append(username)
    topic_repo_dict['repo_name'].append(repo_name)
    topic_repo_dict['stars'].append(stars)
    topic_repo_dict['repo_url'].append(repo_url)

  topic_repo_df = pd.DataFrame(topic_repo_dict)

  return topic_repo_df





In [None]:
import os

In [None]:
# for each topic url get the information about
# the top repositories for that topic
# and save the information procured to a csv file
# having the name of the topic

for index,row in topics_df.iterrows():

  topic_url = row['url']
  topic_repos_df = get_topic_repos(topic_url)
  topic_name = row['title']
  file_name = topic_name+'.csv'

  # checking if file already exists
  if os.path.exists(file_name):
    print("The file {} already exists....".format(file_name))
    continue
  else :
    print("Scrapping file {}".format(file_name))

  topic_repos_df.to_csv(file_name,index=None)

The file 3D.csv already exists....
The file Ajax.csv already exists....
The file Algorithm.csv already exists....
The file Amp.csv already exists....
The file Android.csv already exists....
The file Angular.csv already exists....
The file Ansible.csv already exists....
The file API.csv already exists....
The file Arduino.csv already exists....
The file ASP.NET.csv already exists....
The file Atom.csv already exists....
The file Awesome Lists.csv already exists....
The file Amazon Web Services.csv already exists....
The file Azure.csv already exists....
The file Babel.csv already exists....
The file Bash.csv already exists....
The file Bitcoin.csv already exists....
The file Bootstrap.csv already exists....
The file Bot.csv already exists....
The file C.csv already exists....
The file Chrome.csv already exists....
The file Chrome extension.csv already exists....
The file Command line interface.csv already exists....
The file Clojure.csv already exists....
The file Code quality.csv alrea