<a href="https://colab.research.google.com/github/vishakhun/Data-Analysis-Tutorials/blob/main/web_scraping_project_github_topic_repository_rough_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Top Repositories for Github Topics



##1. Pick a website and describe your objective
- Browse through different sites and pick on to scrape. 
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy.


###Project Outline

- Page picked for scraping - https://github.com/topics
- A list of topic is what we get in GitHub Topics page.
- For each topic, we will get topic title, topic page URL and topic description
- For each topic we will get top 25 repositories in the topic from the topic page.
- For each repository, we will take the repository name, username, stars and repository URL
- for each topic we will create a CSV file in the format as below;

```
Repo Name,Username,Stars,Repo URL
flutter,flutter,122000,https://github.com/flutter/flutter
```






##2.Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.


In [1]:
!pip install requests --upgrade --quiet

[?25l[K     |█████▍                          | 10kB 21.0MB/s eta 0:00:01[K     |██████████▊                     | 20kB 27.3MB/s eta 0:00:01[K     |████████████████                | 30kB 28.9MB/s eta 0:00:01[K     |█████████████████████▍          | 40kB 22.6MB/s eta 0:00:01[K     |██████████████████████████▊     | 51kB 18.1MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 4.1MB/s 
[31mERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.[0m
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[?25h

In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code

200

In [6]:
len(response.text)

128360

In [7]:
page_content = response.text

In [8]:
page_content[:250]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://'

In [9]:
with open('webpage.html', 'w') as f:
    f.write(page_content) 

## 3. Use Beautiful Soup to parse and extract information.

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.



In [10]:
!pip install beautifulsoup4 --upgrade --quiet

[?25l[K     |██▉                             | 10kB 17.8MB/s eta 0:00:01[K     |█████▋                          | 20kB 22.8MB/s eta 0:00:01[K     |████████▌                       | 30kB 27.0MB/s eta 0:00:01[K     |███████████▎                    | 40kB 27.9MB/s eta 0:00:01[K     |██████████████▏                 | 51kB 17.3MB/s eta 0:00:01[K     |█████████████████               | 61kB 12.9MB/s eta 0:00:01[K     |███████████████████▉            | 71kB 12.4MB/s eta 0:00:01[K     |██████████████████████▋         | 81kB 13.3MB/s eta 0:00:01[K     |█████████████████████████▌      | 92kB 14.1MB/s eta 0:00:01[K     |████████████████████████████▎   | 102kB 11.4MB/s eta 0:00:01[K     |███████████████████████████████▏| 112kB 11.4MB/s eta 0:00:01[K     |████████████████████████████████| 122kB 11.4MB/s 
[?25h

In [11]:
from bs4 import BeautifulSoup

In [12]:
doc = BeautifulSoup(page_content, 'html.parser')

In [13]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p',{'class' : selection_class})

In [14]:
len(topic_title_tags)

30

In [15]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [16]:
desc_selector = 'f5 color-text-secondary mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class':desc_selector})
              

In [17]:
topic_desc_tags [:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>, <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>, <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>, <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>, <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [18]:
link_selector = 'd-flex no-underline'
topic_link_tags = doc.find_all('a',{'class':link_selector})

In [19]:
len(topic_link_tags)

30

In [20]:
"https://github.com"+topic_link_tags[0]['href']

'https://github.com/topics/3d'

In [21]:
topic_title_tags[0].text

'3D'

In [22]:
topic_desc_tags[0].text

'\n              3D modeling is the process of virtually developing the surface and structure of a 3D object.\n            '

In [23]:
topic_desc_tags[0].text.strip()

'3D modeling is the process of virtually developing the surface and structure of a 3D object.'

In [24]:
topic_titles = []

for tag in topic_title_tags:
  topic_titles.append(tag.text)

topic_titles



['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [25]:
topic_descs = []

for tag in topic_desc_tags:
  topic_descs.append(tag.text.strip())

topic_descs[:5]


['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [26]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
  topic_urls.append(base_url + tag['href'])

topic_urls


['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [27]:
!pip uninstall pandas --quiet


Proceed (y/n)? y


In [28]:
!pip install pandas==1.1.5 --quiet

[K     |████████████████████████████████| 9.5MB 10.2MB/s 
[31mERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.[0m
[?25h

In [29]:
import pandas as pd

In [30]:
topic_dict = {
    'title' : topic_titles,
    'description' : topic_descs,
    'url' : topic_urls
}

In [31]:
topics_df = pd.DataFrame(topic_dict)

In [32]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [33]:
topics_df.to_csv('topics.csv',index=None)

## Getting information from a topic page

In [34]:
topic_page_url = topic_urls[0]

In [35]:
topic_page_url

'https://github.com/topics/3d'

In [36]:
response = requests.get(topic_page_url)

In [37]:
response.status_code

200

In [38]:
len(response.text)

613474

In [39]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [40]:
h1_selection_class= 'f3 color-text-secondary text-normal lh-condensed'
repo_tags = topic_doc.find_all('h1',{'class': h1_selection_class})

In [41]:
len(repo_tags)

30

In [42]:
repo_tags[0] # repo_tags contain username, repo_name and repo_url

<h1 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904

In [43]:
a_tags = repo_tags[0].find_all ('a')

In [44]:
a_tags [0].text.strip() # username

'mrdoob'

In [45]:
a_tags[1].text.strip() # repo_name

'three.js'

In [46]:
repo_url = base_url + a_tags[1]['href'] # repo_url
repo_url

'https://github.com/mrdoob/three.js'

In [47]:
star_tags = topic_doc.find_all('a', {'class' : 'social-count float-none'})

In [48]:
len (star_tags)

30

In [49]:
star_tags[0].text.strip()

'71.9k'

In [50]:
# function to take the star tag remove the "k" at the end and give it as a number
def parse_star_count(stars_str): 
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1])*1000)
  return int (stars_str)  

In [51]:
parse_star_count(star_tags[0].text.strip()) # stars

71900

In [60]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get_topic_url()
    # Check for successful response
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    # parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc


def get_repo_info(h1_tag,star_tag):
    # returns all the required info about a repository 
    a_tags = h1_tag.find_all('a') 
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    stars = parse_star_count(star_tag.text.strip())
    base_url = 'https://github.com'
    repo_url = base_url + a_tags[1]['href']
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get h1 tags containing repo title, repo URL and username
    h1_selection_class= 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1', {'class': h1_selection_class})
    # Get star tag
    star_tags = topic_doc.find_all('a', {'class' : 'social-count float-none'})
    # Get repo info
    topic_repos_dict = {'username' : [],'repo_name' : [],'stars' : [],'repo_url' : [] }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
      
    return pd.DataFrame(topic_repos_dict)

In [61]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 71900, 'https://github.com/mrdoob/three.js')

In [62]:
url4 = topic_urls[4]

In [63]:
url4

'https://github.com/topics/android'

In [64]:
topic4_doc = get_topic_page(url4)


AttributeError: ignored

In [None]:
topic4_repos = get_topic_repos(topic4_doc)


In [None]:
topic4_repos 

In [None]:
import os
def scrape_topic(topic_url,path):
  if os.path.exists(path):
    print('The file {} already exists.Skipping...'.format(path))
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  topic_df.to_csv(path, index = None)

write single function
1. Get the list of topics from the topics page
2. Get the list of top repos from the individual topic pages
3. for each topic, create a CSV of the top repos for the topic

In [None]:
#1. Functions to  get the list of topics from the topics page

def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class' : selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p',{'class':desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
      topic_descs.append(tag.text.strip())
    return topic_descs 

def get_topic_urls(doc):
    link_selector = 'd-flex no-underline'
    topic_link_tags = doc.find_all('a',{'class':link_selector})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
      topic_urls.append(base_url + tag['href'])
    return topic_urls(doc)

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    topic_dict = {'title' : get_topic_titles(doc),'description' : get_topic_descs(doc),'url' : get_topic_urls(doc)}
    return pd.DataFrame(topic_dict)




In [None]:
import os
help(os.makedirs)

## 4. Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.



In [None]:
def scrape_topics_repos():
  print('scraping list of topics')
  topics_df = scrape_topics()

  os.makedirs('data',exist_ok=True)

  for index,row in topics_df.iterrows():
    print('Scraping top repositories for "{}"'.format(row['title']))
    scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [None]:
scrape_topics_repos()