# Top Repositories For Github Topics


## Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

## Project Outline
- we are going to scrap  https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page 
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
```
Repo Name,User Name,Stars,Repo Url
three.js,mrdoob,94100,https://github.com/mrdoob/three.js
libgdx,libgdx,21800,https://github.com/libgdx/libgdx
```

## Use the requests library to download web pages

In [4]:
!pip install requests --upgrade --quiet

In [5]:
import requests

In [6]:
topics_url = 'https://github.com/topics'

In [7]:
response = requests.get(topics_url)

In [8]:
response.status_code

200

In [9]:
page_contents = response.text

In [10]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="false">\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-983b05c0927a.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" me

In [11]:
with open('webpage.html', 'w') as f:
    f.write(page_contents)

# Use Beautiful Soup to parse and extract information


In [12]:
!pip install beautifulsoup4 --upgrade --quiet

In [13]:
from bs4 import BeautifulSoup

In [14]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [15]:
selection_class= 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags= doc.find_all ('p' ,{'class': selection_class})


In [16]:
len(topic_title_tags)

30

In [17]:
topic_title_tags[:5]


[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [18]:
desc_selector ='f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all ('p', {'class' : desc_selector})

In [19]:
len(topic_desc_tags)

30

In [20]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [21]:
topic_title_tag0 = topic_title_tags [0]

In [22]:
div_tag = topic_title_tag0.parent


In [23]:
topic_link_tags = doc.find_all( 'a', {'class':'no-underline flex-grow-0'})

In [24]:
topic0_url="http://github.com" +topic_link_tags[0]['href']
print (topic0_url)

http://github.com/topics/3d


In [25]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [26]:
topic_descs =[]
for tag in  topic_desc_tags:
    topic_descs.append(tag.text.strip())
#     strip is used to eliminate the spaces before and after the line 
topic_descs

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source platform for building electronic devices.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azure 

In [27]:
topic_url = []
base_url='http://github.com'
for tag in topic_link_tags:
    topic_url.append(base_url+tag['href'])
topic_url
    

['http://github.com/topics/3d',
 'http://github.com/topics/ajax',
 'http://github.com/topics/algorithm',
 'http://github.com/topics/amphp',
 'http://github.com/topics/android',
 'http://github.com/topics/angular',
 'http://github.com/topics/ansible',
 'http://github.com/topics/api',
 'http://github.com/topics/arduino',
 'http://github.com/topics/aspnet',
 'http://github.com/topics/atom',
 'http://github.com/topics/awesome',
 'http://github.com/topics/aws',
 'http://github.com/topics/azure',
 'http://github.com/topics/babel',
 'http://github.com/topics/bash',
 'http://github.com/topics/bitcoin',
 'http://github.com/topics/bootstrap',
 'http://github.com/topics/bot',
 'http://github.com/topics/c',
 'http://github.com/topics/chrome',
 'http://github.com/topics/chrome-extension',
 'http://github.com/topics/cli',
 'http://github.com/topics/clojure',
 'http://github.com/topics/code-quality',
 'http://github.com/topics/code-review',
 'http://github.com/topics/compiler',
 'http://github.com/to

In [28]:
!pip install pandas --quiet

In [29]:
import pandas as pd

In [30]:
topics_dict = {
   'title' : topic_titles,
    'description' : topic_descs,
    'url': topic_url
}

In [31]:
topics_df = pd.DataFrame(topics_dict)

In [32]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


## Create CSV file(s) with the extracted information


In [33]:
topics_df.to_csv ('topics.csv', index=None )

## Getting information out of a topic page 

In [77]:
topic_page_url = topic_url [1]

In [78]:
topic_page_url

'http://github.com/topics/ajax'

In [79]:
response.status_code

200

In [80]:
len(response.text)

166088

In [89]:
!pip install beautifulsoup4 --upgrade --quiet

In [93]:
from bs4 import BeautifulSoup

In [94]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [95]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})

In [92]:
repo_tags


[]

In [86]:
len(repo_tags)

0

In [85]:
repo_tags

[]