# Top Repositories for Github


#### 1] Here we will be importing the required library files

In [132]:
import jovian
import pandas as pd
 

### STEP 1: Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

####                    Project Outline
- We're going to scrape from https://github.com/topics
- We're going to get a list of topics. For each topic we'll get topic title, topic page URL and topic description.
- For each topic we'll get the top 25 repositories in the topic from the topic page
- For each topic we'll grab the repo name, username, stars and repo URL.
- For each topic we'll create a CSV file in the following format- 
 ```
     Repo_name, Username, Stars, Repo URL

### STEP 2: Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [133]:
!pip install requests --upgrade --quiet

In [134]:
import requests

In [135]:
topics_url= 'https://github.com/topics'


In [136]:
response= requests.get(topics_url)

In [137]:
response.status_code

200

In [138]:
len(response.text)

128505

In [139]:
page_contents= response.text

In [140]:
with open('webpage.html', 'w') as f:
    f.write(page_contents)

### STEP 3: Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.


In [141]:
!pip install beautifulsoup4 --upgrade --quiet

In [142]:
from bs4 import BeautifulSoup

In [143]:
doc = BeautifulSoup(page_contents, 'html.parser') 

In [144]:
topic_selection_class= 'f3 lh-condensed mb-0 mt-1 Link--primary'

In [145]:
topic_title_tags=doc.find_all('p',{'class': topic_selection_class})


In [146]:
len(topic_title_tags)

30

In [147]:
topic_title_tags[:3]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>]

In [148]:
description_selection_class='f5 color-text-secondary mb-0 mt-1'
topic_description_tag= doc.find_all('p', {'class': description_selection_class })

In [149]:
len(topic_description_tag)

30

In [150]:
topic_title_tags0=topic_title_tags[0]


In [151]:
link_selector='d-flex no-underline'
topic_link_tags= doc.find_all('a', {'class': link_selector})

In [152]:
len(topic_link_tags)

30

In [67]:
topic_link_tags[0]['href']

'/topics/3d'

In [68]:
topic_0_url='https://github.com' + topic_link_tags[0]['href']

In [69]:
topic_0_url

'https://github.com/topics/3d'

In [70]:
topic_title_tags[0].text

'3D'

In [72]:
#Final Step for topic_tags

In [120]:
topic_titles=[]
for tag in topic_title_tags:
    topic_titles.append(tag.text)
    

In [121]:
topic_titles[0:4]

['3D', 'Ajax', 'Algorithm', 'Amp']

In [94]:
topic_descs=[]

In [95]:
for desc in topic_description_tag:
    topic_descs.append(desc.text.strip())

In [96]:
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [89]:
#Since we have these extra spaces and \n in our data, we will use the .strip() function to clear the empty spaces
#to get to our desired format of data

In [None]:
#Final step for topic_description

In [99]:
topic_descs=[]
for desc in topic_description_tag:
    topic_descs.append(desc.text.strip())

In [101]:
topic_descs

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'A

In [102]:
#Final step for topic_url

In [109]:
topic_urls=[]
base_url='https://github.com'
for url in topic_link_tags:
    topic_urls.append(base_url+url['href'].strip())

In [110]:
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

### STEP 4: Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [111]:
#We will first create a dictionary to add all these arrays into a dataframe

In [125]:
columns_dict={
    'Title': topic_titles,
    'Description': topic_descs,
    'URLs':topic_urls
}

In [126]:
len(topic_titles)

30

In [127]:
df=pd.DataFrame(columns_dict)

In [128]:
df.head()

Unnamed: 0,Title,Description,URLs
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [129]:
#here we will now put the file into the CSV format


In [131]:
df.to_csv('FinalOutput.csv')

# Project Add ON : Getting information out of a topic page
This part of the project is optional

In [153]:
topics_page_url=topic_urls[0]

In [154]:
topics_page_url

'https://github.com/topics/3d'

In [156]:
response=requests.get(topics_page_url)

In [157]:
 response.status_code

200

In [159]:
len(response.text)

613526

In [160]:
topic_doc= BeautifulSoup(response.text , 'html.parser')

In [164]:
repo_tags= topic_doc.find_all('h1', {'class':'f3 color-text-secondary text-normal lh-condensed'} )

In [165]:
len(repo_tags)

30

In [167]:
a_tags= repo_tags[0].find_all('a')

In [168]:
len(a_tags)

2

In [171]:
 a_tags[0].text.strip()

'mrdoob'

In [172]:
 a_tags[1].text.strip()

'three.js'

In [174]:
repo_url= base_url + a_tags[1]['href']

In [175]:
repo_url

'https://github.com/mrdoob/three.js'

In [205]:
star_tags=topic_doc.find_all('a', {'class': 'social-count float-none'})

In [206]:
len(star_tags)

30

In [207]:
star_tags[0].text.strip()

'71.8k'

In [182]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [209]:
parse_star_count(star_tags[0].text.strip())

71800

In [260]:
def get_repo_info(h1_tag, star_tag):
    a_tags= h1_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url= base_url + a_tags[1]['href']
    stars= parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars

In [261]:
get_repo_info(repo_tags[1], star_tags[1])

('libgdx', 'libgdx', 'https://github.com/libgdx/libgdx', 18500)

In [262]:
topic_data={
    'username' : [],
    'repo_name': [],
    'repo_url': [],
    'stars': []
}
for i in range(len(repo_tags)):
    a=get_repo_info(repo_tags[i], star_tags[i])
    topic_data['username'].append(a[0])
    topic_data['repo_name'].append(a[1])
    topic_data['repo_url'].append(a[2])
    topic_data['stars'].append(a[3])

In [263]:
#here we will get topic repos. To simplify, we will be using a function


In [294]:
 def get_topic_repos(topic_url):
        response= requests.get(topic_url)
        if response.status_code != 200:
            raise Exception('ERROR 404.Failed to load page{}'.format(topic_url))
        topic_doc=BeautifulSoup(response.text, 'html.parser') 
        repo_tags= topic_doc.find_all('h1', {'class':'f3 color-text-secondary text-normal lh-condensed'} )
        star_tags=topic_doc.find_all('a', {'class': 'social-count float-none'})
        
        
        topic_data={
            'username' : [],
            'repo_name': [],
            'repo_url': [],
            'stars': []
                }
        for i in range(len(repo_tags)):
            a=get_repo_info(repo_tags[i], star_tags[i])
            topic_data['username'].append(a[0])
            topic_data['repo_name'].append(a[1])
            topic_data['repo_url'].append(a[2])
            topic_data['stars'].append(a[3])
        return pd.DataFrame(topic_data)

In [295]:
topic_repos_df=pd.DataFrame(topic_data)

In [296]:
topic_repos_df.head()

Unnamed: 0,username,repo_name,repo_url,stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,71800
1,libgdx,libgdx,https://github.com/libgdx/libgdx,18500
2,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,14200
3,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,13700
4,aframevr,aframe,https://github.com/aframevr/aframe,12800


In [303]:
url5= topic_urls[5]

In [304]:
topic4_repos=get_topic_repos(url5)

In [305]:
topic4_repos


Unnamed: 0,username,repo_name,repo_url,stars
0,justjavac,free-programming-books-zh_CN,https://github.com/justjavac/free-programming-...,80700
1,angular,angular,https://github.com/angular/angular,74000
2,storybookjs,storybook,https://github.com/storybookjs/storybook,62700
3,ionic-team,ionic-framework,https://github.com/ionic-team/ionic-framework,44800
4,leonardomso,33-js-concepts,https://github.com/leonardomso/33-js-concepts,40700
5,prettier,prettier,https://github.com/prettier/prettier,39900
6,SheetJS,sheetjs,https://github.com/SheetJS/sheetjs,25900
7,angular,angular-cli,https://github.com/angular/angular-cli,24600
8,angular,components,https://github.com/angular/components,21600
9,NativeScript,NativeScript,https://github.com/NativeScript/NativeScript,20200


### STEP 5: Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.
