## Top Repositories for a Topic

> **AIM**: Find top 30 Github repositories for a given topic and store details in a CSV file. EG: machine learning, python etc..

### Break down of aim :
>1. get web page in html format
>2. parse through the web page and find useful information
>3. extract the info
>4. write into a csv file

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url='https://github.com/topics/machine-learning'
response= requests.get(url)
page_contents=response.text
page_contents

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX

In [3]:
doc=BeautifulSoup(page_contents, 'html.parser')
doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-92c7d381038e.css" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+sol

Now that we have a html document,
>scroll through the scource document and find how the details are organised.

we can see that the < article > tag has the details about the repositories

In [4]:
article_tags= doc.find_all('article', {'class':'border rounded color-shadow-small color-bg-subtle my-4'})
len(article_tags)

30

There are 30 repositories listed on the page.

We need to extract the following information from each tag:

1. Repository name
2. Owner's username
3. Number of stars
4. Repository link

You will notice that the above details are all part of a `h3` tag.

The `h3` has `a` tags inside it, one containing the owner's username and the second containing the repository title. The `href` of the second tag also includes the relative path of the repository. Let's extract this information from the `a` tags.

In [5]:
h3=article_tags[0].find('h3')
a=h3.find_all('a')

In [6]:
username=a[0].text.strip()
username

'tensorflow'

In [7]:
repositoryname=a[1].text.strip()
repositoryname

'tensorflow'

In [8]:
a[1]['href']

'/tensorflow/tensorflow'

In [9]:
Repositorylink='https://github.com'+a[1]['href'].strip()
Repositorylink

'https://github.com/tensorflow/tensorflow'

In [10]:
star=article_tags[0].find('span', class_="Counter js-social-count")
star=star.text.strip()

In [11]:
print('Repository name:', repositoryname)
print("Owner's username:", username)
print('Stars:', star)
print('Repository URL:', Repositorylink)

Repository name: tensorflow
Owner's username: tensorflow
Stars: 167k
Repository URL: https://github.com/tensorflow/tensorflow


### Define functions:
let's define functions to maake it easy and look clean

In [12]:
def get_topic_page(x):
    topic_url = 'https://github.com/topics/'+x
    response= requests.get(topic_url)
    page_contents=response.text
    doc=BeautifulSoup(page_contents, 'html.parser')
    return doc

In [13]:
doc=get_topic_page('machine-learning')
doc.title.text

'machine-learning ¬∑ GitHub Topics ¬∑ GitHub'

In [14]:
def parse_repostory(article_tag):
    dic={}
    h3=article_tag.find('h3')
    a=h3.find_all('a')
    dic['Owner\'s username']=a[0].text.strip()

    dic['Repository name']=a[1].text.strip()


    star=article_tag.find('span', class_="Counter js-social-count")
    dic['Stars']=star.text.strip()
    
    Repositorylink='https://github.com'+a[1]['href'].strip()
    dic['Repository URL']=Repositorylink
    
    return dic

In [15]:
parse_repostory(article_tags[0])

{"Owner's username": 'tensorflow',
 'Repository name': 'tensorflow',
 'Stars': '167k',
 'Repository URL': 'https://github.com/tensorflow/tensorflow'}

We can use a list comprehension to parse all the `article` tags in one go.

In [16]:
top_repositories=[parse_repostory(tag) for tag in article_tags]

In [17]:
top_repositories[:5]

[{"Owner's username": 'tensorflow',
  'Repository name': 'tensorflow',
  'Stars': '167k',
  'Repository URL': 'https://github.com/tensorflow/tensorflow'},
 {"Owner's username": 'huggingface',
  'Repository name': 'transformers',
  'Stars': '67.3k',
  'Repository URL': 'https://github.com/huggingface/transformers'},
 {"Owner's username": 'pytorch',
  'Repository name': 'pytorch',
  'Stars': '57.5k',
  'Repository URL': 'https://github.com/pytorch/pytorch'},
 {"Owner's username": 'keras-team',
  'Repository name': 'keras',
  'Stars': '55.7k',
  'Repository URL': 'https://github.com/keras-team/keras'},
 {"Owner's username": 'scikit-learn',
  'Repository name': 'scikit-learn',
  'Stars': '50.8k',
  'Repository URL': 'https://github.com/scikit-learn/scikit-learn'}]

In [18]:
def get_top_repositories(doc):
    article_tags = doc.find_all('article', class_='border rounded color-shadow-small color-bg-subtle my-4')
    topic_repos = [parse_repostory(tag) for tag in article_tags]
    return topic_repos

### Universal functions
We can now use the functions we've defined to get the top repositories for any topic.

In [19]:
topic_page_ml = get_topic_page('artificial-intelligence')
top_repos_ml = get_top_repositories(topic_page_ml)
len(top_repos_ml)

30

In [20]:
topic_page_ai = get_topic_page('artificial-intelligence')
top_repos_ai = get_top_repositories(topic_page_ai)
top_repos_ai[:5]

[{"Owner's username": 'ZuzooVn',
  'Repository name': 'machine-learning-for-software-engineers',
  'Stars': '26k',
  'Repository URL': 'https://github.com/ZuzooVn/machine-learning-for-software-engineers'},
 {"Owner's username": 'explosion',
  'Repository name': 'spaCy',
  'Stars': '23.8k',
  'Repository URL': 'https://github.com/explosion/spaCy'},
 {"Owner's username": 'AMAI-GmbH',
  'Repository name': 'AI-Expert-Roadmap',
  'Stars': '21k',
  'Repository URL': 'https://github.com/AMAI-GmbH/AI-Expert-Roadmap'},
 {"Owner's username": 'Lightning-AI',
  'Repository name': 'lightning',
  'Stars': '19.4k',
  'Repository URL': 'https://github.com/Lightning-AI/lightning'},
 {"Owner's username": 'facebookresearch',
  'Repository name': 'fairseq',
  'Stars': '18.6k',
  'Repository URL': 'https://github.com/facebookresearch/fairseq'}]

In [21]:
get_top_repositories(get_topic_page('python'))[:5]

[{"Owner's username": 'donnemartin',
  'Repository name': 'system-design-primer',
  'Stars': '190k',
  'Repository URL': 'https://github.com/donnemartin/system-design-primer'},
 {"Owner's username": 'tensorflow',
  'Repository name': 'tensorflow',
  'Stars': '167k',
  'Repository URL': 'https://github.com/tensorflow/tensorflow'},
 {"Owner's username": 'CyC2018',
  'Repository name': 'CS-Notes',
  'Stars': '155k',
  'Repository URL': 'https://github.com/CyC2018/CS-Notes'},
 {"Owner's username": 'TheAlgorithms',
  'Repository name': 'Python',
  'Stars': '141k',
  'Repository URL': 'https://github.com/TheAlgorithms/Python'},
 {"Owner's username": 'vinta',
  'Repository name': 'awesome-python',
  'Stars': '135k',
  'Repository URL': 'https://github.com/vinta/awesome-python'}]

## Writing information to CSV files

Let's create a function which takes a list of dictionaries and writes them to a CSV file.


In [22]:
def write_into_csv(items,path):
    
    with open(path, 'w') as f:
        header=list(items[0].keys())
        f.write(','.join(header)+'\n')
        for item in items:
            values=[]
            #values=list(item.values())
            for x in header:
                values.append((item.get(x)))
            f.write(','.join(values)+'\n')


In [23]:
write_into_csv(top_repositories,'machine-learning.csv')

In [24]:
with open('machine-learning.csv', 'r') as f:
    print(f.read())

Owner's username,Repository name,Stars,Repository URL
tensorflow,tensorflow,167k,https://github.com/tensorflow/tensorflow
huggingface,transformers,67.3k,https://github.com/huggingface/transformers
pytorch,pytorch,57.5k,https://github.com/pytorch/pytorch
keras-team,keras,55.7k,https://github.com/keras-team/keras
scikit-learn,scikit-learn,50.8k,https://github.com/scikit-learn/scikit-learn
tesseract-ocr,tesseract,46k,https://github.com/tesseract-ocr/tesseract
ageitgey,face_recognition,45.2k,https://github.com/ageitgey/face_recognition
aymericdamien,TensorFlow-Examples,42.1k,https://github.com/aymericdamien/TensorFlow-Examples
deepfakes,faceswap,41.8k,https://github.com/deepfakes/faceswap
JuliaLang,julia,39.9k,https://github.com/JuliaLang/julia
Developer-Y,cs-video-courses,39.8k,https://github.com/Developer-Y/cs-video-courses
microsoft,ML-For-Beginners,39.8k,https://github.com/microsoft/ML-For-Beginners
binhnguyennus,awesome-scalability,39.6k,https://github.com/binhnguyennus/awesome-scalab

Perfect! We've created a CSV containing repositories for the topic `machine-learning`.Now put together everything so far to solve the original problem.
> **AIM**: Find top 30 Github repositories for a given topic and store details in a CSV file. EG: machine learning, python etc..

In [25]:
import requests
from bs4 import BeautifulSoup

def scrape_repos(topic, path=None):
    """Get the top repositories for a topic and write them to a CSV file"""
    if path== None:
        path= topic+'.csv'
    topic_doc=get_topic_page(topic)
    topic_repository=get_top_repositories(topic_doc)
    write_into_csv(topic_repository,path)
    print('Top Repsitories for topic {} witten to file {}'.format(topic,path))
    return(path)

def get_topic_page(x):
    ''' Get the web page containing the top repositories for a topic and return a BeautifulSoup document'''
    topic_url = 'https://github.com/topics/'+x
    response= requests.get(topic_url)
    page_contents=response.text
    doc=BeautifulSoup(page_contents, 'html.parser')
    return doc


def get_top_repositories(doc):
    '''Parse the top repositories and return a list of dictionaries with info about repositories'''
    article_tags = doc.find_all('article', class_='border rounded color-shadow-small color-bg-subtle my-4')
    topic_repos = [parse_repostory(tag) for tag in article_tags]
    return topic_repos

def parse_repostory(article_tag):
    ''' Parse the information about a repository and get the required values'''
    dic={}
    h3=article_tag.find('h3')
    a=h3.find_all('a')
    dic['Owner\'s username']=a[0].text.strip()

    dic['Repository name']=a[1].text.strip()


    star=article_tag.find('span', class_="Counter js-social-count")
    dic['Stars']=star.text.strip()
    
    Repositorylink='https://github.com'+a[1]['href'].strip()
    dic['Repository URL']=Repositorylink
    
    return dic

def write_into_csv(items,path):
    '''write a list of dictionaries to a CSV file'''
    with open(path, 'w') as f:
        header=list(items[0].keys())
        f.write(','.join(header)+'\n')
        for item in items:
            values=[]
            #values=list(item.values())
            for x in header:
                values.append((item.get(x)))
            f.write(','.join(values)+'\n')


In [26]:
scrape_repos('python')

Top Repsitories for topic python witten to file python.csv


'python.csv'

In [27]:
import pandas as pd

In [28]:
pd.read_csv('python.csv')

Unnamed: 0,Owner's username,Repository name,Stars,Repository URL
0,donnemartin,system-design-primer,190k,https://github.com/donnemartin/system-design-p...
1,tensorflow,tensorflow,167k,https://github.com/tensorflow/tensorflow
2,CyC2018,CS-Notes,155k,https://github.com/CyC2018/CS-Notes
3,TheAlgorithms,Python,141k,https://github.com/TheAlgorithms/Python
4,vinta,awesome-python,135k,https://github.com/vinta/awesome-python
5,justjavac,free-programming-books-zh_CN,94.5k,https://github.com/justjavac/free-programming-...
6,practical-tutorials,project-based-learning,73.2k,https://github.com/practical-tutorials/project...
7,nvbn,thefuck,72.4k,https://github.com/nvbn/thefuck
8,huggingface,transformers,67.3k,https://github.com/huggingface/transformers
9,django,django,65.2k,https://github.com/django/django


In [29]:
scrape_repos('data-analysis')

Top Repsitories for topic data-analysis witten to file data-analysis.csv


'data-analysis.csv'

In [30]:
pd.read_csv('data-analysis.csv')

Unnamed: 0,Owner's username,Repository name,Stars,Repository URL
0,scikit-learn,scikit-learn,50.8k,https://github.com/scikit-learn/scikit-learn
1,apache,superset,47.1k,https://github.com/apache/superset
2,pandas-dev,pandas,34.6k,https://github.com/pandas-dev/pandas
3,metabase,metabase,29.2k,https://github.com/metabase/metabase
4,AMAI-GmbH,AI-Expert-Roadmap,21k,https://github.com/AMAI-GmbH/AI-Expert-Roadmap
5,streamlit,streamlit,20k,https://github.com/streamlit/streamlit
6,gchq,CyberChef,17k,https://github.com/gchq/CyberChef
7,microsoft,Data-Science-For-Beginners,15.5k,https://github.com/microsoft/Data-Science-For-...
8,allinurl,goaccess,14.9k,https://github.com/allinurl/goaccess
9,ml-tooling,best-of-ml-python,11.2k,https://github.com/ml-tooling/best-of-ml-python


In [31]:
scrape_repos('artificial-intelligence')

Top Repsitories for topic artificial-intelligence witten to file artificial-intelligence.csv


'artificial-intelligence.csv'

In [32]:
pd.read_csv('artificial-intelligence.csv')

Unnamed: 0,Owner's username,Repository name,Stars,Repository URL
0,ZuzooVn,machine-learning-for-software-engineers,26k,https://github.com/ZuzooVn/machine-learning-fo...
1,explosion,spaCy,23.8k,https://github.com/explosion/spaCy
2,AMAI-GmbH,AI-Expert-Roadmap,21k,https://github.com/AMAI-GmbH/AI-Expert-Roadmap
3,Lightning-AI,lightning,19.4k,https://github.com/Lightning-AI/lightning
4,facebookresearch,fairseq,18.6k,https://github.com/facebookresearch/fairseq
5,amark,gun,16.2k,https://github.com/amark/gun
6,Tencent,ncnn,15k,https://github.com/Tencent/ncnn
7,kailashahirwar,cheatsheets-ai,14.3k,https://github.com/kailashahirwar/cheatsheets-ai
8,OpenBB-finance,OpenBBTerminal,14.1k,https://github.com/OpenBB-finance/OpenBBTerminal
9,microsoft,recommenders,13.6k,https://github.com/microsoft/recommenders


In [33]:
pd.read_csv('machine-learning.csv')

Unnamed: 0,Owner's username,Repository name,Stars,Repository URL
0,tensorflow,tensorflow,167k,https://github.com/tensorflow/tensorflow
1,huggingface,transformers,67.3k,https://github.com/huggingface/transformers
2,pytorch,pytorch,57.5k,https://github.com/pytorch/pytorch
3,keras-team,keras,55.7k,https://github.com/keras-team/keras
4,scikit-learn,scikit-learn,50.8k,https://github.com/scikit-learn/scikit-learn
5,tesseract-ocr,tesseract,46k,https://github.com/tesseract-ocr/tesseract
6,ageitgey,face_recognition,45.2k,https://github.com/ageitgey/face_recognition
7,aymericdamien,TensorFlow-Examples,42.1k,https://github.com/aymericdamien/TensorFlow-Ex...
8,deepfakes,faceswap,41.8k,https://github.com/deepfakes/faceswap
9,JuliaLang,julia,39.9k,https://github.com/JuliaLang/julia
