# Get Repository Information via Web-Scraping

---

## Do the imports!

In [1]:
import requests
from bs4 import BeautifulSoup

---

## Assess the problem

We are currently in the `NSS-Data-Analytics-Cohort-2` cohort. 
In GitHub, the repositories for this cohort can be found at https://github.com/NSS-Data-Analytics-Cohort-2.

## The Goal
1. Access every repository URL associated with the `NSS-Data-Analytics-Cohort-2` organization.
2. For every repository..
    * Number of Commits (for the master branch)
    * Number of branches

---

## Goal 1
Accessing the repository urls. 
In order to do this.. 
We will need to go through every repository page and grab the repositories.

The first question is.. 
How many pages should we traverse?
In order to answer this, we need to locate the pagination element on the main page, and see how many options there are to select. 
Specifically, we need to figure out what the last number is!

In [2]:
# setting up the base URLs for the project
GITHUB_URL = 'https://github.com'
ORG_URL = f'{GITHUB_URL}/NSS-Data-Analytics-Cohort-2'

In [3]:
# pull data from the page
resp = requests.get(ORG_URL)

# We want the print statement to be 200
print(resp.status_code)

200


In [4]:
# Establish the "soup" object so we can traverse the website content
soup = BeautifulSoup(resp.text, 'html.parser')

---

### Goal 1.A
Figure out what the last page is!
To do this, we need to access the pagination elements.

In [5]:
# Now that we've created the soup.. Let's find the pagination elements.
# Check the repositories page to find out what to look for!
len(soup.findAll('div', {'class': 'pagination'}))

2

In [7]:
# since `soup.findAll` returns a list of elements.. We need to extract the first result.
pagination_elems = soup.findAll('div', {'class': 'pagination'})
pagination_elem = pagination_elems[1]

print(pagination_elem.prettify())

<div aria-label="Pagination" class="pagination" role="navigation">
 <span class="previous_page disabled">
  Previous
 </span>
 <a class="next_page" href="/NSS-Data-Analytics-Cohort-2?page=2" rel="next">
  Next
 </a>
</div>



In [8]:
# since `soup.findAll` returns a list of elements.. We need to extract the first result.
pagination_elems = soup.findAll('div', {'class': 'pagination'})
pagination_elem = pagination_elems[0]

print(pagination_elem.prettify())

<div aria-label="Pagination" class="pagination" role="navigation">
 <span class="previous_page disabled">
  Previous
 </span>
 <em class="current" data-total-pages="11">
  1
 </em>
 <a aria-label="Page 2" href="/NSS-Data-Analytics-Cohort-2?page=2" rel="next">
  2
 </a>
 <a aria-label="Page 3" href="/NSS-Data-Analytics-Cohort-2?page=3">
  3
 </a>
 <a aria-label="Page 4" href="/NSS-Data-Analytics-Cohort-2?page=4">
  4
 </a>
 <a aria-label="Page 5" href="/NSS-Data-Analytics-Cohort-2?page=5">
  5
 </a>
 <span class="gap">
  …
 </span>
 <a aria-label="Page 10" href="/NSS-Data-Analytics-Cohort-2?page=10">
  10
 </a>
 <a aria-label="Page 11" href="/NSS-Data-Analytics-Cohort-2?page=11">
  11
 </a>
 <a class="next_page" href="/NSS-Data-Analytics-Cohort-2?page=2" rel="next">
  Next
 </a>
</div>



In [9]:
# now, let's get all of the links from the pagination element.
links = pagination_elem.findAll('a')
links

[<a aria-label="Page 2" href="/NSS-Data-Analytics-Cohort-2?page=2" rel="next">2</a>,
 <a aria-label="Page 3" href="/NSS-Data-Analytics-Cohort-2?page=3">3</a>,
 <a aria-label="Page 4" href="/NSS-Data-Analytics-Cohort-2?page=4">4</a>,
 <a aria-label="Page 5" href="/NSS-Data-Analytics-Cohort-2?page=5">5</a>,
 <a aria-label="Page 10" href="/NSS-Data-Analytics-Cohort-2?page=10">10</a>,
 <a aria-label="Page 11" href="/NSS-Data-Analytics-Cohort-2?page=11">11</a>,
 <a class="next_page" href="/NSS-Data-Analytics-Cohort-2?page=2" rel="next">Next</a>]

In [10]:
# In our case.. We are interested in the link that represents the final page. 
# So, 2 from the back.
final_page = links[-2]
final_page

<a aria-label="Page 11" href="/NSS-Data-Analytics-Cohort-2?page=11">11</a>

In [11]:
# Now, we can get the text, and convert to an integer.
final_page_number = int(final_page.text)
final_page_number

11

---

### Goal 1.B
Now that we have the reference to the last page, we can use it to grab the links for all the pages.
Speaking of which.. 
Notice the structure of the links from the pagination.

Example: 
```
/NSS-Data-Analytics-Cohort-2?page=3
```

It looks like if we want to be able to go through all the pages.. 
We would need to structure our links like the ones from the pagination links.

So, our links should look like: 
```
https://github.com/NSS-Data-Analytics-Cohort-2?page=1
https://github.com/NSS-Data-Analytics-Cohort-2?page=2
...
https://github.com/NSS-Data-Analytics-Cohort-2?page=10
```

In [12]:
# Use a list comprehension to build our structures!
repo_pages = [f'{ORG_URL}?page={i+1}' for i in range(final_page_number)]
repo_pages

['https://github.com/NSS-Data-Analytics-Cohort-2?page=1',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=2',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=3',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=4',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=5',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=6',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=7',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=8',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=9',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=10',
 'https://github.com/NSS-Data-Analytics-Cohort-2?page=11']

---

### Goal 1.C

Perfect. 
Next, we need to figure out a pattern to grab all of the repository links..

Looking at the main page (`https://github.com/NSS-Data-Analytics-Cohort-2`) we can see that all of the repository elements are included in a `div` element with the `id` equal to `org-repositories`. 
Within that `div`, there is an unordered list (`ul`) with separate list elements (`li`) containing our info. 

Let's try to access those for the first pass.

In [13]:
org_repositories = soup.find(id='org-repositories') \
    .find('ul') \
    .findAll('li')


print(f'Total repositories on page: {len(org_repositories)}')
print(org_repositories[0].prettify())

Total repositories on page: 30
<li class="public source d-block py-4 border-bottom" itemprop="owns" itemscope="itemscope" itemtype="http://schema.org/Code">
 <div class="flex-justify-between d-flex">
  <div class="flex-auto">
   <h3 class="wb-break-all">
    <a class="d-inline-block" data-hovercard-type="repository" data-hovercard-url="/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-joelelle/hovercard" href="/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-joelelle" itemprop="name codeRepository">
     web-sraping-marathons-joelelle
    </a>
   </h3>
   <p class="break-word text-gray mb-0" itemprop="description">
    web-sraping-marathons-joelelle created by GitHub Classroom
   </p>
  </div>
  <div class="flex-items-center d-none d-md-flex">
   <span aria-label="Past year of activity" class="tooltipped tooltipped-s">
    <svg height="30" width="155">
     <defs>
      <lineargradient id="gradient-262606332" x1="0" x2="0" y1="1" y2="0">
       <stop offset="10%" stop-color="#c6e48b

In [14]:
# Sweet. Now, let's grab the link for the first one.
# If we can grab that one.. We can grab the rest.
first_repo = org_repositories[0]
first_repo_a_elem = first_repo.find('a')
print(first_repo_a_elem.prettify())

<a class="d-inline-block" data-hovercard-type="repository" data-hovercard-url="/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-joelelle/hovercard" href="/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-joelelle" itemprop="name codeRepository">
 web-sraping-marathons-joelelle
</a>



In [15]:
# Now.. The link!
first_repo_a_elem.get('href')

'/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-joelelle'

---

### Goal 1.D

Ok. 
Now.. 
We have the ability to get all the links!
We just need to stitch all of our code together.

In [16]:
def get_page_soup(url):
    print(f'Fetching website data for: {url}')
    resp = requests.get(url)
    return BeautifulSoup(resp.text, 'html.parser')


def get_org_repositories(soup):
    print('\tGetting org repositories')
    return soup.find(id='org-repositories') \
        .find('ul') \
        .findAll('li')


def extract_org_repository_links(org_repositories):
    print('\tGetting links from repositories')
    return [repo.find('a').get('href') for repo in org_repositories]

In [17]:
all_links = []

for url in repo_pages:
    soup = get_page_soup(url)
    org_repositories = get_org_repositories(soup)
    links = extract_org_repository_links(org_repositories)
    
    all_links.extend(links)

Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2?page=1
	Getting org repositories
	Getting links from repositories
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2?page=2
	Getting org repositories
	Getting links from repositories
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2?page=3
	Getting org repositories
	Getting links from repositories
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2?page=4
	Getting org repositories
	Getting links from repositories
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2?page=5
	Getting org repositories
	Getting links from repositories
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2?page=6
	Getting org repositories
	Getting links from repositories
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2?page=7
	Getting org repositories
	Getting links from repositories
Fetching website dat

In [18]:
# insert mic-drop here
all_links

['/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-joelelle',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-unewsome',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-mtylerrobbins',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-abrunlinger',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-olsont12',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-taylorperkins',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-brandesmoore',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-fdumessa',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-landrybutler',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-Kristiangarrett',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-BrantIvey',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-st-decker',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-Didymustheblind',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-DavidMellow',
 '/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-gradyrobbin

In [20]:
print(f'Total number of links: {len(all_links)}')

Total number of links: 315


---

### Goal 2.A

Next, we need to pull some information per repository.

> For every repository..
* Number of Commits (for the master branch)
* Number of branches

Like before, let's start with a single repository.

In [21]:
repo_link = all_links[0]
repo_link

'/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-joelelle'

In [22]:
# Just like before!
resp = requests.get(f'{GITHUB_URL}{repo_link}')
print(f'Status code is: {resp.status_code}')

soup = BeautifulSoup(resp.text, 'html.parser')

Status code is: 200


---

### Goal 2.B

Now that we have the soup.. 
Let's focus on locating the appropriate elements. 

Looking at the website we notice that the elements of interest are inside list elements (`li`) within an unordered list (`ul`) having the class `numbers-summary`.
Within those list elements are `span` elements with a `class` equal to `num`. 
Those are what we want.

In [23]:
numbers_summary = soup.find('ul', {'class': 'numbers-summary'})
print(numbers_summary.prettify())

<ul class="numbers-summary">
 <li class="commits">
  <a data-pjax="" href="/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-joelelle/commits/master">
   <svg aria-hidden="true" class="octicon octicon-git-commit" height="16" version="1.1" viewbox="0 0 14 16" width="14">
    <path d="M10.86 7c-.45-1.72-2-3-3.86-3-1.86 0-3.41 1.28-3.86 3H0v2h3.14c.45 1.72 2 3 3.86 3 1.86 0 3.41-1.28 3.86-3H14V7h-3.14zM7 10.2c-1.22 0-2.2-.98-2.2-2.2 0-1.22.98-2.2 2.2-2.2 1.22 0 2.2.98 2.2 2.2 0 1.22-.98 2.2-2.2 2.2z" fill-rule="evenodd">
    </path>
   </svg>
   <span class="num text-emphasized">
    1
   </span>
   commit
  </a>
 </li>
 <li>
  <a data-pjax="" href="/NSS-Data-Analytics-Cohort-2/web-sraping-marathons-joelelle/branches">
   <svg aria-hidden="true" class="octicon octicon-git-branch" height="16" version="1.1" viewbox="0 0 10 16" width="10">
    <path d="M10 5c0-1.11-.89-2-2-2a1.993 1.993 0 00-1 3.72v.3c-.02.52-.23.98-.63 1.38-.4.4-.86.61-1.38.63-.83.02-1.48.16-2 .45V4.72a1.993 1.993 0 00-1-3

In [24]:
metrics = []

for li in numbers_summary.findAll('li'):
    
    # grabbing the text, and doing a little cleanup
    metric = li.find('span', {'class': 'num'}).text \
        .replace('\n', '') \
        .strip()
    
    # going ahead and casting it to an integer if something was found!
    if metric:
        metric = int(metric)
    else:
        metric = None
    
    metrics.append(metric)

In [25]:
metrics

[1, 1, 0, 0, 1]

In [26]:
commits, branches, *_ = metrics
commits, branches

(1, 1)

---

### Goal 2.C

Now, we put it all together! 
Just like before. 
This time, let's keep track of the repo link.

In [27]:
def get_numbers_summary(soup):
    return soup.find('ul', {'class': 'numbers-summary'})


def clean_metric(metric):
    metric = metric \
        .replace('\n', '') \
        .strip()
    
    if metric:
        metric = int(metric)
    else:
        metric = None

    return metric


def get_metrics(numbers_summary):
    metrics = []

    for li in numbers_summary.findAll('li'):

        # grabbing the text, and doing a little cleanup
        metric = li.find('span', {'class': 'num'}).text
        metrics.append(clean_metric(metric))
        
    return metrics

In [28]:
# Check to make sure our functions work!
numbers_summary = get_numbers_summary(soup)
commits, branches, *_ = get_metrics(numbers_summary)
commits, branches

(1, 1)

In [29]:
# Let's just look at our recent project.. HCBB

hcbb_links = [l for l in all_links if 'healthcare-bluebook' in l]

In [30]:
# Ok.. Functions work for one.. 
# Time to try a few of them!!
results = []

for repo_ref in hcbb_links:
    
    url = f'{GITHUB_URL}{repo_ref}'
    soup = get_page_soup(url)
    
    numbers_summary = get_numbers_summary(soup)
    commits, branches, *_, contributors = get_metrics(numbers_summary)
    commits, branches, contributors
    
    results.append((repo_ref, commits, branches, contributors))

Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-the-unquantifiables
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-project-bluebook
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-orange-team
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-red-team
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-blue-team
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-green-team
Fetching website data for: https://github.com/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-instructors


In [31]:
results

[('/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-the-unquantifiables',
  96,
  5,
  None),
 ('/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-project-bluebook',
  61,
  3,
  None),
 ('/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-orange-team', 9, 5, None),
 ('/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-red-team', 37, 4, None),
 ('/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-blue-team', 27, 5, None),
 ('/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-green-team', 19, 3, None),
 ('/NSS-Data-Analytics-Cohort-2/healthcare-bluebook-instructors', 7, 1, None)]

---

### Goal 3!!!

Now, of course, pandas!

In [32]:
import pandas as pd

df = pd.DataFrame(results, columns=['url', 'commits', 'branches', 'contributors'])
df

Unnamed: 0,url,commits,branches,contributors
0,/NSS-Data-Analytics-Cohort-2/healthcare-bluebo...,96,5,
1,/NSS-Data-Analytics-Cohort-2/healthcare-bluebo...,61,3,
2,/NSS-Data-Analytics-Cohort-2/healthcare-bluebo...,9,5,
3,/NSS-Data-Analytics-Cohort-2/healthcare-bluebo...,37,4,
4,/NSS-Data-Analytics-Cohort-2/healthcare-bluebo...,27,5,
5,/NSS-Data-Analytics-Cohort-2/healthcare-bluebo...,19,3,
6,/NSS-Data-Analytics-Cohort-2/healthcare-bluebo...,7,1,


In [None]:
# The rest is for you to explore on your own time!