<a href="https://colab.research.google.com/github/ronaldoyw/Exploratory-Data-Analysis/blob/main/Webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Topic Repositories for Github Repos

##Pick a website and describe your objective
* Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
* Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
* Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

####Outline:
* scrape https://github.com/topics
* Get a list of topics (topic title, topic page URL, topic description)
* Get top 25 repositories from the topic page for each topic
* For each repo, grab the repo name, username, starts and repo URL
* For each topic, create a CSV file in the following format:

---

---

##Use the requests library to download web pages
* Inspect the website's HTML source and identify the right URLs to download.
* Download and save web pages locally using the requests library.
* Create a function to automate downloading for different topics/search queries.

In [2]:
!pip install requests --upgrade --quiet

[?25l[K     |█████▏                          | 10 kB 27.7 MB/s eta 0:00:01[K     |██████████▍                     | 20 kB 33.4 MB/s eta 0:00:01[K     |███████████████▋                | 30 kB 38.3 MB/s eta 0:00:01[K     |████████████████████▉           | 40 kB 37.3 MB/s eta 0:00:01[K     |██████████████████████████      | 51 kB 39.2 MB/s eta 0:00:01[K     |███████████████████████████████▎| 61 kB 42.7 MB/s eta 0:00:01[K     |████████████████████████████████| 62 kB 1.5 MB/s 
[?25h

In [3]:
import requests

In [4]:
topics_url = 'https://github.com/topics'

In [5]:
response = requests.get(topics_url)

In [6]:
response.status_code

200

In [7]:
len(response.text)

151836

In [8]:
page_contents =response.text

In [9]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [10]:
with open('webpage.html', 'w') as f:
  f.write(page_contents)

##Use Beautiful Soup to parse and extract information
* Parse and explore the structure of downloaded web pages using Beautiful soup.
* Use the right properties and methods to extract the required information.
* Create functions to extract from the page into lists and dictionaries.
* (Optional) Use a REST API to acquire additional information if required.

In [11]:
!pip install beautifulsoup4 --upgrade --quiet

[?25l[K     |██▋                             | 10 kB 27.5 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 38.4 MB/s eta 0:00:01[K     |███████▊                        | 30 kB 43.2 MB/s eta 0:00:01[K     |██████████▎                     | 40 kB 29.4 MB/s eta 0:00:01[K     |████████████▉                   | 51 kB 32.5 MB/s eta 0:00:01[K     |███████████████▍                | 61 kB 35.9 MB/s eta 0:00:01[K     |██████████████████              | 71 kB 36.3 MB/s eta 0:00:01[K     |████████████████████▌           | 81 kB 38.6 MB/s eta 0:00:01[K     |███████████████████████         | 92 kB 40.9 MB/s eta 0:00:01[K     |█████████████████████████▋      | 102 kB 40.0 MB/s eta 0:00:01[K     |████████████████████████████▏   | 112 kB 40.0 MB/s eta 0:00:01[K     |██████████████████████████████▊ | 122 kB 40.0 MB/s eta 0:00:01[K     |████████████████████████████████| 128 kB 40.0 MB/s 
[?25h

In [12]:
from bs4 import BeautifulSoup

In [13]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [14]:
doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" media="all" rel="stylesheet"><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/assets/dark_dimmed-0adfa28f0e68.c

In [17]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', class_ = selection_class)

In [18]:
len(topic_title_tags)

30

In [19]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [20]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', class_=desc_selector)


In [21]:
len(topic_desc_tags)

30

In [22]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [23]:
topic_link_tags = doc.find_all('a', 'no-underline flex-1 d-flex flex-column' )
len(topic_link_tags)


30

In [24]:
topic_link_tags[0]['href']

'/topics/3d'

In [25]:
topic0_url = 'https://github.com'+ topic_link_tags[0]['href']
topic0_url

'https://github.com/topics/3d'

In [26]:
topic_titles =[]

for tag in topic_title_tags:
  topic_titles.append(tag.text)
topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [27]:
topic_descriptions = []

for tag in topic_desc_tags:
  topic_descriptions.append(tag.text.strip())
topic_descriptions[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [28]:
topic_urls =[]
base_url = 'https://github.com'

for tag in topic_link_tags:
  topic_urls.append(base_url + tag['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [29]:
import pandas as pd


In [30]:
topics_dict = {
    'title': topic_titles, 'description': topic_descriptions, 'url': topic_urls
    }

In [31]:
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


##Create CSV file(s) with the extracted information
* Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
* Execute the function with different inputs to create a dataset of CSV files.
* Verify the information in the CSV files by reading them back using Pandas.

In [32]:
topics_df.to_csv('githubtopics.csv', index=None)

##Getting information out of a topic page

In [33]:
topic_page_urls = topic_urls[0]
topic_page_urls

'https://github.com/topics/3d'

In [34]:
response = requests.get(topic_page_urls)

In [35]:
response.status_code

200

In [36]:
len(response.text)

450021

In [37]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [38]:
repo_tags = topic_doc.find_all('h3', class_='f3 color-fg-muted text-normal lh-condensed')

In [39]:
len(repo_tags)


20

In [40]:
a_tags = repo_tags[0].find_all('a')
a_tags

[<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [41]:
a_tags[0].text.strip()

'mrdoob'

In [42]:
a_tags[1].text.strip

<function str.strip(chars=None, /)>

In [43]:
a_tags[1]['href']

'/mrdoob/three.js'

In [44]:
repo_url = base_url + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [45]:
star_tags = topic_doc.find_all('span', class_= 'Counter js-social-count')
len(star_tags)

20

In [46]:
star_tags[0].text.strip()

'86.1k'

In [47]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

parse_star_count(star_tags[0].text.strip())

86100

##Document and share your work
* Add proper headings and documentation in your Jupyter notebook.
* Publish your Jupyter notebook to your Jovian profile
* (Optional) Write a blog post about your project and share it online.