<a href="https://colab.research.google.com/github/ronaldoyw/Exploratory-Data-Analysis/blob/main/Webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Topic Repositories for Github Repos

##Pick a website and describe your objective
* Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
* Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
* Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

####Outline:
* scrape https://github.com/topics
* Get a list of topics (topic title, topic page URL, topic description)
* Get top 25 repositories from the topic page for each topic
* For each repo, grab the repo name, username, starts and repo URL
* For each topic, create a CSV file in the following format:

---

---

##Use the requests library to download web pages
* Inspect the website's HTML source and identify the right URLs to download.
* Download and save web pages locally using the requests library.
* Create a function to automate downloading for different topics/search queries.

In [57]:
!pip install requests --upgrade --quiet

In [58]:
import requests

In [59]:
topics_url = 'https://github.com/topics'

In [60]:
response = requests.get(topics_url)

In [61]:
response.status_code

200

In [62]:
len(response.text)

151942

In [63]:
page_contents =response.text

In [64]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [65]:
with open('webpage.html', 'w') as f:
  f.write(page_contents)

##Use Beautiful Soup to parse and extract information
* Parse and explore the structure of downloaded web pages using Beautiful soup.
* Use the right properties and methods to extract the required information.
* Create functions to extract from the page into lists and dictionaries.
* (Optional) Use a REST API to acquire additional information if required.

In [66]:
!pip install beautifulsoup4 --upgrade --quiet

In [67]:
from bs4 import BeautifulSoup

In [68]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [69]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', class_ = selection_class)

In [70]:
len(topic_title_tags)

30

In [71]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [72]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', class_=desc_selector)


In [73]:
len(topic_desc_tags)

30

In [74]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [75]:
topic_link_tags = doc.find_all('a', 'no-underline flex-1 d-flex flex-column' )
len(topic_link_tags)


30

In [76]:
topic_link_tags[0]['href']

'/topics/3d'

In [77]:
topic0_url = 'https://github.com'+ topic_link_tags[0]['href']
topic0_url

'https://github.com/topics/3d'

In [78]:
topic_titles =[]

for tag in topic_title_tags:
  topic_titles.append(tag.text)
topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [79]:
topic_descriptions = []

for tag in topic_desc_tags:
  topic_descriptions.append(tag.text.strip())
topic_descriptions[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [80]:
topic_urls =[]
base_url = 'https://github.com'

for tag in topic_link_tags:
  topic_urls.append(base_url + tag['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [81]:
import pandas as pd


In [82]:
topics_dict = {
    'title': topic_titles, 'description': topic_descriptions, 'url': topic_urls
    }

In [83]:
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


##Create CSV file(s) with the extracted information
* Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
* Execute the function with different inputs to create a dataset of CSV files.
* Verify the information in the CSV files by reading them back using Pandas.

In [84]:
topics_df.to_csv('githubtopics.csv', index=None)

##Getting information out of a topic page

In [85]:
topic_page_urls = topic_urls[0]
topic_page_urls

'https://github.com/topics/3d'

In [86]:
response = requests.get(topic_page_urls)

In [87]:
response.status_code

200

In [88]:
len(response.text)

450017

In [89]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [90]:
repo_tags = topic_doc.find_all('h3', class_='f3 color-fg-muted text-normal lh-condensed')

In [91]:
len(repo_tags)


20

In [92]:
a_tags = repo_tags[0].find_all('a')
a_tags

[<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [93]:
a_tags[0].text.strip()

'mrdoob'

In [94]:
a_tags[1].text.strip

<function str.strip(chars=None, /)>

In [95]:
a_tags[1]['href']

'/mrdoob/three.js'

In [96]:
repo_url = base_url + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [97]:
star_tags = topic_doc.find_all('span', class_= 'Counter js-social-count')
len(star_tags)

20

In [98]:
star_tags[0].text.strip()

'86.1k'

In [99]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

parse_star_count(star_tags[0].text.strip())

86100

In [110]:
def get_repo_info(h3_tag, star_tag):
  #return all required information about the repositories
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tags[0].text.strip())
  return username, repo_name, stars, repo_url

In [111]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 86100, 'https://github.com/mrdoob/three.js')

In [116]:
topic_repos_dict ={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}

for i in range(len(repo_tags)):
  repo_info = get_repo_info(repo_tags[i], star_tags[i])
  topic_repos_dict['username'].append(repo_info[0])
  topic_repos_dict['repo_name'].append(repo_info[1])
  topic_repos_dict['stars'].append(repo_info[2])
  topic_repos_dict['repo_url'].append(repo_info[3])

topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'ssloy',
  'aframevr',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'isl-org',
  'a1studmuffin',
  'blender',
  'domlysz',
  'openscad',
  'spritejs',
  'FyroxEngine',
  'google',
  'jagenjo'],
 'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'tinyrenderer',
  'aframe',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'Open3D',
  'SpaceshipGenerator',
  'blender',
  'BlenderGIS',
  'openscad',
  'spritejs',
  'Fyrox',
  'model-viewer',
  'webglstudio.js'],
 'stars': [86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100,
  86100],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/libgdx/libgdx',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/BabylonJS/Babyl

In [121]:
topic_repos_df =pd.DataFrame(topic_repos_dict)

In [120]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,86100,https://github.com/mrdoob/three.js
1,libgdx,libgdx,86100,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,86100,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,86100,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,86100,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,86100,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,86100,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,86100,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,86100,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,86100,https://github.com/CesiumGS/cesium


In [146]:
def get_topic_page(topics_url):
  # Download the page
  response = requests.get(topics_url)
  # Cheeck successful response
  if response.status_code != 200:
    raise Exception('Failed to load page{}'.format(topics_url))
  # Parse using BEautifulSOup  
  topic_doc = BeautifulSoup(response.text, 'html.parser')
  return topic_doc

def get_repo_info(h3_tag, star_tag):
  #return all required information about the repositories
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tags[0].text.strip())
  return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
  # Download the page
  response = requests.get(topic_page_urls)
  # Cheeck successful response
  if response.status_code != 200:
    raise Exception('Failed to load page{}'.format(topics_url))
  # Parse using BEautifulSOup  
  topic_doc = BeautifulSoup(response.text, 'html.parser')
  # Get the h3 tags containing repo title, repo URL and username
  repo_tags = topic_doc.find_all('h3', class_='f3 color-fg-muted text-normal lh-condensed')
  # Get star tags
  star_tags = topic_doc.find_all('span', class_= 'Counter js-social-count')

  topic_repos_dict ={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
  }

  # Get repo info
  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
  
  return pd.DataFrame(topic_repos_dict)

In [151]:
url4 = topic_urls[4]
url4

'https://github.com/topics/android'

In [152]:
topic4_doc = get_topic_page(url4)

In [153]:
topic4_repos =get_topic_repos(topic4_doc)

In [154]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,86100,https://github.com/mrdoob/three.js
1,libgdx,libgdx,86100,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,86100,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,86100,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,86100,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,86100,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,86100,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,86100,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,86100,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,86100,https://github.com/CesiumGS/cesium


In [155]:
get_topic_repos(get_topic_page(topic_urls[5]))

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,86100,https://github.com/mrdoob/three.js
1,libgdx,libgdx,86100,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,86100,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,86100,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,86100,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,86100,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,86100,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,86100,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,86100,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,86100,https://github.com/CesiumGS/cesium


In [None]:
def scrape_topics():
  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
  


def scrape_topics_repos():
  url = 'https://github.com/topics'
  if response.status_code != 200:
    raise Exception('Failed to load page{}'.format(topics_url))

  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p', class_ = selection_class)
  desc_selector = 'f5 color-fg-muted mb-0 mt-1'
  topic_desc_tags = doc.find_all('p', class_=desc_selector)
  topic_link_tags = doc.find_all('a', 'no-underline flex-1 d-flex flex-column' )

  topic_titles =[]
  for tag in topic_title_tags:
    topic_titles.append(tag.text)

  topic_descriptions = []
  for tag in topic_desc_tags:
    topic_descriptions.append(tag.text.strip())

  topic_urls =[]
  base_url = 'https://github.com'

  for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
  


##Document and share your work
* Add proper headings and documentation in your Jupyter notebook.
* Publish your Jupyter notebook to your Jovian profile
* (Optional) Write a blog post about your project and share it online.