##Project Introduction:


1.   **Problem statement**: <br>
The objective of this project is to extract and organize information from GitHub's topic pages. GitHub offers a wealth of information on various topics and associated repositories, but manually gathering this data can be time-consuming. The goal is to automate the process of collecting data, organizing it into structured formats, and saving it into CSV files for further analysis or usage.<br><br>


2.   **Introduction to  Web Scraping and GitHub**: <br>
GitHub is a popular platform for hosting and sharing code repositories, facilitating collaboration and version control. Web scraping involves extracting data from websites, allowing users to gather information in a structured format. In this project, web scraping will be used to extract data from GitHub's topic pages.<br><br>



3. **Tools used**:
*   <u>Python</u>: Backbone of the whole project.

*   <u>Requests</u>: Used to fetch the content of GitHub's topic pages.

*   <u>BeautifulSoup</u>: It helps in navigating and searching through the HTML structure of GitHub pages.

*   <u>Pandas</u>: Use to organize the extracted data into structured formats like CSV files.<br><br>


4. **Useful Links**: <br>
https://github.com <br>
https://github.com/topics




##Project outline:
1.   We are going to scrap https://github.com/topics
2.   We will get a list of topics. For each topic we will get topic title, topic page URL and topic description.

3.   For each topic we will get top 20 repositories in the topic from the topic page.
4. For each repositories we will grab the reponame, username(author name), stars and repo URL.

##Scrape the list of topics from GitHub:

**Use the requests library to download web pages:**

In [63]:
!pip install requests --upgrade --quiet

In [64]:
import requests

**Use Beautiful Soup to parse and extract information**:

In [65]:
!pip install beautifulsoup4 --upgrade --quiet

In [66]:
from bs4 import BeautifulSoup

**Let's create one function to download the webpage of gitHub topics**

In [67]:
def get_topic_page():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

##Parse the information from the topic page:

**Let's create some helper functions**:

In [68]:
def get_topic_title(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', class_=selection_class)

    topic_titles = [tag.text.strip() for tag in topic_title_tags]

    return topic_titles


# topic_titles = get_topic_title(doc)
# print(topic_titles)

get_topic_title can be used to get the list of titles

In [69]:
def get_topic_url(doc):
    topic_link_selector = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', class_=topic_link_selector)
    base_url = 'https://github.com'
    topic_urls = [base_url + tag['href'] for tag in topic_link_tags]
    return topic_urls

We are using get_topic_desc function to get the list of topic description

In [70]:
def get_topic_desc(doc):
    topic_desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', class_=topic_desc_selector)
    topic_descs = [tag.text.strip() for tag in topic_desc_tags]
    return topic_descs

Similar to topic title and topic url, we create the function get_topic_url to grab the url's of the topics.

**Installing and importing pandas library to form the dataframes**:

In [71]:
!pip install pandas --quiet

In [72]:
import pandas as pd

Let's combine all these functions into a single one:

In [73]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topics_url}')
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'Topic_title': get_topic_title(doc),
        'Topic_description': get_topic_desc(doc),
        'Topic_URL': get_topic_url(doc)
    }
    return pd.DataFrame(topics_dict)

Let's run it to extract the information related to each topic.

In [74]:
total_df = scrape_topics()
total_df.to_csv(f'topics.csv', index=None)

## Get top 20 repositories in the topic from the topic page:

**Let's create one function to download the webpage of each topic with the help of topic url**

In [75]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

**Useful functions**:<br>

* get_repo_info: It will be use to scrape the username, repo_name,
 repo_url and the count of stars on each topic.

*  parse_star_count: This function will convert value of star count to a number. For example(3.2k to 3200).





In [76]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1])*1000)
  else:
    return int(stars_str)

def get_repo_info(repo_tag, star_tag):
    a_tags = repo_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = 'https://github.com' + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [77]:
def get_topic_repos(topic_doc):
    repo_tags = topic_doc.find_all('h3', class_='f3 color-fg-muted text-normal lh-condensed')
    star_tags = topic_doc.find_all('span', class_='Counter js-social-count')

    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

In [78]:
def scrape_topic(topic_url, topic_name):
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(f'{topic_name}.csv', index=None)

**Puttting it all together**: <br>


*  We have a function to get the list of topics.
*  We have a function to create CSV file for scraped repos from topics page.

*   Now let's put them all together.











In [79]:
def scrape_topic_repos():
    topics_df = scrape_topics()
    print('Scraping list of topics:')
    for index, row in topics_df.iterrows():
        print(f'Scraping top repos for the topic: "{row["Topic_title"]}"')
        scrape_topic(row['Topic_URL'], row['Topic_title'])

Let's run it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [80]:
scrape_topic_repos()

Scraping list of topics:
Scraping top repos for the topic: "3D"
Scraping top repos for the topic: "Ajax"
Scraping top repos for the topic: "Algorithm"
Scraping top repos for the topic: "Amp"
Scraping top repos for the topic: "Android"
Scraping top repos for the topic: "Angular"
Scraping top repos for the topic: "Ansible"
Scraping top repos for the topic: "API"
Scraping top repos for the topic: "Arduino"
Scraping top repos for the topic: "ASP.NET"
Scraping top repos for the topic: "Awesome Lists"
Scraping top repos for the topic: "Amazon Web Services"
Scraping top repos for the topic: "Azure"
Scraping top repos for the topic: "Babel"
Scraping top repos for the topic: "Bash"
Scraping top repos for the topic: "Bitcoin"
Scraping top repos for the topic: "Bootstrap"
Scraping top repos for the topic: "Bot"
Scraping top repos for the topic: "C"
Scraping top repos for the topic: "Chrome"
Scraping top repos for the topic: "Chrome extension"
Scraping top repos for the topic: "Command-line interf

##References and Future Ideas:

**References**: <br>
*  BeautifulSoup Documentation: https://beautiful-soup-4.readthedocs.io/en/latest/

* Request Documentation:  https://pypi.org/project/requests/

*  Pandas Documentation: https://pandas.pydata.org/docs/user_guide/10min.html
<br>

**Future Ideas**:<br>
*   Develop some functions in the same script to extract the data related to some more topics present on different pages of the same website.

*   Make it more user friendly by asking the topic names to the user itself.

*   Develop some mechanism for real-time data updates by setting up particular frequency.





