<a href="https://colab.research.google.com/github/sridamju23/github-topics/blob/main/GithubTopics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project outline**


*   we are going to scrape "https://github.com/topics"
*   we will get a list of topics. for each topic we will get topic name,page url and description
*   for each topic we will get top 25 repository from topic name
*   for each repository we will get repo name,username,repo url
*   for each topic we will create a csv file in below format
            repo name,user name,stars,repo url





In [1]:
#required libraries
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
import os


In [5]:
baseUrl = "https://github.com" # address of home page,that is our base url
# this class contains some methods to perform different small independent task
class GithubTopics:
  #configure of objects
  def __init__(self,limit:int)->None:
    self.limit =limit

  def getMainSoup(self,redirectTo):
    """
    Parameters:
    - redirectTo (str): The URL or path to redirect to for fetching the content.

    Returns:
    BeautifulSoup: A BeautifulSoup object representing the parsed HTML content.

    Raises:
    Exception: If the HTTP request to fetch the content fails or returns a non-200 status code.

    This method sends a GET request to the specified URL (baseUrl + redirectTo),
    and upon a successful response, it creates a BeautifulSoup object to parse
    the HTML content. If the HTTP request fails or the status code is not 200,
    an exception is raised with an informative error message.
    """

    topics_response = requests.get(baseUrl+redirectTo)
    if topics_response.status_code!=200:
      raise Exception(f"Failed to response from {baseUrl}")
    return BeautifulSoup(topics_response.text)

  def getTopics(self,soup):
    """
    Extract a list of topics from a BeautifulSoup object representing a parsed HTML page.

    Parameters:
    - soup (BeautifulSoup): The BeautifulSoup object containing the parsed HTML content.

    Returns:
    list: A list of topic names extracted from the HTML page.

    This method finds and extracts topics from the HTML page using the specified CSS class
    ('f3 lh-condensed mb-0 mt-1 Link--primary') applied to 'p' tags. It limits the number
    of topics returned based on the 'limit' attribute of the class instance. The extracted
    topic names are then returned as a list.
    """
    topics_class="f3 lh-condensed mb-0 mt-1 Link--primary"
    topics_tags = soup.findAll('p',{'class':topics_class},limit = self.limit)
    topics_name =[topic.get_text() for topic in topics_tags]
    return topics_name

  def getTopicDescriptions(self,soup):
    """
    Extract a list of descriptions from a BeautifulSoup object representing a parsed HTML page.

    Parameters:
    - soup (BeautifulSoup): The BeautifulSoup object containing the parsed HTML content.

    Returns:
    list: A list of descriptions extracted from the HTML page.

    This method finds and extracts descriptions from the HTML page using the specified CSS class
    ("f5 color-fg-muted mb-0 mt-1") applied to 'p' tags. It limits the number
    of descriptions returned based on the 'limit' attribute of the class instance. The extracted
    descriptions are then returned as a list.
    """
    description_class ="f5 color-fg-muted mb-0 mt-1"
    description_tags = soup.findAll('p',{'class':description_class},limit=self.limit)
    descriptions =[des.text.strip() for des in description_tags]
    return descriptions

  def getTopicLinks(self,soup):

    """
    Extract a list of links from a BeautifulSoup object representing a parsed HTML page.

    Parameters:
    - soup (BeautifulSoup): The BeautifulSoup object containing the parsed HTML content.

    Returns:
    list: A list of links extracted from the HTML page.

    This method finds and extracts links from the HTML page using the specified CSS class
    ("no-underline flex-grow-0") applied to 'p' tags. It limits the number
    of links returned based on the 'limit' attribute of the class instance. The extracted
    links are then returned as a list.
    """
    topics_ref_tags = soup.find_all('a',href=GithubTopics.topics_links,class_="no-underline flex-grow-0",limit=self.limit)
    topics_ref =[baseUrl+a_links['href'] for a_links in topics_ref_tags]
    return topics_ref

  def getDataFrame(self,topics_name,descriptions,topics_ref):
    """
    Create a Pandas DataFrame from lists of topic names, descriptions, and reference links.

    Parameters:
    - topics_name (list): List of topic names.
    - descriptions (list): List of topic descriptions.
    - topics_ref (list): List of topic reference links.

    Returns:
    pd.DataFrame: A Pandas DataFrame containing the provided data.

    This method takes lists of topic names, descriptions, and reference links and
    creates a Pandas DataFrame with columns 'Topics', 'Descriptions', and 'Links'.
    The DataFrame is then returned.
    """
    data ={'Topics':topics_name,'Descriptions':descriptions,'Links':topics_ref}
    github_df = pd.DataFrame(data)
    return github_df

  def to_csv(self,dataFrame,file_name):
    """
    Save a Pandas DataFrame to a CSV file.

    Parameters:
    - dataFrame (pd.DataFrame): The Pandas DataFrame to be saved.

    Prints:
    - str: Status message indicating whether the CSV file was saved successfully or if there was an error.

    This method attempts to save the provided DataFrame to a CSV file with the filename 'topics.csv'
    in the current working directory. It prints a status message indicating the success or failure of the operation.
    """
    try:
      dataFrame.to_csv(file_name,index=False)
      print("CSV file is saved successfully")
    except Exception as e:
      print(f"There is an error: {str(e)}")

  @staticmethod
  def topics_links(href):
    """
    this static method checks if href of links contains "/topics/" to get different topic's reference
    """
    return href and re.search('/topics/',href)


In [7]:
def topicDriver():
    """
    Driver function to scrape GitHub topics, descriptions, links, create a DataFrame, and save it to a CSV file.
    This function initializes a GithubTopicMain instance, fetches the main soup,
    extracts topic information, creates a DataFrame, and saves it to a CSV file.

    Note: 'limit' parameter in the GithubTopicMain instantiation
    based on specific requirements.
    """
    try:
        limit = int(input("Enter the number of topics wanna get :"))
        gtm = GithubTopics(limit=limit)
        redirectTo = "/topics"
        soup = gtm.getMainSoup(redirectTo)
        topics_name = gtm.getTopics(soup)
        descriptions = gtm.getTopicDescriptions(soup)
        topics_ref = gtm.getTopicLinks(soup)
        df = gtm.getDataFrame(topics_name, descriptions, topics_ref)
        file_name = redirectTo[1:] + ".csv"
        gtm.to_csv(df, file_name)
    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    topicDriver()


Enter the number of topics wanna get :10
CSV file is saved successfully


In [11]:
class GithubRepos:
  def __init__(self,limit) -> None:
    self.limit = limit

  def getSoup(self,link):
    """
    Retrieve and parse the HTML content of a each topic web page get from our earlier links.

    Parameters:
    - link (str): The URL or path to the each topic page.

    Returns:
    BeautifulSoup: A BeautifulSoup object representing the parsed HTML content.

    Raises:
    Exception: If the HTTP request to fetch the content fails or returns a non-200 status code.

    This method sends a GET request to the specified URL, and upon a successful response,
    it creates a BeautifulSoup object to parse the HTML content. If the HTTP request fails
    or the status code is not 200, an exception is raised with an informative error message.
    """
    response = requests.get(link)
    if response.status_code!=200:
      raise Exception("Failed!")
    return BeautifulSoup(response.text)

  def getRepoInfo(self,topic_soup):
    """
    Extract repository information from a BeautifulSoup object representing a topic page.

    Parameters:
    - topic_soup (BeautifulSoup): The BeautifulSoup object containing the parsed HTML content of a topic page.

    Returns:
    tuple: A tuple containing lists of user names, repo names, repo links, and stars for the top repositories.

    This method finds and extracts information about the top repositories for a given topic from the HTML page.
    It returns lists of user names, repo names, repo links, and stars for further processing.
    """
    user_name = []
    repo_name = []
    repo_link = []
    stars = []
    h_tag =topic_soup.find_all('h3',class_="f3 color-fg-muted text-normal lh-condensed",limit=self.limit)
    star_class ="Counter js-social-count"
    stars_tags = topic_soup.find_all('span',class_=star_class,limit = self.limit)
    for star_tag in stars_tags:
      star = GithubRepos.covert_to_number(star_tag.get_text())
      stars.append(int(star))
    for topic in h_tag:
      user = topic.get_text().split("/")
      user_name.append(user[0].strip())
      repo_name.append(user[1].strip())
      a_tag = topic.find('a',class_="Link text-bold wb-break-word")
      repo_link.append(baseUrl + a_tag['href'])
    return user_name,repo_name,repo_link,stars

  def to_csv(self,index,repo,df):
    """
    Save repository information to a CSV file.

    Parameters:
    - index (int): Index of the topic in the DataFrame.
    - repo (tuple): Tuple containing user names, repo names, repo links, and stars.
    - df (pd.DataFrame): The main DataFrame containing topic information to get each topic name based on index

    This method takes the extracted repository information, creates a DataFrame, and saves it to a CSV file
    with the filename based on the topic name. The index parameter is used to identify the corresponding topic.
    """
    d ={}
    d["Username"] = repo[0]
    d["Repo Name"] = repo[1]
    d["Repo Link"] = repo[2]
    d["Star"] = repo[3]
    repo_df = pd.DataFrame(d)
    file_name = df["Topics"][index]+".csv"
    repo_df.to_csv(file_name,index = False)

  @staticmethod
  def covert_to_number(star):
    """
    Convert star count string to a numerical value.

    Parameters:
    - star (str): String representing the star count, e.g., "60k".

    Returns:
    int: The numerical representation of the star count.

    This static method takes a star count string and converts it to a whole number.
    It handles suffixes 'k' (thousand), 'm' (million), and 'b' (billion) to provide
    the appropriate numerical representation.
    """
    if star is None:
      return 0
    if star[-1].lower() =="k":
      return eval(star[:-1])*1000
    if star[-1].lower() =="m":
      return eval(star[:-1])*1000000
    if star[-1].lower() =="b":
      return eval(star[:-1])*1000000000
    else:
      return eval(star)


In [12]:
def repoDriver():
    limit = int(input("Enter the number of repo wanna get :"))
    githubTopics = GithubRepos(limit=10)
    df = pd.read_csv("topics.csv")

    for index, link in enumerate(df["Links"]):
        try:
            topic_soup = githubTopics.getSoup(link)
            repo = githubTopics.getRepoInfo(topic_soup)
            githubTopics.to_csv(index, repo, df)
        except Exception as e:
            print(f"An error occurred: {str(e)}")
if __name__ == "__main__":
    repoDriver()

Enter the number of repo wanna get :15
