## EDA Lab Assignment - 5
* Ghori Zeel Jivrajbhai - 202201287

### Scrape data from the website link provided below:

- Link: https://www.tfrrs.org/

#### Tasks:

- Extract the data from the given table in websites.
- Store the data in dataframe using pandas or any other relevant library.
- Perform some preprocessing steps if required in it.

# **Link of the colab file**

https://drive.google.com/file/d/1cMCdeaSIC1iv8_iVr0hu84LvFHocVt_j/view?usp=sharing

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Request the website content
site_url = 'https://www.tfrrs.org/'
site_response = requests.get(site_url)

# Step 2: Parse the HTML of the website
html_content = BeautifulSoup(site_response.text, 'html.parser')

# Step 3: Locate the table on the main page
main_table = html_content.find('table')

# Step 4: Extract the column names (headers) from the table
column_titles = [header.text.strip() for header in main_table.find_all('th')]

# Step 5: Extract rows and find hyperlinks from the table
data_rows = []
page_links = []
for row in main_table.find_all('tr')[1:]:  # Skip header
    row_elements = []
    for cell in row.find_all('td'):
        # Check if the cell contains a link
        link_element = cell.find('a')
        if link_element:
            page_links.append(link_element['href'])
        row_elements.append(cell.text.strip())
    data_rows.append(row_elements)

# Step 6: Create a DataFrame for the main table
df_primary = pd.DataFrame(data_rows, columns=column_titles)

# Step 7: Initialize the list to store DataFrames for linked tables
scraped_tables = []
base_site_url = 'https://www.tfrrs.org/'

# Step 8: Loop through each link and scrape the linked tables
for page_link in page_links:
    full_url = base_site_url + page_link
    page_response = requests.get(full_url)

    if page_response.status_code == 200:
        page_soup = BeautifulSoup(page_response.text, 'html.parser')

        # Extract the linked table from the page
        page_table = page_soup.find('table')
        if page_table:
            # Extract the headers of the linked table
            table_headers = [header.text.strip() for header in page_table.find_all('th')]

            # Extract the rows of the linked table
            table_rows = []
            for row in page_table.find_all('tr')[1:]:
                row_data = [cell.text.strip() for cell in row.find_all('td')]
                table_rows.append(row_data)

            # Store the linked table in a DataFrame
            df_link = pd.DataFrame(table_rows, columns=table_headers)
            scraped_tables.append(df_link)

# Step 9: Preprocess the DataFrames
df_primary.dropna(inplace=True)  # Remove missing data from the main table

# Drop missing data from the linked DataFrames
for index, df in enumerate(scraped_tables):
    scraped_tables[index] = df.dropna()

# Step 10: Print the main table and linked tables
print("Primary Table DataFrame:")
print(df_primary)

for index, df in enumerate(scraped_tables):
    print(f"\nLinked Table {index+1}:")
    print(df)

# Step 11: Save the data to CSV files
df_primary.to_csv('primary_table.csv', index=False)
for index, df in enumerate(scraped_tables):
    df.to_csv(f'linked_table_{index+1}.csv', index=False)

print("Data successfully saved.")


Primary Table DataFrame:
      DATE                                 MEET NAME STATE  \
0   Oct 10       PVAMU Clifton Gillard Cross Country    TX   
1   Oct  9                        Principia Twilight    IL   
2   Oct  8                        CCNY John Jay Dual    NY   
3   Oct  6  Queensborough Cross Country Invitational    NY   
4   Oct  5                  Ted Castaneda XC Classic    CO   
..     ...                                       ...   ...   
70  Sep 28                     2024 Cougar Challenge    CA   
71  Sep 28         Jessup Cross Country Invitational    CA   
72  Sep 28            Harry F. Anderson Invitational    NY   
73  Sep 28                       Thomas Invitational    ME   
74  Sep 28                   Brown Bear Invitational    MA   

                        VENUE  
0                Prairie View  
1                   Principia  
2                        CCNY  
3           Van Cortland Park  
4            Colorado College  
..                        ...  
70    