# Reprocess Repositories with Missing READMEs

This notebook inspects the `gov_repositories_catalog.csv` file to identify and remove entries where the README file could not be fetched (indicated by `"README not available"`).

By removing these entries, a subsequent run of the `GitHubScanner` will attempt to fetch the READMEs for these repositories again.

In [1]:
import pandas as pd
import os

# Define the path to the catalog file. 
# Using a relative path makes the notebook more portable.
catalog_file = '../data/20250624_223620_a910a8e1/gov_repositories_catalog.csv'

## 1. Load and Inspect the Data

In [2]:
if not os.path.exists(catalog_file):
    print(f"Error: Catalog file not found at {catalog_file}")
else:
    df = pd.read_csv(catalog_file)
    print(f"Successfully loaded {catalog_file}")
    print(f"Total rows: {len(df)}")
    df.info()

Successfully loaded ../data/20250624_223620_a910a8e1/gov_repositories_catalog.csv
Total rows: 8707
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8707 entries, 0 to 8706
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   account           8707 non-null   object
 1   name              8707 non-null   object
 2   description       8707 non-null   object
 3   stars             8707 non-null   int64 
 4   forks             8707 non-null   int64 
 5   language          8707 non-null   object
 6   url               8707 non-null   object
 7   readme_snippet    8678 non-null   object
 8   last_scanned_utc  8707 non-null   object
 9   created_at        8707 non-null   object
 10  pushed_at         8707 non-null   object
dtypes: int64(2), object(9)
memory usage: 748.4+ KB


## 2. Identify and Count Repos with Missing READMEs

In [3]:
missing_readme_text = "README not available"

missing_readme_mask = df['readme_snippet'] == missing_readme_text
num_missing = missing_readme_mask.sum()

print(f"Found {num_missing} repositories with 'README not available'.")
if len(df) > 0:
    print(f"This is {num_missing / len(df):.2%} of the total repositories.")

Found 0 repositories with 'README not available'.
This is 0.00% of the total repositories.


In [4]:

# Find rows where the combination of 'account' and 'name' is duplicated.
# 'keep=False' marks all occurrences of a duplicate combination as True.
duplicate_combinations = df[df.duplicated(subset=['account', 'name'], keep=False)]

if not duplicate_combinations.empty:
    print("Found rows with duplicate 'account' and 'name' combinations:")
    # Using display() for better formatting of DataFrames in Jupyter notebooks
    display(duplicate_combinations)
    print(f"\nTotal number of rows involved in duplicate 'account' and 'name' combinations: {len(duplicate_combinations)}")
else:
    print("No rows found with duplicate 'account' and 'name' combinations.")

# Optional: Print a summary of unique vs. total rows for context
total_unique_combinations = df.drop_duplicates(subset=['account', 'name']).shape[0]
print(f"\nSummary:")
print(f"Total unique 'account' and 'name' combinations: {total_unique_combinations}")
print(f"Total rows in the DataFrame: {df.shape[0]}")

# --- End of new cell content ---


No rows found with duplicate 'account' and 'name' combinations.

Summary:
Total unique 'account' and 'name' combinations: 8707
Total rows in the DataFrame: 8707


## 3. Filter Out Rows and Overwrite CSV

This step will remove the identified rows from the DataFrame and save the result back to the original CSV file, effectively preparing it for a re-run.

In [4]:
if num_missing > 0:
    # Keep only the rows where the readme is NOT missing
    df_reprocessed = df[~missing_readme_mask]
    
    print(f"Original number of rows: {len(df)}")
    print(f"Number of rows to remove: {num_missing}")
    print(f"New number of rows: {len(df_reprocessed)}")
    
    # Save the reprocessed dataframe back to the original file
    df_reprocessed.to_csv(catalog_file, index=False, encoding='utf-8')
    
    print(f"\nSuccessfully removed rows and updated '{catalog_file}'.")
else:
    print("No rows with 'README not available' to remove.")

Original number of rows: 9778
Number of rows to remove: 1071
New number of rows: 8707

Successfully removed rows and updated '../data/20250624_223620_a910a8e1/gov_repositories_catalog.csv'.


## 4. Verification

Let's reload the file to confirm that the rows have been removed.

In [5]:
df_verify = pd.read_csv(catalog_file)

remaining_missing = (df_verify['readme_snippet'] == missing_readme_text).sum()

if remaining_missing == 0:
    print("Verification successful: No more rows with 'README not available'.")
else:
    print(f"Verification FAILED: Found {remaining_missing} rows with 'README not available'.")

print(f"Current total rows: {len(df_verify)}")

Verification successful: No more rows with 'README not available'.
Current total rows: 8707
