# Task
Apply ICR steps to the dataset at "https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset".

## Download the resume dataset

### Subtask:
Download the resume dataset from Kaggle using `kagglehub`.


**Reasoning**:
The subtask is to download the resume dataset using kagglehub. This involves importing the library and using the download function.



In [1]:
import kagglehub

resume_dataset_path = kagglehub.dataset_download("snehaanbhawal/resume-dataset")

print("Path to resume dataset files:", resume_dataset_path)

Using Colab cache for faster access to the 'resume-dataset' dataset.
Path to resume dataset files: /kaggle/input/resume-dataset


## Adapt Text Extraction and Information Extraction Code

Since the resume dataset is provided as a CSV with the text already extracted, we can skip the image preprocessing and Tesseract OCR steps. We will directly use the text from the CSV file and apply the information extraction using the Gemini model.

In [2]:
import pandas as pd
import json
import time
import os
from google import genai
from google.colab import userdata

# Load the resume data from the CSV file
resume_csv_path = os.path.join(resume_dataset_path, 'Resume', 'Resume.csv')
resume_df = pd.read_csv(resume_csv_path)

# Initialize the Generative AI client
genai_client = genai.Client(api_key=userdata.get('GOOGLE_API_KEY'))

# Define the output folder for extracted information
output_folder_path = "/content/resume_json_output"
os.makedirs(output_folder_path, exist_ok=True)
print(f"Created folder: {output_folder_path}")

# Define the prompt for information extraction
# We need to adjust the prompt as there's no image to provide as context
prompt_template = """
Extract the following information from the given resume text:
- Category (from the 'Category' column)
- Extracted Text (the full text of the resume)

Provide the output in the following JSON format:
{{
    "Category": "CATEGORY",
    "Extracted Text": "FULL_RESUME_TEXT"
}}

Here is the resume text:

{}
"""

start_time = time.time()
total_resumes = len(resume_df)
print(f"Total resumes to process: {total_resumes}")

# Process each resume in the DataFrame
# Limiting to first 20 for demonstration purposes
for index, row in resume_df.head(20).iterrows():
    category = row['Category']
    resume_text = row['Resume_str']
    resume_id = f"resume_{index}" # Create a simple ID based on the index

    print(f"Processing resume {index + 1}/{total_resumes}: {resume_id}")

    # Format the prompt with the current resume text
    current_prompt = prompt_template.format(resume_text)

    # Create content for the model (text only)
    contents = [
        {
            "text": current_prompt
        }
    ]

    try:
        # Generate content using the Gemini model
        response = genai_client.models.generate_content(model='gemini-1.5-flash', contents=contents)

        # Access the usage_metadata attribute
        usage_metadata = response.usage_metadata

        # Print the different token counts
        print(f"Input Token Count: {usage_metadata.prompt_token_count}")
        # Note: thoughts_token_count might not always be present or relevant for all models/requests
        # if hasattr(usage_metadata, 'thoughts_token_count'):
        #   print(f"Thoughts Token Count: {usage_metadata.thoughts_token_count}")
        print(f"Output Token Count: {usage_metadata.candidates_token_count}")
        print(f"Total Token Count: {usage_metadata.total_token_count}")


        # Parse the JSON output
        # Added error handling for potential issues with JSON parsing
        try:
            # Clean the response text to ensure valid JSON
            cleaned_response_text = response.text.strip()
            if cleaned_response_text.startswith('```json'):
                cleaned_response_text = cleaned_response_text[7:]
            if cleaned_response_text.endswith('```'):
                cleaned_response_text = cleaned_response_text[:-3]
            cleaned_response_text = cleaned_response_text.strip()


            extracted_information = json.loads(cleaned_response_text)

            # Save the extracted information to a JSON file
            output_path = os.path.join(output_folder_path, f"{resume_id}.json")
            with open(output_path, "w") as f:
                json.dump(extracted_information, f, indent=4)

            print(f"Saved extracted information to {output_path}")
            print("-" * 50)

        except json.JSONDecodeError as e:
            print(f"Error decoding JSON for resume {resume_id}: {e}")
            print(f"Response text: {response.text}")
            print("-" * 50)
        except Exception as e:
            print(f"An error occurred while processing resume {resume_id}: {e}")
            print("-" * 50)


    except Exception as e:
        print(f"An API error occurred for resume {index + 1}/{total_resumes}: {e}")
        print("Skipping this resume and waiting for 60 seconds before continuing.")
        print("-" * 50)
        time.sleep(60) # Wait for 60 seconds before retrying or continuing


print("Information Extraction Completed.")
print(f"Total time taken: {time.time() - start_time} seconds")

Created folder: /content/resume_json_output
Total resumes to process: 2484
Processing resume 1/2484: resume_0
An API error occurred for resume 1/2484: 404 NOT_FOUND. {'error': {'code': 404, 'message': 'Publisher Model `projects/generativelanguage-ga/locations/us-central1/publishers/google/models/gemini-1.5-flash-002` was not found or your project does not have access to it. Please ensure you are using a valid model version. For more information, see: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions', 'status': 'NOT_FOUND'}}
Skipping this resume and waiting for 60 seconds before continuing.
--------------------------------------------------
Processing resume 2/2484: resume_1
An API error occurred for resume 2/2484: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.\n* Quota exceeded

KeyboardInterrupt: 

## Identify image and text files

### Subtask:
Identify the location and naming convention of the image and text files within the downloaded resume dataset.


**Reasoning**:
Explore the downloaded dataset directory to understand its structure and identify the location and naming conventions of image and text files.



In [3]:
import os

# List the contents of the downloaded dataset directory
print("Contents of the dataset directory:")
for item in os.listdir(resume_dataset_path):
    print(item)

# Define the path to the dataset's main directory
dataset_main_path = os.path.join(resume_dataset_path, 'Resume')

# List the contents of the 'Resume' directory
print("\nContents of the 'Resume' directory:")
for item in os.listdir(dataset_main_path):
    print(item)

# Assuming there's a subdirectory for images (commonly 'images' or 'img') and text (commonly 'text' or 'annotations')
# Let's check common naming conventions and structures
image_dir = None
text_dir = None

for root, dirs, files in os.walk(dataset_main_path):
    for dir_name in dirs:
        if 'image' in dir_name.lower() or 'img' in dir_name.lower():
            image_dir = os.path.join(root, dir_name)
        if 'text' in dir_name.lower() or 'annotation' in dir_name.lower() or 'label' in dir_name.lower():
            text_dir = os.path.join(root, dir_name)

# Print the identified directories
print(f"\nIdentified image directory: {image_dir}")
print(f"Identified text directory: {text_dir}")

# If image directory is found, list a few files to observe naming convention
if image_dir and os.path.exists(image_dir):
    print(f"\nFirst 10 files in image directory ({image_dir}):")
    for i, file_name in enumerate(os.listdir(image_dir)):
        if i < 10:
            print(file_name)
        else:
            break

# If text directory is found, list a few files to observe naming convention
if text_dir and os.path.exists(text_dir):
    print(f"\nFirst 10 files in text directory ({text_dir}):")
    for i, file_name in enumerate(os.listdir(text_dir)):
        if i < 10:
            print(file_name)
        else:
            break

Contents of the dataset directory:
Resume
data

Contents of the 'Resume' directory:
Resume.csv

Identified image directory: None
Identified text directory: None


## Summary:

### Data Analysis Key Findings

*   The dataset was successfully downloaded from Kaggle to the path `/kaggle/input/resume-dataset`.
*   The downloaded dataset contains a directory named `Resume` and a file named `data`.
*   Inside the `Resume` directory, there is a single file named `Resume.csv`.
*   The dataset structure does not contain separate directories for image and text files in the format expected based on a previous task.
*   The resume data is primarily contained within the `Resume.csv` file, not as individual image and text files in separate directories.

### Insights or Next Steps

*   The task of identifying image and text files in the expected format could not be completed because the dataset structure differs from the assumed format.
*   Future analysis should focus on processing the `Resume.csv` file to extract and analyze the resume text data.
