# Web Scraper
This is a dead simple app that is simply meant to help me learn basic Python capabilities.  I am enlisting Claude to help wtih this module as well to serve as a teacher and accelerator.  Hence, everything in here is hard-coded.

## Purpose
The purpose of this module is simply to extract a list of the top 2000 nouns and their gender in Spanish from a blog entry I found.  Later I will run statistics on that (from a separate app) to generate some simple statistics.  The output will go into a CSV file.

In [26]:
import requests
from bs4 import BeautifulSoup
import csv
import re

# URL of the webpage you want to scrape
url = "https://frequencylists.blogspot.com/2015/12/the-2000-most-frequently-used-spanish.html"  # Replace with the actual URL you want to scrape

# Send a GET request to the URL
response = requests.get(url)
response.encoding = 'utf-8'  # Explicitly set encoding to UTF-8

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all text from the page
all_text = soup.get_text()

## Trimming
The hard-coded site above has specific formatting that needs to be dealt with.  First we need to trim off lines that are not useful to us, some before and some after.

In [145]:
# Split the text into lines and clean them
lines = all_text.split('\n')
lines = [line.strip() for line in lines if line.strip()]
lines = lines[5:-13]  # Remove the first 5 and last 13 lines

*******Flagged:   926. expert - experto/a - masculine, feminine


## Parsing
The Parse_Entry function is the main part here.  There are four main formats for the list that need to be dealt with.  The author of the blog article was not totally consistent in the way they conveyed the data.  Four formats capture more than 99% of the list.  The remainder amount to a few entries, so rather than code up every last permutation, we'll just throw those out, as it will have an insignificant effect on the statistics we will generate later.

### Supported Formats
1.  ``word - palabra - gender``
2.  ``word - palabra - gender / palabra - gender``
3.  ``word - palabro/palabra - masculine/feminine``
4.  ``word - palabro/a - masculine/feminine``

The first one accounts for the bulk of the entries.  The remaining three are mostly variations of the same thing- words that change slightly in spelling for the other gender.  Typically, in Spanish, nouns ending in "o" are masculine and words ending in "a" are feminine.  But not always.  *While "palabro" is not a word, but is meant to show that when it is used in these last two formats, the word ending in "o" is always masculine in these cases.*

In all cases, the English word is discarded as we only want to run stats on the Spanish words.  The result is that the output will always be for these formats:

``word - palabra- gender                        ``&rarr;`` palabra,gender``\
*The gender may be masculine, feminine or masculine/feminine. In the latter case, gender will be set to "both"*

``word - palabra - gender / palabra - gender  |``\
``word - palabro/palabra - masculine/feminine | ``&rarr;`` | palabro,gender``\
``word - palabro/a - masculine/feminine       |     | palabra,gender``

with the latter formats resulting in splitting into two outputs lines.

Remaining formats number less than 20 lines and will be discarded.

In [158]:
# Filter out lines with parentheses
lines = [line for line in lines if '(' not in line and ')' not in line and 'plural' not in line and ',' not in line]

def parse_entry(entry):
    parts = [part.strip() for part in re.split('[-/]',entry)]
    numSlashes = entry.count('/')
    numParts = len(parts)
    results = []
 
    if numParts < 3:
        print(f"Skipping invalid entry: {entry}")
        return results
    
    # Discard the English word
    spanish_part = parts[1]
    gender_part = parts[2]
   
    # Case 1: word - palabra - gender
    if numParts == 3:
    # if '/' not in spanish_part and '/' not in gender_part:
        results.append((parts[1],parts[2]))

    # Case 2: word - palabra - both
    elif numParts == 4:
        if parts[2] == "masculine" and parts[3] == "feminine":
            results.append((spanish_part, "both"))

    # Case 3: word - palabra - gender / palabra - gender
    elif numParts == 5 and numSlashes == 1:
        results.append((parts[1],parts[2]))
        results.append((parts[3],parts[4]))

    # Case 4: word - palabro/a - masculine/feminine
    elif numParts == 5 and "o/a" in entry:
        results.append((parts[1],parts[3]))
        results.append((parts[1][:-1]+'a',parts[4]))
        
    # Case 5: word - palabra/palabra - gender/gender
    elif numParts == 5 and numSlashes == 2:
        results.append((parts[1],parts[3]))
        results.append((parts[2],parts[4]))

    return results

## Main Part
Now we have the pure list, let's loop through it and parse it!

In [153]:
# Parse all lines
parsed_data = []
for line in lines:
    parsed_entries = parse_entry(line)
    parsed_data.extend(parsed_entries)

print(f"Number of parsed entries: {len(parsed_data)}")
if parsed_data:
    print("First few parsed entries:")
    for entry in parsed_data[:5]:
        print(entry)
else:
    print("No entries were successfully parsed.")

# Save the data to a CSV file
with open('parsed_spanish_words.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    # writer.writerow(['Spanish Word', 'Gender'])  # Write header
    writer.writerows(parsed_data)

print("Data has been parsed and saved to parsed_spanish_words.csv")

# parts:  3 ,  Parts:  ['1. time', 'tiempo', 'masculine'] 

# parts:  3 ,  Parts:  ['2. man', 'hombre', 'masculine'] 

# parts:  3 ,  Parts:  ['3. way', 'camino', 'masculine'] 

# parts:  3 ,  Parts:  ['4. people', 'gente', 'feminine'] 

# parts:  3 ,  Parts:  ['5. life', 'vida', 'feminine'] 

# parts:  3 ,  Parts:  ['6. day', 'día', 'masculine'] 

# parts:  3 ,  Parts:  ['7. work', 'trabajo', 'masculine'] 

# parts:  3 ,  Parts:  ['8. call', 'llamada', 'feminine'] 

# parts:  3 ,  Parts:  ['9. night', 'noche', 'feminine'] 

# parts:  3 ,  Parts:  ['10. home', 'hogar', 'masculine'] 

# parts:  3 ,  Parts:  ['11. thought', 'pensamiento', 'masculine'] 

# parts:  3 ,  Parts:  ['12. money', 'dinero', 'masculine'] 

# parts:  3 ,  Parts:  ['13. name', 'nombre', 'masculine'] 

# parts:  3 ,  Parts:  ['14. father', 'padre', 'masculine'] 

# parts:  3 ,  Parts:  ['15. guy', 'chico', 'masculine'] 

# parts:  3 ,  Parts:  ['16. place', 'lugar', 'masculine'] 

# parts:  3 ,  Parts:  ['17. feel',