# Homework 3:  Web-as-Output!

Last week was dedicated to _consuming_ (or, perhaps, _gathering_) content **from** the web.

This week and this notebook invites you into the world of _producing_ content for the web. The nice thing is
  + the actual _producing_ happens in a scripting language
  + and then the _formatting_ for the web can be done automatically
  + whew!

#### <font style="color:rgb(180,120,10);"><b>hw3pr1, parts (a) and (b)</b>:  &nbsp; "real" webscraping...</font>

This problem bridges input from the web with output to the web. Last week's use of APIs found and interpreted **structured** data, mostly JSON.  (For pre-defined APIs, JSON is what's used, most of the time!)

What if a site has information you'd like to use, but only has HTML, but not JSON? In this case, <tt>requests</tt> will provide the raw HTML (as a string) and it'll be up to us to extract the information we want! We'll use 
  + Python string-handling and <tt>string</tt> libraries, and
  + Python's _regular expression_ <tt>re</tt> library, a mini-language for string-matching and -manipulating.

First, an example.  We want to programmatically access the _best snacks_ on the <u>definitive snacks page</u>, which is [here at this url](https://www.cs.hmc.edu/~dodds/demo.html)

Alas, this snack-centric web service seems not to have a JSON API! We will have to grab the whole HTML text. HTML is always sent over as a huge string...

In [58]:
import requests

url = "https://www.cs.hmc.edu/~dodds/demo.html"
result = requests.get(url)
print(f"{result = }")

result = <Response [200]>


In [59]:
# Let's print the text we just grabbed:
snack_page = result.text
print(snack_page)

text = snack_page         # ok to have many names...

<html>
  <head>
    <title>My streamlined website</title>
  </head>
  <body>
    <h1> Welcome! </h1>
    <h2> The best numbers </h2>

    <div id="numberlist">
      <ol>
	<li class="number"> 35 </li>
	<li class="number"> 42 </li>
	<li class="number"> <a href="https://en.wikipedia.org/wiki/Rayo%27s_number">Rayo's number</a> </li>
      </ol>
    </div>

    <img src="./spam.jpg" height="84px">
    <br><br>

    <h2> The <s>only</s> best snacks </h2>

    <div id="snacklist">
      <ul>
	<li class="snack"> Poptarts </li>
	<li class="snack"> Chocolate </li>
	<li class="snack"> Coffee </li>
      </ul>
    </div>

<!--    <a href="./demo_cat.html">Aliens <3 cats!</a>  -->

    <img src="./alien.png" height="101px">

  </body>
</html>






#### <font style="color:rgb(180,120,10);"><b>hw3pr1a</b>:  &nbsp; snack-scraping, _an example to run_ </font>

For this part, follow the cells below to scrape all of the snacks from the above string.

Notice that all of the snacks have a _common context_ - namely, the HTML ``<li>`` and ``</li>`` tags in which they're embedded. In addition, they are all of _class_ ``"snack"``

<br>
<hr>

Ooh... we notice that all of the snacks are inside ``li`` tags:
+ These are _list items_ within an _unordered list_ 
+ Here is an example of one: ``<li class="snack"> Poptarts </li>``
+ Notice, too, that the ``class`` in each case is ``"snack"``

There are three ways to grab all of these snacks!
1. We can use the ``.find`` method all strings have! (We'll do this.)
1. We can use regular expressions. See part (c)!
1. We can use a library such as [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)  [[Good choice for a **final** project, if you'd like...]]

For now, let's show how ``.find`` can work:

First, let's see/remember what the <tt>find</tt> method does:

In [1]:
#    0         1         2             # ten's place
#    0123456789012345678901234567      # one's place
s = "abcdefghijklmnopqrstuvwxy&jk"
s.find("e")                            # try 'a', 'j', 'hi', 'hit', and 'z' ! jk!

                                       # s.find("a",15)   # try ("j",15)


4

Ok! Now we can create a plan...

<br>
<hr>

Let's 
  + find each instance of ``<li class="snack">``
  + print their indices and
  + print the string between them!

In [4]:
end = 0

while True:
    start = snack_page.find('<li class="snack">', end)
    if start == -1: break     # stop if we're done!
    end = start + 42          # 42 characters!
    
    snack_slice = snack_page[ start:end ]
    print(f"{snack_slice = }")

print("\nComplete!")


snack_slice = '<li class="snack"> Poptarts </li>\n\t<li cla'
snack_slice = '<li class="snack"> Coffee </li>\n      </ul'

Complete!


Aargh!  We only got two snacks. &nbsp; ***Do you see why?***

It's because we started the next ``find`` 42 characters after the first one, at ``end``, and it <u>ate into</u> the next snack. <br> So, it could only find the first and third snacks.

<br>

Let's repeat the process, more carefully
  + We should find the following ``</li>``
  + and then continue from there!

In [5]:
end = 0

while True:
    start = snack_page.find('<li class="snack">', end)
    if start == -1: break     # stop if we're done!
    end = snack_page.find('</li>', start)  # find the correct ending!
    
    snack_slice = snack_page[ start:end+5 ]
    print(f"{snack_slice = }")

print("\nComplete!")

snack_slice = '<li class="snack"> Poptarts </li>'
snack_slice = '<li class="snack"> Chocolate </li>'
snack_slice = '<li class="snack"> Coffee </li>'

Complete!


<b>We have our snacks!</b>

Let's show how to get ***only*** the snacks, not the HTML and CSS...

What's needed is the offset to the front of the snack, here in the variable ``FRONT``

In [None]:
# we need the length of the search string!
FRONT = len('<li class="snack">')

end = 0

while True:
    start = snack_page.find('<li class="snack">', end)
    if start == -1: break     # stop if we're done!
    end = snack_page.find('</li>', start)  # find the correct ending!
    
    snack_slice = snack_page[ start+FRONT:end ]
    print(f"{snack_slice = }")

print("\nYay!!!")

snack_slice = ' Poptarts '
snack_slice = ' Chocolate '
snack_slice = ' Coffee '

Yay!!!


#### Scraping Success!  

We have 
+ scraped a superior snack page that, alas, did not have a JSON API...
+ written a special-purpose script that extracted the superior snacks from the page
+ and shown that we have them (by printing them, but we could put them in a ~~fridge~~ list for future snack-use!)

#### <font style="color:rgb(180,120,10);"><b>hw3pr1b</b>:  &nbsp; Scrape another page and extract specific data - your choice -  from it ... </font>

For this part, find another page - as large and complicated as you'd like - and scrape one or more pieces of information -- your choice -- from it...
  + Be sure that your information-extraction involves some use of the function <tt>find</tt>  
  + _or_ some use of the ``re`` regular expression library, which is introduced and used below.
  + The other details are up to you...

Ideas? Possibilities include
+ Any page that allows you to scrape it will work -- in the past, students have used The Student Life, and then compared which college is mentioned the most...
+ or the NYTimes, and see which of two cities/states/nation is mentioned the most
+ Perhaps one or two Wikipedia page(s), or a landing page for an organization...
+ With patience, you _can_ use ``find`` and/or ``re`` to extract arbitrary information... and this is a powerful foundation 
  + worth bragging about... 🍰 


In [63]:
#
# hw3pr1, part (b)
#
import requests
from bs4 import BeautifulSoup
import re

#Counting how many times porridge is mentioned in Goldilocks and the Three Bears

url = "https://americanliterature.com/childrens-stories/goldilocks-and-the-three-bears"
result = requests.get(url)
print(f"{result = }")

goldi = result.text

soup= BeautifulSoup(goldi,'html.parser') 
text = soup.get_text()

porridge_count = len(re.findall(r"porridge",text,re.IGNORECASE))

print(f"The word 'porridge' appears {porridge_count} times in Goldilocks and the Three Bears.")


result = <Response [200]>
The word 'porridge' appears 16 times in Goldilocks and the Three Bears.


That's it for <b>hw3pr1</b>, parts (a) and (b) ...

<br>
<hr>
<br>

Onward to <b>hw3pr1</b>, part (c): &nbsp; _Writing your own web-engine_ &nbsp; (with regular expressions) 
  + We'll start by introducing _regular expressions_ - we'll see they provide a nice way to "grab" the <tt>&lt;li&gt;</tt> items from HTML...
  + In fact, they're a great toolset for pretty much ***any*** text-extraction at all!

  <br><br>

#### <font style="color:rgb(180,120,10);"><b>hw3pr2</b>  &nbsp; Regular Expressions: &nbsp;  A _better_ approach to list-item finding and extracting...</font>

The list-item example above used one function to find the items and another to "clearn them up."
  + This is great! And, will work for absolutely anything you need (adding functions as you go...)

**However**, there is a very powerful "mini" pattern-matching language that can help with many text-processing tasks: ***regular expressions***
  + Sometimes called <tt>regex</tt>'es or <tt>re</tt>'s,
  + regular expressions are a very compact languages for matching text patterns.
  + the Python library is <tt>re</tt>

Before unpacking the regex language, let's see it in action for the "handle list-item tags" challenge:

In [43]:
# Let's import the regular expression library (it should be built-in)
import re

In [44]:
# REs are a whole language! 
# Let's see a strategic use, to get our snacks from the snack_page above:
import re

m = re.findall(r'<li class="snack">(.*)</li>', snack_page )      # Yikes!    Common functions: findall, sub, search, match  

print(f"{m = }")                                                 # Wow!!!

NameError: name 'snack_page' is not defined

### A nice example of RE's, _Regular Expressions_!  &nbsp;&nbsp; 

No turning back now...  😊

<br>

As a goal, let's build up to that large example above.  However, we won't use ``findall`` .

It's more informative to use ``sub`` (for _substitution_), so we can see what's found -- and what it becomes.

In [6]:
# Let's try some smaller examples to build up to the snack_page example:

# fundamental capabilities:  regex matching and substitution  
#
#    the regex:
#      matcher:    replacer:   in this string:
re.sub(r"Harvey",  "Mildred",  "Harvey Mudd")           # the 'r' is for 'raw' strings. They're best for re's.

'Mildred Mudd'

In [7]:
re.sub(r"car", "cat",  "This car is careful!")          # we'll stick with substitution for now...  uh oh!  space or ,1

'This cat is cateful!'

In [16]:
re.sub(r"a.*a", "a-a", "alabama")  

'a-a'

In [45]:
re.sub(r"d", "dd", "Harvey Mud")          # try "Mildred Mudd"

'Harvey Mudd'

In [46]:
# ANCHORS:  Patterns can be anchored:   $ meand the _end_
re.sub(r"d$", "dd", "Mildred Mud" )   # $ signifies (matches) the END of the line

'Mildred Mudd'

In [47]:
# ANCHORS:  Patterns can be anchored:   ^  means the _start_ 
re.sub(r"^M", "ℳ", "Mildred Mudd" )   # ^ signifies (matches) the START of the line  (unicode M :)

'ℳildred Mudd'

In [48]:
# PLUS  +   means one or more:
re.sub(r"i+", "i", "Isn't the aliiien skiing this weekend? AiiiIIIiiiiIIIeee!" )   # try replacing with "" or "I" or "𝒾" or "ⓘ"

"Isn't the alien sking this weekend? AiIIIiIIIeee!"

In [49]:
# SquareBrackets  [iI]  mean any from that character group:
re.sub(r"[Ii]+", "i", "Isn't the aliiien skiing this weekend? AiiiIIIiiiiIIIeee!" )   # it can vary within the group!

"isn't the alien sking this weekend? Aieee!"

In [50]:
# SquareBrackets allow ranges, e.g., [a-z]
re.sub(r"[a-z]", "*", "Aha! You've FOUND my secret: 42!")       # use a +,  add A-Z, show \w, for "word" character

"A**! Y**'** FOUND ** ******: 42!"

In [51]:
# Let's try the range [0-9] and +
re.sub(r"[0-9]+", "42",  "Aliens <3 pets! They have 45 cats, 6 lemurs, and 789 manatees!")   # DISCUSS!  no +? How to fix?!

'Aliens <42 pets! They have 42 cats, 42 lemurs, and 42 manatees!'

Ok! &nbsp;&nbsp; Let's expand our thought experiments:

In [52]:
re.sub( r"or", "and", "words or phrases" )
re.sub( r"s", "-", "words or phrases" )
re.sub( r"[aeiou]", "-", "words or phrases" )

re.sub( r"$", " [end]", "words or phrases" )
re.sub( r"^", "[start] ", "words or phrases" )

# Challenge! The dot . matches _any_ single character:  
re.sub( r".", "-", "words or phrases" )   # What will this do?

re.sub( r".s", "-S", "words or phrases" )  # And this one?!

re.sub( r".+s", "-S", "words or phrases" )  # And this one?!!

'-S'

There is one more "common" regular expression element. &nbsp;&nbsp; The star * means "zero or more" of what precedes it...

It's similar to the plus + (which means 1 or more), _but * also allows for 0 times_ !  &nbsp;&nbsp; This can be mind-bending...

In [53]:
# The star (asterisk) matches ZERO or more times...
re.sub(r"42*", "47", "Favorite #'s:  4 42 422 4222 42222 422222")       # try + {2}  {1,3}   (42)

"Favorite #'s:  47 47 47 47 47 47"

####   Ok!  Let's break out, to a more <font color="DodgerBlue"><b>hands-on</b></font> medium...

... to try out our ``"alabama"`` and ``"Google"`` regular-expression challenges... :) 

<br><br>

We now have ***almost*** everything in that list-item-handling example from a while back. 

Let's take a look -- and add the idea of a _capture group_   &nbsp;&nbsp; (using parens)

In [60]:
m = re.findall(r'<li class="snack">(.*)</li>', snack_page )   # parens are a "capture group"   # try w/o it  # try search & sub
                                                   # each set of parents "captures" the text inside it
print(f"{m = }")                                   # it can even be used later, as \1, \2, \3, etc. 

m = [' Poptarts ', ' Chocolate ', ' Coffee ']


#### <font style="color:rgb(180,120,10);"><b>hw3pr1, part(c)</b> &nbsp;&nbsp; Writing your own Web Engine &nbsp; _with Regular Expressions_ ... </font>

A **web engine** is an informal term for software that makes content visible in a browser. For example,
+ In Jupyter notebooks, we write _markdown_ and then VSCode renders it as _markup_
+ Similarly, this happens in Google Colab and _anywhere_ markdown is used! 
  + to do this, the syntax <tt>_italic_</tt> gets transformed into <tt><i>italic</i></tt> by a "markdown-to-markup" web engine
  + from there, the browser can render the latter using its markup: &nbsp; <i>italic</i>
  + (in fact, it uses another web engine to go from markup to visible content)  
  
+ We will focus on implementing the **markdown-to-markup** step - and extending it, by adding a few features of your own design 

<b><font color="DodgerBlue">Side note</font></b>: &nbsp;&nbsp; This is an example of _meta-programming_ for software! That is, writing programs that transform one sort of programs into another, more useful sort!
+ Often, with strategic transformations along the way...
+ Metaprogramming is poised to be a much larger part of the next two decades than it was in the last two...!

<br>
<hr>
<br>

The next cell has the _starting markdown_ for our **markdown-to-markup** web engine.  

Because the next cell ***is*** markdown -- and it's in a notebook _with_ a markdown engine -- you'll see the markup, as usual!
+ As usual, you can see the markdown by double-clicking the cell
+ It's also available as a Python string in the following cell...

# Claremont's Colleges - MARKDOWN version

The Claremont Colleges are a *consortium* of **five** SoCal institutions. <br>
We list them here.

## The 5Cs: a list
+ [Pomona](https://www.pomona.edu/)
+ [CMC](https://www.cmc.edu/)
+ [Pitzer](https://www.pitzer.edu/)
+ [Scripps](https://www.scrippscollege.edu/)
+ [HMC](https://www.hmc.edu/)

The above's an _unordered_ list.  <br>
At the 5Cs, we all agree there's __no__ order!

---

## Today's featured college: [CMC](https://coloradomtn.edu/)

<img src="https://ygzm5vgh89zp-u4384.pressidiumcdn.com/wp-content/uploads/2017/06/GWS_campusview_1000x627.jpg" height=160>

---

### Also featured: &nbsp; Scripps and Pitzer and Mudd and Pomona

<img src="https://i0.wp.com/tsl.news/wp-content/uploads/2018/09/scripps.png?w=1430&ssl=1" height=100px> &nbsp; 
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f9/Brant_Clock_Tower%2C_Pitzer_College%2C_2016_%28cropped%29.jpg" height=100px> &nbsp; 
<img src="https://www.hmc.edu/about/wp-content/uploads/sites/2/2020/02/campus-gv.jpg" height=100px> &nbsp;
<img src="https://upload.wikimedia.org/wikipedia/commons/4/46/Smith_Tower_and_the_San_Gabriel_Mountains.jpg" height=100px>

Are there _other_ schools in Claremont?

### Claremont destinations
+ _Pepo Melo_, a fantastic font of fruit!
+ **Starbucks**, the center of Claremont's "city," not as good as Scripps's _Motley_ 
+ ***Sancho's Tacos***, the village's newest establishment
+ ~~In-and-out CS35_Participant_3~~ (not in Claremont, alas, but close! CMC-supported!)
+ `42`nd Street Bagel, an HMC fave, definitely _well-numbered_
+ Trader Joe's, providing fuel for the walk back to Pitzer _from Trader Joe's_

---

#### Regular Expression Code-of-the-Day 
`import re`               
`pet_statement = re.sub(r'dog', 'cat', 'I <3 dogs')`

#### New Construction of the ~~Day~~ _Decade_!

<img src="https://www.cs.hmc.edu/~dodds/roberts_uc.png" height=150> <br><br>

CMC's **_Roberts Science Center_, also known as _"The Rubiks Cube"_** <br>
Currently under construction, under deadline, and undeterred by SoCal sun, or rain... 

<br><br>


In [67]:
#
# Here is a code cell, with the entire first-draft markdown of the previous cell 
# 
# stored in the Python variable      original_markdown
#

original_markdown = """

# Claremont's Colleges - MARKDOWN version

The Claremont Colleges are a *consortium* of **five** SoCal institutions. <br>
We list them here.

## The 5Cs: a list
+ [Pomona](https://www.pomona.edu/)
+ [CMC](https://www.cmc.edu/)
+ [Pitzer](https://www.pitzer.edu/)
+ [Scripps](https://www.scrippscollege.edu/)
+ [HMC](https://www.hmc.edu/)

The above's an _unordered_ list.  <br>
At the 5Cs, we all agree there's __no__ order!

---

## Today's featured college: [CMC](https://coloradomtn.edu/)

<img src="https://ygzm5vgh89zp-u4384.pressidiumcdn.com/wp-content/uploads/2017/06/GWS_campusview_1000x627.jpg" height=160>

---

### Also featured: &nbsp; Scripps and Pitzer and Mudd and Pomona

<img src="https://i0.wp.com/tsl.news/wp-content/uploads/2018/09/scripps.png?w=1430&ssl=1" height=100px> &nbsp; 
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f9/Brant_Clock_Tower%2C_Pitzer_College%2C_2016_%28cropped%29.jpg" height=100px> &nbsp; 
<img src="https://www.hmc.edu/about/wp-content/uploads/sites/2/2020/02/campus-gv.jpg" height=100px> &nbsp;
<img src="https://upload.wikimedia.org/wikipedia/commons/4/46/Smith_Tower_and_the_San_Gabriel_Mountains.jpg" height=100px>

Are there _other_ schools in Claremont?

### Claremont destinations
+ _Pepo Melo_, a fantastic font of fruit!
+ **Starbucks**, the center of Claremont's "city," not as good as Scripps's _Motley_ 
+ ***Sancho's Tacos***, the village's newest establishment
+ ~~In-and-out CS35_Participant_3~~ (not in Claremont, alas, but close! CMC-supported!)
+ `42`nd Street Bagel, an HMC fave, definitely _well-numbered_
+ Trader Joe's, providing fuel for the walk back to Pitzer _from Trader Joe's_

---

#### Regular Expression Code-of-the-Day 
`import re`               
`pet_statement = re.sub(r'dog', 'cat', 'I <3 dogs')`

#### New Construction of the ~~Day~~ _Decade_!

<img src="https://www.cs.hmc.edu/~dodds/roberts_uc.png" height=150> <br><br>

CMC's **_Roberts Science Center_, also known as _"The Rubiks Cube"_** <br>
Currently under construction, under deadline, and undeterred by SoCal sun, or rain... 

<br><br>


"""

In [134]:
#
# here is a function to write a string to a file (default name: output.html)
#

def write_to_file(contents, filename="output.html"):
    """ writes the string final_contents to the file filename """
    f = open(filename,"w")
    print(contents, file=f)
    print(f"{filename = } written. Try opening it in a browser!")
    f.close()

In [133]:
#
# Let's write our original_markdown to file...
#

write_to_file(original_markdown)

filename = 'output.html' written. Try opening it in a browser!


#### <font color="Goldenrod"><b>Your hw3pr1c task</b></font> is to create a set of functions that create a markdown-to-markup transformer!
+ <b>including</b> at least these existing markdown features: headers, bold, italic, strikethrough (for Toby!), url-links, and item-lists
+ <b>and you should design</b> at least three new markdown-features of your own. <font size="-2">(This is ***modern*** markdown, not that stodgy markdown from the 90's!)</font>
+ The assignment page has several suggestions. You'll add to the markdown source to show off your new features (and customize)

<hr>

To get started, the following cells have a couple of example transformations: 
+ how to convert the word ``MARKDOWN`` to the word ``MARKUP``
+ how to convert all of the newlines to <tt>&lt;br&gt;</tt>
+ how to handle the <tt># </tt>  top-level headers, which use <tt>&lt;h1&gt;</tt> and  <tt>&lt;/h1&gt;</tt> around their contents
+ how to handle fixed-width (<tt>code-type</tt>) text, which converts backticks <tt>`</tt> to <tt>&lt;tt&gt;</tt>, e.g., <tt>&#96;code&#96;</tt> to <tt>&lt;tt&gt;code&lt;/tt&gt;</tt>

It writes out the result to a file. 
+ Reload it directly in a browser to see how well it's doing.
+ Then, dive into the other changes...

In [140]:
#
# overall mardown-to-markup transformer
#

contents_v0 = original_markdown              # here is the input - be sure to run the functions, below:

contents_v1 = handle_down_to_up(contents_v0)   #   blank lines to <br>
contents_v2 = handle_newlines(contents_v1)   #   blank lines to <br>
contents_v3 = handle_headers(contents_v2)    #   # title to <h1>title</h1>  (more needed: ## to <h2>, ... up to <h6>)
contents_v4 = handle_code(contents_v3)       #   `code` to <tt>code</tt>
contents_v5 = handle_text_stylings(contents_v4)
contents_v6 = handle_lists(contents_v5)
contents_v7 = handle_links(contents_v6)
contents_v8 = feature1(contents_v7)
contents_v9 = feature2(contents_v8)
contents_v10 = feature3(contents_v9)

final_contents = contents_v10                # here is the output - be sure it's the version you want!

write_to_file(final_contents, "output.html") # now, written to file:  Reload it in your browser!

filename = 'output.html' written. Try opening it in a browser!


In [141]:
# we can also print the final output's source - this should show the HTML (so far)
print(final_contents)    
# in addition, _do_ open up output.html in your browser and then View Source to see the same HTML (so far)

<br>
<br>
<h1>Claremont's Colleges - MARKUP version</h1>
<br>
The Claremont Colleges are a <i>consortium</i> of <b>five</b> SoCal institutions. <br>
We list them here.
<br>
<h2>The 5Cs: a list</h2>
<ul><li><a href="https://www.pomona.edu/">Pomona</a></li></ul>
<ul><li><a href="https://www.cmc.edu/">CMC</a></li></ul>
<ul><li><a href="https://www.pitzer.edu/">Pitzer</a></li></ul>
<ul><li><a href="https://www.scrippscollege.edu/">Scripps</a></li></ul>
<ul><li><a href="https://www.hmc.edu/">HMC</a></li></ul>
<br>
The above's an <i>unordered</i> list.  <br>
At the 5Cs, we all agree there's <b>no</b> order!
<br>
---
<br>
<h2>Today's featured college: <a href="https://coloradomtn.edu/">CMC</a></h2>
<br>
<img src="https://ygzm5vgh89zp-u4384.pressidiumcdn.com/wp-content/uploads/2017/06/GWS<i>campusview</i>1000x627.jpg" height=160>
<br>
---
<br>
<h3>Also featured: &nbsp; Scripps and Pitzer and Mudd and Pomona</h3>
<br>
<img src="https://i0.wp.com/tsl.news/wp-content/uploads/2018/09/scripps.png?w

In [78]:
# here is a function to change MARKDOWN to MARKUP
#
import re

def handle_down_to_up(contents):
    """ replace all instances of MARKDOWN with MARKUP """
    new_contents = re.sub(r"MARKDOWN", r"MARKUP", contents)  # simple substitution
    return new_contents

# Let's test this!
if True:
    old_contents = "This is MARKDOWN text"
    new_contents = handle_down_to_up(old_contents) 
    print(new_contents)


This is MARKUP text


In [79]:
# here is a function to handle blank lines (making them <br>)
#
import re

def handle_newlines(contents):
    """ replace all of the just-newline characters \n with HTML newlines <br> """
    NewLines = []
    OldLines = contents.split("\n")

    for line in OldLines:
        new_line = re.sub(r"^\s*$", r"<br>", line)  # if a line has only space characters, \s, we make an HTML newline <br>
        NewLines.append(new_line)

    new_contents = "\n".join(NewLines)   # join with \n characters so it's readable by humans
    return new_contents


# Let's test this!
if True:
    old_contents = """
# Title
    
# Another title"""
    new_contents = handle_newlines(old_contents)
    print(new_contents)

<br>
# Title
<br>
# Another title


In [95]:
# here is a function to handle headers - right now only h1 (top-level)
#
import re

def handle_headers(contents):
    """ replace all of the #, ##, ###, ... ###### headers with <h1>, <h2>, <h3>, ... <h6> """
    NewLines = []
    OldLines = contents.split("\n")

    for line in OldLines:
        for i in range(6, 0, -1):  # check for h6 h5 h4 h3 h2 h1 
            pattern = r'^#{' + str(i) + r'}\s+(.*)$' # str(i) is 654321 ^startswith$endswith
            replacement = r'<h' + str(i) + r'>\1</h' + str(i) + r'>' # capture the contents and wrap with <h1> and </h1>
            if re.match(pattern, line):
                line = re.sub(pattern, replacement, line)
                break  # Stop after first match
        NewLines.append(line)

    new_contents = "\n".join(NewLines)   # join with \n characters so it's readable by humans
    return new_contents

# Let's test this!
if True:
    old_contents = """
# Title
<br>
# H1
## H2
### H3
#### H4
##### H5
###### H6"""
    new_contents = handle_headers(old_contents)
    print(new_contents)


<h1>Title</h1>
<br>
<h1>H1</h1>
<h2>H2</h2>
<h3>H3</h3>
<h4>H4</h4>
<h5>H5</h5>
<h6>H6</h6>


In [81]:
# here is a function to handle code - using markdown backticks
#
import re

def handle_code(contents):
    """ replace all of the backtick content with <code> </code> """
    NewLines = []
    OldLines = contents.split("\n")

    for line in OldLines:
        new_line = re.sub(r"`(.*)`", r"<tt>\1</tt>", line)  # capture the contents and wrap with <code> and </code>
        NewLines.append(new_line)

    new_contents = "\n".join(NewLines)   # join with \n characters so it's readable by humans
    return new_contents

# Let's test this!
if True:
    old_contents = """\
This is <tt>42</tt>   
<br> 
Our regex library:  <tt>import re</tt>"""
    new_contents = handle_code(old_contents)
    print(new_contents)

This is <tt>42</tt>   
<br> 
Our regex library:  <tt>import re</tt>


In [None]:
# functions to handle word-stylings
# completed with jen lim
#
import re

def handle_text_stylings(contents):
    """ replace tildes with <s></s> for strikethrough
    replace two asterisks **bold** and two underscores __bold__ with <b></b> for bold
    replace asterisks *italic* and underscores _italic_ for italic
    """
    contents = re.sub(r'~~(.*?)~~', r'<s>\1</s>', contents) #strikethrough
    contents = re.sub(r'\*\*(.*?)\*\*|__(.*?)__', r'<b>\1\2</b>', contents) #bold
    contents = re.sub(r'\*(.*?)\*|_(.*?)_', r'<i>\1\2</i>', contents) #italic

    return contents

# Let's test this!
if True:
    old_contents = """
~~strikethrough~~
**bold** and __bold__
*italic* and _italic_
"""
    new_contents = handle_text_stylings(old_contents)
    print(new_contents)


<s>strikethrough</s>
<b>bold</b> and <b>bold</b>
<i>italic</i> and <i>italic</i>



In [122]:
def handle_lists(contents):
    """ Convert markdown lists to markup lists. """
    contents = re.sub(r'^\+ (.*)', r'<ul><li>\1</li></ul>', contents, flags=re.MULTILINE)
    return contents

# Let's test this!
if True:
    old_contents = """ 
+ list item
+ list item2
"""
    new_contents = handle_lists(old_contents)
    print(new_contents)

 
<ul><li>list item</li></ul>
<ul><li>list item2</li></ul>



In [124]:
def handle_links(contents):
    """ Convert markdown links [text](url) into HTML links <a href='url'>text</a>. """
    contents = re.sub(r'\[(.*?)\]\((.*?)\)', r'<a href="\g<2>">\g<1></a>', contents)
    return contents

# Let's test this!
if True:
    old_contents = """
[Google](https://www.google.com)
[Goldilocks and the Three bears](https://americanliterature.com/childrens-stories/goldilocks-and-the-three-bears)
"""
    new_contents = handle_links(old_contents)
    print(new_contents)


<a href="https://www.google.com">Google</a>
<a href="https://americanliterature.com/childrens-stories/goldilocks-and-the-three-bears">Goldilocks and the Three bears</a>



In [125]:
def feature1(contents):
    """ removes backslashes that interupt asterisks and underscores """
    contents = re.sub(r'\\\*', '*', contents) #replace \* with *
    contents = re.sub(r'\\_', '_', contents)
    return contents

# Let's test this!
if True:
    old_contents = """
 This is \*Wow\* and \_Yikes\_.
 This is \_\_Yikes\_\_.
"""
    new_contents = feature1(old_contents)
    print(new_contents)


 This is *Wow* and _Yikes_.
 This is __Yikes__.



  old_contents = """


In [129]:
def feature2(contents):
    """ converts markdown superscript and subscript to markup """
    contents = re.sub(r'@@(.*?)@@', r'<sup>\1</sup>', contents) #superscript
    contents = re.sub(r'%%(.*?)%%', r'<sub>\1</sub>', contents) #subscript
    return contents

# Let's test this!
if True:
    old_contents = """
Y = e@@x@@ (Superscript)
%%Sub%%Script (Subscript)
"""
    new_contents = feature2(old_contents)
    print(new_contents)


Y = e<sup>x</sup> (Superscript)
<sub>Sub</sub>Script (Subscript)



In [130]:
def feature3(contents):
    """ converts markdown for text colors to markup """
    contents = re.sub(r'\{color:([\w#]+)\|(.*?)\}', r'<span style="color:\1;">\2</span>', contents)
    return contents

# Let's test this!
if True:
    old_contents = """
{color:red|red text}
{color:DodgerBlue|Go Dodgers!}
"""
    new_contents = feature3(old_contents)
    print(new_contents)


<span style="color:red;">red text</span>
<span style="color:DodgerBlue;">Go Dodgers!</span>



#### <font style="color:rgb(180,120,10);"><b>hw3pr1 part(c)</b>  &nbsp;&nbsp; More transformations!</font>

Your task is to make sure you can run the above transformations:
+ For each one, one at a time, try it on the small example
+ Then, uncomment it from the large (overall) example
+ Be **sure** to change the final ``final_contents`` variable
  + Forgetting this is the most common bug (not really a bug - just not running!)

<br>
<hr>
<br>

From there, implement the other markdown-to-markup transformations as noted in [HW3's gdocs page](https://docs.google.com/document/d/17bJfQIeuNGVh5vP8Y2BjRbVSDyDUNTpIrH0lgYubiUU/edit?tab=t.0) :
+ add new functions and cells -- or reuse other ones -- as you prefer
  + do keep things organized, either way!
+ handle all six levels of headers ``<h1>`` through ``<h6>``
+ handle at least the five word-stylings noted, including _italic_, **bold**, ~~strikethrough~~, unordered lists, and [urls](https://docs.google.com/document/d/1IKZk9mbVkvsf9tl14EZD2CuNYhy3lQvO4Lnk89RmA-0/edit)
+ and, handle, at least <b><font color="DodgerBlue">three more features-or-stylings</font></b> of your own design. (See that gdocs hw page for several possibilities...)
  + Note that you're welcome to _add prose to the original markdown page_ to show of your creative transformations
  + Please don't _remove_ any of the original markdown, however -- that is for testing the various transformations, as well...

<br>

Lots of room for creativity, for sure...   

<br>

#### <font style="color:rgb(180,120,10);"><b>Be sure your <u>final output HTML</u> is present!</b></font>
+ This should show the result of _all_ the transformations:
+ both the starting ones (such as strikethrough, bold, etc.)
+ and your own creations :)



<br>
<hr>
<br>

<font color="DodgerBlue"><b>Meta-programming</b></font> -- that is, writing programs to help you write programs -- is mind-bending, for sure. 
+ As AI rises, there's no avoiding it: &nbsp;&nbsp; We have definitively entered the era of meta-programming ...

Once your neurons are suitably _"bent"_ ... you'll find ***lots*** of uses for it! 
