# Week 8 - Strings, Websites, Images, Email

This week we're going to look at some small miscellaneous stuff that's important to know but doesn't really fit anywhere in the context of the course, as well how we can use Python to interact with the Web, Images and Email. To start, let's look at some basic string manipulation

Python comes with some built in tools for string maniupulation and making our lives a bit easier - we'll look at some built in functionality along with the `re` module for Regex. Finally, we'll look at appling these to some HTML parsing to get some information from the web.

## Format Strings 

Usually when manipulating strings we've done so either using the `+` operator or just passing it in as another argument to the print function as below:

In [5]:
def say_hello():
    name = input("Please enter your name: ")
    print("Hello,", name)

say_hello()

Hello, sam


This is a fine way to do it, and works in a lot of situations, but what about formatting the string to look a bit nicer? We can use `f-strings` to format strings inline:

In [6]:
def say_hello():
    name = input("Please enter your name: ")
    print(f"Hello, {name}")

say_hello()

Hello, sam


This might look like it's just as good as the above - but it comes with some more powerful tools. Consider the following examples:

In [7]:
import math

name = "Sam Ball"
other_name = "John T. W. Smith"

# Print pi to 3dp
print(f"Pi is equal to roughly: {math.pi:.3f}")

# Print the name of a variable along with it's value (helpful for debugging!)
print(f"{name=}")

# Force padding for a number (great for tables!):
print(f"Monopoly Scores:\n{'-' * 20}")
print(f"{name:20}== £100")
print(f"{other_name:20}== £2400")

Pi is equal to roughly: 3.142
name='Sam Ball'
Monopoly Scores:
--------------------
Sam Ball            == £100
John T. W. Smith    == £2400


## The Big Scary World of Regex

Regular Expressions (or Regex) are a way of working with text that allows us to search for patterns. Some of you will find this amazing and instantly know a million places where you can use it, and others will wonder why we are talking about this at all!

These expressions look a bit scary and indecipherable, but after a little bit of work I promise you'll be able to figure out what any regular expression is looking for.

Let's start by defining the basic rules and see how they'd be put together. First, if you want to match a single character, we have some basic operations:

* `[abc]` will match a single a, b or c. `[a-m]` will match any character from a to m.
* `[^abc]` will match any character that's *not* a, b or c.
* `.` will match any character.

Then for defining *how many* of that character, we have the following basic characters:
* `*` - zero or more occurences (i.e. optional)
* `+` - one or more occurances
* `?` - zero or one occurances
* `{2}` - exactly 2 occurances (can be any number)
* `{3, 5}` - exactly 3 to 5 occurances. Leaving out the first number means 0 to the second number - leaving out the second number means "at least the first number".

So, for example, `[abc]+` will match *any* string or substring made up of a, b and cs. `e{2}` will match all `ee`s in the text.

Before moving onto some more complex example, let's see how this works in Python:

In [8]:
import re

text_to_search = "The rain in spain falls mainly on the plains"

# Search will give the first instance matching the pattern
print(re.search(r"[rpml]ain", text_to_search))

# Findall will give all instances
print(re.findall(r"[rpml]ain", text_to_search))

# One or more letter, followed by 2 ls followed by a letter.
print(re.findall(r"[a-z]+l{2}[a-z]", text_to_search))

<re.Match object; span=(4, 8), match='rain'>
['rain', 'pain', 'main', 'lain']
['falls']


With this we can already do a lot of good stuff - we can get even better by adding the following rules:

* `\d \w, \s` are special characters denoting any digit, word, or space respectivly.
*  A pipe (`|`) allows for "or" groupings of words - for example `(green|red)\sapple` will match both "red apple" and "green apple"
* \b represents a word boundary (i.e start and end of a word)
* `^` and `$` match the beginning and end of an input respectivly.

So if we want to get all the rhymes in the above phrase, we can use:


In [9]:
print(re.findall(r"\b\w{1,2}ain[lys]+\b", text_to_search))

['mainly', 'plains']


Regex always takes a bit of trial and error and can be fiddly to get it to work - very often you start simple with the patterns and build them up bit by bit. You'll make complex patterns in no time!

If you want to practice Regex, I *highly* recommend [Regex Crosswords](https://regexcrossword.com) - I'd actually say this is the best way to learn Regex!

Now let's look at a way of using Regex in real life:

## Input Validation

To validate a postcode using Python we have two options - use an API web service to validate that the postcode is indeed valid and corresponds to an address, or check that it is in the right format manually. The [government actually provides a Regex string](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/488478/Bulk_Data_Transfer_-_additional_validation_valid_from_12_November_2015.pdf) for postcode matching; this is a bit more complex as it has to handle all the edge cases.

Let's build a function that takes in a string, and scans it for (most) valid postcodes.

Postcodes can take a few forms: 1 or 2 letters followed by 1 or 2 numbers, followed by number, letter, letter.
This would be fairly difficult to do using normal Python, but far easier using regex!

In [10]:
def check_for_postcodes(input_string):
    return re.findall(r"[A-Za-z]{1,2}[\d]{1,2}\s?\d[A-Za-z]{1,2}", input_string)

my_string = "Sometimes I write my postcode as SW179EF, sometimes as SW17 9EF."

check_for_postcodes(my_string)

['SW179EF', 'SW17 9EF']

## Web Scraping

Regex is a useful building block for searching for text - and mastery of it will make you very valuable whenever someone wants to look for something; but thankfully for those of us who aren't so comfortable writing out 3 line regex patterns other people have built specific libraries for common tasks.

One such example is web scraping; websites are made of HTML code which looks like the following:
```
<html>
    <head>Website title</head>
    <body>Website contents</body>
<html/>
```
Obviously most website are a bit more complex than this - let's look at the BBC news website for example:

In [11]:
import requests

bbc = requests.get("https://www.bbc.co.uk/news")
print(bbc.text)

<!DOCTYPE html>
<html lang="en-GB" class="b-pw-1280 b-reith-sans-font no-touch" id="responsive-news">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=1">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta name="google-site-verification" content="Tk6bx1127nACXoqt94L4-D-Of1fdr5gxrZ7u2Vtj9YI">
    <link href="//static.bbc.co.uk" rel="preconnect" crossorigin>
    <link href="//m.files.bbci.co.uk" rel="preconnect" crossorigin>
    <link href="//nav.files.bbci.co.uk" rel="preconnect" crossorigin>
    <link href="//ichef.bbci.co.uk" rel="preconnect" crossorigin>
    <link rel="dns-prefetch" href="//mybbc.files.bbci.co.uk">
    <link rel="dns-prefetch" href="//ssl.bbc.co.uk/">
    <link rel="dns-prefetch" href="//sa.bbc.co.uk/">
    <link rel="dns-prefetch" href="//ichef.bbci.co.uk">


    <link rel="preload" as="style" href="//m.files.bbci.co.uk/modules/bbc-morph-news-page-styles/2.4.25/enhanced.

There's a lot of stuff there! How can we navigate through it? Well, the `BeautifulSoup` library gives us a lot of tools for parsing this into some form we may want to use: 

In [12]:
!pip install bs4



In [13]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(bbc.text)

print(soup.title)

<title>Home - BBC News</title>


`bs4` comes with a ton of useful tools for navigating the webpage - for example to get all the link in the page, we can just use:

In [14]:
for link in soup.find_all('a'):
    print(link.get('href'))

https://www.bbc.co.uk
#skip-to-content-link-target
https://www.bbc.co.uk/accessibility/
https://account.bbc.com/account
https://www.bbc.co.uk/notifications
https://www.bbc.co.uk
https://www.bbc.co.uk/news
https://www.bbc.co.uk/sport
https://www.bbc.co.uk/weather
https://www.bbc.co.uk/iplayer
https://www.bbc.co.uk/sounds
https://www.bbc.co.uk/bitesize
https://www.bbc.co.uk/cbeebies
https://www.bbc.co.uk/cbbc
https://www.bbc.co.uk/food
https://www.bbc.com/
https://www.bbc.com/news
https://www.bbc.com/sport
https://www.bbc.com/reel
https://www.bbc.com/worklife
https://www.bbc.com/travel
https://www.bbc.com/future
https://www.bbc.com/culture
https://www.bbc.co.uk/schedules/p00fzl9m
https://www.bbc.com/weather
https://www.bbc.co.uk/sounds
#orbit-more-drawer
https://search.bbc.co.uk/search?scope=all&destination=news_ps
https://www.bbc.co.uk
https://www.bbc.co.uk/news
https://www.bbc.co.uk/sport
https://www.bbc.co.uk/weather
https://www.bbc.co.uk/iplayer
https://www.bbc.co.uk/sounds
https://w

Most of these are hyperlinks to other parts of the website - but how can we filter to *only* these entries? Well - we can use regex!

In [15]:
links = [link.get('href') for link in soup.find_all('a')]

# Now pattern match
only_hypers = [s for s in links if re.compile("https://www\..*").match(s)]
print(only_hypers)

['https://www.bbc.co.uk', 'https://www.bbc.co.uk/accessibility/', 'https://www.bbc.co.uk/notifications', 'https://www.bbc.co.uk', 'https://www.bbc.co.uk/news', 'https://www.bbc.co.uk/sport', 'https://www.bbc.co.uk/weather', 'https://www.bbc.co.uk/iplayer', 'https://www.bbc.co.uk/sounds', 'https://www.bbc.co.uk/bitesize', 'https://www.bbc.co.uk/cbeebies', 'https://www.bbc.co.uk/cbbc', 'https://www.bbc.co.uk/food', 'https://www.bbc.com/', 'https://www.bbc.com/news', 'https://www.bbc.com/sport', 'https://www.bbc.com/reel', 'https://www.bbc.com/worklife', 'https://www.bbc.com/travel', 'https://www.bbc.com/future', 'https://www.bbc.com/culture', 'https://www.bbc.co.uk/schedules/p00fzl9m', 'https://www.bbc.com/weather', 'https://www.bbc.co.uk/sounds', 'https://www.bbc.co.uk', 'https://www.bbc.co.uk/news', 'https://www.bbc.co.uk/sport', 'https://www.bbc.co.uk/weather', 'https://www.bbc.co.uk/iplayer', 'https://www.bbc.co.uk/sounds', 'https://www.bbc.co.uk/bitesize', 'https://www.bbc.co.uk/cbe

Of course, we now have more links we can iterate over and build a network of how the BBC news website works (please put a delay in otherwise you'll get in trouble!)

## Image Processing

We can also get all the images from the BBC website with a little bit of work:




In [16]:
imgs = [link.get('data-src') for link in soup.find_all('img')]

print(imgs)

[None, None, 'https://ichef.bbci.co.uk/news/{width}/cpsprodpb/11FB3/production/_127515637__127511687_ros_bowen_thumbnail-bright.jpg', 'https://ichef.bbci.co.uk/news/{width}/cpsprodpb/16C93/production/_127513339_gettyimages-1278767096.jpg', 'https://ichef.bbci.co.uk/news/{width}/cpsprodpb/393B/production/_127515641_johnson-index-getty.jpg', 'https://ichef.bbci.co.uk/news/{width}/cpsprodpb/145E7/production/_127513438_gettyimages-1244462094_cut.jpg', 'https://ichef.bbci.co.uk/news/{width}/cpsprodpb/6FAE/production/_124409582_058ab7c1-729c-4099-9c58-3dd425573af2.jpg', 'https://ichef.bbci.co.uk/news/{width}/cpsprodpb/6FAE/production/_124409582_058ab7c1-729c-4099-9c58-3dd425573af2.jpg', 'https://ichef.bbci.co.uk/news/{width}/cpsprodpb/0FC2/production/_127043040_index_heathrow_getty_11_oct.jpg', 'https://ichef.bbci.co.uk/news/{width}/cpsprodpb/3355/production/_127514131_gettyimages-1438801284.jpg', 'https://ichef.bbci.co.uk/news/{width}/cpsprodpb/3FC1/production/_127512361_gettyimages-1244382

Now we have a very typical problem in programming - our result is good, but not quite there. We have 2 problems:

* We have some `None` values in our list - this is pretty easy to deal with with a silly list comprehension pattern `[a in my_list if a]`.
* We have this `{width}` thing in each of the urls - this is a bit of a funny web thing that lets us put a number in here to get an image with the specified width. If we choose a value for this width we need to replace this `{width}` with that number to get a valid image.
* We have a number of duplicate images. We can fix this by converting to a set and back

For this, we can do it in two steps:

In [36]:
# Remove Nones
imgs = [a for a in imgs if a]

# Replace {width} with 400
imgs = list(set([url.replace("{width}", "400") for url in imgs]))

print(imgs)

['https://ichef.bbci.co.uk/news/400/cpsprodpb/155AE/production/_127507478_recessionlondon_index1_getty.jpg', 'https://ichef.bbci.co.uk/news/400/cpsprodpb/16C35/production/_126973239_midterms-hero-desktop.jpg', 'https://ichef.bbci.co.uk/news/400/cpsprodpb/265D/production/_127512890_warnerandsmithv2.jpg', 'https://ichef.bbci.co.uk/news/400/cpsprodpb/16C93/production/_127513339_gettyimages-1278767096.jpg', 'https://ichef.bbci.co.uk/news/400/cpsprodpb/1333/production/_127151940_89f08761-4848-4ac0-aee5-fbf4084a36b5.jpg', 'https://ichef.bbci.co.uk/news/400/cpsprodpb/14A14/production/_127500548_20_col_mortgage_cal976_x_550_mortgage_cal-1-nc.png', 'https://ichef.bbci.co.uk/news/400/cpsprodpb/2FC7/production/_127513221_islay.jpg', 'https://ichef.bbci.co.uk/news/400/cpsprodpb/F032/production/_127509416_gettyimages-1437602423.jpg', 'https://ichef.bbci.co.uk/news/400/cpsprodpb/2AF4/production/_127469901_gettyimages-1419683660.jpg', 'https://ichef.bbci.co.uk/news/400/cpsprodpb/0A51/production/_1275

Now we have a list of all the images, we can load them using a library like `PIL`:

In [37]:
from PIL import Image
from urllib.request import urlopen

all_imgs = []

for url in imgs:
    img = Image.open(urlopen(url))
    all_imgs.append(img)

We can check if this has worked by showing the first image:

In [34]:
all_imgs[0].show()

Now let's create a board for all the images we've found. This is fairly complex and to be fair took a load of trial and error on my part!

In [44]:
# Get the height and width of one image
w = all_imgs[0].size[0]
h = all_imgs[0].size[1]

# Calculate the height and width of a 4 wide grid of all the images
w_all = w * 4
h_all = math.ceil(h * len(all_imgs) / 4)

# Create a new image with the dimensions above
im = Image.new("RGBA", (w_all, h_all))

# Iterate over the list of images and their index, adding the image to each spot
for idx, image in enumerate(all_imgs):
    im.paste(image, ((idx % 4) * w, (idx // 4) * h))

# Save the image
im.save("grid.png")
# Show the image
im.show()

Phew! Let's switch gears and look at how to use this in a different way.

## Sending Emails with Python

Say we have the above program that scrapes the BBC news homepage, gets all the images and combines them into a grid. What if we want to send this via email? Well, with a bit of work and help from our mail provider we can:


In [45]:
import smtplib
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart

# First, we need to read the images as bytes
with open("./grid.png", 'rb') as f:
        img_data = f.read()

from_email ='<from_email address>' 
to_email = '<destination_email address>'

msg = MIMEMultipart()
msg['Subject'] = 'This is a test email'
msg['From'] = from_email
msg['To'] = to_email 

text = MIMEText("This is a test email")
msg.attach(text)

image = MIMEImage(img_data, name="grid.png")
msg.attach(image)

server = smtplib.SMTP("smtp.gmail.com", 587)
server.starttls()

server.login(from_email, '<gmail auth key>')

server.sendmail(from_email, to_email, msg.as_string())
server.quit()

(221,
 b'2.0.0 closing connection w11-20020a5d608b000000b002366f9bd717sm4016044wrt.45 - gsmtp')

# Exercises

## Pretty Formatting

Write a function that takes in a dictionary of values and prints the key value pairs in the following format:

```
key         | value
key         | value
key         | value
```

## Email RegEx

A few weeks ago we saw that emails are in the form username@domain.tld. Write a RegEx pattern that matches emails in this form

## Web Scraping

The [official top 40 singles chart](https://www.officialcharts.com/charts/uk-top-40-singles-chart/") contains a table with all the top 40 hits in the UK. Use `BeautifulSoup` to scrape this website to get the top 40 singles in a human readable format (`1. ANTI-HERO - Taylor Swift, 2. UNHOLY - Sam Smith & Kim Petras` etc).

You'll need to use the below query to get started, where `chart_soup` is the output of `BeautifulSoup`:

`tracks = chart_soup.find_all("div", {"class": "track"})`