# CSD 1: Beautiful Soup

In [None]:
ans = {}
ans['id_number'] = 0
ans['HW'] = 'CSD1'

Go to [ebay.com](ebay.com) and search for little boys t-shirts. The ebay website, like any modern website, is filled with text, images and links. But if you are using Google Chrome and you right-click on any page and choose "View page source" you will see the raw HTML script behind it.

The python library [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)'s job is to help you parse this raw HTML, to get what you want. Run the following piece of code, by pressing it and pressing the "play" icon in the above menu, or just Ctrl + Enter:

In [None]:
from bs4 import BeautifulSoup
import requests

url = "https://il.ebay.com/b/Boys-Short-Sleeve-Sleeve-Tops-T-Shirts-Sizes-4-Up/175521/bn_4278610?rt=nc&LH_ItemCondition=1000&LH_BIN=1&LH_PrefLoc=3&_pgn=1"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")

The above code imports Beautiful Soup, imports the requests library for handling web connections, assigns an ebay search results page address to a variable called `url`, "requests" this URL, stores the response in a variable called `r`, makes a `BeautifulSoup` object out of the response's `content`, and assigns it to a variable called `soup`.

it is advised to visit the url using your browser, so you will have a visual understanding of what you are doing.

Print the raw HTML:

In [None]:
print(soup.prettify())

#### Q1) Replace the `### YOUR CODE HERE ###` comment to print __just the title__ of the page (as a string, without html tags).

Hint 1: The [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Hint 2: type `soup`, then `.`, then press TAB to get possible object members or methods from Jupyter

Hint 3: Keep Calm and [Stack Overflow](https://stackoverflow.com/a/51550)

In [None]:
url_title = ### YOUR CODE HERE ###
print(url_title)
ans['Q1'] = url_title

### links
This is an HTML paragraph tag or element: `<p>This is a paragraph</p>`

This is a hyperlink tag: `<a href="https://www.google.com/">Google it!</a>`

`find_all` links in a page:

In [None]:
all_links = ### YOUR CODE HERE ###
print('all_links is a: ' + str(type(all_links)))
print()
print('first 5 elements in all_links:')
print(all_links[:5])

#### Q2) how many links are there?

In [None]:
ans['Q2'] = ### YOUR answer HERE ###

Notice that the hyperlink tag has an attribute called `href` holding the link's address. In a BeautifulSoup element, to access this attribute you can think of an element as a dictionary and the attribute its key:

In [None]:
print(type(all_links[100]))
print()
print(all_links[100])
print()
print(all_links[100]['href'])

To get the actual dictionary of an element use the `attrs` member:

In [None]:
print(type(all_links[100].attrs))
print()
print(all_links[100].attrs.keys())

### images
Find all image elements in our ebay page and put them in a variable called `images`. You might want to find out what is the [HTML tag for an image](https://www.w3schools.com/tags/tag_img.asp) first.

In [None]:
images = ### YOUR CODE HERE ###

Get a `list` of all image titles from the `images` object, **except for the first one**. Print that list.

Hint: `alt`

In [None]:
image_titles = [img['alt'] for img in images[1:]]
print(image_titles)

What is the attribute for an image JPEG file address?

Some images have the attribute `src` and some `data-src`. This is one way to combine the two. Make sure you understand:

In [None]:
image_files_src = [img['src'] for img in images[1:]]
image_files_datasrc = [img.get('data-src', None) for img in images[1:]]
image_files = [src if datasrc is None else datasrc for src, datasrc in zip(image_files_src, image_files_datasrc)]
image_files[:5]

### prices
Let's find a shirt's price. 
go to the url in your browser and use the code inspection tool (F12) to look interactively at the url source code. 
find the element that holds price data.
notice that the price may be nested within a few levels of htm tags. you are searching for the "lowest" level. that which holds the price directly.

in our case it is a `span` element with a specific class

#### Q3) what is the specific class for span elements holding the prices?

In [None]:
ans['Q3'] = ### YOUR answer HERE ###

In [None]:
price_elements = soup.find_all('span', class_ = ### YOUR answer HERE ###)
print(price_elements[:5])

From each of these `price_elements` we extract the actual price text with the `get_text` function:

In [None]:
print(price_elements[1].get_text())

you can see that all prices come with the "ILS" prefix and then the number. <br>
also you can see some prices come as a range. <br>
for this project we decided to simply take the minimum price of the range. <br>
to do so we could split this string to its elements:

In [None]:
print(price_elements[1].get_text().split(' '))

Get the second element:

In [None]:
print(price_elements[1].get_text().split(' ')[1])

And convert it to a float

In [None]:
print(float(price_elements[1].get_text().split(' ')[1]))

Your task is to complete the `parse_price` function so that in the end the `prices` variable will hold a list of all shirts prices:

In [None]:
def parse_price(price_element):
    try:
        price = ### YOUR CODE HERE ###
    except:
        price = None
    return price

prices = [parse_price(price_e) for price_e in price_elements]

It's time to actually download the shirts images! The following function accepts an image file address, a shirt title and the file name for the image and attempts to download the image to the current directory with the specified file name:

In [None]:
def download_image(url, title, file_name):
    try:
        response = requests.get(url)    
    except:
        return '', ''
    with open(file_name, "wb") as file:
        file.write(response.content)
    return title, file_name

Download the first image from our page, name it 'test.jpg'. Make sure it was downloaded correctly and see what the function returns:

In [None]:
download_image(### YOUR answer HERE ###)

We will now download all of the page's images, using a loop. 

First, create a folder named 'boys' in the current directory. You can do it right here in this notebook!

In [None]:
!mkdir boys

While downloading, fill in the blanks to correctly create a dictionary called `images_data` which will hold the title of the image, its file name, and the shirt's price:

In [None]:
from ipywidgets import IntProgress
from IPython.display import display


your_answer_1 = ### YOUR CODE HERE ###
your_answer_2 = ### YOUR CODE HERE ###


images_data = {'title': {},
               your_answer_1: {},
               'price': {}}

f = IntProgress(min = 0, max = len(images[1:])) # instantiate a progress bar
display(f) # display the bar

for i in range(len(images[1:])):
    title, file_name = download_image(image_files[i], image_titles[i], './boys/' + str(i) + '.jpg')
    images_data['title'][i] = title
    images_data['file_name'][i] = file_name
    images_data[your_answer_2][i] = prices[i]
    f.value += 1

        
ans['Q4'] = your_answer_1 
ans['Q5'] = your_answer_2

One thing that would prove useful later on is having a dataset which summarizes all we have gathered. That's what `images_data` is for. We're going to use `pandas` to make it a `DataFrame` we can easily read and write:

In [None]:
import pandas as pd
images_data_df = pd.DataFrame(images_data)
images_data_df.head()

This was fun, we got 48 images. But we're looking to get times ~200 than that, and the same amount of shirts images for girls. The following code was run to get all boys shirts images. You can run it to see that it's working or you can just skim it to see you get how all the different elements are combined:

In [None]:
boys_url = 'https://il.ebay.com/b/Boys-Short-Sleeve-Sleeve-Tops-T-Shirts-Sizes-4-Up/175521/bn_4278610?rt=nc&LH_ItemCondition=1000&LH_BIN=1&LH_PrefLoc=3&_pgn='
max_pages = 400
boys_items_data = {'title': {}, 'file_id': {}, 'price': {}}
f = IntProgress(min = 0, max = max_pages)
display(f)
all_items_counter = 0

for page_num in range(max_pages):
    url = boys_url + str(page_num)
    try:
        r = requests.get(url, "lxml")
    except:
        print('Stopped at page: ' + page_num)
        break
    soup = BeautifulSoup(r.content)
    images = soup.find_all('img')[1:]
    image_titles = [img['alt'] for img in images]
    image_files_src = [img['src'] for img in images]
    image_files_datasrc = [img.get('data-src', None) for img in images]
    image_files = [src if datasrc is None else datasrc for src, datasrc in zip(image_files_src, image_files_datasrc)]
    
    price_elements = soup.find_all('span', class_ = 's-item__price')
    prices = [parse_price(price_e) for price_e in price_elements]
    try:
        assert len(prices) == len(images)
    except:
        print('Found unequal number of prices in page_num % d' % page_num)
        prices = [None] * len(images)
        
    for i in range(len(images)):
        title, file_name = download_image(image_files[i], image_titles[i], './boys/' + str(all_items_counter + i) + '.jpg')
        boys_items_data['title'][all_items_counter + i] = title
        boys_items_data['file_id'][all_items_counter + i] = all_items_counter + i
        boys_items_data['price'][all_items_counter + i] = prices[i]
    all_items_counter += len(images)
    f.value += 1

This is how you'll get all boys and girls images quicker, using the images that were downloaded for you. You should be able to do this only once.

First download the compressed file from a remote server:

In [None]:
url = "http://www.tau.ac.il/~saharon/DScourse/ebay_boys_girls_shirts.tar.gz"
r = requests.get(url)

with open("ebay_boys_girls_shirts.tar", "wb") as file:
    file.write(r.content)

Next decompress the file in the datasets folder:

In [None]:
import tarfile

with tarfile.open("ebay_boys_girls_shirts.tar") as tar:
    tar.extractall('.')

You now have in your datasets folder all ~33K boys and girls shirts images. See that you can read the four CSVs holding the metadata for the train and test sets of images:

In [None]:
folder = 'ebay_boys_girls_shirts/'
boys_train_df = pd.read_csv(folder + 'boys_train.csv')
girls_train_df = pd.read_csv(folder + 'girls_train.csv')
boys_test_df = pd.read_csv(folder + 'boys_test.csv')
girls_test_df = pd.read_csv(folder + 'girls_test.csv')
print('N boys train images: %d' % boys_train_df.shape[0])
print('N girls train images: %d' % girls_train_df.shape[0])
print('N boys test images: %d' % boys_test_df.shape[0])
print('N girls test images: %d' % girls_test_df.shape[0])

# Finished!
now you know how to scrape like a pro! <br>
please hand in the csv through moodle so we could know that too....

In [None]:
import pandas as pd
df_ans = pd.DataFrame.from_dict(ans, orient='index')
if df_ans.shape[0] == 7:
    df_ans.to_csv('{}_{}.csv'.format(ans['HW'],str(ans['id_number'])))
else:
    print("seems like you missed a question, make sure you have run all the code blocks")