# MechanicalSoup

MechanicalSoup is a python package that automatically stores and sends cookies, follows redirects, and also can follow hyperlinks and forms in a webpage. It was created by M Hickford. He was always amazed by the Mechanize library. Mechanize was a John J. Lee project which enables programmatic web browsing in Python, and it was later taken over by Kovid Goyal in 2017.

To read about it more, please refer [this](https://analyticsindiamag.com/mechanicalsoup-web-scraping-custom-dataset-tutorial/) article.

# Code Implementation

You can install MechanicalSoup using PyPI (python package manager).

Mechanical soup install command will download BeautifulSoup, requests, six, Urllib, and other libraries. 

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install MechanicalSoup wget --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

## Quickstart

Testing our library for any errors

In [None]:
import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
url = "http://httpbin.org/"

browser.open(url)
print(browser.get_url())

Furthermore, we can do some other things with MechanicalSoup like following the subdomains as follows:

In [None]:
browser.follow_link("forms")
browser.get_url()

Now we are on a new domain http://httpbin.org/forms/post, Let’s extract the page content:

In [None]:
browser.get_current_page()

You can find any tag by using the following command:

In [None]:
browser.get_current_page().find_all('legend')

We  can also fill the forms and POST request using MechanicalSoup by using the following command:

In [None]:
browser.select_form('form[action="/post"]')
browser.get_current_form().print_summary()

For filling the form we can use the following commands:

In [None]:
browser["custname"] = "Mohit"
browser["custtel"] = "9081515151"
browser["custemail"] = "mohitmaithani@aol.com"
browser["comments"] = "please make pizza dough more soft"
browser["size"] = "large"
browser["topping"] = "mushroom"
 
#launch browser
browser.launch_browser()

## Let’s scrape Cats’ images from the internet using MechanicalSoup and create our custom dataset!

It’s a good use-case. The very first step of every data science project is to create or collect data, and then further processing, cleaning, analysis, modeling, and tuning part comes. Now, as we already familiar with essential API, let’s straight jump to code:

> * Search cats on Google Images

We are setting the google search query and making it open in the browser with search text cat.

In [None]:
import mechanicalsoup
 
browser = mechanicalsoup.StatefulBrowser()
url = "https://www.google.com/imghp?hl=en"
 
browser.open(url)
 
#get HTML
browser.get_current_page()
 
#target the search input
browser.select_form()
browser.get_current_form().print_summary()
 
#search for a term
search_term = 'cat'
browser["q"] = search_term 
 
#submit/"click" search
browser.launch_browser()
response = browser.submit_selected()
 
print('new url:', browser.get_url())
print('response:\n', response.text[:500])

Navigate to the new pages and target all the images, it will return the output as URLs list

In [None]:
#open URL
new_url = browser.get_url()
browser.open(new_url)

#get HTML code
page = browser.get_current_page()
all_images = page.find_all('img')

#target the source attributes of image
image_source = []
for image in all_images:
    image = image.get('src')
    image_source.append(image)

image_source[5:25]

Let’s fix the corrupted URLs

Python startswith function to filter the URLs not having HTTPS 

In [None]:
#save cleaned links in "image_source"
image_source = [image for image in image_source if image.startswith('https')]

print(image_source)

Create a local repo to store cat images.

In [None]:
#Uncomment to save it in the local

In [None]:
# import os
 
# path = os.getcwd()
# path = os.path.join(path, search_term + "s")
 
# #create the directory
# os.mkdir(path)
# #print path where cats images are going to save
# path


Download using wget 

In [None]:
import wget  

##download images
counter = 0
for image in image_source:
    save_as = os.path.join(path, search_term + str(counter) + '.jpg')
    wget.download(image, save_as)
    counter += 1