## Python Project: Web Scrapping Instagram Data With Selenium

### What is Selenium?

#### Selenium is a powerful tool for controlling web browsers through programs and performing browser automation. It is functional for all browsers, works on all major OS and its scripts are written in various languages i.e Python, Java, C#, etc, we will be working with Python.

### Where you can run Selenium?

#### You can run Selenium with Python scripts for Firefox, Chrome, IE, etc. on different Operating Systems.

### Why to learn Selenium - Python?

#### Open Source and Portable – Selenium is an open source and portable Web testing Framework.
#### Combination of tool and DSL – Selenium is combination of tools and DSL (Domain Specific Language) in order to carry out various types of tests.
#### Easier to understand and implement – Selenium commands are categorized in terms of different classes which make it easier to understand and implement.
#### Less burden and stress for testers – As mentioned above, the amount of time required to do testing repeated test scenarios on each and every new build is reduced to zero, almost. Hence, the burden of tester gets reduced.
#### Cost reduction for the Business Clients – The Business needs to pay the testers their salary, which is saved using automation testing tool. The automation not only saves time but gets cost benefits too, to the business.

## AUTOMATING IMAGE EXTRACTION

THIS NOTEBOOK IS BASED ON WEB SCRAPPING INSTAGRAM WITH SILENIUM LIBRARY.
THESE CODES HAVE BEEN MADE COMFORTABLE TO FIT FOR ALL KIND OF USERS AND HAS SOLUTIONS FOR ALL THE ISSUES REGARDING WEBSCRAPPING. AND ALSO THESE CODES ARE EXTRACTING FULL SIZE OF IMAGES NOT ONLY THE THUMBNAILS.

#### Importing Important Libraries

In [1]:
#!pip install selenium
#import selenium libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import time #This module provides various functions to manipulate time values.

### You can run Selenium-Python script according to your WebDriver. So here I am using ChromeDriver. Download the latest stable release of ChromeDriver from the given link below:
https://chromedriver.chromium.org/

### Log in to your Instagram Account

In [2]:
#specify the path to chromedriver.exe (download and save on your computer)
driver = webdriver.Chrome('C:\\Users\\SUBRAT PATRA\\Downloads\\Compressed\\chromedriver.exe')

#open the webpage of Instagram
driver.get("http://www.instagram.com")

In [3]:
#target username and password
#selecting input field

username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']")))
password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']")))


### What is CSS_SELECTOR ?
#### Essentially, the CSS Selector combines an element selector and a selector value that can identify particular elements on a web page. Like XPath, CSS selector can be used to locate web elements without ID, class, or Name.

In [4]:
#enter username and password
#proceeding to use own information
username.clear()
username.send_keys("sketch_listening__")
password.clear()
password.send_keys("ilovedeepikapadukone")

In [5]:
#target the login button and click it
log_in = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']"))).click()

#submit:An Expectation for checking an element is visible and enabled such that you can click it.


#### Bang on! We are logged in.

### Handle Alerts
#### you might only get a single alert, or you might get two of them please adjust the cell below accordingly.

In [6]:
#to proceed further lets just get into "Not now" 

time.sleep(5) #Delay execution for a given number of seconds

not_now = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Not Now")]'))).click()

# first not_now alert is for to save login information or not.
    


In [7]:
not_now2 = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Not Now")]'))).click()

# second not_now alert is for turn on notifications or not.

### What is XPATH?
#### In Selenium automation, if the elements are not found by the general locators like id, class, name, etc. then XPath is used to find an element on the web page . XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression. XPath can be used for both HTML and XML documents to find the location of any element on a webpage using HTML DOM structure.

### Now let's head into the "Search Box"

### Let's search for a certain hashtag. Example: (#earth)

In [8]:
#target the search input field
searchbox = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//input[@placeholder='Search']")))
searchbox.clear()

In [9]:
#search for the hashtag earth
keyword = "#earth"
searchbox.send_keys(keyword) #send the hashtag to the search field

#this will lead to seach for hashtag earth in search box

In [10]:
#target on the mentioned hashtag 
time.sleep(5) # Wait for 5 seconds
my_link = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(@href, '/" + keyword[1:] + "/')]")))
my_link.click()

## Scroll Down

### Increase n_scrolls to select more photos (depending on screen resolution)
#### Example:

- 2 scrolls cover approx. 84 photos
- 3 scrolls cover approx. 132 photos

In [12]:
#scroll down 2 times
#increase the range to sroll more
n_scrolls = 2
for j in range(0, n_scrolls):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

In [13]:
#target all the link elements on the page

anchors = driver.find_elements_by_tag_name('a')
anchors = [a.get_attribute('href') for a in anchors]

#narrow down all links to image links only

anchors = [a for a in anchors if str(a).startswith("https://www.instagram.com/p/")]

print('Found ' + str(len(anchors)) + ' links to images')
anchors[:5]

Found 45 links to images


['https://www.instagram.com/p/CMj-y3YFw9W/',
 'https://www.instagram.com/p/CMkE3j9nTfg/',
 'https://www.instagram.com/p/CMkTOxFAmLO/',
 'https://www.instagram.com/p/CMkIcDmAH6b/',
 'https://www.instagram.com/p/CMkAoq3BwlQ/']

In [14]:
images = []

#follow each image link and extract only image at index=1
#we need to target the source attribute of the images

for a in anchors:
    driver.get(a)
    time.sleep(5)
    img = driver.find_elements_by_tag_name('img')
    img = [i.get_attribute('src') for i in img]
    images.append(img[1])
    
images[:5]

['https://instagram.fblr1-4.fna.fbcdn.net/v/t51.2885-15/e35/161281040_916520965816769_5372746183060585742_n.jpg?tp=1&_nc_ht=instagram.fblr1-4.fna.fbcdn.net&_nc_cat=1&_nc_ohc=YjWpoFtrV5EAX9IxBoc&ccb=7-4&oh=95dc9f928cd8666494221a3144ce53bd&oe=607DA427&_nc_sid=86f79a',
 'https://instagram.fblr1-4.fna.fbcdn.net/v/t51.2885-15/e35/162543015_492536795224495_4780597981339514474_n.jpg?tp=1&_nc_ht=instagram.fblr1-4.fna.fbcdn.net&_nc_cat=1&_nc_ohc=A2gQZWmQ_AoAX8sMFDR&ccb=7-4&oh=b4aa5e35ecb03bdb378ed5e061bb0622&oe=607EE42C&_nc_sid=86f79a',
 'https://instagram.fblr1-5.fna.fbcdn.net/v/t51.2885-15/e35/p1080x1080/161254538_312692036953511_7730331982942911213_n.jpg?tp=1&_nc_ht=instagram.fblr1-5.fna.fbcdn.net&_nc_cat=111&_nc_ohc=Ihjl47PFRAcAX9k9Mv2&ccb=7-4&oh=e7ad718cd3d78e7c19fb513cacd1511a&oe=607BE29A&_nc_sid=86f79a',
 'https://instagram.fblr1-4.fna.fbcdn.net/v/t51.2885-15/e35/p1080x1080/161405238_260984018999809_950882679840762589_n.jpg?tp=1&_nc_ht=instagram.fblr1-4.fna.fbcdn.net&_nc_cat=101&_nc_ohc=

## Save images to the computer

###  Save images to computer.
### First we'll create a new folder for our images somewhere on our computer.Then, we'll save all the images there.

In [16]:
#!pip install wget
import os
import wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py): started
  Building wheel for wget (setup.py): finished with status 'done'
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9686 sha256=d718ac99b0946ecb3ca13476957f84b1e60493112b9c55a59511c1784a05b471
  Stored in directory: c:\users\subrat patra\appdata\local\pip\cache\wheels\bd\a8\c3\3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [17]:
path = os.getcwd()
path = os.path.join(path, keyword[1:])

#instead of mentioning a string word '#earth' I have used keyword which was taken before and also to remove the hashtage I have mentioned the index location. 

In [18]:
#create the directory

os.mkdir(path)
path

'C:\\Users\\SUBRAT PATRA\\Project on DS and Python\\earth'

### Download  the images

In [19]:
#download images

counter = 0
for image in images:
    save_as = os.path.join(path, keyword[1:] + str(counter) + '.jpg')
    wget.download(image, save_as)
    counter += 1

100% [............................................................................] 183173 / 183173

## Conclusion:

### So we concluded that, by using this above set of codeswe can logged into instagram account and also can download the pictures which will be related to the respective hashtag.