# Image Retrieval From Instagram


Goal: collect image data from instagram and then preprocess it

Image size: 224*224

Resolution: 

Number of images: 



#### Websites: 

This notebook's code is based on the following tutorials: 

https://medium.com/@srujana.rao2/scraping-instagram-with-python-using-selenium-and-beautiful-soup-8b72c186a058

https://edmundmartin.com/scraping-instagram-with-python/

https://michaeljsanders.com/2017/05/12/scrapin-and-scrollin.html

**Important Note:** *Remember to respect user’s rights when you download copyrighted content. Do not use images/videos from Instagram for commercial intent.*

### 1. Import dependencies

Pip Install selenium and download chrome driver from the following link http://chromedriver.chromium.org/. 

In [1]:
import time
import re
import json
from pandas.io.json import json_normalize
import pandas as pd, numpy as np

# to install
from selenium import webdriver
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen

### 2. Open the web browser: 

Selenium uses chrome driver to open the profile given a username (public user).

Download the latest ChromeDriver that support your Chrome version from here: https://sites.google.com/a/chromium.org/chromedriver/downloads

Then follow the steps shown below:

1. Put the ChromeDriver executable in the same directory where you have this python script

2. Add it to your system path

3. Specify the location directly via executable_path


In [2]:
# to specify
browser = webdriver.Chrome('C:/Users/Anonym/Documents/GitHub/DLfM_BrandManagement/chromedriver.exe')

By now, a new and empty Chrome window has popped up.

#### User-profile Page

If you want to open a user-profile page, specify the username as: 

In [3]:
# to specify
username='pickuplimes'
browser.get('https://www.instagram.com/'+username+'/?hl=en')

The desired webpage loads in the empty Chrome window. 

In [4]:
Pagelength = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

#### Hashtag Page

If you want to open a hashtag page (instead of a user profile): 

In [6]:
# to specify
hashtag='food'
# to specify
browser = webdriver.Chrome('C:/Users/Anonym/Documents/GitHub/DLfM_BrandManagement/chromedriver.exe')

browser.get('https://www.instagram.com/explore/tags/'+hashtag)
Pagelength = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

### 3. Parse HTML source page: 

Open the source page and use beautiful soup to parse it. Go through the body of html script and **extract a link for each image in that page** and pass it to an empty list ‘links[]’.

In [5]:
#Extract links from user profile page
links=[]
source = browser.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
for link in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
    links.append('https://www.instagram.com'+'/p/'+link['node']['shortcode']+'/')

In [73]:
# check out links list 
len(links)

12

In [None]:
#Extract links from hashtag page
links=[]
source = browser.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
for link in data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
    links.append('https://www.instagram.com'+'/p/'+link['node']['shortcode']+'/')

### How you can extract information (number of followers, image files) from a users Instagram profile: 

based on: https://edmundmartin.com/scraping-instagram-with-python/

Install non-standard libraries: requests, BeautifulSoup

In [7]:
from random import choice
import json

# to install
import requests
from bs4 import BeautifulSoup

Switching user agents is often a best practice when web scraping and can help you avoid detection. Should the caller of our class have provided their own list of user agents we take a random agent from the provided list.  Otherwise we will return our default user agent.

In [8]:
_user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
]

Define a class called InstagramScraper: 

In [64]:
class InstagramScraper:

    def __init__(self, user_agents=None, proxy=None):
        self.user_agents = user_agents
        self.proxy = proxy

    def __random_agent(self):
        if self.user_agents and isinstance(self.user_agents, list):
            return choice(self.user_agents)
        return choice(_user_agents)

    def __request_url(self, url):
        """Our second helper method is simply a wrapper around requests. 
        We pass in a URL and try to make a request using the provided user agent and proxy. 
        If we are unable to make the request or Instagram responds with a non-200 status code we simply re-raise the error. 
        If everything goes fine, we return the page in questions HTML."""
        try:
            response = requests.get(url, headers={'User-Agent': self.__random_agent()}, proxies={'http': self.proxy,
                                                                                                 'https': self.proxy})
            response.raise_for_status()
        except requests.HTTPError:
            raise requests.HTTPError('Received non 200 status code from Instagram')
        except requests.RequestException:
            raise requests.RequestException
        else:
            return response.text


    @staticmethod
    def extract_json_data(html):
        """Instagram serve’s all the of information regarding a user in the form of JavaScript object. 
        This means that we can extract all of a users profile information and their recent posts by just 
        making a HTML request to their profile page. We simply need to turn this JavaScript object into 
        JSON, which is very easy to do."""
        soup = BeautifulSoup(html, 'html.parser')
        body = soup.find('body')
        script_tag = body.find('script')
        raw_string = script_tag.text.strip().replace('window._sharedData =', '').replace(';', '')
        return json.loads(raw_string)

    def profile_page_metrics(self, profile_url):
        results = {}
        try:
            response = self.__request_url(profile_url)
            json_data = self.extract_json_data(response)
            metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']
        except Exception as e:
            raise e
        else:
            for key, value in metrics.items():
                if key != 'edge_owner_to_timeline_media':
                    if value and isinstance(value, dict):
                        value = value['count']
                        results[key] = value
                    elif value:
                        results[key] = value
        return results

    def profile_page_posts(self, profile_url):
        results = []
        try:
            response = self.__request_url(profile_url)
            json_data = self.extract_json_data(response)
            metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']["edges"]
            #pprint(metrics)
        except Exception as e:
            raise e
        else:
            for node in metrics:
                node = node.get('node')
                #if node and isinstance(node, dict): #this line only gets most recent post out
                results.append(node)
        return results

Specify instragram USERNAME profile whose page you want to scrape. Get a dictionary with all information (image, comments, etc.) from that Instagram profile. 

In [68]:
# get posts (images) from profile page 
from pprint import pprint

# to specify
username='pickuplimes'
url = 'https://www.instagram.com/'+username+'/?hl=en'

k = InstagramScraper()
results = k.profile_page_posts(url)

print('Posts on Instagram profile page: ', len(results))
pprint(results[1]['display_url'])

Posts on Instagram profile page:  12
'https://instagram.fzrh2-1.fna.fbcdn.net/v/t51.2885-15/e35/89830458_279055869752422_1934838557654693738_n.jpg?_nc_ht=instagram.fzrh2-1.fna.fbcdn.net&_nc_cat=110&_nc_ohc=TatfJYTxmiAAX_jXpLZ&oh=27788c795031ee04f31919525f5b98a0&oe=5E82CA49'


In [41]:
# get profile page metrics
from pprint import pprint

# to specify
username='pickuplimes'
url = 'https://www.instagram.com/'+username+'/?hl=en'

k = InstagramScraper()
results = k.profile_page_metrics(url) # only the most recently uploaded image 
#pprint(results)

### 5. Save images from list of dict: 

Use requests library to download images from the ‘display_url’ in pandas ‘result’ data frame and store them with respective shortcode as file name.

Specify the directory for storing the images. 

In [72]:
# download all images from an Instagram page 
import os
import requests
import shutil

# to specify
directory= r"C:\Users\Anonym\Documents\GitHub\DLfM_BrandManagement\images"
os.chdir(directory)

for i in range(len(results)):
    r = requests.get(results[i]['display_url'], stream=True)
    with open(results[i]['shortcode']+".jpg", 'wb') as f:
        # Set decode_content value to True, otherwise the downloaded image file's size will be zero.
        r.raw.decode_content = True
        # Copy the response stream raw data to local image file.
        shutil.copyfileobj(r.raw, f)
        # Remove the image url response object.
        del r

In [43]:
# download one image only
import os
import requests
import shutil

# to specify
directory= r"C:\Users\Anonym\Documents\GitHub\DLfM_BrandManagement\images"
os.chdir(directory)

r = requests.get(url, stream=True)

with open(directory+"B-Tckr0AgrH"+".jpg", 'wb') as f:
    # Set decode_content value to True, otherwise the downloaded image file's size will be zero.
    r.raw.decode_content = True
    # Copy the response stream raw data to local image file.
    shutil.copyfileobj(r.raw, f)
    # Remove the image url response object.
    del r