<a href="https://colab.research.google.com/github/radonys/webscrapers/blob/master/Instagram_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instagram Scraper

## Install Modules

In [0]:
!pip install bs4 selenium pandas scikit-learn

## Import Modules

In [0]:
# For accessing and parsing Instagram Webpages
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.request import urlopen
import json

# Structuring Collected Data
import pandas as pd

# General Libraries
import time
import re
import os
import requests

## Install Chrome Driver for scrolling Instagram pages

References:

1) How can we use Selenium Webdriver in colab.research.google.com? [[StackOverFlow](https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com)]

In [0]:
!apt install chromium-chromedriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

## Variable Declarations

In [0]:
browser = webdriver.Chrome('chromedriver',options=options)

#Instagram Handles to be scraped
instagram_handles = ['walmart']

## Helper Functions for Scraping, Parsing & Save

### Instagram Scraper Function

References:

1) Scraping Instagram with Python (using Selenium and Beautiful Soup) ([Medium](https://medium.com/@srujana.rao2/scraping-instagram-with-python-using-selenium-and-beautiful-soup-8b72c186a058))

2) How to scroll to the end of the page using selenium in python ([StackOverFlow](https://stackoverflow.com/questions/32391303/how-to-scroll-to-the-end-of-the-page-using-selenium-in-python/32629481))

In [0]:
#The Code has been referenced from the above two resources which have been combined together as per the needs of the desired program. 

def instagram_link_scraper(suffix, tagged=False):
  
  if tagged:
    link = 'https://www.instagram.com/'+suffix+'/tagged/'
  else:
    link = 'https://www.instagram.com/'+suffix
  
  browser.get(link)
  
  page_length = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var page_length=document.body.scrollHeight;return page_length;")
  flag = False
  links = []
  
  while flag==False:
    
    previous_length = page_length
    
    #Sleep Time added so fake that Instagram page is not being traversed by a Bot.
    time.sleep(3)
    print(len(set(links)))
    
    page_length = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var page_length=document.body.scrollHeight;return page_length;")
    
    #If the previous length is equal to current length, that means we are at the end of the page.
    if previous_length==page_length:
      flag = True
    
    #Collect image/video links on the current page.
    
    if tagged==False:
    
      source = browser.page_source
      data = BeautifulSoup(source, 'html.parser')
    
      #Post links in the HTML code are in the Body tag.
      body = data.find('body')
      script = body.find('span')
      
      for link in script.findAll('a'):
         if re.match("/p/", link.get('href')):
            links.append('https://www.instagram.com' + link.get('href'))
    
    else:
      
      #Page Refresh time.
      time.sleep(5)
      
      data = browser.find_elements_by_tag_name("article")
      a_tags = data[0].find_elements_by_tag_name("a")
      
      for link in a_tags:
        links.append(link.get_attribute('href'))
  
  return list(set(links))

### Retreive Needed Information from JSON Data

In [0]:
def extract_needed_info(json_data):
  
  keys = ['id', 'shortcode', 'edge_media_preview_like', 'display_url', 'is_video', 'edge_media_preview_comment', 'taken_at_timestamp']
  temp_dict = {}
  
  for key in keys:
    
    if key not in json_data:
      
      if key=='edge_media_preview_comment':
        temp_key = 'edge_media_to_parent_comment'
        
        if temp_key not in json_data:
          temp_key = 'edge_media_to_comment'
        
      temp_dict[key] = json_data[temp_key]
    
    else:
      temp_dict[key] = json_data[key]
  
  return temp_dict

### Parse and Save Link Data

References:

1) Scraping Instagram with Python (using Selenium and Beautiful Soup) ([Medium](https://medium.com/@srujana.rao2/scraping-instagram-with-python-using-selenium-and-beautiful-soup-8b72c186a058))

2) Manually raising (throwing) an exception in Python [StackOverFlow](https://stackoverflow.com/questions/2052390/manually-raising-throwing-an-exception-in-python)

In [0]:
def save_metadata(links, save_name):
  
  data = pd.DataFrame()
  counter = 0
  total_links = len(links)
  
  while counter!=total_links:
  
    for link in links:
    
          try:
          
            # Open the URL and extract JSON information.
            page = urlopen(link).read()
            source = BeautifulSoup(page, 'html.parser')
            body = source.find('body')
            script = body.find('script')
            json_extract = re.sub('window._sharedData =|;', '', script.text)
            json_data = json.loads(json_extract)

            # Extract needed information.
            image_metadata = json_data['entry_data']['PostPage'][0]['graphql']['shortcode_media']
            temp_dict = extract_needed_info(image_metadata)
            temp = pd.DataFrame.from_dict(temp_dict, orient='columns')
            data = data.append(temp)
            links.remove(link)
            counter = counter + 1
            
          except Exception as error:
            print(link)
            print('Caught this error: ' + repr(error))
  
  data = data.drop_duplicates(subset = 'shortcode')
  data.index = range(len(data.index))
  data.to_csv(save_name+'data.csv')
  save_images(data, save_name)

### Save Images

References:

1) Scraping Instagram with Python (using Selenium and Beautiful Soup) ([Medium](https://medium.com/@srujana.rao2/scraping-instagram-with-python-using-selenium-and-beautiful-soup-8b72c186a058))

In [0]:
#The Code has been taken from the above resource.

def save_images(data, save_name):
  
  images_path = save_name+'/'
  
  if not os.path.isdir(images_path):
    os.mkdir(images_path)
  
  for i in range(0, len(data['display_url'])):
    
    image = requests.get(data['display_url'][i])
    with open(images_path + data['shortcode'][i] + ".jpg", 'wb') as file:
                    file.write(image.content)

## Scrape and Parsing Instagram Data

Note: Data scraping for posts and tagged images of a handle has been seperated due to processing time constraints.

In [0]:
for handle in instagram_handles:
  
  #links = instagram_link_scraper(handle)
  #save_metadata(links,handle+'_posts_')
  
  links = instagram_link_scraper(handle, tagged=True)
  print(len(links))
  save_metadata(links,handle+'_tagged_')

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
!zip -r lazboy_tagged_images.zip lazboy_tagged_/

In [0]:
!mv *.zip /content/drive/My\ Drive/
!mv *.csv /content/drive/My\ Drive/

## General References

1) Pandas Docs ([Pandas](https://pandas.pydata.org/pandas-docs/stable/))

2) Selenium WebDriver ([Selenium](https://www.seleniumhq.org/projects/webdriver/))