<a href="https://colab.research.google.com/github/santhosh790/data-wrangling/blob/master/Data_Wrangling_AEM_Images.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Wrangling of AEM Images:**

Data wrangling is transforming the 'raw' data into another format (Structured/Unstructured) to enable the data analysis process. The 'raw' data could come from one or more sources.

In this notebook, the aim is to collect set of images from AEM (Adobe Experience Manager) server and store in Google Drive.

Steps: 


1. Connecting to AEM
2. Downloading to Images
3. Connecting to Google Drive
4. Create Folder and Storing images in Team Drive
5. Main function to Download All images


**1. Connecting to AEM**

In [0]:
#Importing the required Packages for AEM Connection#
import urllib
import urllib.parse
import urllib.request
import base64
import json
import logging
import os

In [0]:
def connectAuthorised(url, credentials):
     auth_header_userpass = "Basic "+base64.standard_b64encode(credentials.encode('ascii')).decode('ascii')
     auth_header = { 'Authorization ' : auth_header_userpass }
     req = urllib.request.Request(url, None, auth_header)

**2. Download Images**

In [0]:
def readAEMData(url, credentials):
    req = connectAuthorised(url, credentials)
    try:
        with urllib.request.urlopen(req) as response:
            Data=response.read()     
            return Data 
    except urllib.error.HTTPError as e:   
       writeToFile(errorFile,"Error in downloading:"+url) 
    return {}

**3. Connecting to Google Drive**

It loads the drive package from Google Colab and mounts the gdrive

While mounting, the could asks you to authorize the Colab to access the GDrive. Authorization token generates when you click on the url and that needs to fed into the text box in output column.

In [6]:
from google.colab import drive as dr
dr.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


**Helper Functions:**
1. Load Image names to List:
It loads all images from input file to a Python List Datastructure


In [0]:
def loadImageNamesToList(file_name):
    the_return_list = []
    if os.path.exists(file_name):
        f = open(file_name,"r")
        for each_line in f:
            if each_line.rstrip() is not "":
                print(each_line)
                the_return_list.append(each_line.rstrip())

    return the_return_list

2. Writing the given text to a file

In [0]:
def writeToFile(file_name,text_to_write):
        f = open(file_name,"a+")
        f.write(text_to_write + "\n")
        f.close()

3. Generate the fully qualified Url:

Generating the fully qualified url of image to hit the AEM server. This code assumes all images will be stored under a folder that formed through the name of image.

In [0]:
def imagePath(ser,image,imgExt):
    parent = image[-2:]
    damPath = prodPath+parent+"/"+image+"/"+image+imgExt
    return ser+damPath

**4. Creating Folder and Storing images in Google Drive**

In [0]:
def storeImages(url,creds,folderName):
    if not os.path.exists(folderName):
        os.makedirs(folderName) # Creates Folder in GDrive
    filePath = folderName+"/"+url[url.rfind("/")+1:] # This path is GDrive Image path
    if not os.path.exists(filePath): # If image is not there in GDrive Already
        fileVal = readAEMData(url,creds)
        if fileVal: # If it is a valid Image
            output = open(folderName+"/"+url[url.rfind("/")+1:],'wb') # Creates a file stream in GDrive
            output.write(fileVal) #Writing the Image Downloaded
            output.close()
    else:
        print("File Exists:"+filePath) # If image already Exists, write that to file

**5. Main function to Download All images**

In [0]:
import time
def downloadImages(pathFile,server_url):
    imageList = loadImageNamesToList(pathFile)
    processedImages = loadImageNamesToList("Images_processed.txt")
    for image in imageList:
        if image not in processedImages:
            damPath = imagePath(server_url,isbn,".jpg")
            storeImages(damPath,creds,"/content/gdrive/Team Drives/allImagesAutomate")
            time.sleep(1)
            writeToFile("Images_processed.txt",isbn)