# Download Content: Transcript and Manuscript-Images

+ **Author**: Sugato Ray
+ **Published Date**: 2018-Jun-12
+ **Version**: 1.0

#### Description: 
+ This project was conceptualized and realized in order to download transcripts and manuscripts (image-files) from [Bichitra: Online Tagore Variorum](http://bichitra.jdvu.ac.in/index.php), an initiative by School of Cultural Texts and Records, Jadavpur University. 
+ This project uses a custom-made python package library: **`Webscrape_Bichitra_Library.py`**. Download the package from [here](http://localhost:8888/edit/Documents/PythonScripts/Webscrape_Bichitra_Library.py#).

***

### Some helpful shorctcuts:  
<font color="red">
**Note**: To enter command mode press **`ESC`** key.

|**Action**|**Command Mode Shortcut**|
| :- |-------------: | 
| Run current cell only:| **`SHIFT` + `ENTER`**|
| Inser cell above:| **`A`**|
| Inser cell below:| **`B`**|
| Toggle cell ***MARKUP***:| **`M`**|
| Toggle cell ***CODE***:| **`C`**|
| Toggle cell output scrolled:| **`SHIFT` + `O`**|
| Toggle cell output collapsed:| **`O`**|
| Toggle cell line numbering:| **`L`**|
| Toggle notebook line numbering:| **`SHIFT` + `L`**|
<font color="black">


***


## Import Necessary Packages

In [2]:
# Mandatory imports
import Webscrape_Bichitra_Library as wbl

# Additional optional imports
import webbrowser as wb
import os
import errno
import requests
from bs4 import BeautifulSoup as bsp
import urllib.parse as urlparse
from numpy import arange
import time # Refer to: https://stackoverflow.com/questions/3620943/measuring-elapsed-time-with-the-time-module
import pandas as pd
import matplotlib.pyplot as plt
from numpy import random as np_random, floor as np_floor

%matplotlib inline

## Primary Content Title of the Manuscript(s):

In [3]:
primary_content_title = "Tasher_Desh"
print("Prmary Content Title: " + primary_content_title)

Prmary Content Title: Tasher_Desh


## Prepare a List of Target Manuscript URLs:

In [4]:
target_url_list = [
     "http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=9&mname=RBVBMS_009A" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=104&mname=RBVBMS_096%28i%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=105&mname=RBVBMS_096%28ii%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=106&mname=RBVBMS_096%28iii%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=107&mname=RBVBMS_096%28iv%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=108&mname=RBVBMS_096%28v%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=119&mname=RBVBMS_101%28i%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=120&mname=RBVBMS_101%28ii%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=121&mname=RBVBMS_101%28iii%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=199&mname=RBVBMS_159" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=208&mname=RBVBMS_168" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=264&mname=RBVBMS_192%28i%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=265&mname=RBVBMS_192%28ii%29" \
    ,"http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=589&mname=BMSF_036" \
    ]

### Show Target URLs:

In [5]:
wbl.show_TargetURLs(target_url_list)

The list of target urls: 

0: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=9&mname=RBVBMS_009A
1: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=104&mname=RBVBMS_096%28i%29
2: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=105&mname=RBVBMS_096%28ii%29
3: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=106&mname=RBVBMS_096%28iii%29
4: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=107&mname=RBVBMS_096%28iv%29
5: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=108&mname=RBVBMS_096%28v%29
6: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=119&mname=RBVBMS_101%28i%29
7: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=120&mname=RBVBMS_101%28ii%29
8: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=121&mname=RBVBMS_101%28iii%29
9: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=199&mname=RBVBMS_159
10: http://

### Open Any Targetted URL:

Alternatively, you could also click on the URLs above to directly go to the pages.

In [6]:
target_index = 11
wb.open_new_tab(target_url_list[target_index])

True

## Extract Transcript File Names from URLs:

In [7]:
fileName_list = wbl.make_FileName_List(target_url_list)

File Names: 

0: RBVBMS_009A 
1: RBVBMS_096(i) 
2: RBVBMS_096(ii) 
3: RBVBMS_096(iii) 
4: RBVBMS_096(iv) 
5: RBVBMS_096(v) 
6: RBVBMS_101(i) 
7: RBVBMS_101(ii) 
8: RBVBMS_101(iii) 
9: RBVBMS_159 
10: RBVBMS_168 
11: RBVBMS_192(i) 
12: RBVBMS_192(ii) 
13: BMSF_036 


## Download Transcripts of Targetted Manuscripts: 

In [8]:
subFolderPath_list = ["Tasher_Desh", "Transcripts"]
fileName_list = wbl.begin_downloadManuscriptAsFile(target_url_list, subFolderPath_list)


The list of target urls: 

0: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=9&mname=RBVBMS_009A
1: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=104&mname=RBVBMS_096%28i%29
2: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=105&mname=RBVBMS_096%28ii%29
3: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=106&mname=RBVBMS_096%28iii%29
4: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=107&mname=RBVBMS_096%28iv%29
5: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=108&mname=RBVBMS_096%28v%29
6: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=119&mname=RBVBMS_101%28i%29
7: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=120&mname=RBVBMS_101%28ii%29
8: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=121&mname=RBVBMS_101%28iii%29
9: http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=199&mname=RBVBMS_159
10: http://

## Downloading Manuscript Images:

### 1. Setup Parameters for Downloading Manuscript Images:

+ #### Maximum Number of Image-Pages for the Manuscripts

In [9]:
#total_image_pages_list = [98,35,29,29,28,29,29,29,29,291,86,63,2,4]
total_image_pages_list = wbl.getTotalImagesNumberList(target_url_list)
print(total_image_pages_list)

[98, 35, 29, 29, 28, 29, 29, 29, 29, 291, 86, 63, 2, 4]


In [10]:
print("\nIndex" + "\t" + "Images" + "\t" + "Manuscript Name" + "\t\t" + "Manuscript URL" + "\n")
for i,target_url in enumerate(target_url_list):
    print(str(i) + "\t" + str(total_image_pages_list[i])  + "\t" + fileName_list[i] + "\t" + target_url)


Index	Images	Manuscript Name		Manuscript URL

0	98	RBVBMS_009A	http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=9&mname=RBVBMS_009A
1	35	RBVBMS_096(i)	http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=104&mname=RBVBMS_096%28i%29
2	29	RBVBMS_096(ii)	http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=105&mname=RBVBMS_096%28ii%29
3	29	RBVBMS_096(iii)	http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=106&mname=RBVBMS_096%28iii%29
4	28	RBVBMS_096(iv)	http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=107&mname=RBVBMS_096%28iv%29
5	29	RBVBMS_096(v)	http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=108&mname=RBVBMS_096%28v%29
6	29	RBVBMS_101(i)	http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=119&mname=RBVBMS_101%28i%29
7	29	RBVBMS_101(ii)	http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=120&mname=RBVBMS_101%28ii%29
8	29	RBVBMS_101(iii)	http://bichitra.jdvu.ac.in/m

## 2. Begin Downloading Manuscript Images:

In [21]:
download_index = 12
#primary_content_title = "Tasher_Desh" # moved to the very top of this notebook

# Get setup parameters for downloading manuscript-images
target_url_manuscript, manuscript_name, total_images_number, basePath = \
    wbl.setParams_Download_Manuscript_Image_Files(total_image_pages_list, \
                                                  target_url_list, \
                                                  fileName_list, \
                                                  primary_content_title, \
                                                  download_index)

# Begin manuscript-images download    
%prun image_number_list, image_name_list, image_filename_list, image_fileNamePath_list, image_url_list = \
    wbl.downloadBichitraManuscriptImages(target_url_manuscript, manuscript_name, total_images_number, basePath)


 Manuscript belongs to Title: Tasher_Desh

 Manuscript Name: RBVBMS_192(ii)

 Manuscript URL: 
http://bichitra.jdvu.ac.in/manuscript/manuscript_viewer.php?manid=265&mname=RBVBMS_192%28ii%29

 Total Number of Image-Files: 2
RelativePath: C:\Users\raysu\Documents\PythonScripts\Tasher_Desh\Manuscripts\RBVBMS_192(ii)

 Title: RBVBMS_192(ii)
 Manuscript Download BasePath: C:\Users\raysu\Documents\PythonScripts\Tasher_Desh\Manuscripts\RBVBMS_192(ii)

 Downloading Image Files for Manuscript: RBVBMS_192(ii)

http://bichitra.jdvu.ac.in/utility/image.php?fpath=Y29udGVudC9tYW51c2NyaXB0L1JCVkJNU18wMDlBLw%3D%3D&img=00000006.gif&amp;dummy=934
{'http://bichitra.jdvu.ac.in/utility/image.php?fpath': ['Y29udGVudC9tYW51c2NyaXB0L1JCVkJNU18wMDlBLw=='], 'img': ['00000006.gif'], 'dummy': ['934']}
Total Number of Images to Download: 2
http://bichitra.jdvu.ac.in/utility/image.php?fpath=Y29udGVudC9tYW51c2NyaXB0L1JCVkJNU18wMDlBLw==&img=00000001.gif&dummy=43
http://bichitra.jdvu.ac.in/utility/image.php?fpath=Y29

In [10]:
"""
def getFromManuscriptURL_TotalImagesNumber(target_url, iDebugFlag = 0):

    # Extract **total_pages_number** for a manuscript's images.
        
    #target_url = target_url_list[0]
    # Get source html text.
    source_html = requests.get(target_url)
    if iDebugFlag == 1:
        print("\n Target HTML Text: \n" + source_html.text + "\n")

    # Use BeautifulSoup to extract intended content.
    # Soup for HTML body
    soup_body = bsp(source_html.text, 'html.parser').find("body")
    # Soup for HTML body code containg target-area in the manuscript_toolbar
    soup_toolbar_manuscript = soup_body.find("div", {"class": "main_toolbar"})\
                                .find("div", {"class": "manuscript_toolbar"})\
                                .find("span", {"class": r"clearfix ui-widget ui-corner-all button_group"})
    # soup for <span></span> containing info: total-number-of-pages
    soup_page_number = soup_toolbar_manuscript.find("span", {"class": "man_button"})
    if iDebugFlag == 2:
        print("\n HTML Text for Total Number of Pages: \n" \
              + soup_page_number.prettify() + "\n")
    
    # Split into lines
    page_num_text_lines = str(soup_page_number.text).splitlines()
    # Pick last item from the list (page_num_text_lines) and 
    # split with "of " followed by stripping of white-spaces
    total_pages_number = int("".join("".join(page_num_text_lines[-1:])\
                                    .split("of ")[-1:]\
                                    ).strip()\
                            )
        
    return total_pages_number
    
def getTotalImagesNumberList(target_url_list):
    total_image_pages_list = list(getFromManuscriptURL_TotalImagesNumber(target_url) for target_url in target_url_list)
    
    return total_image_pages_list    
"""