[Source](https://www.codementor.io/aviaryan/downloading-files-from-urls-in-python-77q3bs0un)

# Downloading Files in Python

This notebook demonstrates how to efficiently/correctly download files from URLs using Python.

The example uses the ```requests``` library, but there are many other libraries that can achieve the same thing with diffent syntax

Let's start with baby steps on how to download a file using ```request```

In [12]:
import requests

url = 'http://google.com/favicon.ico'
response = requests.get(url, allow_redirects=True)
file_name = 'google.ico'
file_mode = 'wb'
file = open(file_name, file_mode)
file.write(response.content)
file.close()

In [3]:
import os
os.getcwd()

'/Users/kolobj/Google Drive/Python/Exercises/Files'

The above code will download the media at <http://google.com/favicon.ico> and save it as google.ico.

Now let's take another example where url is <https://www.youtube.com/watch?v=9bZkp7q19f0>.

What do you think will happen if the above code is used to download it?

If you said that a HTML page will be downloaded, you are spot on. 

Headers usually contain a **Content-Type** parameter which tells us about the type of data the url is linking to. You can inspect the headers to determine if you'd like to save the content.

In [20]:
# a naive way
# url = 'https://www.youtube.com/watch?v=9bZkp7q19f0'
# url = 'http://google.com/favicon.ico'
# url = 'https://www.andrew.cmu.edu/user/lakoglu/postdoc@Heinz_files/image001.jpg'
url = 'https://automatetheboringstuff.com/files/rj.txt'
response = requests.get(url, allow_redirects=True)
print(response.headers.get('content-type'))

image/jpeg


It works but is not the optimum way to do so as it involves downloading the file for checking the header.

So if the file is large, this will do nothing but waste bandwidth.

If you're concerned about bandwidth you can fetch just the header and skip over resources that don't match the content you're trying to download.

In [25]:
# define a method is_downloadable that checks for specific content types
def is_downloadable(url):
    """
    Does the url contain a downloadable resource
    """
    h = requests.head(url, allow_redirects=True)
    header = h.headers
    content_type = header.get('content-type')
    if 'text' in content_type.lower():
        return False
    if 'html' in content_type.lower():
        return False
    return True

In [23]:
print(is_downloadable('https://www.youtube.com/watch?v=9bZkp7q19f0'))

False


In [24]:
print(is_downloadable('http://google.com/favicon.ico'))

True


To restrict download by file size, we can get the filesize from the **Content-Length** header and then do suitable comparisons.

In [44]:
# define a method is_downloadable that checks for specific content types with a max size
def is_downloadable(url, max_length=None): 
    """
    Does the url contain a downloadable resource
    """
    h = requests.head(url, allow_redirects=True)
    header = h.headers
    
    content_length = header.get('content-length', None)
    print('content-length: {0}, max_length: {1}'.format(content_length, max_length))
    if max_length and content_length and int(content_length) > max_length:  
        return False
    
    content_type = header.get('content-type')
    if 'text' in content_type.lower():
        return False
    if 'html' in content_type.lower():
        return False
    return True


So using the above function, we can skip downloading urls which don't link to media.

In [45]:
url = 'https://www.heinz.cmu.edu/heinz-shared/_files/img/life-at-heinz-college-images/hamburg-hall-media-component-image.jpg'
print(is_downloadable(url))
print(is_downloadable(url, max_length=2048 * 200)) # 200 mb
print(is_downloadable(url, max_length=9400000)) # 200 mb

content-length: 9399746, max_length: None
True
content-length: 9399746, max_length: 409600
False
content-length: 9399746, max_length: 9400000
True


## Getting filename from URL

We can parse the url to get the filename.
Example - <http://aviaryan.in/images/profile.png>.

To extract the filename from the above URL we can write a routine which fetches the last string after backslash (/).

In [46]:
url = 'http://aviaryan.in/images/profile.png'
if url.find('/'):
    print(url.rsplit('/', 1)[1])

profile.png


In [48]:
url = 'http://aviaryan.in/images/profile.png'
if url.find('/'):
    parts = url.rsplit('/', 1)
    name = parts[1]
    print(parts, name)

['http://aviaryan.in/images', 'profile.png'] profile.png


In [49]:
url = 'http://aviaryan.in/images/profile.png'
if url.find('/'):
    parts = url.split('/')
    name = parts[-1]
    print(parts, name)

['http:', '', 'aviaryan.in', 'images', 'profile.png'] profile.png


This will be give the filename in some cases correctly. However, there are times when the filename information is not present in the url.

Example, something like <http://url.com/download>. In that case, the **Content-Disposition** header will contain the filename information.

Here is how to fetch it.

In [53]:
import re

def get_filename_from_cd(content_disposition):
    """
    Get filename from content-disposition
    """
    if not content_disposition:
        return None
    fname = re.findall('filename=(.+)', content_disposition)
    if len(fname) == 0:
        return None
    return fname[0]

In [57]:
url = 'https://www.atkearney.com/documents/10192/565528/Carnegie-Mellon-Heinz-College-Logo.png/f68f4055-317c-4c29-affc-ff44baf9b84c?t=1472465722965'
# url = 'https://c1.staticflickr.com/1/326/20106051581_04627c3af2_b.jpg'
response = requests.get(url, allow_redirects=True)
# print(response.headers)
content_disposition = response.headers.get('content-disposition')
print('content-disposition: {0}'.format(content_disposition))
filename = get_filename_from_cd(content_disposition)
print('filename: {0}'.format(filename))
# open(filename, 'wb').write(r.content)

content-disposition: inline; filename="Carnegie-Mellon-Heinz-College-Logo.png"
filename: "Carnegie-Mellon-Heinz-College-Logo.png"
