First we need to get all the PDF files from the Monthly immigrant visa issuance statistics page here: 

https://travel.state.gov/content/travel/en/legal/visa-law0/visa-statistics/immigrant-visa-statistics/monthly-immigrant-visa-issuances.html 

We want this file to do the following:

- get files of the type: March 2017 - IV Issuances by FSC or Place of Birth and Visa Class
    - FSC or Place of Birth basically equals country
- output the PDFs to a single folder
- save the list of links to a CSV file
- bonus: check the CSV file for an existing list of files downloaded 
- - (assuming that all the files downloaded correctly)
- - (skip the files in that list when running that update)

In [19]:
import os 
import pandas as pd
import time
import requests
from bs4 import BeautifulSoup as bs
import json
import random


In [2]:
base_url = 'https://travel.state.gov/content/travel/en/legal/visa-law0/visa-statistics/immigrant-visa-statistics/monthly-immigrant-visa-issuances.html'



In [7]:
# Fetch the HTML content
response = requests.get(base_url)
html_content = response.content

# Parse the HTML using BeautifulSoup
soup = bs(html_content, 'lxml')

for link in soup.findAll('a'):
    if link.get('href'): #checks for the presence of an href
        if 'pdf' in link.get('href') and 'FSC' in link.get('href'):
            print(link.get('href'))
            

/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/March%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class%20-%20Worldwide.pdf
/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/APRIL%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/MAY%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/JUNE%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/JULY%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/AUGUST%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/SEPTEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
/co

You will notice in the printout above that the links are of the following form:

/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/DECEMBER%202023%20-%20IV%20Issuances%20by%20FSC%20or%20Place%20of%20Birth%20and%20Visa%20Class.pdf

The actual links are of the form:

https://travel.state.gov/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/March%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class%20-%20Worldwide.pdf 

This means that we need to prefix " https://travel.state.gov/ " to each entry

In [4]:
target_links = [] # initialize empty list to take in URLs from page
link_prefix = 'https://travel.state.gov/'
for link in soup.findAll('a'):
    if link.get('href'): #checks for the presence of an href
        if 'pdf' in link.get('href') and 'FSC' in link.get('href'):
                target_links.append(f"{link_prefix}{link.get('href')}")
print(target_links)

['https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/March%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class%20-%20Worldwide.pdf', 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/APRIL%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf', 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/MAY%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf', 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/JUNE%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf', 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/JULY%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf', 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/AUGUST%202017%20-%20IV%20Issuances%20by%20FSC%20and

Next we want to:
- loop through this list of links
- put in a time delay between each link
- download the file into a folder
- add the list of target files into a json file so that we can record what files have already been downloaded
- 

first we save the list of files into a json file. 
For our purposes, a CSV file would work better as the list of links isn't nested.

In [5]:
filename = 'target_links.json'

# Write the target_links list to a JSON file
with open(filename, 'w') as f:
    json.dump(target_links, f)

print(f"Target links have been saved to {filename}")

Target links have been saved to target_links.json


Or we could save the links to a csv file, or better yet, a tsv (tab separated file).
This is because issues come up often with commas in files

In [6]:
filename = 'target_links.tsv'

# Write the target_links list to a TSV file using the \t delimiter
with open(filename, 'w') as f:
    for link in target_links:
        f.write(f"{link}\t\n")

print(f"Target links have been saved to {filename}")

Target links have been saved to target_links.tsv


The next step is to set up the folder for downloading the files. 

If the folder does not exist: 
- we can create the folder and then
- download all the files in the target_links.tsv file.

In [8]:
folder_path = 'pdf_monthly_immigrant_visa_class_country'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)




If the folder does exist, assuming that this notebook has been run before:
- we can create a new file with all the links and then check against target_links.tsv
- - Let's call this completed_target_links.tsv
- if the link exists in target_links.cst, we can skip downloading the file
- if the link doesn't exist in target_links.tsv, we can download the file

At the end we can overwrite the target_links.tsv file entries into the completed_target_links.tsv file.
This assumes that we have downloaded each file into the folder with no errors.


In [27]:
file_name = "completed_target_links.tsv"
completed_target_links = []
if os.path.exists(file_name):
    print(f"The file '{file_name}' exists in the current directory.")
    with open(file_name, 'r') as f:
        # Read each line in the file
        for line in f:
            # Split the line by tabs to extract individual values
            values = line.strip().split('\t')
            # Assuming the link is the first column, you can append it to the list
            completed_target_links.append(values[0])  # Adjust the index if the link is in a different column
    print("List of links from the TSV file:")
    print(completed_target_links)
else:
    print(f"The file '{file_name}' does not exist in the current directory.")

The file 'completed_target_links.tsv' exists in the current directory.
List of links from the TSV file:
['https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/March%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class%20-%20Worldwide.pdf', 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/APRIL%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf', 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/MAY%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf', 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/JUNE%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf', 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/JULY%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf', 'https://travel.state.gov//content/dam/visa

In [28]:
completed_target_links

['https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/March%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class%20-%20Worldwide.pdf',
 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/APRIL%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf',
 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/MAY%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf',
 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/JUNE%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf',
 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/JULY%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf',
 'https://travel.state.gov//content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/AUGUST%202017%20-%20IV%20Issuances%20by%20FSC%

In [29]:
if 'completed_target_links' in locals():
    print('yes')
    for link in target_links:
        if link in completed_target_links:
            print(f"File already downloaded: {filename}")
        else:
            filename = os.path.basename(link)
            # Create the full file path to save the file
            file_path = os.path.join(folder_path, filename)
            
            # Download the file
            response = requests.get(link)
            
            # Check if the request was successful (status code 200)
            if response.status_code == 200:
                # Save the file
                with open(file_path, 'wb') as file:
                    file.write(response.content)
                print(f"File downloaded successfully: {filename}")
            else:
                print(f"Failed to download file: {filename}")
            
            # Introduce a delay of 0.5 to 3 seconds after each download
            time.sleep(random.uniform(0.5, 3.0))
            
else:
    print('no')
    for link in target_links:
        filename = os.path.basename(link)
        # Create the full file path to save the file
        file_path = os.path.join(folder_path, filename)
        
        # Download the file
        response = requests.get(link)
        
        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            # Save the file
            with open(file_path, 'wb') as file:
                file.write(response.content)
            print(f"File downloaded successfully: {filename}")
        else:
            print(f"Failed to download file: {filename}")
        
        # Introduce a delay of 0.5 to 3 seconds after each download
        time.sleep(random.uniform(0.5, 3.0))

yes
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already downloaded: NOVEMBER%202017%20-%20IV%20Issuances%20by%20FSC%20and%20Visa%20Class.pdf
File already dow

If there are no errors in downloading the file, save the links in target_links to completed_target_links.tsv

In [21]:
filename = 'completed_target_links.tsv'

# Write the target_links list to a TSV file using the \t delimiter
with open(filename, 'w') as f:
    for link in target_links:
        f.write(f"{link}\t\n")

print(f"Target links have been saved to {filename}")

Target links have been saved to completed_target_links.tsv
