Data Science Project: Phage Lysin Sequence Alignment Analysis 

With the ongoing rise of antibiotic resistance in bacteria, including challenging pathogens like Mycobacteria, finding alternative treatments for patients is becoming increasingly urgent. An emerging approach is the use of phage lysins to effectively target and eradicate specific bacterial infections. Lysins are enzymes produced by bacteriophages, they possess domains that adhere to bacterial cell wall components and trigger hydrolysis, ultimately resulting in the elimination of the host bacterium. Understanding the conservation and variability of amino acid sequences, especially in the context of Mycobacteria infections, is crucial for developing effective therapeutic lysin treatments. In this project, we will identify conserved regions, assess sequence similarity, and uncover evolutionary patterns to inform the development of targeted antibacterial therapies, specifically focusing on Mycobacteria skin and lung infections.

Data Wrangling

In [1]:
# Import Modules
import requests
import json
import re

Data collection

In [9]:
# Fetch genes data from phagesdb API

# URL of the API endpoint
url = "https://phagesdb.org/api/genes/"

# Define the chunk size
chunk_size = 10000  # You can adjust this value based on your needs

# Initialize variables for pagination
page = 1
total_records = None

# Open the JSON file in write mode
with open('phagesdb_genes.json', 'w') as file:
    # Write an opening bracket to indicate the start of a JSON array
    file.write("[\n")
   
    # Fetch data in chunks until all records are retrieved
    while total_records is None or (page - 1) * chunk_size < total_records:
        # Make a request to the API with pagination parameters
        response = requests.get(url, params={'page': page, 'page_size': chunk_size})
       
        # Check if the request was successful
        if response.status_code == 200:
            # Convert the response to JSON format
            data = response.json()
           
            # Update the total number of records if it's not set yet
            if total_records is None:
                total_records = data['count']
           
            # Write the fetched data from the current page to the JSON file
            json.dump(data['results'], file)
           
            # Write a comma after each chunk except for the last one
            if (page - 1) * chunk_size + len(data['results']) < total_records:
                file.write(",\n")
           
            # Print a message to indicate progress
            print(f"Fetched {len(data['results'])} records (Chunk {page})")
        else:
            # Print an error message if the request was not successful
            print(f"Error fetching data: {response.status_code}")
            break
       
        # Increment the page number for the next request
        page += 1
   
    # Write a closing bracket to indicate the end of the JSON array
    file.write("\n]\n")

# Print a message to indicate successful completion
print("Data fetched and saved successfully!")


Fetched 10000 records (Chunk 1)
Fetched 10000 records (Chunk 2)
Fetched 10000 records (Chunk 3)
Fetched 10000 records (Chunk 4)
Fetched 10000 records (Chunk 5)
Fetched 10000 records (Chunk 6)
Fetched 10000 records (Chunk 7)
Fetched 10000 records (Chunk 8)
Fetched 10000 records (Chunk 9)
Fetched 10000 records (Chunk 10)
Fetched 10000 records (Chunk 11)
Fetched 10000 records (Chunk 12)
Fetched 10000 records (Chunk 13)
Fetched 10000 records (Chunk 14)
Fetched 10000 records (Chunk 15)
Fetched 10000 records (Chunk 16)
Fetched 10000 records (Chunk 17)
Fetched 10000 records (Chunk 18)
Fetched 10000 records (Chunk 19)
Fetched 10000 records (Chunk 20)
Fetched 10000 records (Chunk 21)
Fetched 10000 records (Chunk 22)
Fetched 10000 records (Chunk 23)
Fetched 10000 records (Chunk 24)
Fetched 10000 records (Chunk 25)
Fetched 10000 records (Chunk 26)
Fetched 10000 records (Chunk 27)
Fetched 10000 records (Chunk 28)
Fetched 10000 records (Chunk 29)
Fetched 10000 records (Chunk 30)
Fetched 10000 recor

In [10]:
#Combine outer lists generated from chunk

# Read the JSON file
with open('phagesdb_genes.json', 'r') as file:
    data = json.load(file)

# Combine the outer lists
combined_data = sum(data, [])

# Determine the number of lists in the final JSON data
num_lists = len(combined_data)
print("Number of lists in the final JSON data:", num_lists)

# Define the file path for the new JSON file
output_file = 'combined_data_output.json'

# Write the combined JSON data to the new file
with open(output_file, 'w') as file:
    json.dump(combined_data, file)

print("Combined JSON data saved to", output_file)

Number of lists in the final JSON data: 481205
Combined JSON data saved to combined_data_output.json


Data Definition and Cleaning

In [11]:
# Count number of dictionaries, lists, and elements in json file

# Read the JSON file
with open('combined_data_output.json', 'r') as file:
    data = json.load(file)

# Initialize counters for dictionaries, lists, and total elements
num_dicts = 0
num_lists = 0
num_elements = 0

# Function to recursively count elements
def count_elements(data):
    global num_dicts, num_lists, num_elements
    if isinstance(data, dict):
        num_dicts += 1
        for value in data.values():
            count_elements(value)
    elif isinstance(data, list):
        num_lists += 1
        for item in data:
            count_elements(item)
    else:
        num_elements += 1

# Call the function to count elements
count_elements(data)

print("Number of dictionaries:", num_dicts)
print("Number of lists:", num_lists)
print("Total number of elements:", num_elements)


Number of dictionaries: 962410
Number of lists: 481206
Total number of elements: 6736870


In [23]:
# Get keys of first dictionary in json

# Analyze the first dictionary
analyze_first_dict(data)

# Access the first dictionary (assuming the JSON data is a list of dictionaries)
first_dict = data[0]

# Get the keys of the first dictionary
keys = first_dict.keys()

# Print the keys
print("Keys in the first dictionary:")
for key in keys:
    print(key)

Keys in the first dictionary:
GeneID
PhageID
phams
Start
Stop
Length
Name
translation
Orientation
Notes


In [12]:
# Check for Missing Values in json

missing_values = {}
for idx, item in enumerate(data):
    for key, value in item.items():
        if value is None:
            if key not in missing_values:
                missing_values[key] = []
            missing_values[key].append(idx)

if missing_values:
    print("Missing values found:")
    for key, indices in missing_values.items():
        print(f"Key: {key}, Missing in indices: {indices}")
else:
    print("No missing values found.")


No missing values found.


In [16]:
# Filter and save any dictionaries that contain Lysin

# Function to check if any form of "lysin" is present in the Notes
def contains_lysin(notes):
    lysin_pattern = re.compile(r'lysin', re.IGNORECASE)
    return bool(lysin_pattern.search(notes))

# Filter for dictionaries where the Notes key contains any form of "lysin"
lysin_dictionaries = []

print("Filtering dictionaries...")

for item in data:
    notes = item.get('Notes', '')
    if contains_lysin(notes):
        lysin_dictionaries.append(item)

print("Filtering complete.")

# Specify the path to the new JSON file
output_file_path = 'lysin_dictionaries.json'

# Write the filtered dictionaries to the new JSON file
print("Saving filtered dictionaries...")

with open(output_file_path, 'w') as output_file:
    json.dump(lysin_dictionaries, output_file)

num_saved_dictionaries = len(lysin_dictionaries)
print(f"Filtered dictionaries saved to '{output_file_path}'")
print(f"Number of dictionaries saved: {num_saved_dictionaries}")


Filtering dictionaries...
Filtering complete.
Saving filtered dictionaries...
Filtered dictionaries saved to 'lysin_dictionaries.json'
Number of dictionaries saved: 7191


In [17]:
# Count dictionaries, lists, and elements in filtered json

with open('lysin_dictionaries.json', 'r') as file:
    data = json.load(file)

# Initialize counters for dictionaries, lists, and total elements
num_dicts = 0
num_lists = 0
num_elements = 0

# Function to recursively count elements
def count_elements(data):
    global num_dicts, num_lists, num_elements
    if isinstance(data, dict):
        num_dicts += 1
        for value in data.values():
            count_elements(value)
    elif isinstance(data, list):
        num_lists += 1
        for item in data:
            count_elements(item)
    else:
        num_elements += 1

# Call the function to count elements
count_elements(data)

print("Number of dictionaries:", num_dicts)
print("Number of lists:", num_lists)
print("Total number of elements:", num_elements)


Number of dictionaries: 14382
Number of lists: 7192
Total number of elements: 100674
