Data Science Project: Phage Lysin Sequence Alignment Analysis 

With the ongoing rise of antibiotic resistance in bacteria, including challenging pathogens like Mycobacteria, finding alternative treatments for patients is becoming increasingly urgent. An emerging approach is the use of phage lysins to effectively target and eradicate specific bacterial infections. Lysins are enzymes produced by bacteriophages, they possess domains that adhere to bacterial cell wall components and trigger hydrolysis, ultimately resulting in the elimination of the host bacterium. Understanding the conservation and variability of amino acid sequences, especially in the context of Mycobacteria infections, is crucial for developing effective therapeutic lysin treatments. In this project, we will identify conserved regions, assess sequence similarity, and uncover evolutionary patterns to inform the development of targeted antibacterial therapies, specifically focusing on Mycobacteria skin and lung infections.

Data Wrangling

In [1]:
# Import Modules
import requests
import json
import re

Data collection

Data Definition and Cleaning

In [2]:
# Load the combined_data_output.json file
input_file_path = 'combined_data_output.json'

with open(input_file_path, 'r') as input_file:
    data = json.load(input_file)

# Function to check if any form of "lysin" is present in the Notes
def contains_lysin(notes):
    lysin_pattern = re.compile(r'lysin', re.IGNORECASE)
    return bool(lysin_pattern.search(notes))

# Filter for dictionaries where the Notes key contains any form of "lysin"
lysin_dictionaries = []

print("Filtering dictionaries...")

for item in data:
    notes = item.get('Notes', '')
    if contains_lysin(notes):
        lysin_dictionaries.append(item)

print("Filtering complete.")

# Specify the path to the new JSON file
output_file_path = 'lysin_dictionaries.json'

# Write the filtered dictionaries to the new JSON file
print("Saving filtered dictionaries...")

with open(output_file_path, 'w') as output_file:
    json.dump(lysin_dictionaries, output_file)

num_saved_dictionaries = len(lysin_dictionaries)
print(f"Filtered dictionaries saved to '{output_file_path}'")
print(f"Number of dictionaries saved: {num_saved_dictionaries}")

# Load the combined_data_output.json file
input_file_path = 'combined_data_output.json'

with open(input_file_path, 'r') as input_file:
    combined_data_output = json.load(input_file)

Filtering dictionaries...
Filtering complete.
Saving filtered dictionaries...
Filtered dictionaries saved to 'lysin_dictionaries.json'
Number of dictionaries saved: 7191


In [3]:
# Un-nest PhageID dictionary in json file 

# Load the JSON data
with open('lysin_dictionaries.json', 'r') as file:
    data = json.load(file)

# Iterate through each gene entry
for gene_entry in data:
    # Unpack the PhageID dictionary
    phage_id_dict = gene_entry.pop('PhageID')
    for key, value in phage_id_dict.items():
        # Add each key-value pair to the outer dictionary
        gene_entry[key] = value

# Save the modified data to a new JSON file
output_file_path = 'lysin_unnested_data.json'
with open(output_file_path, 'w') as output_file:
    json.dump(data, output_file, indent=4)

print("Unnested data saved to:", output_file_path)


Unnested data saved to: lysin_unnested_data.json


In [4]:
# Investigate contents of json file

# Open the JSON file and load the data
with open('lysin_unnested_data.json', 'r') as file:
    data = json.load(file)

# Initialize counters for dictionaries, lists, and total elements
num_dicts = 0
num_lists = 0
num_elements = 0

# Function to recursively count elements
def count_elements(data):
    global num_dicts, num_lists, num_elements
    if isinstance(data, dict):
        num_dicts += 1
        # Check if the key 'key' exists in the dictionary
        if 'key' in data:
            print("Type of 'key':", type(data['key']))
        for value in data.values():
            count_elements(value)
    elif isinstance(data, list):
        num_lists += 1
        for item in data:
            count_elements(item)
    else:
        num_elements += 1

# Call the function to count elements
count_elements(data)

# Print the counts
print("Number of dictionaries:", num_dicts)
print("Number of lists:", num_lists)
print("Total number of elements:", num_elements)

Number of dictionaries: 7191
Number of lists: 7192
Total number of elements: 93483


In [5]:
# Print object types of keys in the first entry

# Get the first entry
first_entry = data[0]

for key, value in first_entry.items():
    print("Key:", key, "| Type:", type(value))

Key: GeneID | Type: <class 'str'>
Key: phams | Type: <class 'list'>
Key: Start | Type: <class 'int'>
Key: Stop | Type: <class 'int'>
Key: Length | Type: <class 'int'>
Key: Name | Type: <class 'str'>
Key: translation | Type: <class 'str'>
Key: Orientation | Type: <class 'str'>
Key: Notes | Type: <class 'str'>
Key: PhageID | Type: <class 'str'>
Key: Accession | Type: <class 'str'>
Key: HostStrain | Type: <class 'str'>
Key: Cluster | Type: <class 'str'>


In [6]:
# Print number of values for each key 

# Dictionary to store total counts for each key
total_counts = {}

# Iterate through each key
for key in data[0].keys():
    # Initialize count for the key
    count = 0
    # Iterate through each entry in the data
    for entry in data:
        # Check if the value for the key is not missing
        if entry[key] != '' and entry[key] is not None:
            count += 1
    # Store the count for the key
    total_counts[key] = count

# Print the total counts for each key
for key, count in total_counts.items():
    print(f"Key: {key} | Total Value Count: {count}")


Key: GeneID | Total Value Count: 7191
Key: phams | Total Value Count: 7191
Key: Start | Total Value Count: 7191
Key: Stop | Total Value Count: 7191
Key: Length | Total Value Count: 7191
Key: Name | Total Value Count: 7191
Key: translation | Total Value Count: 7191
Key: Orientation | Total Value Count: 7191
Key: Notes | Total Value Count: 7191
Key: PhageID | Total Value Count: 7191
Key: Accession | Total Value Count: 7181
Key: HostStrain | Total Value Count: 7191
Key: Cluster | Total Value Count: 7126


In [7]:
# Print missing values for each key

# Dictionary to store missing value counts
missing_counts = {}

# Iterate through each key
for key in data[0].keys():
    # Initialize count for the key
    count = 0
    # Iterate through each entry in the data
    for entry in data:
        # Check if the value for the key is missing
        if entry[key] == '' or entry[key] is None:
            count += 1
    # Store the count for the key
    missing_counts[key] = count

# Print the missing value counts
for key, count in missing_counts.items():
    print(f"Key: {key} | Missing Value Count: {count}")

Key: GeneID | Missing Value Count: 0
Key: phams | Missing Value Count: 0
Key: Start | Missing Value Count: 0
Key: Stop | Missing Value Count: 0
Key: Length | Missing Value Count: 0
Key: Name | Missing Value Count: 0
Key: translation | Missing Value Count: 0
Key: Orientation | Missing Value Count: 0
Key: Notes | Missing Value Count: 0
Key: PhageID | Missing Value Count: 0
Key: Accession | Missing Value Count: 10
Key: HostStrain | Missing Value Count: 0
Key: Cluster | Missing Value Count: 65


Cluster info is sometimes unknown and missing from this dataset. It will not affect the downsteam analysis. 

In [8]:
# Count Duplicate values for each key

# Dictionary to store duplicate value counts
duplicate_counts = {}

# Iterate through each key
for key in data[0].keys():
    # Initialize a set to store unique values for the key
    unique_values = set()
    # Initialize count for duplicates for the key
    count = 0
    # Iterate through each entry in the data
    for entry in data:
        # Check if the value for the key is a list
        if isinstance(entry[key], list):
            # Iterate through each element in the list
            for item in entry[key]:
                # Check if the item is a duplicate
                if item in unique_values:
                    count += 1
                else:
                    unique_values.add(item)
        else:
            # Check if the value for the key is a duplicate
            if entry[key] in unique_values:
                count += 1
            else:
                unique_values.add(entry[key])
    # Store the count for duplicates for the key
    duplicate_counts[key] = count

# Print the duplicate value counts
for key, count in duplicate_counts.items():
    print(f"Key: {key} | Duplicate Value Count: {count}")


Key: GeneID | Duplicate Value Count: 0
Key: phams | Duplicate Value Count: 7007
Key: Start | Duplicate Value Count: 2227
Key: Stop | Duplicate Value Count: 2284
Key: Length | Duplicate Value Count: 6778
Key: Name | Duplicate Value Count: 2962
Key: translation | Duplicate Value Count: 3464
Key: Orientation | Duplicate Value Count: 7189
Key: Notes | Duplicate Value Count: 7128
Key: PhageID | Duplicate Value Count: 2962
Key: Accession | Duplicate Value Count: 2968
Key: HostStrain | Duplicate Value Count: 7179
Key: Cluster | Duplicate Value Count: 6889


Most importantly each dictionary entry has a unique GeneID, it is not suprising that other Keys contain duplicate values. 

In [9]:
# Investigate notes field and count unique values

# Dictionary to store counts of different notes
notes_counts = {}

# Iterate through each entry in the data
for entry in data:
    # Extract the value of the "Notes" key
    notes = entry['Notes']
    # Check if the value is not empty
    if notes:
        # Update the count for the note value
        if notes in notes_counts:
            notes_counts[notes] += 1
        else:
            notes_counts[notes] = 1

# Print all unique note values and their counts
for note, count in notes_counts.items():
    print(f"Notes: {note} | Count: {count}")

Notes: b'lysin B' | Count: 2519
Notes: b'lysin A' | Count: 2772
Notes: b'endolysin' | Count: 1010
Notes: b'putative lysin A' | Count: 8
Notes: b'putative lysin B' | Count: 6
Notes: b'LysM-like endolysin' | Count: 28
Notes: b'lysin A, protease M23 domain' | Count: 63
Notes: b'lysin' | Count: 28
Notes: b'lysin A, N-acetylmuramoyl-L-alanine amidase domain' | Count: 100
Notes: b'lysin A, L-Ala-D-Glu peptidase domain' | Count: 156
Notes: b'lysin A, glycosyl hydrolase domain' | Count: 213
Notes: b'lysin A, protease C39 domain' | Count: 93
Notes: b'lysin A, N-acetylmuramoyl-L-alanine amidase' | Count: 1
Notes: b'lysin A, M23 peptidase domain' | Count: 2
Notes: b'lysin A, amidase domain' | Count: 3
Notes: b'endolysin, L-Ala-D-Glu peptidase domain' | Count: 41
Notes: b'endolysin, N-acetylmuramoyl-L-alanine amidase domain' | Count: 46
Notes: b'endolysin, protease M23 domain' | Count: 6
Notes: b'lysin A, N-acetylmuramoyl-L-alanine' | Count: 2
Notes: b'gp24, lysin' | Count: 1
Notes: b'lysin A, L-a

The Notes field contains many different iterations of lysin. In order to proceed with analysis we will determine how many entries contain the following: lysin A, lysin B, endolysin, and other.

In [10]:
# Initialize counters for lysin A, lysin B, and endolysin
lysin_a_count = 0
lysin_b_count = 0
endolysin_count = 0
other_lysin_count = 0  # Count for notes containing 'lysin' but not categorized as lysin A, lysin B, or endolysin

# List to store notes entries that are not lysin A, lysin B, or endolysin
other_notes = []

# Iterate through each entry in the data
for entry in data:
    # Extract the value of the "Notes" key
    notes = entry['Notes']
    # Check if the value is not empty
    if notes:
        # Convert notes to lowercase for case-insensitive comparison
        notes_lower = notes.lower()
        # Check if the entry contains any iteration of lysin A, lysin B, or endolysin (case-insensitive)
        if any(keyword in notes_lower for keyword in ['lysin a', 'lysin b', 'endolysin', 'lysinA', 'lysinB']):
            # Increment respective counters
            if 'lysin a' in notes_lower or 'lysinA' in notes_lower:
                lysin_a_count += 1
            if 'lysin b' in notes_lower or 'lysinB' in notes_lower:
                lysin_b_count += 1
            if 'endolysin' in notes_lower:
                endolysin_count += 1
        # Check if the entry contains 'lysin' but is not categorized as lysin A, lysin B, or endolysin
        elif 'lysin' in notes_lower:
            other_lysin_count += 1
        else:
            # If the note is not lysin A, lysin B, or endolysin, add it to other_notes list
            other_notes.append(notes)

# Calculate total count
total_lysin_notes = lysin_a_count + lysin_b_count + endolysin_count + other_lysin_count

# Print the counts
print(f"Total number of entries containing any iteration of lysin A, lysin B, or endolysin: {total_lysin_notes}")
print(f"Lysin A count: {lysin_a_count}")
print(f"Lysin B count: {lysin_b_count}")
print(f"Endolysin count: {endolysin_count}")
print(f"Other lysin count: {other_lysin_count}")

# Print all notes entries that are not lysin A, lysin B, or endolysin
print("\nNotes entries that are not lysin A, lysin B, or endolysin:")
for note in other_notes:
    print(note)


Total number of entries containing any iteration of lysin A, lysin B, or endolysin: 7192
Lysin A count: 3460
Lysin B count: 2529
Endolysin count: 1152
Other lysin count: 51

Notes entries that are not lysin A, lysin B, or endolysin:


In [11]:
# Re-classify notes entries to lysin A, lysin B, endolysin, and putative lysin  

# Replace note values with the desired ones
for entry in data:
    notes = entry['Notes']
    if notes:
        notes_lower = notes.lower()
        if 'lysin a' in notes_lower or 'lysinA' in notes_lower:
            entry['Notes'] = 'lysin A'
        elif 'lysin b' in notes_lower or 'lysinB' in notes_lower:
            entry['Notes'] = 'lysin B'
        elif 'endolysin' in notes_lower:
            entry['Notes'] = 'endolysin'
        elif 'lysin' in notes_lower:
            entry['Notes'] = 'putative lysin'

# Specify the path to save the new JSON file
new_json_file_path = "modified_data.json"

# Save the modified data as JSON
with open(new_json_file_path, 'w') as json_file:
    json.dump(data, json_file, indent=4)

print("New JSON file saved successfully!")

New JSON file saved successfully!


In [12]:
# Investigate notes field and count unique values

# Dictionary to store counts of different notes
notes_counts = {}

# Iterate through each entry in the data
for entry in data:
    # Extract the value of the "Notes" key
    notes = entry['Notes']
    # Check if the value is not empty
    if notes:
        # Update the count for the note value
        if notes in notes_counts:
            notes_counts[notes] += 1
        else:
            notes_counts[notes] = 1

# Print all unique note values and their counts
for note, count in notes_counts.items():
    print(f"Note: {note} | Count: {count}")

# Calculate total count
total_count = lysin_a_count + lysin_b_count + endolysin_count + other_lysin_count

# Print the total count
print(f"Total count after modifications: {total_count}")


Note: lysin B | Count: 2529
Note: lysin A | Count: 3460
Note: endolysin | Count: 1151
Note: putative lysin | Count: 51
Total count after modifications: 7192


json file has been cleaned and data is ready for further analysis. 