# **Tutorial 1. Dealing with bigWig Data**

BigWig files are compressed binary files developed by UCSC. They are primarily used in bioinformatics to store and visualise genomic data in UCSC's Genome Browser. The files are indexed allowing access to specific genomic regions without having to access or download the whole genome. Accessing and converting their data for machine learning purposes can be quite confusing at times. UCSC has their own programs to extract genomic data from bigWigs, however python libraries like pyBigWig also exist, making converting bigWigs to numpy arrays much easier.


## **UCSC Programs**

UCSC's programs to deal with bigWigs include:

**bigWigInfo** — prints out information about a bigWig file. <br><br>
**bigWigSummary** — extracts summary information from a bigWig file. <br><br>
**bigWigToWig** — converts a bigWig file to wig format. Note: if a bigWig file was created from a bedGraph, bigWigToWig will revert the file back to bedGraph. <br><br>
**bigWigToBedGraph** — converts a bigWig file to ASCII bedGraph format. <br><br>


Available to download from "https://genome.ucsc.edu/goldenPath/help/bigWig.html". The executable programs are within this tutorial's github repository.

Running the programs by themselves print out their help/usage statements.

In [10]:
!file ./bigWigInfo

#!./bigWigInfo
#!./bigWigSummary
#!./bigWigToWig
#!./bigWigToBedGraph

./bigWigInfo: Mach-O 64-bit executable x86_64


## bigWigInfo
**BigWig files (.bw) can be accessed via url**. For example, if you look at genomic data from the [Encode Project](https://www.encodeproject.org), processed data such as p-values and fold-change data are in bigWig format. The bigWig file is available to be downloaded or accessed from "https://www.encodeproject.org/files/[experiment]/@@download/[experiment].bigWig", where [experiment] is the accession id of the experiment. We can parse the indexed data without having to download the files themselves.

Take the accession id [ENCSR817LUF](https://www.encodeproject.org/experiments/ENCSR817LUF/). This is the same experiment that we visualised earlier using the UCSC Genome Browser (1.2.2 Example Data Representations:). Recall it was a ChIP-seq experiment targeting the H3K36me3 histone modification in brain tissue.

The bigWig file containing the p-value for signal strength has the accession id [ENCFF601VTB](https://www.encodeproject.org/files/ENCFF601VTB/). Lets run UCSC's programs on this bigWig.


In [2]:
!./bigWigInfo https://www.encodeproject.org/files/ENCFF601VTB/@@download/ENCFF601VTB.bigWig

Couldn't open https://www.encodeproject.org/files/ENCFF601VTB/@@download/ENCFF601VTB.bigWig


## bigWigSummary

bigWigSummary allows us to view summary statistics (-type=mean/min/max/std/coverage) from specified genomic regions within chromosomes.
This is done using "bigWigSummary -type=mean/min/max/std/coverage file.bigWig chrom start end dataPoints". The start and end positions represent the genomic coordinates we are interested in, while dataPoints are simply the number of bins we want to summarise the data into. 

Earlier using the UCSC Genome Browser (1.2.2 Example Data Representations:) we visualised the p-value signals from chromosome 1, within the region: 11,084,744 - 11,095,920. Lets summarise the mean of p-values across this genomic region into 10 bins.

In [None]:
!./bigWigSummary -type=mean https://www.encodeproject.org/files/ENCFF601VTB/@@download/ENCFF601VTB.bigWig chr1 11084744 11095920 10

## bigWigToWig & bigWigToBedGraph

These programs are quite similar in implementation. The key difference between Wig and Bed files 

In [None]:
!./bigWigToWig "https://www.encodeproject.org/files/ENCFF601VTB/@@download/ENCFF601VTB.bigWig" output.wig -chrom=chr1 -start=11084744 -end=11095920


In [None]:
with open("output.wig", "r") as file:
    for line in file:
        print(line)

In [None]:
!./bigWigToBedGraph "https://www.encodeproject.org/files/ENCFF601VTB/@@download/ENCFF601VTB.bigWig" output.bed -chrom=chr1 -start=11084744 -end=11095920

In [None]:
with open("output.bed", "r") as file:
    for line in file:
        print(line)

## Converting Wig/Bed Files into Numpy Arrays

In [None]:
import numpy as np

def parse_wig(wig_file):
    data = []
    with open(wig_file, 'r') as f:
        for line in f:
            line = line.strip()
            if line.startswith('chr'):
                parts = line.split('\t')
                chrom = parts[0]
                start = int(parts[1])
                end = int(parts[2])
                value = float(parts[3])
                data.append((chrom, start, end, value))
    return data

# Example usage
wig_file = 'output.wig'
parsed_data = parse_wig(wig_file)

# Convert to NumPy arrays
chroms, starts, ends, values = zip(*parsed_data)
starts = np.array(starts)
ends = np.array(ends)
values = np.array(values)

# Save as .npz file
np.savez('data.npz', starts=starts, ends=ends, values=values)


In [None]:
# Load .npz file
data = np.load('data.npz')

# Check what's in the .npz file
print(list(data.keys()))  # Print list of arrays in the file

# Access and inspect individual arrays
starts = data['starts']
ends = data['ends']
values = data['values']

# Print array shapes and some data points
print(f"starts array shape: {starts.shape}")
print(f"ends array shape: {ends.shape}")
print(f"values array shape: {values.shape}")

# Print some example data points
print("Example data points:")
for i in range(3):  # Print first 3 data points
    print(f"Start: {starts[i]}, End: {ends[i]}, Value: {values[i]}")


## USING pyBigWig

In [4]:
import pyBigWig



In [7]:
bw = pyBigWig.open("https://www.encodeproject.org/files/ENCFF601VTB/@@download/ENCFF601VTB.bigWig")

# Get values for specific intervals
chrom = "chr1"
start = 11084744 
end = 11095920

# Get intervals in the specified region
intervals = bw.intervals(chrom, start, end)


# Print the start position, end position, and value for each interval
if intervals is not None:
    for interval in intervals:
        #print(f"Start: {interval[0]}, End: {interval[1]}, Value: {interval[2]}")
        print(chrom, interval[0], interval[1], interval[2])
    
else:
    print("No intervals found in the specified region.")
    

# Close the BigWig file
bw.close()


chr1 11084717 11085202 0.13422000408172607
chr1 11085202 11085214 0.033240001648664474
chr1 11085214 11085679 0.13422000408172607
chr1 11085679 11085688 0.5755599737167358
chr1 11085688 11085894 0.13422000408172607
chr1 11085894 11085913 0.5755599737167358
chr1 11085913 11086093 0.13422000408172607
chr1 11086093 11086279 0.5755599737167358
chr1 11086279 11086299 1.3043099641799927
chr1 11086299 11086306 2.265320062637329
chr1 11086306 11086322 3.415019989013672
chr1 11086322 11086353 4.050429821014404
chr1 11086353 11086360 3.415019989013672
chr1 11086360 11086539 2.265320062637329
chr1 11086539 11086559 1.3043099641799927
chr1 11086559 11086566 0.5755599737167358
chr1 11086566 11086589 0.13422000408172607
chr1 11086589 11086793 0.033240001648664474
chr1 11086793 11086811 0.13422000408172607
chr1 11086811 11086817 0.5755599737167358
chr1 11086817 11086851 1.3043099641799927
chr1 11086851 11086869 2.265320062637329
chr1 11086869 11086874 3.415019989013672
chr1 11086874 11086912 4.050429

In [8]:
# Function to extract data from a BigWig file for a specified range
def extract_bigwig_data(bw_path, chrom, start, end):
    bw = pyBigWig.open(bw_path)
    interval = bw.intervals(chrom, start, end)
    bw.close()
    return interval

# Define genomic region of interest
chrom = 'chr1'
start = 11084744
end = 11095920

# List of BigWig files from ENCODE
bigwig_files = [
    'https://www.encodeproject.org/files/ENCFF601VTB/@@download/ENCFF601VTB.bigWig',
    'https://www.encodeproject.org/files/ENCFF359FNY/@@download/ENCFF359FNY.bigWig'
]

# Extract data from each BigWig file
data_matrix = []

for bw_file in bigwig_files:
    data = extract_bigwig_data(bw_file, chrom, start, end)
    data_matrix.append(data)

data_matrix = np.array(data_matrix)

print(data_matrix)

NameError: name 'np' is not defined

In [None]:
# Plotting each dataset in data_matrix
plt.figure(figsize=(10, 6))
for i, data in enumerate(data_matrix):
    plt.plot(data, label=f'BigWig {i+1}')

plt.xlabel('Genomic Position')
plt.ylabel('Raw Signal Value')
plt.title('Raw Signal Values from BigWig Files')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt
# Open a BigWig file
bw = pyBigWig.open("https://www.encodeproject.org/files/ENCFF601VTB/@@download/ENCFF601VTB.bigWig")


# Define the region
chrom = "chr1"
start = 11084744
end = 11095920

# Get values for the specified interval
values = bw.values(chrom, start, end)

# Close the BigWig file
bw.close()

# Plotting the data
plt.figure(figsize=(10, 4))
plt.plot(range(start, end), values, color='blue')
plt.fill_between(range(start, end), values, color='blue', alpha=0.3)
plt.title(f"Read Coverage in {chrom}:{start}-{end}")
plt.xlabel("Genomic Position")
plt.ylabel("Coverage")
plt.show()