This is an older version of the *Haemophilus influenzae* notebook in the same directory; it is a little clunkier (it uses `urllib3` to fetch the DNA sequence instead of the `requests` library in the newer notebook. I'm just keeping it around for posterity.

In [None]:
# For its use in colab notebook
!pip install jbrowse-jupyter

In [None]:
# For its use in colab notebook
!pip install pandas
import pandas as pd

In [None]:
# don't actally need create here--that's only good for "build in" geneomes hg19, hg38
from jbrowse_jupyter import create, launch

For this example, we'll use a really short bacterial genome (at one point it was the shortest ever sequenced), *Carsonella ruddii*'* https://www.ncbi.nlm.nih.gov/nuccore/NC_018416.1/ It is a endosymbiont bacteria that I don't think is ever found "free living" https://en.wikipedia.org/wiki/Candidatus_Carsonella_ruddii

In [None]:
refseq_name = 'NC_018416.1' #it would be nice to extract this from fasta or gff

All of the data we are using came originally from the NCBI link above. Things I did to make our lives easier:
* Created a bgzipped and faidx indexed (SAMtools) copy of the fasta sequence
* Created a bgzipped and tabix indexed copy of the GFF (which contains the gene annotations)
* Created an unzipped copy of the fasta sequence (to make it easier for us to read)
* Plopped all of this data in a AWS S3 bucket with CORS enabled for public access

Here we are creating an empty JBrowse configuration option, telling it that it will be a linear genome view (LGV) as opposed to a circular genome view (CGV) and setting the assemblying, telling where to find the indexed fasta file.

In [None]:
import jbrowse_jupyter
config = jbrowse_jupyter.JBrowseConfig(view="LGV",)
config.set_assembly('https://s3.amazonaws.com/wormbase-modencode/test/Carsonella_ruddii/Carsonella_ruddii.fa.gz', # bgzipped
                    name=refseq_name)

In [None]:
c_ruddii_ref_url = 'https://s3.amazonaws.com/wormbase-modencode/test/Carsonella_ruddii/Carsonella_ruddii.fa' #unzipped--this is what we are going to read in this script

In [None]:
!pip install urllib3

This is the "magic" for reading the contents from a URL.

In [None]:
import urllib3
import io
http = urllib3.PoolManager()  
r = http.request('GET', c_ruddii_ref_url, preload_content=False)
r.auto_close = False
print(r.status)


Here we are creating our first JBrowse 2 track from the GFF3 file I mentioned above. It's pretty straight forward, just giving it a url for the data and a name that will be displayed in JBrowse 2 (the "name") and a track_id, which is what we use to refer to it programatically below.

In [None]:
#add gene track
geneGFFUrl = 'https://s3.amazonaws.com/wormbase-modencode/test/Carsonella_ruddii/Carsonella_ruddii.gff.gz'
config.add_track(geneGFFUrl, name="Genes",track_id="Genes")

Now we start some computation. The window and step are for calculating sliding averages: the window is how many bases we'll use to calculate an average and the step is how many bases we'll move the "center" of the window with each time through the loop.

In [None]:
window = 20 #window has to be an even number
step   = 5

Frist read the sequence data into a variable called `seq`

In [None]:
firstline = True
linecount=0

seq = '' #first try just putting all of the seq into a variable (rather than working on streaming sequence)

tw = io.TextIOWrapper(r)
while True:
#for line in io.TextIOWrapper(r):
  rline = tw.readline()
  if not rline:
    break
  if firstline:
    firstline = False
    continue # skip the header line
  if rline.find(">") > -1: #assuming there are other sequences in the fasta, stop after the first one
                           # note that this is a bit of a hack here: the interaction between streaming the data from AWS and TextIOWrapper doesn't
                           # work great (that is, it closes before we read to the end), so I hacked the fasta file that we are reading to have two 
                           # copies of the bacterial genome it it
    break
  seq = seq + rline.rstrip("\n")

Now we are going to do the calculations that happen on every base (rather than sliding averages, that we'll do below). Two things are getting calculated here:
* The cummulative GC skew (basically, every time you see a G, add 1, every time you see a C, subtract 1)
* The total of each type of base
Notice that typically the number of As is typically roughly equal to the number of Ts and the same is true for Gs and Cs. I don't know if an accepted hypothosis for why that is has ever been presented.

In [None]:
all_freq = {}
cummulative_skew = {}
cummulative_name = {}
start = {}
end   = {}
refName = {}
skew_name = {}
firstBase = True
i = 1
for bp in seq: #loop over every base in the sequence
  if firstBase:
    if bp == "G":
      cummulative_skew[i] = 1
    elif bp == "C":
      cummulative_skew[i] = -1
    else:
      cummulative_skew[i] = 0
    firstBase = False
  # calculating cummulative skew at every base
  else:
    if bp == "G":
      cummulative_skew[i] = cummulative_skew[i-1] + 1
    elif bp == "C":
      cummulative_skew[i] = cummulative_skew[i-1] - 1
    else:
      cummulative_skew[i] = cummulative_skew[i-1]
  cummulative_name[i] = 'Cummulative GC skew'
  start[i] = i
  end[i] = i
  refName[i] = refseq_name
  skew_name[i] = 'Cummulative GC skew'
  #not really a frequency, really sum
  if bp in all_freq:
    all_freq[bp] += 1
  else:
    all_freq[bp] = 1
  i += 1
print(all_freq)

Create a pandas DataFrame for the skew data (this is the structure that JBrowse expects)

In [None]:
skew_data = {'refName': refName,
                'start':start,
                'end':end,
                'name':skew_name,
                'score':cummulative_skew}
skew_df = pd.DataFrame(skew_data)
print(skew_df)

Now to calculate the sliding average statistics, percent GC and local GC asymetry

In [None]:
seqlen = len(seq)
GCpercent = {}
GCasym = {}
start = {}
end = {}
refName = {}
percent_name={}
asym_name={}
rowcount=0
for bp in range(int(window/2), int(seqlen-window/2), step): #note the use of range to create a number sequence to skip through the sequence string
  seqfrag = seq[bp-int(window/2):bp+int(window/2)]
  Acount = seqfrag.count('A')
  Ccount = seqfrag.count('C')
  Gcount = seqfrag.count('G')
  Tcount = seqfrag.count('T')
  refName[rowcount] = refseq_name
  start[rowcount] = bp
  end[rowcount] = bp
  percent_name[rowcount] = 'GC percent'
  asym_name[rowcount] = 'GC asymetry'
#casting fractions into int percents; score shouldn't have to be an int!
  GCpercent[rowcount] = int((100*Gcount+Ccount)/window)
  GCasym[rowcount] = 0
  if Gcount+Ccount > 0:
    GCasym[rowcount]  = int(100*(Gcount-Ccount)/(Gcount+Ccount))
  rowcount = rowcount+1
    


  

And create pandas DataFrames for percent GC and GC asymetry data

In [None]:
percent_data = {'refName': refName,
                'start':start,
                'end':end,
                'name':percent_name,
                'score':GCpercent}
percent_df = pd.DataFrame(percent_data)
print(percent_df)

In [None]:
asym_data = {'refName': refName,
                'start':start,
                'end':end,
                'name':asym_name,
                'score':GCasym}
asym_df = pd.DataFrame(asym_data)
print(asym_df)

I'm printing out the config just to get a look at it to make sure it looks ok before I start adding data (which will make the config really big)

In [None]:
print(config.get_config())

In [None]:
config.add_df_track(percent_df,'percent GC',track_id='df_percent_gc',overwrite=True)
config.add_df_track(asym_df,'GC asymetry',track_id='df_gc_asym',overwrite=True)
config.add_df_track(skew_df,'Cummulative GC skew',track_id='df_gc_skew',overwrite=True)

Finally, we set a few things about how we want the initial view of JBrowse to look (location, tracks that are turned on) and then launch the genome browser. Note that as we zoom it, it might get a little "jerky," as the config file is fairly large, since all of the data we created above is in the config. If we wanted to make this into something we showed other people, we'd want to create data files for each of these tracks.

In [None]:
config.set_location("NC_018416.1:1000..3500")
config.set_default_session(['Genes','df_gc_skew','df_percent_gc','df_gc_asym'], False)
full_conf = config.get_config()
launch(full_conf, port=3003,height=800)