# File input/output

In this section and the next, you will create a number of data structures:

| Object Name | Object Description |
|---|---|
| gene_file | string of file name of mouse genome selection |
| fopen | list where each element is a line of gene_file |
| gene_info | dict of the contents of gene_file, where keys are gene names and values are everything associated with them |
| chroms | list of all the chromosomes included in gene_info |
| gene_counts | dict with chromosome name as key and number of genes contained as value |
| fa_file | string of file name of FASTA formatted select mouse gene sequences (selChroms.mm9.fa.zip) |
| seq_dict | dict of the fa_file content (short dict, lots of sequence) |
| cntn4 | string of Cntn4 gene identifier, uc009dcr.2 |
| hchr | string of chromosome number of Cntn4 gene location |
| hst | string of location of start site of Cntn4 gene |
| hen | string of location of end site of Cntn4 gene |
| cntn4_seq | string of sequence of Cntn4 |
| gene_lengths | dict of gene names as keys, and gene lengths as values |
| in_gene | set of gene names as keys, and the range of the indices of chromosome location. |
| chr6_starts | dict of chromosome 6 gene names as keys, and their start sites as values. |
| tata_dist | dict of gene names as keys, and the index, if any, of the nearest TATA to its start. |


## Importing course module functions

Since you followed the instructions in [Downloading Necessary Files](https://courses.edx.org/courses/course-v1:MITx+7.QBWx+2T2022/jump_to_id/ec0617e829984077b9670f3fdbea25e4), your working directory now contains the downloaded Python module. Instead of importing the entire module, we will just grab the associated functions from it using this command. This makes it easier to call the commands later on:

In [1]:
from qbwPythonModule import *

## input/ouput: opening the data in Python

You have downloaded a list of genes from part of the mouse genome. This is data from the [UCSC genome browser](http://genome.ucsc.edu/) that maintains genomic information across many organisms as they have been sequenced.

First, assign the file name to a variable so that you do not have to type it too frequently.

In [2]:
gene_file='mm9_sel_chroms_knownGene.txt' # mouse chromosome sequences

Now reading a file is straightforward in Python.  The easiest way is to use the readlines command to get the file in a list, where each element in the list is the content of a line.

In [3]:
fopen=open(gene_file).readlines()

Now that the file is simply a list, you can check its length:

In [4]:
len(fopen)

10674

There should be 10,674 lines in the file.

You can now access fopen to peek at various parts of the list:

In [5]:
fopen[0]

'uc009auw.1\tchr6\t+\t3238518\t3267019\t3238518\t3238518\t3\t3238518,3264405,3266613,\t3238700,3264550,3267019,\t\tuc009auw.1\n'

In [6]:
fopen[1999]

'uc009dmc.2\tchr6\t-\t119265150\t119280784\t119267073\t119273733\t5\t119265150,119270438,119273666,119274218,119280512,\t119267528,119271029,119273795,119274426,119280784,\tQ8BGX3\tuc009dmc.2\n'

However, this is pretty difficult to interpret.  The course module you loaded earlier contains a command called loadGeneFile designed to parse this file automatically into a nice dictionary.

In [7]:
gene_info=loadGeneFile(gene_file)

Parsed info for 10674 genes on 4 chromosomes


Now you can use the dictionary structure to analyze the data in the file. The keys of the dictionary represent gene names. 

In [12]:
#gene_info.keys()

Each gene name can be used to access the necessary information, such as the chromosome and position of the gene on the chromosome:

In [13]:
gene_info['uc009auw.1']['chr']

'chr6'

In [14]:
gene_info['uc009auw.1']['start']

3238518

In [None]:
gene_info.getloc

## evaluating the data

The first question to ask is how many chromosomes are in this file?  I only selected a subset to be included in this course. 

To count the chromosomes we first create a list. Then for each gene, we look to see if the chromosome is in list.  If not, we add it:

In [18]:
chroms=[]

for k in gene_info.keys():
    chrom=gene_info[k]['chr']
    if chrom not in chroms:
        chroms=chroms+[chrom]

# final list of chromosomes
chroms

['chr6', 'chr11', 'chr15', 'chr16']

In [19]:
gene_counts={}

# we iterate through each of the chromosomes we have collected and count how many genes are on that chromosome
for chrom in chroms:
    chrom_count=0 
    for k in gene_info.keys():
        if gene_info[k]['chr']==chrom:
            chrom_count+=1
    gene_counts[chrom]=chrom_count

gene_counts

{'chr6': 2990, 'chr11': 3899, 'chr15': 2110, 'chr16': 1675}