# ZFIN-8499 Reports


# New Download File

UniProt has released 2023_01, the first release of 2023. We are expecting this to reflect the changes from our previews in ZFIN-8376 which included a to_keep.csv file and to_delete.csv file.

As a first step, I ran our preload which downloads from https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/ and filters for zebrafish records and puts them in pre_zfin.dat.

To compare the to_keep.csv from before and the pre_zfin.dat files, I pulled just the uniprot IDs from each file and put them in the db.sqlite file in tables named "to_keep_ids" and "pre_zfin_ids" respectively.


# Preface 

## Section A: Comparisons

### i) pre_zfin count
pre_zfin has 52,331 unique uniprot IDs.

### ii) to_keep count
to_keep has 51,823 unique uniprot IDs

### iii) IDs in pre_zfin, but not in to_keep
There are 3,331

### iv) IDs in to_keep, but not in pre_zfin
There are 2,823



## Next Steps

import tables from 8395 report and generate the same reports for the new data

# Queries supporting these results


## Initialize Database

In [1]:
!cp inputs/db.sqlite .

In [2]:
%reload_ext sql


In [3]:
%%sql
sqlite:///db.sqlite

## Database Queries

### Ai

In [4]:
%%sql

select count(distinct id) from pre_zfin_ids;

 * sqlite:///db.sqlite
Done.


count(distinct id)
52331


### Aii

In [5]:
%%sql

select count(distinct id) from to_keep_ids;

 * sqlite:///db.sqlite
Done.


count(distinct id)
51823


### Aiii

In [6]:
%%sql

create table "Aiii" as 
select * from pre_zfin_ids where id not in (select * from to_keep_ids);
select count(*) from Aiii;


 * sqlite:///db.sqlite
Done.
Done.


count(*)
3331


### Aiv

In [7]:
%%sql

create table "Aiv" as
select * from to_keep_ids where id not in (select * from pre_zfin_ids);
select count(*) from Aiv;

 * sqlite:///db.sqlite
Done.
Done.


count(*)
2823


# Export Excel Spreadsheet



In [8]:
import sqlite3
import pandas as pd

def main():

    tables = ['pre_zfin_ids', 'to_keep_ids', 'Aiii', 'Aiv']

    # Create a Pandas Excel writer using the openpyxl engine
    writer = pd.ExcelWriter('Notes.xlsx', engine='openpyxl')

    # Loop over the CSV files
    for i, table in enumerate(tables):
      # Read the CSV file
      df = get_table_rows_as_data_frame(table)

      # Write the dataframe to a sheet in the Excel file
      df.to_excel(writer, table, index=False)

    writer.close()


def get_table_rows_as_data_frame(tablename):
    # Connect to the database
    conn = sqlite3.connect('db.sqlite')

    # Create a cursor
    cursor = conn.cursor()
    cursor.execute('SELECT * FROM "' + tablename + '"')
    results = cursor.fetchall()

    column_names = [description[0] for description in cursor.description]

    # Convert the results to a Pandas DataFrame
    df = pd.DataFrame(results, columns=column_names)

    cursor.close()
    conn.close()

    return df

main()

