# Prepare data for CITL
Here, we use two data sets as examples for how to prepare data. The data sets were used in the article of CITL (Wei et al., 2021).

## Create from fastq
The raw data of single-cell sequencing is in fastq format. Annotations of spliced/unspliced reads could be obtained using [velocyto CLI](http://velocyto.org/velocyto.py/tutorial/cli.html) or [kallisto](https://linnarssonlab.org/loompy/kallisto/index.html). Usage of these tools is in their documents. 
## Estimate RNA velocity 
The tools mentioned above will give outputs in loom format, containing the annotations of spliced/unspliced reads. With the annotations, we can estimate RNA velocity throug two methods: [*velocyto*](http://velocyto.org/) and [*scVelo*](https://scvelo.readthedocs.io/). Both of them are compatible with CITL. In the following examples, we use *velocyto*.

We considered two data sets in the article. Data set 1 was from mouse P0 and P5 dentate gyrus (Zeisel et al., 2018) and data set 2 was the human week ten fetal forebrain data set (La Manno et al., 2018). Both data sets were referred to in *velocyto* as well. Detailed procedures of estimating RNA velocity in these data sets have been described in *velocyto*'s [notebooks](https://github.com/velocyto-team/velocyto-notebooks/tree/master/python). Here lists some key parameters of the procedures.
###  Data set 1

In [1]:
#The minimum number of spliced molecules detected considering all the cells
min_expr_counts = 40
#The minimum number of cells that express spliced molecules of a gene
min_cells_express = 30
#The minimum number of unspliced molecules detected considering all the cells
min_expr_counts_U = 25
#The minimum number of cells that express unspliced molecules of a gene
min_cells_express_U = 20
#The number to select rank genes on the basis of a CV vs mean fit
N = 3000
## k for KNN imputation
k =500


### Data set 2

In [None]:
#The minimum number of spliced molecules detected considering all the cells
min_expr_counts = 30
#The minimum number of cells that express spliced molecules of a gene
min_cells_express = 20
#The minimum number of unspliced molecules detected considering all the cells
min_expr_counts_U = 25
#The minimum number of cells that express unspliced molecules of a gene
min_cells_express_U = 20
#The number to select rank genes on the basis of a CV vs mean fit
N = 2000
## k for KNN imputation
k =550

After running the codes in the notebooks, normalized expression levels and RNA velocity of genes can be save as following codes. 

In [None]:
import velocyto as vcy
#load data set
vlm = vcy.VelocytoLoom("dataset.loom")
#...
#...
#...(codes in the notebook)

velocity=np.transpose(vlm.delta_S)
np.savetxt('delta_s.csv',velocity,delimiter=',')

x=np.transpose(vlm.Sx_sz)
np.savetxt('Spliced.csv',x,delimiter=',')

import csv
with open('gene_names.csv', 'w') as csvfile:
	spamwriter = csv.writer(csvfile)
	spamwriter.writerow(vlm.ra['Gene'])

## Shape of the data
CITL requires matrices of the current expression levels ("Spliced.csv"), the changing expression levels ("delta_s.csv") and names of genes ("gene_names.csv"). Suppose there are $n$ cells and $p$ genes left after the filter. "Spliced.csv" should be a matrix with $n$ rows and $p$ columns without head line and index column. The shape of "delta_s.csv" is same as "Spliced.csv". "gene_names.csv" has one row recording $p$ names of genes. 