# Módulos

En Python, cada script o archivo de código fuente, se denominan módulos. Estos módulos, a la vez, pueden formar parte de paquetes. Un paquete, es una carpeta que contiene archivos `.py`. Por ejemplo, si guardáramos el contenido de la función para obtener codones a partir de un string, y le ponemos la extensión `py` y lo guardamos como `get_codons.py` sería un script de python, si el script está en la misma carpeta del notebook yo podría importarlo así:

In [1]:
# import get_codons

Eso me permite reutilizar mi código. Afortunadamente python contienen módulos `built-in`, métodos integrados. Además, podemos instalar nuevos paquetes que contienen módulos con `pip`, el instalador oficial de Python, o con `conda`, el gestor de paquetes de Anaconda Inc.

In [2]:
import calendar

In [3]:
print(calendar.month(2022, 2))

   February 2022
Mo Tu We Th Fr Sa Su
    1  2  3  4  5  6
 7  8  9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28



Es posible también, abreviar los namespaces mediante un alias. Para ello, durante la importación, se asigna la palabra clave as seguida del alias con el cuál nos referiremos en el futuro a ese namespace importado:

- `import modulo`
- `import modulo as m`
- `import paquete.modulo1 as pm`
- `import paquete.subpaquete.modulo1 as psm`

# Pandas

<img src="./imgs/pandas.png" align="center"/>

## ¿Qué es Pandas?

Pandas, de *"panel data"*, es una biblioteca de Python que nos permite manejar tablas, también conocidas como *DataFrames*, las cuales están constituidas por tres elementos:

- Los datos
- El índice
- Las columnas

|Estructura de pandas | Dimensiones | Analogía Excel
| -- | -- | -- |
|Series |1D| Columna|
|DataFrame |2D | Hoja de cálculo|
|Panel | 3D| Múltiples hojas de cálculo|


Para utilizar el módulo Pandas es necesario importarlo.

```python
import pandas as pd
```

Descargaremos mas adelante el archivo de manera remota utilizando comandos bash en Jupyter Notebooks

```python
! mkdir -p data
! curl https://atgenomics-data.s3.amazonaws.com/IGC.annotation.tsv.gz -o data/IGC.annotation.tsv.gz

df = pd.read_csv("data/IGC.annotation.tsv.gz", sep='\t')
```

O bien, podemos descargar archivos de manera remota utilizando únicamente Pandas, recuerda quitar el `#` de la siguiente celda:

In [4]:
# df = pd.read_csv('https://atgenomics-data.s3.amazonaws.com/IGC.annotation.tsv.gz', sep='\t')

### Uso de comandos bash en celdas de Jupyper Notebooks

Podemos utilizar comandos bash en Jupyter por celda con el comando mágico: `%%bash`

In [5]:
%%bash

for i in {1..10}
do
    echo $i
done

1
2
3
4
5
6
7
8
9
10


Comando mágico de IPython `%ls` 

In [6]:
%ls -ltrh *ipynb

-rw-r--r-- 1 pgc288  41K Feb  8 19:27 day01.ipynb
-rw-r--r-- 1 pgc288  793 Feb  8 22:34 Untitled1.ipynb
-rw-r--r-- 1 pgc288  65K Feb  8 22:41 day02.ipynb
-rw-r--r-- 1 pgc288 7.2K Feb  8 22:54 class_and_methods.ipynb
-rw-r--r-- 1 pgc288 390K Feb 10 08:04 day03.ipynb


Linux command with `!`

In [7]:
! ls -ltrh *ipynb

! echo "This is Bash"

-rw-r--r-- 1 pgc288 staff  41K Feb  8 19:27 day01.ipynb
-rw-r--r-- 1 pgc288 staff  793 Feb  8 22:34 Untitled1.ipynb
-rw-r--r-- 1 pgc288 staff  65K Feb  8 22:41 day02.ipynb
-rw-r--r-- 1 pgc288 staff 7.2K Feb  8 22:54 class_and_methods.ipynb
-rw-r--r-- 1 pgc288 staff 390K Feb 10 08:04 day03.ipynb
This is Bash


In [8]:
myvar = !ls

In [9]:
myvar

['Untitled1.ipynb',
 '__pycache__',
 'atgtools.py',
 'class_and_methods.ipynb',
 'data',
 'day01.ipynb',
 'day02.ipynb',
 'day03.ipynb',
 'get_codons.py',
 'get_gc_content.py']

Magic Commands availables per line/cell

In [10]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%

### Uso de únicamente Python para descargar archivos

Descargar el archivo de manera remota utilizando ÚNICAMENTE módulos de Python

In [11]:
import os
import requests

Crear directorio

In [12]:
os.makedirs('./data/', exist_ok = True)

Definir variable con URL

In [13]:
url = 'https://atgenomics-data.s3.amazonaws.com/IGC.annotation.tsv.gz'

Utilizar requests para obtener el contenido del archivo en la URL

In [14]:
r = requests.get(url, allow_redirects=True)

Escribir el contenido de request a un archivo

In [15]:
open('./data/IGC.annotation.tsv.gz', 'wb').write(r.content)

30375588

Cargamos el paquete Pandas y  archivo a pandas

In [16]:
import pandas as pd

In [17]:
df = pd.read_csv('data/IGC.annotation.tsv.gz', sep='\t')

In [18]:
df.head()

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,911104.WcibK1_010100007220,24615,Complete,SP,Firmicutes,Weissella,unknown,NOG12793,0.066298,0.072897,unknown,Function unknown,EUR
1,585054.EFER_0542,21669,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,SZEY-48A_GL0052647,20778,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.21215,Enzyme Families,Function unknown,EUR;CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.095501,0.096262,Enzyme Families,Function unknown,EUR;CHN
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA


### Leer archivos en formato Excel, *xlsx*

Podemos leer un archivo Excel directamente e imprimir el nombre de las hojas de cálculo del archivo

In [19]:
# xlsx = pd.ExcelFile('miarchivoExcel.xlsx')

Lista el nombre de todas las hojas de cálculo:

In [20]:
# xlsx.sheet_names

# df = pd.read_excel(xlsx, "Datos")

O bien, leer directamente la hoja de cálculo definiento el nombre, el número de filas que tiene que omitir -generalmente la cabecera en muchos documentos oficiales- y la columna que definiremos como el índice.

In [21]:
# df = pd.read_excel('miarchivoExcel.xlsx', sheet_name='Datos', skiprows=3, index_col=1)

## Descripción general de *DataFrames*

Es importante recordar siempre qué tipo de dato estamos utilizando en cada línea. En este caso, `df` es un:

In [22]:
type(df)

pandas.core.frame.DataFrame

Obtenemos las dimensiones del DataFrame, el número de filas y columnas.

In [23]:
len(df)

1285642

In [24]:
df.shape

(1285642, 13)

¿Qué tipo de datos contiene cada celda? ¿Es así? La asignación de tipo de dato la hace pandas de manera automática, es importante corregir algunos tipos de datos si queremos utilizar ciertos métodos. Por ejemplo, no podemos usar métodos numéricos donde hay valores de 'objetos' (strings,listas, diccionarios); tampoco podemos aplicar métodos numéricos en una columna que tiene sólo valores categóricos.

In [25]:
df.dtypes

Gene Name                              object
Gene Length                             int64
Gene Completeness                      object
Cohort Origin                          object
Taxonomic Annotation(Phylum Level)     object
Taxonomic Annotation(Genus Level)      object
KEGG Annotation                        object
eggNOG Annotation                      object
Sample Occurence Frequency            float64
Individual Occurence Frequency        float64
KEGG Functional Categories             object
eggNOG Functional Categories           object
Cohort Assembled                       object
dtype: object

|Pandas dtype| Python | Uso|
|--|--|--|
|`object`| string o mixto| Texto o números en string, lista o diccionario|
|`int64`|int | Números enteros|
|`float64`|float | Números decimales|
|`bool`|bool| Valores True/False|
|`datetime64`|datetime | Valores de fechas y hora|
|`category`| NA | Categorías en texto

Una versión extendida descriptiva del DataFrame. AL final vemos el uso de memoria de este grupo de datos, ¿es poco, mucho? ¿Qué creen que sea el `Non-Null Count`?

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1285642 entries, 0 to 1285641
Data columns (total 13 columns):
 #   Column                              Non-Null Count    Dtype  
---  ------                              --------------    -----  
 0   Gene Name                           1285642 non-null  object 
 1   Gene Length                         1285642 non-null  int64  
 2   Gene Completeness                   1285642 non-null  object 
 3   Cohort Origin                       1285642 non-null  object 
 4   Taxonomic Annotation(Phylum Level)  1285642 non-null  object 
 5   Taxonomic Annotation(Genus Level)   1285642 non-null  object 
 6   KEGG Annotation                     1285642 non-null  object 
 7   eggNOG Annotation                   1285642 non-null  object 
 8   Sample Occurence Frequency          1285642 non-null  float64
 9   Individual Occurence Frequency      1285642 non-null  float64
 10  KEGG Functional Categories          1285642 non-null  object 
 11  eggNOG Func

El método `df.head()` nos permite obtener las primeras 5 líneas del DataFrame. Podemos definir el número deseado en el en medio de los paréntesis, por ejemplo `df.head(20)` para obtener las primeras 20 líneas.

In [27]:
df.head()

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,911104.WcibK1_010100007220,24615,Complete,SP,Firmicutes,Weissella,unknown,NOG12793,0.066298,0.072897,unknown,Function unknown,EUR
1,585054.EFER_0542,21669,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,SZEY-48A_GL0052647,20778,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.21215,Enzyme Families,Function unknown,EUR;CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.095501,0.096262,Enzyme Families,Function unknown,EUR;CHN
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA


Ahora las últimas 5.

In [28]:
df.tail()

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
1285637,665954.HMPREF1017_02554,102,Complete,SP,Bacteroidetes,Bacteroides,unknown,unknown,0.089187,0.07757,unknown,unknown,
1285638,753642.ECNC101_09444,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.001579,0.000935,unknown,unknown,
1285639,768724.HMPREF9131_0965,102,Complete,SP,Firmicutes,Peptoniphilus,unknown,unknown,0.000789,0.000935,unknown,unknown,
1285640,866771.HMPREF9296_1721,102,Complete,SP,Bacteroidetes,Prevotella,unknown,unknown,0.000789,0.000935,unknown,unknown,
1285641,926026.ECO5101_18612,102,Complete,SP,Proteobacteria,Escherichia,unknown,proNOG05789,0.001579,0.001869,unknown,Function unknown,


Podemos usar `sample()` para muestrear aleatoriamente filas del DataFrame

In [29]:
df.sample()

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
3724,V1.UC31-0_GL0030911,4548,Complete,EUR,Firmicutes,unknown,K00284,COG0067;COG0069;COG0070,0.342541,0.348598,Energy Metabolism,Amino acid transport and metabolism,EUR;CHN;USA


Podemor colocar el puntero en medio de los paréntesis de `df.sample()` y utilizar las teclas `SHIFT + TAB` para obtener ayuda del método de pandas, o de cualquier método que estés aplicando:

![green-divider](imgs/help.png)

In [30]:
df.sample(10)

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
1125162,MH0434_GL0244560,285,Complete,EUR,Bacteroidetes,Bacteroides,unknown,COG3436,0.12865,0.137383,unknown,"Replication, recombination and repair",EUR
1168023,MH0011_GL0004548,240,Complete,EUR,Firmicutes,Faecalibacterium,unknown,NOG14713,0.209155,0.201869,unknown,Function unknown,EUR;USA
737000,MH0161_GL0014740,711,Complete,EUR,Firmicutes,Faecalibacterium,K01104,COG4464,0.399369,0.384112,Cellular Processes and Signaling,Carbohydrate transport and metabolism;Cell wal...,EUR;CHN;USA
944007,MH0266_GL0022060,474,Complete,EUR,Bacteroidetes,unknown,K00783,COG1576,0.002368,0.002804,Poorly Characterized,Function unknown,EUR;CHN
945462,T2D-63A_GL0026446,474,Complete,CHN,Firmicutes,unknown,K06950,COG1418,0.001579,0.001869,Poorly Characterized,General function prediction only,CHN
1213317,159733294-stool1_revised_scaffold26513_1_gene1...,198,Complete,USA,Bacteroidetes,unknown,unknown,unknown,0.000789,0.000935,unknown,unknown,USA
112079,V1.UC35-0_GL0110065,1866,Complete,EUR,Bacteroidetes,Porphyromonas,K00174,COG0674;COG1014,0.009471,0.008411,Carbohydrate Metabolism,Energy production and conversion,EUR
313806,MH0356_GL0008889,1269,Complete,EUR,Actinobacteria,Bifidobacterium,K01436,COG1473,0.270718,0.26729,Enzyme Families,General function prediction only,EUR;CHN
235063,MH0057_GL0019817,1407,Complete,EUR,Bacteroidetes,Paraprevotella,unknown,NOG75428,0.192581,0.178505,unknown,unknown,EUR;CHN;USA
139796,MH0274_GL0144378,1722,Complete,EUR,Bacteroidetes,unknown,unknown,NOG235407,0.080505,0.084112,unknown,Function unknown,EUR;CHN;USA


Cuando manejamos columnas numéricas podemos utilizar `describe()` para obtener algunas medidas de tendencia central. Es importante notar que si una columna de valores booleanos o categóricos en código numérico tienen una etiqueta numérica con `df.dtypes` se obtendrá los descriptores aunque estos carezcan de sentido categórico.

In [31]:
df.describe()

Unnamed: 0,Gene Length,Sample Occurence Frequency,Individual Occurence Frequency
count,1285642.0,1285642.0,1285642.0
mean,958.1168,0.1361281,0.1352274
std,728.5365,0.1941363,0.1932935
min,102.0,0.0,0.0
25%,453.0,0.01420679,0.01401869
50%,819.0,0.04814522,0.04859813
75%,1257.0,0.175217,0.1738318
max,24615.0,1.0,1.0


Podemos trasponer una tabla con el método `transpose()` o la propiedad `T`

In [32]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Gene Length,1285642.0,958.116782,728.536549,102.0,453.0,819.0,1257.0,24615.0
Sample Occurence Frequency,1285642.0,0.136128,0.194136,0.0,0.014207,0.048145,0.175217,1.0
Individual Occurence Frequency,1285642.0,0.135227,0.193294,0.0,0.014019,0.048598,0.173832,1.0


In [33]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Gene Length,1285642.0,958.116782,728.536549,102.0,453.0,819.0,1257.0,24615.0
Sample Occurence Frequency,1285642.0,0.136128,0.194136,0.0,0.014207,0.048145,0.175217,1.0
Individual Occurence Frequency,1285642.0,0.135227,0.193294,0.0,0.014019,0.048598,0.173832,1.0


En Python, al ser un lenguaje de tipado dinámico, podemos definir una variable en cualquier momento sin definirla al principio. 

In [34]:
description = df.describe().T

y guardar el contenido de una variable, o de un DataFrame, a un archivo

In [35]:
description.to_csv()

',count,mean,std,min,25%,50%,75%,max\nGene Length,1285642.0,958.116782121306,728.536548632577,102.0,453.0,819.0,1257.0,24615.0\nSample Occurence Frequency,1285642.0,0.13612813347528085,0.19413631329975223,0.0,0.0142067876874507,0.0481452249408051,0.175217048145225,1.0\nIndividual Occurence Frequency,1285642.0,0.13522741836228966,0.1932935302121455,0.0,0.014018691588785,0.0485981308411215,0.173831775700935,1.0\n'

In [36]:
description.to_csv('data/description.tsv', sep='\t', na_rep="**", index=True)

### To Copy or not To Copy

In [37]:
df_test = pd.DataFrame({'A': [1, 2, 3]})

In [38]:
df_test

Unnamed: 0,A
0,1
1,2
2,3


In [39]:
df_sub = df_test[0:2]

In [40]:
df_sub

Unnamed: 0,A
0,1
1,2


In [41]:
df_sub.loc[0, 'A'] = -1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub.loc[0, 'A'] = -1


In [42]:
df_sub

Unnamed: 0,A
0,-1
1,2


In [43]:
df_test

Unnamed: 0,A
0,-1
1,2
2,3


Una modificación de un subset modifica el DataFrame original, para evitar este posible problema podemos usar el método `copy()`.

In [44]:
df_test = pd.DataFrame({'A': [1, 2, 3]})

In [45]:
df_copy = df_test[0:2].copy()

In [46]:
df_copy.loc[0, 'A'] = -1

In [47]:
df_copy

Unnamed: 0,A
0,-1
1,2


In [48]:
df_test

Unnamed: 0,A
0,1
1,2
2,3


Podemos usar un **encadenamiento de métodos**, una técnica que se utiliza para realizar varias llamadas a métodos en el mismo objeto, utilizando la referencia del objeto solo una vez: `df.uno().dos().tres().cuatro()`

In [49]:
df.head(100).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Gene Length,100.0,14855.25,2779.32897,12138.0,12709.5,13570.5,16718.25,24615.0
Sample Occurence Frequency,100.0,0.162928,0.162151,0.002368,0.059984,0.093528,0.206393,0.831097
Individual Occurence Frequency,100.0,0.163607,0.159957,0.001869,0.061682,0.095794,0.213084,0.831776


El atributo `columns` y `values` permite accedeer a las columnas y a los valores del DataFrame, respectivamente.

In [50]:
for x in df.columns:
    print(x)

Gene Name
Gene Length
Gene Completeness
Cohort Origin
Taxonomic Annotation(Phylum Level)
Taxonomic Annotation(Genus Level)
KEGG Annotation
eggNOG Annotation
Sample Occurence Frequency
Individual Occurence Frequency
KEGG Functional Categories
eggNOG Functional Categories
Cohort Assembled


In [51]:
[x for x in df.columns].upper()

AttributeError: 'list' object has no attribute 'upper'

In [52]:
df.columns.upper()

AttributeError: 'Index' object has no attribute 'upper'

In [53]:
df.columns.str.upper()

Index(['GENE NAME', 'GENE LENGTH', 'GENE COMPLETENESS', 'COHORT ORIGIN',
       'TAXONOMIC ANNOTATION(PHYLUM LEVEL)',
       'TAXONOMIC ANNOTATION(GENUS LEVEL)', 'KEGG ANNOTATION',
       'EGGNOG ANNOTATION', 'SAMPLE OCCURENCE FREQUENCY',
       'INDIVIDUAL OCCURENCE FREQUENCY', 'KEGG FUNCTIONAL CATEGORIES',
       'EGGNOG FUNCTIONAL CATEGORIES', 'COHORT ASSEMBLED'],
      dtype='object')

In [54]:
df.values

array([['911104.WcibK1_010100007220', 24615, 'Complete', ..., 'unknown',
        'Function unknown', 'EUR'],
       ['585054.EFER_0542', 21669, 'Complete', ..., 'unknown',
        'Function unknown', 'EUR;CHN'],
       ['SZEY-48A_GL0052647', 20778, 'Complete', ..., 'Enzyme Families',
        'Function unknown', 'EUR;CHN'],
       ...,
       ['768724.HMPREF9131_0965', 102, 'Complete', ..., 'unknown',
        'unknown', nan],
       ['866771.HMPREF9296_1721', 102, 'Complete', ..., 'unknown',
        'unknown', nan],
       ['926026.ECO5101_18612', 102, 'Complete', ..., 'unknown',
        'Function unknown', nan]], dtype=object)

¿Qué tipo de datos son los siguientes atributos?

In [55]:
type(df.shape)

tuple

In [56]:
type(df.dtypes)

pandas.core.series.Series

In [57]:
type(df.columns)

pandas.core.indexes.base.Index

In [58]:
type(df.values)

numpy.ndarray

### Selección de columnas

Seleccionamos una columna con la sintaxis `[]` para una sola columna o `[[]]` para una lista de columnas.

In [59]:
df.columns

Index(['Gene Name', 'Gene Length', 'Gene Completeness', 'Cohort Origin',
       'Taxonomic Annotation(Phylum Level)',
       'Taxonomic Annotation(Genus Level)', 'KEGG Annotation',
       'eggNOG Annotation', 'Sample Occurence Frequency',
       'Individual Occurence Frequency', 'KEGG Functional Categories',
       'eggNOG Functional Categories', 'Cohort Assembled'],
      dtype='object')

In [60]:
df.Gene 

AttributeError: 'DataFrame' object has no attribute 'Gene'

In [None]:
df['Gene Name']

In [61]:
df['Gene Name'].head()

0    911104.WcibK1_010100007220
1              585054.EFER_0542
2            SZEY-48A_GL0052647
3       1048689.ECO55CA74_02930
4              MH0427_GL0087973
Name: Gene Name, dtype: object

In [62]:
df[['Gene Name', 'Gene Length']].head()

Unnamed: 0,Gene Name,Gene Length
0,911104.WcibK1_010100007220,24615
1,585054.EFER_0542,21669
2,SZEY-48A_GL0052647,20778
3,1048689.ECO55CA74_02930,20778
4,MH0427_GL0087973,20775


In [63]:
type(df['Gene Name'])

pandas.core.series.Series

In [64]:
type(df[['Gene Name']])

pandas.core.frame.DataFrame

In [65]:
type(df[['Gene Name', 'Gene Length']])

pandas.core.frame.DataFrame

Cuando transponemos las columnas el índice ahora ocupa el valor de las columnas

In [66]:
df[['Gene Name', 'Gene Length']].head().T

Unnamed: 0,0,1,2,3,4
Gene Name,911104.WcibK1_010100007220,585054.EFER_0542,SZEY-48A_GL0052647,1048689.ECO55CA74_02930,MH0427_GL0087973
Gene Length,24615,21669,20778,20778,20775


Con `set_index()` definimos la columna que usaremos como índice del DataFrame.

In [67]:
tmp = df.head().set_index('Gene Length')

In [68]:
tmp.reset_index()

Unnamed: 0,Gene Length,Gene Name,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,24615,911104.WcibK1_010100007220,Complete,SP,Firmicutes,Weissella,unknown,NOG12793,0.066298,0.072897,unknown,Function unknown,EUR
1,21669,585054.EFER_0542,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,20778,SZEY-48A_GL0052647,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.21215,Enzyme Families,Function unknown,EUR;CHN
3,20778,1048689.ECO55CA74_02930,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.095501,0.096262,Enzyme Families,Function unknown,EUR;CHN
4,20775,MH0427_GL0087973,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA


In [69]:
df[['Gene Name', 'Gene Length']].head()

Unnamed: 0,Gene Name,Gene Length
0,911104.WcibK1_010100007220,24615
1,585054.EFER_0542,21669
2,SZEY-48A_GL0052647,20778
3,1048689.ECO55CA74_02930,20778
4,MH0427_GL0087973,20775


In [70]:
df[['Gene Name', 'Gene Length']].head().T

Unnamed: 0,0,1,2,3,4
Gene Name,911104.WcibK1_010100007220,585054.EFER_0542,SZEY-48A_GL0052647,1048689.ECO55CA74_02930,MH0427_GL0087973
Gene Length,24615,21669,20778,20778,20775


In [71]:
df[['Gene Name', 'Gene Length']].set_index('Gene Name').head()

Unnamed: 0_level_0,Gene Length
Gene Name,Unnamed: 1_level_1
911104.WcibK1_010100007220,24615
585054.EFER_0542,21669
SZEY-48A_GL0052647,20778
1048689.ECO55CA74_02930,20778
MH0427_GL0087973,20775


Mientras no definamos el resultado a una nueva variable o sobreescribamos el dataframe original el nuevo índice no se guardará.

In [72]:
df[['Gene Name', 'Gene Length']].set_index('Gene Name').head().T

Gene Name,911104.WcibK1_010100007220,585054.EFER_0542,SZEY-48A_GL0052647,1048689.ECO55CA74_02930,MH0427_GL0087973
Gene Length,24615,21669,20778,20778,20775


### Ordenamiento de filas

In [73]:
df.head()

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,911104.WcibK1_010100007220,24615,Complete,SP,Firmicutes,Weissella,unknown,NOG12793,0.066298,0.072897,unknown,Function unknown,EUR
1,585054.EFER_0542,21669,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,SZEY-48A_GL0052647,20778,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.21215,Enzyme Families,Function unknown,EUR;CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.095501,0.096262,Enzyme Families,Function unknown,EUR;CHN
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA


In [74]:
df.sort_index('Gene Length').head()

  df.sort_index('Gene Length').head()


ValueError: No axis named Gene Length for object type DataFrame

In [75]:
df.sort_index(axis=1, level='Gene Length').head()

Unnamed: 0,Cohort Assembled,Cohort Origin,Gene Completeness,Gene Length,Gene Name,Individual Occurence Frequency,KEGG Annotation,KEGG Functional Categories,Sample Occurence Frequency,Taxonomic Annotation(Genus Level),Taxonomic Annotation(Phylum Level),eggNOG Annotation,eggNOG Functional Categories
0,EUR,SP,Complete,24615,911104.WcibK1_010100007220,0.072897,unknown,unknown,0.066298,Weissella,Firmicutes,NOG12793,Function unknown
1,EUR;CHN,SP,Complete,21669,585054.EFER_0542,0.062617,unknown,unknown,0.058406,Escherichia,Proteobacteria,NOG12793,Function unknown
2,EUR;CHN,CHN,Complete,20778,SZEY-48A_GL0052647,0.21215,K01317,Enzyme Families,0.203631,Escherichia,Proteobacteria,NOG12793,Function unknown
3,EUR;CHN,SP,Complete,20778,1048689.ECO55CA74_02930,0.096262,K01317,Enzyme Families,0.095501,Escherichia,Proteobacteria,NOG12793,Function unknown
4,EUR;CHN;USA,EUR,Complete,20775,MH0427_GL0087973,0.104673,K01317,Enzyme Families,0.098658,Escherichia,Proteobacteria,NOG12793,Function unknown


In [76]:
df[['Gene Length']].sort_index(axis=1).head()

Unnamed: 0,Gene Length
0,24615
1,21669
2,20778
3,20778
4,20775


### Ordenamiento por columna

In [77]:
df.sort_values('Gene Length').head()

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
1285641,926026.ECO5101_18612,102,Complete,SP,Proteobacteria,Escherichia,unknown,proNOG05789,0.001579,0.001869,unknown,Function unknown,
1285206,O2.UC13-0_GL0011042,102,Complete,EUR,Actinobacteria,unknown,unknown,unknown,0.002368,0.000935,unknown,unknown,EUR
1285205,O2.UC12-1_GL0011992,102,Complete,EUR,Actinobacteria,Collinsella,unknown,unknown,0.011839,0.01028,unknown,unknown,EUR
1285204,O2.CD3-0-PT_GL0126180,102,Complete,EUR,Firmicutes,unknown,unknown,unknown,0.001579,0.000935,unknown,unknown,EUR
1285203,O2.CD3-0-PT_GL0020295,102,Complete,EUR,Firmicutes,Clostridium,unknown,unknown,0.011839,0.013084,unknown,unknown,EUR


In [78]:
df.sort_values('Gene Length', ascending=False).head()

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,911104.WcibK1_010100007220,24615,Complete,SP,Firmicutes,Weissella,unknown,NOG12793,0.066298,0.072897,unknown,Function unknown,EUR
1,585054.EFER_0542,21669,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,SZEY-48A_GL0052647,20778,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.21215,Enzyme Families,Function unknown,EUR;CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.095501,0.096262,Enzyme Families,Function unknown,EUR;CHN
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA


Podemos seleccionar una columna, ordenar los valores de manera descendente y obtiener las primeras líneas.

In [79]:
type(df.sort_values('Gene Length', ascending=False).head())

pandas.core.frame.DataFrame

In [80]:
type(df['Gene Length'].sort_values(ascending=False).head())

pandas.core.series.Series

**¿Cuántas categorías funcionales KEGG hay, y cuántos genes de cada categoría?**

In [81]:
df['KEGG Functional Categories'].value_counts()

unknown                                                                        512851
Carbohydrate Metabolism                                                         90254
Membrane Transport                                                              72775
Cellular Processes and Signaling                                                61821
Poorly Characterized                                                            61400
                                                                                ...  
Cell Motility;Membrane Transport                                                    1
Carbohydrate Metabolism;Transport and Catabolism                                    1
Carbohydrate Metabolism;Glycan Biosynthesis and Metabolism;Metabolism               1
Folding, Sorting and Degradation;Genetic Information Processing;Translation         1
Energy Metabolism;Genetic Information Processing                                    1
Name: KEGG Functional Categories, Length: 199, dtype: 

In [82]:
df['KEGG Functional Categories'].unique()

array(['unknown', 'Enzyme Families', 'Membrane Transport',
       'Infectious Diseases', 'Amino Acid Metabolism',
       'Carbohydrate Metabolism', 'Cellular Processes and Signaling',
       'Metabolism', 'Signaling Molecules and Interaction',
       'Cellular Processes and Signaling;Signaling Molecules and Interaction',
       'Replication and Repair', 'Lipid Metabolism',
       'Genetic Information Processing',
       'Metabolism of Terpenoids and Polyketides',
       'Glycan Biosynthesis and Metabolism',
       'Amino Acid Metabolism;Genetic Information Processing',
       'Membrane Transport;Signaling Molecules and Interaction',
       'Metabolism of Other Amino Acids', 'Poorly Characterized',
       'Signal Transduction', 'Transcription',
       'Genetic Information Processing;Replication and Repair',
       'Metabolism of Cofactors and Vitamins',
       'Carbohydrate Metabolism;Metabolism', 'Translation',
       'Nucleotide Metabolism', 'Energy Metabolism',
       'Folding, Sorti

In [83]:
from collections import Counter

In [84]:
Counter(df['KEGG Functional Categories']).most_common()

[('unknown', 512851),
 ('Carbohydrate Metabolism', 90254),
 ('Membrane Transport', 72775),
 ('Cellular Processes and Signaling', 61821),
 ('Poorly Characterized', 61400),
 ('Amino Acid Metabolism', 56274),
 ('Metabolism', 54868),
 ('Genetic Information Processing', 49299),
 ('Translation', 46951),
 ('Replication and Repair', 46526),
 ('Transcription', 36173),
 ('Nucleotide Metabolism', 34335),
 ('Energy Metabolism', 27446),
 ('Metabolism of Cofactors and Vitamins', 26134),
 ('Enzyme Families', 22267),
 ('Folding, Sorting and Degradation', 17248),
 ('Signal Transduction', 16494),
 ('Lipid Metabolism', 13502),
 ('Glycan Biosynthesis and Metabolism', 13240),
 ('Metabolism of Terpenoids and Polyketides', 8769),
 ('Metabolism of Other Amino Acids', 4782),
 ('Xenobiotics Biodegradation and Metabolism', 3118),
 ('Signaling Molecules and Interaction', 2097),
 ('Cell Motility', 1655),
 ('Infectious Diseases', 1603),
 ('Biosynthesis of Other Secondary Metabolites', 1236),
 ('Cell Growth and Deat

In [85]:
kegg = []

for category in df['KEGG Functional Categories']:
    if category not in kegg:
        kegg.append(category)

In [86]:
kegg = {}

for category in df['KEGG Functional Categories']:
    if category not in kegg:
        kegg[category] = 1
    else:
        kegg[category] += 1

In [87]:
sorted(kegg.items(), key=lambda x: x[1], reverse=True)

[('unknown', 512851),
 ('Carbohydrate Metabolism', 90254),
 ('Membrane Transport', 72775),
 ('Cellular Processes and Signaling', 61821),
 ('Poorly Characterized', 61400),
 ('Amino Acid Metabolism', 56274),
 ('Metabolism', 54868),
 ('Genetic Information Processing', 49299),
 ('Translation', 46951),
 ('Replication and Repair', 46526),
 ('Transcription', 36173),
 ('Nucleotide Metabolism', 34335),
 ('Energy Metabolism', 27446),
 ('Metabolism of Cofactors and Vitamins', 26134),
 ('Enzyme Families', 22267),
 ('Folding, Sorting and Degradation', 17248),
 ('Signal Transduction', 16494),
 ('Lipid Metabolism', 13502),
 ('Glycan Biosynthesis and Metabolism', 13240),
 ('Metabolism of Terpenoids and Polyketides', 8769),
 ('Metabolism of Other Amino Acids', 4782),
 ('Xenobiotics Biodegradation and Metabolism', 3118),
 ('Signaling Molecules and Interaction', 2097),
 ('Cell Motility', 1655),
 ('Infectious Diseases', 1603),
 ('Biosynthesis of Other Secondary Metabolites', 1236),
 ('Cell Growth and Deat

In [88]:
{k: v for k, v in sorted(kegg.items(), key=lambda x: x[1], reverse=True)}

{'unknown': 512851,
 'Carbohydrate Metabolism': 90254,
 'Membrane Transport': 72775,
 'Cellular Processes and Signaling': 61821,
 'Poorly Characterized': 61400,
 'Amino Acid Metabolism': 56274,
 'Metabolism': 54868,
 'Genetic Information Processing': 49299,
 'Translation': 46951,
 'Replication and Repair': 46526,
 'Transcription': 36173,
 'Nucleotide Metabolism': 34335,
 'Energy Metabolism': 27446,
 'Metabolism of Cofactors and Vitamins': 26134,
 'Enzyme Families': 22267,
 'Folding, Sorting and Degradation': 17248,
 'Signal Transduction': 16494,
 'Lipid Metabolism': 13502,
 'Glycan Biosynthesis and Metabolism': 13240,
 'Metabolism of Terpenoids and Polyketides': 8769,
 'Metabolism of Other Amino Acids': 4782,
 'Xenobiotics Biodegradation and Metabolism': 3118,
 'Signaling Molecules and Interaction': 2097,
 'Cell Motility': 1655,
 'Infectious Diseases': 1603,
 'Biosynthesis of Other Secondary Metabolites': 1236,
 'Cell Growth and Death': 524,
 'Transport and Catabolism': 279,
 'Carbohyd

In [89]:
kegg

{'unknown': 512851,
 'Enzyme Families': 22267,
 'Membrane Transport': 72775,
 'Infectious Diseases': 1603,
 'Amino Acid Metabolism': 56274,
 'Carbohydrate Metabolism': 90254,
 'Cellular Processes and Signaling': 61821,
 'Metabolism': 54868,
 'Signaling Molecules and Interaction': 2097,
 'Cellular Processes and Signaling;Signaling Molecules and Interaction': 1,
 'Replication and Repair': 46526,
 'Lipid Metabolism': 13502,
 'Genetic Information Processing': 49299,
 'Metabolism of Terpenoids and Polyketides': 8769,
 'Glycan Biosynthesis and Metabolism': 13240,
 'Amino Acid Metabolism;Genetic Information Processing': 14,
 'Membrane Transport;Signaling Molecules and Interaction': 1,
 'Metabolism of Other Amino Acids': 4782,
 'Poorly Characterized': 61400,
 'Signal Transduction': 16494,
 'Transcription': 36173,
 'Genetic Information Processing;Replication and Repair': 19,
 'Metabolism of Cofactors and Vitamins': 26134,
 'Carbohydrate Metabolism;Metabolism': 278,
 'Translation': 46951,
 'Nucl

In [None]:
# pd.Series(mydictionary)

In [90]:
numbers = [1, 2, 2, 2, 4]

In [91]:
list(set(numbers))

[1, 2, 4]

### Medidas de tendencia central

In [92]:
pd.__version__

'1.4.0'

In [93]:
df['Gene Length'].sum()

1231795176

In [94]:
df['Gene Length'].min()

102

In [95]:
df['Gene Length'].max()

24615

In [96]:
df['Gene Length'].mean()

958.116782121306

In [97]:
df['Gene Length'].median()

819.0

In [98]:
df['Gene Length'].mode()

0    219
Name: Gene Length, dtype: int64

In [99]:
df['Gene Length'].std()

728.536548632577

In [100]:
df['Gene Length'].var()

530765.5026934672

In [101]:
df['Gene Length'].quantile([.25, .5, .75])

0.25     453.0
0.50     819.0
0.75    1257.0
Name: Gene Length, dtype: float64

**Funciones utilizadas el día de hoy**

* `df.shape`
* `df.dtypes`
* `df.info()`
* `df.head()`
* `df.tail()`
* `df.describe()`
* `df.T`
* `df.index`
* `df.columns`
* `df.set_index()`
* `df.sort_index()`
* `df.sort_values()`
* `df.to_csv()`
* `df.sum()`
* `df.min()`
* `df.max()`
* `df.mean()`
* `df.median()`
* `df.mode()`
* `df.std()`
* `df.var()`
* `df.quantile()`

## Filtrado de columnas

In [102]:
df.columns

Index(['Gene Name', 'Gene Length', 'Gene Completeness', 'Cohort Origin',
       'Taxonomic Annotation(Phylum Level)',
       'Taxonomic Annotation(Genus Level)', 'KEGG Annotation',
       'eggNOG Annotation', 'Sample Occurence Frequency',
       'Individual Occurence Frequency', 'KEGG Functional Categories',
       'eggNOG Functional Categories', 'Cohort Assembled'],
      dtype='object')

In [103]:
df['Cohort Origin'].head()

0     SP
1     SP
2    CHN
3     SP
4    EUR
Name: Cohort Origin, dtype: object

In [104]:
df['Cohort Origin'] == 'EUR'

0          False
1          False
2          False
3          False
4           True
           ...  
1285637    False
1285638    False
1285639    False
1285640    False
1285641    False
Name: Cohort Origin, Length: 1285642, dtype: bool

In [105]:
df[df['Cohort Origin'] == 'EUR']

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA
6,MH0373_GL0094743,20511,Complete,EUR,Firmicutes,Coprobacillus,unknown,unknown,0.332281,0.338318,unknown,unknown,EUR;CHN;USA
8,MH0315_GL0002178,18975,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.043410,0.047664,Enzyme Families,Function unknown,EUR;CHN;USA
12,V1.FI08_GL0038488,18753,Complete,EUR,Proteobacteria,unknown,unknown,bactNOG98109,0.407261,0.388785,unknown,Function unknown,EUR;CHN;USA
13,V1.CD20-0_GL0037007,18738,Complete,EUR,Proteobacteria,unknown,unknown,NOG12793,0.421468,0.402804,unknown,Function unknown,EUR;CHN;USA
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1285323,V1.UC55-4_GL0094986,102,Complete,EUR,Firmicutes,Ruminococcus,unknown,unknown,0.000789,0.000000,unknown,unknown,EUR
1285324,V1.UC55-4_GL0242705,102,Complete,EUR,Firmicutes,Roseburia,unknown,unknown,0.465667,0.467290,unknown,unknown,EUR
1285325,V1.UC58-0_GL0103882,102,Complete,EUR,Firmicutes,Subdoligranulum,unknown,unknown,0.037096,0.038318,unknown,unknown,EUR
1285326,V1.UC60-0_GL0143314,102,Complete,EUR,Firmicutes,Faecalibacterium,unknown,unknown,0.006314,0.006542,unknown,unknown,EUR


In [106]:
df[df['Cohort Origin'] == 'EUR'].reset_index(drop=True)

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA
1,MH0373_GL0094743,20511,Complete,EUR,Firmicutes,Coprobacillus,unknown,unknown,0.332281,0.338318,unknown,unknown,EUR;CHN;USA
2,MH0315_GL0002178,18975,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.043410,0.047664,Enzyme Families,Function unknown,EUR;CHN;USA
3,V1.FI08_GL0038488,18753,Complete,EUR,Proteobacteria,unknown,unknown,bactNOG98109,0.407261,0.388785,unknown,Function unknown,EUR;CHN;USA
4,V1.CD20-0_GL0037007,18738,Complete,EUR,Proteobacteria,unknown,unknown,NOG12793,0.421468,0.402804,unknown,Function unknown,EUR;CHN;USA
...,...,...,...,...,...,...,...,...,...,...,...,...,...
797743,V1.UC55-4_GL0094986,102,Complete,EUR,Firmicutes,Ruminococcus,unknown,unknown,0.000789,0.000000,unknown,unknown,EUR
797744,V1.UC55-4_GL0242705,102,Complete,EUR,Firmicutes,Roseburia,unknown,unknown,0.465667,0.467290,unknown,unknown,EUR
797745,V1.UC58-0_GL0103882,102,Complete,EUR,Firmicutes,Subdoligranulum,unknown,unknown,0.037096,0.038318,unknown,unknown,EUR
797746,V1.UC60-0_GL0143314,102,Complete,EUR,Firmicutes,Faecalibacterium,unknown,unknown,0.006314,0.006542,unknown,unknown,EUR


In [107]:
df[df['Gene Completeness'] == 'Complete']

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,911104.WcibK1_010100007220,24615,Complete,SP,Firmicutes,Weissella,unknown,NOG12793,0.066298,0.072897,unknown,Function unknown,EUR
1,585054.EFER_0542,21669,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,SZEY-48A_GL0052647,20778,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.212150,Enzyme Families,Function unknown,EUR;CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.095501,0.096262,Enzyme Families,Function unknown,EUR;CHN
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1285637,665954.HMPREF1017_02554,102,Complete,SP,Bacteroidetes,Bacteroides,unknown,unknown,0.089187,0.077570,unknown,unknown,
1285638,753642.ECNC101_09444,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.001579,0.000935,unknown,unknown,
1285639,768724.HMPREF9131_0965,102,Complete,SP,Firmicutes,Peptoniphilus,unknown,unknown,0.000789,0.000935,unknown,unknown,
1285640,866771.HMPREF9296_1721,102,Complete,SP,Bacteroidetes,Prevotella,unknown,unknown,0.000789,0.000935,unknown,unknown,


In [108]:
df.columns

Index(['Gene Name', 'Gene Length', 'Gene Completeness', 'Cohort Origin',
       'Taxonomic Annotation(Phylum Level)',
       'Taxonomic Annotation(Genus Level)', 'KEGG Annotation',
       'eggNOG Annotation', 'Sample Occurence Frequency',
       'Individual Occurence Frequency', 'KEGG Functional Categories',
       'eggNOG Functional Categories', 'Cohort Assembled'],
      dtype='object')

In [109]:
df.query('`Gene Completeness` == "Complete" & `Cohort Origin` == "EUR"' )

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA
6,MH0373_GL0094743,20511,Complete,EUR,Firmicutes,Coprobacillus,unknown,unknown,0.332281,0.338318,unknown,unknown,EUR;CHN;USA
8,MH0315_GL0002178,18975,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.043410,0.047664,Enzyme Families,Function unknown,EUR;CHN;USA
12,V1.FI08_GL0038488,18753,Complete,EUR,Proteobacteria,unknown,unknown,bactNOG98109,0.407261,0.388785,unknown,Function unknown,EUR;CHN;USA
13,V1.CD20-0_GL0037007,18738,Complete,EUR,Proteobacteria,unknown,unknown,NOG12793,0.421468,0.402804,unknown,Function unknown,EUR;CHN;USA
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1285323,V1.UC55-4_GL0094986,102,Complete,EUR,Firmicutes,Ruminococcus,unknown,unknown,0.000789,0.000000,unknown,unknown,EUR
1285324,V1.UC55-4_GL0242705,102,Complete,EUR,Firmicutes,Roseburia,unknown,unknown,0.465667,0.467290,unknown,unknown,EUR
1285325,V1.UC58-0_GL0103882,102,Complete,EUR,Firmicutes,Subdoligranulum,unknown,unknown,0.037096,0.038318,unknown,unknown,EUR
1285326,V1.UC60-0_GL0143314,102,Complete,EUR,Firmicutes,Faecalibacterium,unknown,unknown,0.006314,0.006542,unknown,unknown,EUR


In [110]:
df[(df["Taxonomic Annotation(Genus Level)"] == 'Salmonella') 
    & (df['Cohort Origin'] == 'EUR')
    & (df['KEGG Functional Categories'] == "Translation")]

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
2713,MH0025_GL0055549,4959,Complete,EUR,Proteobacteria,Salmonella,K14326,COG1112,0.03236,0.037383,Translation,"Replication, recombination and repair",EUR;CHN
1020000,MH0008_GL0018223,393,Complete,EUR,Proteobacteria,Salmonella,K02996,COG0103,0.628256,0.649533,Translation,"Translation, ribosomal structure and biogenesis",EUR;CHN;USA
1026977,V1.CD3-0-PT_GL0051806,387,Complete,EUR,Proteobacteria,Salmonella,K02994,COG0096,0.08603,0.086916,Translation,"Translation, ribosomal structure and biogenesis",EUR
1116958,V1.CD3-3-PN_GL0019322,294,Complete,EUR,Proteobacteria,Salmonella,K07574,COG1534,0.058406,0.057009,Translation,"Translation, ribosomal structure and biogenesis",EUR
1131123,V1.CD7-0-PN_GL0013326,279,Complete,EUR,Proteobacteria,Salmonella,K02965,COG0185,0.022099,0.018692,Translation,"Translation, ribosomal structure and biogenesis",EUR


In [111]:
df[df['Taxonomic Annotation(Genus Level)'].str.contains('Salmonella')]

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
1270,MH0008_GL0021722,6342,Complete,EUR,Proteobacteria,Salmonella,K06877,COG1205,0.063931,0.071028,Poorly Characterized,General function prediction only,EUR;CHN
2713,MH0025_GL0055549,4959,Complete,EUR,Proteobacteria,Salmonella,K14326,COG1112,0.032360,0.037383,Translation,"Replication, recombination and repair",EUR;CHN
2807,MH0008_GL0021723,4914,Complete,EUR,Proteobacteria,Salmonella,K00571,COG1002,0.035517,0.040187,Genetic Information Processing,Defense mechanisms,EUR
3930,T2D-77A_GL0017130,4518,Complete,CHN,Proteobacteria,Salmonella,K06877,COG1205,0.030781,0.033645,Poorly Characterized,General function prediction only,CHN
6668,V1.CD3-0-PT_GL0053841,4029,Complete,EUR,Proteobacteria,Salmonella,K03043,COG0085,0.024467,0.018692,Nucleotide Metabolism,Transcription,EUR;CHN
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1276755,O2.UC24-2_GL0128465,120,Complete,EUR,Proteobacteria,Salmonella,unknown,unknown,0.029992,0.030841,unknown,unknown,EUR;CHN;USA
1276963,NLM027_GL0023741,120,Complete,CHN,Proteobacteria,Salmonella,unknown,NOG81196,0.001579,0.001869,unknown,Function unknown,CHN
1278878,MH0095_GL0092146,117,Complete,EUR,Proteobacteria,Salmonella,K07497,COG2801,0.005525,0.006542,Genetic Information Processing,"Replication, recombination and repair",EUR
1279465,T2D-66A_GL0073432,117,Complete,CHN,Proteobacteria,Salmonella,unknown,unknown,0.380426,0.402804,unknown,unknown,CHN;USA


In [112]:
df[df['Taxonomic Annotation(Genus Level)'].str.contains('sAlmonELLa', case=False)]

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
1270,MH0008_GL0021722,6342,Complete,EUR,Proteobacteria,Salmonella,K06877,COG1205,0.063931,0.071028,Poorly Characterized,General function prediction only,EUR;CHN
2713,MH0025_GL0055549,4959,Complete,EUR,Proteobacteria,Salmonella,K14326,COG1112,0.032360,0.037383,Translation,"Replication, recombination and repair",EUR;CHN
2807,MH0008_GL0021723,4914,Complete,EUR,Proteobacteria,Salmonella,K00571,COG1002,0.035517,0.040187,Genetic Information Processing,Defense mechanisms,EUR
3930,T2D-77A_GL0017130,4518,Complete,CHN,Proteobacteria,Salmonella,K06877,COG1205,0.030781,0.033645,Poorly Characterized,General function prediction only,CHN
6668,V1.CD3-0-PT_GL0053841,4029,Complete,EUR,Proteobacteria,Salmonella,K03043,COG0085,0.024467,0.018692,Nucleotide Metabolism,Transcription,EUR;CHN
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1276755,O2.UC24-2_GL0128465,120,Complete,EUR,Proteobacteria,Salmonella,unknown,unknown,0.029992,0.030841,unknown,unknown,EUR;CHN;USA
1276963,NLM027_GL0023741,120,Complete,CHN,Proteobacteria,Salmonella,unknown,NOG81196,0.001579,0.001869,unknown,Function unknown,CHN
1278878,MH0095_GL0092146,117,Complete,EUR,Proteobacteria,Salmonella,K07497,COG2801,0.005525,0.006542,Genetic Information Processing,"Replication, recombination and repair",EUR
1279465,T2D-66A_GL0073432,117,Complete,CHN,Proteobacteria,Salmonella,unknown,unknown,0.380426,0.402804,unknown,unknown,CHN;USA


In [113]:
df[df['Taxonomic Annotation(Genus Level)'].str.contains('Salmonella')].shape

(935, 13)

In [114]:
df[df['Taxonomic Annotation(Genus Level)'].str.contains('Escherichia')].shape

(34019, 13)

In [115]:
'|'.join(['Salmonella', 'Escherichia'])

'Salmonella|Escherichia'

In [116]:
df[df['Taxonomic Annotation(Genus Level)'].str.contains('Salmonella|Escherichia')]

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
1,585054.EFER_0542,21669,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,SZEY-48A_GL0052647,20778,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.212150,Enzyme Families,Function unknown,EUR;CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.095501,0.096262,Enzyme Families,Function unknown,EUR;CHN
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA
5,754082.ECSTEC7V_0573,20775,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.068666,0.073832,Enzyme Families,Function unknown,EUR;CHN
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1285627,656444.ECNG_05213,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.000789,0.000935,unknown,unknown,
1285628,656444.ECNG_04574,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.042620,0.049533,unknown,unknown,
1285629,656444.ECNG_05118,102,Complete,SP,Proteobacteria,Escherichia,unknown,COG1475,0.000000,0.000000,unknown,Transcription,
1285638,753642.ECNC101_09444,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.001579,0.000935,unknown,unknown,


In [117]:
df[df['Taxonomic Annotation(Genus Level)'].str.contains('|'.join(['Salmonella', 'Escherichia']))]

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
1,585054.EFER_0542,21669,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,SZEY-48A_GL0052647,20778,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.212150,Enzyme Families,Function unknown,EUR;CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.095501,0.096262,Enzyme Families,Function unknown,EUR;CHN
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA
5,754082.ECSTEC7V_0573,20775,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.068666,0.073832,Enzyme Families,Function unknown,EUR;CHN
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1285627,656444.ECNG_05213,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.000789,0.000935,unknown,unknown,
1285628,656444.ECNG_04574,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.042620,0.049533,unknown,unknown,
1285629,656444.ECNG_05118,102,Complete,SP,Proteobacteria,Escherichia,unknown,COG1475,0.000000,0.000000,unknown,Transcription,
1285638,753642.ECNC101_09444,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.001579,0.000935,unknown,unknown,


In [118]:
df[df['Taxonomic Annotation(Genus Level)'].str.contains('|'.join(['Salmonella', 'Escherichia']))].shape

(34954, 13)

In [119]:
bacterias_patog = ['Salmonella', 'Escherichia']
df[df['Taxonomic Annotation(Genus Level)'].isin(bacterias_patog)]

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
1,585054.EFER_0542,21669,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,SZEY-48A_GL0052647,20778,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.212150,Enzyme Families,Function unknown,EUR;CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.095501,0.096262,Enzyme Families,Function unknown,EUR;CHN
4,MH0427_GL0087973,20775,Complete,EUR,Proteobacteria,Escherichia,K01317,NOG12793,0.098658,0.104673,Enzyme Families,Function unknown,EUR;CHN;USA
5,754082.ECSTEC7V_0573,20775,Complete,SP,Proteobacteria,Escherichia,K01317,NOG12793,0.068666,0.073832,Enzyme Families,Function unknown,EUR;CHN
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1285627,656444.ECNG_05213,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.000789,0.000935,unknown,unknown,
1285628,656444.ECNG_04574,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.042620,0.049533,unknown,unknown,
1285629,656444.ECNG_05118,102,Complete,SP,Proteobacteria,Escherichia,unknown,COG1475,0.000000,0.000000,unknown,Transcription,
1285638,753642.ECNC101_09444,102,Complete,SP,Proteobacteria,Escherichia,unknown,unknown,0.001579,0.000935,unknown,unknown,


**El encadenamiento de métodos es un gran poder que conlleva una gran responsabilidad.**

In [120]:
(df.loc[((df['Gene Length'] >= 15000) & 
         (df['Gene Completeness'] >= 'Complete') &
         (df['KEGG Annotation'] != 'unknown')), 
       ['Gene Name', 'KEGG Annotation', 'Gene Length', 
        'KEGG Functional Categories']]
    .sort_values('Gene Length')
    .set_index('KEGG Annotation'))

Unnamed: 0_level_0,Gene Name,Gene Length,KEGG Functional Categories
KEGG Annotation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
K12548,457398.HMPREF0326_01805,15099,Cellular Processes and Signaling
K01179,360104.CCC13826_0899,15243,Carbohydrate Metabolism
K01317,386585.ECs0542,15876,Enzyme Families
K01447,556261.HMPREF0240_03368,16590,Cellular Processes and Signaling
K00599,MH0244_GL0138579,16611,Amino Acid Metabolism
K14194,411490.ANACAC_02896,16887,Infectious Diseases
K01779,684738.LLKF_1213,17136,Amino Acid Metabolism
K01179,MH0185_GL0087605,17184,Carbohydrate Metabolism
K00599,ED50A_GL0021676,17310,Amino Acid Metabolism
K15125,591001.Acfer_0201,17715,Infectious Diseases


## Selección de campos en *DataFrames*

La selección de campos en DataFrames, *slicing*, puede utilizar tres diferentes sintaxis:
- `df[]`
- `df.loc[]`
- `df.iloc[]`

`loc` es para *location*, y `iloc` para *index location*

|Operación| `df[]` | `df.loc[]`| `df.iloc[]`|
| :--- | :--- | :--- |:---|
|Seleccione una sola columna por etiqueta|`df['A']`|`df.loc[:, 'A']`| `-` |
|Seleccionar lista de columnas por etiqueta|`df[['A', 'C']]`|`df.loc[:, ['A', 'C']]`|`-`|
|Cortar columnas por etiqueta|`-`|`df.loc[:, 'A':'C']`|`-`|
|Seleccione una sola columna por posición|`-`|`-`|`df.iloc[:, 1]`|
|Seleccionar lista de columnas por posición|`-`|`-`|`df.iloc[:, [0, 2]]`|
|Cortar columnas por posición|`-`|`-`|`df.iloc[:, 0:2]`|
|Seleccione una sola fila por etiqueta|`-`|`df.loc['a']*`|`-`|
|Seleccione una lista de filas por etiqueta|`-`|`df.loc[['a', 'b']]*`|`-`|
|Cortar filas por etiqueta|`df['a':'d']*`|`df.loc['b':'d']*`|`-`|
|Seleccione una sola fila por posición|`-`|`-`|`df.iloc[1]`|
|Seleccionar una lista de filas por posición|`-`|`-`|`df.iloc[[1, 3]]`|
|Cortar filas por posición|`df[1:4]`|`-`|`df.iloc[1:4]`|
|Seleccionar lista de filas y columnas por etiqueta|`-`|`df.loc[['b', 'c'], ['A', 'C']]*`|`-`|
|Seleccionar lista de filas y columnas por posición|`-`|`-`|`df.iloc[[1, 3], [2, 1]]`|
|Cortar filas y columnas por etiqueta|`-`|`df.locp[['b': 'c'], ['A': 'C']]*`|`-`|
|Cortar filas y columnas por posición|`-`|`-`|`df.iloc[1:3, 0:2]`|

**\*** El índice de la fila debe ser *string*

Hemos seleccionado columnas usando el índice de columnas.

In [121]:
df['Gene Name'].head()

0    911104.WcibK1_010100007220
1              585054.EFER_0542
2            SZEY-48A_GL0052647
3       1048689.ECO55CA74_02930
4              MH0427_GL0087973
Name: Gene Name, dtype: object

In [122]:
df[['Gene Name', 'KEGG Annotation', 'Gene Length']].head(5)

Unnamed: 0,Gene Name,KEGG Annotation,Gene Length
0,911104.WcibK1_010100007220,unknown,24615
1,585054.EFER_0542,unknown,21669
2,SZEY-48A_GL0052647,K01317,20778
3,1048689.ECO55CA74_02930,K01317,20778
4,MH0427_GL0087973,K01317,20775


El *slicing* permite obtener la localización basado en un índice numérico o en el índice de la columna.

In [123]:
df[0:3].head()

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,911104.WcibK1_010100007220,24615,Complete,SP,Firmicutes,Weissella,unknown,NOG12793,0.066298,0.072897,unknown,Function unknown,EUR
1,585054.EFER_0542,21669,Complete,SP,Proteobacteria,Escherichia,unknown,NOG12793,0.058406,0.062617,unknown,Function unknown,EUR;CHN
2,SZEY-48A_GL0052647,20778,Complete,CHN,Proteobacteria,Escherichia,K01317,NOG12793,0.203631,0.21215,Enzyme Families,Function unknown,EUR;CHN


In [124]:
df.loc[0:4, 'Gene Name']

0    911104.WcibK1_010100007220
1              585054.EFER_0542
2            SZEY-48A_GL0052647
3       1048689.ECO55CA74_02930
4              MH0427_GL0087973
Name: Gene Name, dtype: object

In [125]:
df.loc[:, 'Gene Name']

0          911104.WcibK1_010100007220
1                    585054.EFER_0542
2                  SZEY-48A_GL0052647
3             1048689.ECO55CA74_02930
4                    MH0427_GL0087973
                      ...            
1285637       665954.HMPREF1017_02554
1285638          753642.ECNC101_09444
1285639        768724.HMPREF9131_0965
1285640        866771.HMPREF9296_1721
1285641          926026.ECO5101_18612
Name: Gene Name, Length: 1285642, dtype: object

In [126]:
df.loc[:, 'Gene Name'].head()

0    911104.WcibK1_010100007220
1              585054.EFER_0542
2            SZEY-48A_GL0052647
3       1048689.ECO55CA74_02930
4              MH0427_GL0087973
Name: Gene Name, dtype: object

In [127]:
df.loc[:, 'Gene Name':'Cohort Origin']

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin
0,911104.WcibK1_010100007220,24615,Complete,SP
1,585054.EFER_0542,21669,Complete,SP
2,SZEY-48A_GL0052647,20778,Complete,CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP
4,MH0427_GL0087973,20775,Complete,EUR
...,...,...,...,...
1285637,665954.HMPREF1017_02554,102,Complete,SP
1285638,753642.ECNC101_09444,102,Complete,SP
1285639,768724.HMPREF9131_0965,102,Complete,SP
1285640,866771.HMPREF9296_1721,102,Complete,SP


In [128]:
df.iloc[:, 0:4]

Unnamed: 0,Gene Name,Gene Length,Gene Completeness,Cohort Origin
0,911104.WcibK1_010100007220,24615,Complete,SP
1,585054.EFER_0542,21669,Complete,SP
2,SZEY-48A_GL0052647,20778,Complete,CHN
3,1048689.ECO55CA74_02930,20778,Complete,SP
4,MH0427_GL0087973,20775,Complete,EUR
...,...,...,...,...
1285637,665954.HMPREF1017_02554,102,Complete,SP
1285638,753642.ECNC101_09444,102,Complete,SP
1285639,768724.HMPREF9131_0965,102,Complete,SP
1285640,866771.HMPREF9296_1721,102,Complete,SP


In [129]:
df.iloc[:, [1, 3, 2]]

Unnamed: 0,Gene Length,Cohort Origin,Gene Completeness
0,24615,SP,Complete
1,21669,SP,Complete
2,20778,CHN,Complete
3,20778,SP,Complete
4,20775,EUR,Complete
...,...,...,...
1285637,102,SP,Complete
1285638,102,SP,Complete
1285639,102,SP,Complete
1285640,102,SP,Complete


In [130]:
df.iloc[0:4, 1:5]

Unnamed: 0,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level)
0,24615,Complete,SP,Firmicutes
1,21669,Complete,SP,Proteobacteria
2,20778,Complete,CHN,Proteobacteria
3,20778,Complete,SP,Proteobacteria


¿Y si quiero obtener un rango de columnas para hacer *slicing*?

## Pandas profiling

In [None]:
from pandas_profiling import ProfileReport

In [None]:
profile = ProfileReport(df, title="Explorando con Pandas", explorative=True)

In [None]:
profile.to_file("reporte_IGC.html")