<a href="https://colab.research.google.com/github/restrepo/ComputationalMethods/blob/master/material/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

From http://pandas.pydata.org/pandas-docs/stable/

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

See also:

* https://github.com/restrepo/data-analysis
  * https://classroom.github.com/g/sSMBdBqN
  * https://classroom.github.com/a/PcbQBE7F
* https://github.com/restrepo/PythonTipsAndTricks

A good and practice book about `Pandas` possibilities is:

[__Python for Data Analysis__](https://drive.google.com/open?id=0BxoOXsn2EUNIWExXbVc4SDN0YTQ)<br/>
Data Wrangling with Pandas, NumPy, and IPython<br/>
_By William McKinney_


This other is about aplications based on `Pandas`:
![image.png](https://covers.oreillystatic.com/images/0636920030515/cat.gif) [Introduction to Machine Learning with Python](https://drive.google.com/open?id=0BxoOXsn2EUNISGhrdEZ3S29fS3M)<br/>
A Guide for Data Scientists
By Sarah Guido, Andreas Müller

`Pandas` can be used in a similar way to `R`, which is based on similar data structures. `Pandas` also can replace the use of graphical interfaces to access spreadsheets like Excel

## Standard way to load the module

In [0]:
import pandas as pd

## Data structures

`Pandas` has two new data structures:
* `Series` which are similar to dictionaries
* `DataFrame` which are similar to numpy arrays but with some assigned key. 

The row in a two-dimensional `DataFrame` corresponds to `Series` with similar keys, while the columns are also with the indeces as keys. 

An example of a  `DataFrame` is a spreadsheet.

### `Series`

A `Pandas` `Series` object can be just initialized from a `Python` dictionary:

In [0]:
s=pd.Series({'Name':'Juan Valdez','Nacionality':'Colombia','Age':23})
s

and can be used as such dictionaries

In [0]:
s['Name']

'Juan Valdez'

but also as containers of name spaces!

In [0]:
s.Name

'Juan Valdez'

### New heading

### `DataFrame`

#### Incialization from de Series
We start with an empty `DataFrame`:

In [0]:
df=pd.DataFrame()
df

We can append a `Series` as a row of the `DataFrame`, provided that we always use the option: `ignore_index=True`

In [0]:
df=df.append(s,ignore_index=True)
df

Unnamed: 0,Age,Nacionality,Name
0,23.0,Colombia,Juan Valdez


We can fix the type of data of the `'Age'` column

In [0]:
type(df.Age)

pandas.core.series.Series

In [0]:
df['Age']=df.Age.astype(int)
df

Unnamed: 0,Age,Nacionality,Name
0,23,Colombia,Juan Valdez


To add a second file we build another `Series`

In [0]:
s=pd.Series()
for k in ['Name','Nacionality','Age','Company']:
    var=input('{}:\n'.format(k))
    s[k]=var

Name:
Álvaro Uribe
Nacionality:
Colombia
Age:
65
Company:
Senado


#### Exercises
* Display the resulting `Series` in the screen:

* Append to the previous `DataFrame` y visualizarlo:

* Fill NaN with empty strings

* Save `Pandas` `DataFrame` as an Excel file

* Load pandas DataFrame from the saved file in Excel

### Common operations upon `DataFrames`
See https://github.com/restrepo/PythonTipsAndTricks

* __To fill a specific cell__

In [0]:
df.loc[0,'Company']='Federación de Caferos'

In [0]:
df

Unnamed: 0,Age,Nacionality,Name,Company
0,23.0,Colombia,Juan Valdez,Federación de Caferos


## Other formats to saving and read files

In [0]:
df.to_csv('hoja.csv',index=False)

## Loading data from the clouds
See: https://github.com/kennethreitz/python-guide

In [0]:
%%writefile drive.cfg
[FILES]
CIB_Wos.xlsx                                = 0BxoOXsn2EUNIRjJkQ1VEamdJXzA

Writing drive.cfg


We follow the conventions of https://github.com/kennethreitz/python-guide

In [0]:
import os
import sys
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname('__file__'), '../input')))
from google_drive_tools import *

In [0]:
df=read_drive_excel('CIB_Wos.xlsx')

Vea el tamaño del DataFrame

In [0]:
df.shape

(415, 58)

In [0]:
df.sample()

Unnamed: 0,AB,AF,AU,BP,C1,CR,DE,DI,DT,EI,...,SU,CA,MA,PN,BE,BN,D2,SE,SP,HO
171,Background and aim: The involvement of Toll-li...,"Castiblanco, John\nVarela, Diana-Cristina\nCas...","Castiblanco, J\nVarela, DC\nCastano-Rodriguez,...",541\n,"[Castiblanco, John; Varela, Diana-Cristina; Ca...","Anaya JM, 2006, CLIN DEV IMMUNOL, V13, P185, D...",TIRAP; Mal; Tuberculosis; Systemic lupus eryth...,10.1016/j.meegid.2008.03.001\n,Article\n,,...,,,,,,,,,,


In [0]:
df=df.fillna('')

In [0]:
df.sample()

Unnamed: 0,AB,AF,AU,BP,C1,CR,DE,DI,DT,EI,...,SU,CA,MA,PN,BE,BN,D2,SE,SP,HO
178,Background. Invasive fungal diseases are impor...,"De Pauw, Ben\nWalsh, Thomas J.\nDonnelly, J. P...","De Pauw, B\nWalsh, TJ\nDonnelly, JP\nStevens, ...",1813\n,"[Donnelly, J. Peter] Radboud Univ Nijmegen, Me...","Ascioglu S, 2002, CLIN INFECT DIS, V34, P7, DO...",,10.1086/588660\n,Article\n,,...,,,,,,,,,,


##  ACTIVITIES
See:
* https://github.com/ajcr/100-pandas-puzzles
* https://github.com/guipsamora/pandas_exercises

## Final remarks
With basic scripting and Pandas we already have a solid environment to analyse data. We introduce the other libraries motivated with the extending the capabilities of Pandas