# Pandas

*pandas* is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas has several advantages:
- Pandas provides powerful tools for processing table-like data.
- Pandas adds indexes and labels to 1d and 2d NumPy arrays.
- Easy handling of missing data.
- Pandas DataFrame object can now be directly used by R.

However, the majority of meteorological data is not *table-like*. We mainly take advatage of pandas's input/output capability.

In [1]:
# Loading libraries
import numpy as np
import pandas as pd

*Comma-Seperated Values* is a very common format for atble-like data. Let's see how we can read such a file into Python. The file *../data/ShiLin.PM2.5.csv* contains PM2.5 data of EPA's ShiLin station. Let's try to read it in numpy:

In [2]:
# Read data from a csv file
data = np.genfromtxt('../data/ShiLin.PM2.5.csv', delimiter=',', encoding='utf-8')
print("Data dimension: " + str(data.shape))
print("Show the first 5 rows:")
print(data[:5])

Data dimension: (1827, 25)
Show the first 5 rows:
[[        nan         nan         nan         nan         nan         nan
          nan         nan         nan         nan         nan         nan
          nan         nan         nan         nan         nan         nan
          nan         nan         nan         nan         nan         nan
          nan]
 [        nan 18.         33.         29.         41.         31.
  10.         13.         20.         16.         10.         23.
  16.          4.         12.         27.         31.         15.
  18.         33.         21.         17.         30.         27.
  30.        ]
 [        nan 28.         21.         29.         19.         15.
  30.         33.         23.         16.         18.         32.
  25.         16.         32.         25.         14.         22.
  25.         24.         18.         21.         32.         17.
  10.        ]
 [        nan 19.          9.          9.         25.         17.
  24.         2

And then, we read it with pandas.

In [3]:
data = pd.read_csv('../data/ShiLin.PM2.5.csv', encoding='utf-8')
print("Data dimension: " + str(data.shape))
print("Show the first 5 rows:")
print(data[:5])

Data dimension: (1826, 25)
Show the first 5 rows:
         date   X00   X01   X02   X03   X04        X05   X06   X07        X08  \
0  2011/01/01  18.0  33.0  29.0  41.0  31.0  10.000000  13.0  20.0  16.000000   
1  2011/01/02  28.0  21.0  29.0  19.0  15.0  30.000000  33.0  23.0  16.000000   
2  2011/01/03  19.0   9.0   9.0  25.0  17.0  24.000000  26.0  14.0  24.947511   
3  2011/01/04  16.0  26.0  25.0  22.0  31.0  23.783136  28.0  26.0  28.000000   
4  2011/01/05  30.0  26.0  27.0  26.0  33.0  22.000000  17.0  24.0  18.000000   

   ...   X14   X15   X16   X17   X18   X19   X20   X21   X22   X23  
0  ...  27.0  31.0  15.0  18.0  33.0  21.0  17.0  30.0  27.0  30.0  
1  ...  25.0  14.0  22.0  25.0  24.0  18.0  21.0  32.0  17.0  10.0  
2  ...   6.0   4.0   4.0   9.0   5.0   0.0  10.0  35.0  24.0   5.0  
3  ...  39.0  49.0  49.0  40.0  44.0  28.0  30.0  30.0  23.0  36.0  
4  ...  13.0  21.0  27.0  11.0  22.0  22.0  31.0  27.0  12.0  13.0  

[5 rows x 25 columns]


As you can see, `pandas.read_csv`  not only provides cleaner code, but also get more correct results.

Besides data input/output, pandas also provides many functions for data manipulations. We will discuss these functions in details in the later topics.