## Pandas Tutorial ##
by Briane Paul Samson

Welcome to our second tutorial in the COMET Data Science Workshops. This notebook will walk you through some basic Pandas functions and operations that you can use in any data analysis and data science project.

**Overview**

pandas consists of the following things:

 - A set of labeled array data structures, the primary of which are Series and DataFrame 
 - Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing 
 - An integrated group by engine for aggregating and transforming data sets 
 - Date range generation (date_range) and custom date offsets enabling the implementation of customized frequencies
 - Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format. 
 - Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value) 
 - Moving window statistics (rolling mean, rolling standard deviation, etc.) 
 - Static and moving window linear and panel regression

source: https://pandas.pydata.org/pandas-docs/stable/overview.html

Samples from: https://pandas.pydata.org/pandas-docs/stable/10min.html

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

Series and DataFrames
---------------------

**Series** - 1D labeled homogeneously-typed array

DataFrame - General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns

Panel - General 3D labeled, also size-mutable array

In [None]:
s = pd.Series([1,3,5,np.nan,6,8])
s

In [None]:
dates = pd.date_range('20130101', periods=6)
dates

In [None]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

In [None]:
data_set = pd.read_csv("../input/migration_nz.csv")
data_set

In [None]:
data_set.dtypes

In [None]:
data_set.head()

In [None]:
data_set.head(20)

In [None]:
data_set.index

In [None]:
data_set.columns

In [None]:
data_set['Country']

In [None]:
data_set.values

In [None]:
data_set.describe()

In [None]:
s = pd.Series(['a', 'a', 'b', 'c']) #categorical
s.describe()

In [None]:
s = pd.Series([
np.datetime64("2000-01-01"),
np.datetime64("2010-01-01"),
np.datetime64("2010-01-01")
])
s

In [None]:
s.describe()