<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Pandas</p><br>

**pandas** is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

**pandas** is built upon **numpy** and **scipy** providing easy-to-use data structures and data manipulation functions with integrated indexing.
* Most numpy advantages still hold true
* pandas uniquely enables ingestion + manipulation of heterogenous data types in a intuitive fashion
* helps combine large data sets via **merge** and **join**
* provides efficient library to break data sets, transform them, and recombine them
* provides visualizations
* handles time-series data effectively via native methods
* also has native methods to deal with missing data, data pivoting, data sorting, description capabilities, fast generation of data plots, and boolean indexing for fast image processing (and other masking operations)

The main data structures *pandas* provides are **Series** and **DataFrames**. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are:
* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using dataframes
* Working with timestamps and time-series data

Pandas provides almost all data wrangling capabilities data scientists need and is actively supported by the dev community and constantly increasing in functionality.

**Additional Recommended Resources:**
* *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/
* *Python for Data Analysis* by Wes McKinney
* *Python Data Science Handbook* by Jake VanderPlas

Let's get started with our first *pandas* notebook!

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Import Libraries
</p>

In [1]:
import pandas as pd

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Introduction to pandas Data Structures</p>
<br>
*pandas* has two main data structures it uses, namely, **Series** and **DataFrames**. 

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
pandas Series</p>

**Series** 1-dimensional labeled array that provides many ways to index data + supports many data types
* handles ints, floats, strings, Python objects, etc.
* they are a valid object in many Numpy methods due to similarities to arrays
* axis labels = **index** (like a fit-sized dictionary, but is *flexible*)
* can get and set values via indices  

In [3]:
# similar to Numpy arrays but we can define index labels via 2nd list arg

ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [4]:
ser.index

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

In [5]:
# get 2 specific index value pairs by index name

ser.loc[['nancy','bob']]

nancy    300
bob      foo
dtype: object

In [6]:
# get indices + values by numeric index

ser[[4, 3, 1]]

eric    500
dan     bar
bob     foo
dtype: object

In [7]:
# get the *specific* index value from the NUMERIC index

ser.iloc[2]

300

In [10]:
# check for an index in the series

'bob' in ser

True

In [11]:
ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [13]:
# double each value in the series
# ints will be doubled mathematically
# strings will be copied and pasted and combined into 1 string

ser * 2

tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object

In [14]:
# just square values for indices of 'nancy' and 'eric'

ser[['nancy', 'eric']] ** 2

nancy     90000
eric     250000
dtype: object

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
pandas DataFrame</p>

**DataFrame** is a 2D, elastic, and labeled data structure that supports heterogenous data w/ labeled axis for both rows and columns
* think of them like a container for Series objects (each row = series)

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Create DataFrame from dictionary of Python Series</p>

In [16]:
# create a dictionary where each value for the key is a series
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
d

{'one': apple    100.0
 ball     200.0
 clock    300.0
 dtype: float64, 'two': apple      111.0
 ball       222.0
 cerill     333.0
 dancy     4444.0
 dtype: float64}

In [45]:
# turn that dictionary into a data frame (merges the indices, like with a FULL JOIN)

df = pd.DataFrame(d)
print(df)

          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0


In [18]:
# get the indices of the DF
df.index

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [19]:
# get the cols of the DF
df.columns

Index(['one', 'two'], dtype='object')

In [20]:
# create a dataframe from our dictionary but JUST for specified indices
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

Unnamed: 0,one,two
dancy,,4444.0
ball,200.0,222.0
apple,100.0,111.0


In [23]:
# create a dataframe from our dictionary but JUST for specified indices and specified cols
# col 'five' will be all NULL since we don't have a KV-pair with that key in our dictionary

pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

Unnamed: 0,two,five
dancy,4444.0,
ball,222.0,
apple,111.0,


<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Create DataFrame from list of Python dictionaries</p>

In [27]:
# create a list of 2 dictionaries (1 w/ 2 KV-pairs, 1 3/ KV-pairs)
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

#return that list as a dataframe (keys = cols)
# did not provide index labels so we get the numeric list index values as row indices
pd.DataFrame(data)

Unnamed: 0,alex,alice,dora,ema,joe
0,1.0,,,,2.0
1,,20.0,10.0,5.0,


In [28]:
# provide index labels (row0 = orange, row1 = red)

pd.DataFrame(data, index = ['orange', 'red'])

Unnamed: 0,alex,alice,dora,ema,joe
orange,1.0,,,,2.0
red,,20.0,10.0,5.0,


In [31]:
# subset the dataframe and only get 3 specific colums

pd.DataFrame(data, columns=['joe', 'dora','alice'], index = ['orange','red'])

Unnamed: 0,joe,dora,alice
orange,2.0,,
red,,10.0,20.0


<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Basic DataFrame operations</p>

In [33]:
# view original DF
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,4444.0


In [34]:
# get column 0
df['one']

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

In [46]:
# create new column that has values of col 0 multiplied by col 1

df['three'] = df['one'] * df['two']
df

Unnamed: 0,one,two,three
apple,100.0,111.0,11100.0
ball,200.0,222.0,44400.0
cerill,,333.0,
clock,300.0,,
dancy,,4444.0,


In [47]:
# create a boolean column where values in col 0 are > 250
df['flag'] = df['one'] > 250
df

Unnamed: 0,one,two,three,flag
apple,100.0,111.0,11100.0,False
ball,200.0,222.0,44400.0,False
cerill,,333.0,,False
clock,300.0,,,True
dancy,,4444.0,,False


In [48]:
# delete column 2 and RETURN IT/STORE IT (cannot do if deleting the column, only POPPING)

three = df.pop('three')
three

apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64

In [49]:
#make sure col 2 is gone
df

Unnamed: 0,one,two,flag
apple,100.0,111.0,False
ball,200.0,222.0,False
cerill,,333.0,False
clock,300.0,,True
dancy,,4444.0,False


In [50]:
# delete col 1 and DO NOT return/store it 
del df['two']
df

Unnamed: 0,one,flag
apple,100.0,False
ball,200.0,False
cerill,,False
clock,300.0,True
dancy,,False


In [51]:
# insert a new column at index 2 with name 'copy of one' and with the values from col 0
df.insert(2, 'copy_of_one', df['one'])
df

Unnamed: 0,one,flag,copy_of_one
apple,100.0,False,100.0
ball,200.0,False,200.0
cerill,,False,
clock,300.0,True,300.0
dancy,,False,


In [52]:
# create a new column with the 1st 2 values from col 0 (end up w/ only 2 values and 3 NA's) 
df['one_upper_half'] = df['one'][:2]
df

Unnamed: 0,one,flag,copy_of_one,one_upper_half
apple,100.0,False,100.0,100.0
ball,200.0,False,200.0,200.0
cerill,,False,,
clock,300.0,True,300.0,
dancy,,False,,


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Pandas Data Ingestion</p>
<br>

pandas can ingest data from a variety of sources in a variety of formats and data types. 
* **CSV** - simple file storing **tabular** data) via **pandas.read_csv(path)** which outputs a **pandas DataFrame object**
* **JSON (JavaScript Object Notation** - format to structre data commonly used for communications w/in web apps via **pandas.read_json(path to JSON string/file)** which outputs a **pandas DataFrame or Series object**
* **HTML** - file format used for the basis of every web page whose data is stored in a **list of pandas DataFrames** via **pandas_read_html(URL/raw HTLM string)**
* **SQL** - used to communicate to a database via queries and **pandas.read_sql_query(SQLQuery,DatabaseConnection)** which outputs a **pandas DataFrame object**, OR via **pandas.read_sql_table(SQLTableName,DatabaseConnection)** to get a whole relational table