# Intro: Python and Pandas

This notebook will provide you with a short introduction into the basic command in Python, and specifically the basic commands of the pandas packages that we will be using in class. 
You should have completed already the [DataCamp Intro to Python for Data Science](https://www.datacamp.com/courses/intro-to-python-for-data-science) course. It is free and takes about four hours. If you havn't done that course please consider this a helpful ressource to familiarize yourself with Python/Pandas. 
After the class you will be able to read in data from a database and execute basic data manipulations using Pandas. 

## Table of Contents
1. [General Remarks](#General-Remarks)
    1. [Python Setup](#Python-Setup)
    1. [Loading Data](#Load-the-Data)
1. [Data Analysis in Pandas](#Data-Analysis-in-Pandas)
    1. [Displaying Data](#Displaying-Data)
    1. [Subsetting Data](#Subsetting-Data)
    1. [Statistics](#Statistics)
    1. [Adding and Updating Data](#Adding-and-Updating-Data)
    1. [Grouping and Aggregating Data](#Grouping-and-Aggregating-Data)
    1. [Merging Dataframes](#Merging-Dataframes)
    1. [Saving a CSV](#Saving-a-CSV)

## General Remarks
---

Python: 
* Is a high-level interpreted general purpose programming language named after the Monty Python British comedy troupe
* Was created by Guido van Rossum, and is maintained by an open source community
* Is the fifth most popular programming language
* Is an object orientied language
* Is use mostly in data science because it is powerful and fast, and is compatible with other languages
* Runs everywhere, it's easy to learn, it's highly readable, open-source and its fast development time compared to other languages
* Comes with a growing and always-improving list of open-source libraries for scientific programming, data manipulation, and data analysis (e.g., Numpy, Scipy, Pandas, Scikit-Learn, Statsmodels, Matplotlib, Seaborn, PyTables, etc.)

IPython/Jupyter
* Is an enhanced, interactive python interpreter that started as a grad school project by Fernando Perez. 
* Evolved into the IPython notebook, which allowed users to archive their code, figures, and analysis in a single document, making doing reproducible research and sharing said research much easier. 
* Other languages including but not limited to Julia, Python and R were included. This then led to a rebranding known as the Jupyter Project. 

### Python Setup

- In Python, we `import` packages. The `import` command allows us to use libraries created by others in our own work by "importing" them. You can think of importing a library as opening up a toolbox and pulling out a specific tool. 
- NumPy is short for numerical python. NumPy is a lynchpin in Python's scientific computing stack. Its strengths include a powerful *N*-dimensional array object, and a large suite of functions for doing numerical computing. 
- Pandas is a library in Python for data analysis that uses the DataFrame object from R which is similiar to a spreedsheet but allows you to do your analysis programaticaly rather than the point-and-click of Excel. It is a lynchpin of the PyData stack.  
- Psycopg2 or sqlalchemy are a python libraries for interfacing with a PostGreSQL database. 
- Matplotlib is the standard plotting library in python. 
`%matplotlib inline` is a so-called "magic" function of Jupyter that enables plots to be displayed inline with the code and text of a notebook. 

#### This is how the start of a notebook might look like

In [48]:
# remember to put this line in your notebook, otherwise the visualization won't show up
%pylab inline
# import the packages
# numpy for array and matrix computation
import numpy as np

# pandas for data analysis
import pandas as pd

# matplotlib and seaborn are the data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

# sqlalchemy an psycopg2 are sql connection packages
from sqlalchemy import create_engine

# configure pandas display: set the maximum number of columns displayed to 25
pd.options.display.max_columns = 25

# use the __future__ version of division and print
from __future__ import division, print_function
import warnings
warnings.filterwarnings('ignore')

Populating the interactive namespace from numpy and matplotlib


In practice we typically load libraries like `numpy` and `pandas` with shortened aliases, e.g, `import numpy as np`. This is like saying, "`import numpy`, and wherever you see `np`, read it as `numpy`." Similarly, you'll often see `import pandas as pd`, or `import matplotlib.pyplot as plt`. 

Another shortcut is `%pylab inline`. This command includes both `import numpy as np` and `import matplotlib.pyplot as plt `. This shortcut was invented because it's faster to type `plt.plot()` rather than `matplotlib.pyplot.plot()`, and even programmers don't like to type more than they have to. 

In documentation and in examples, you will frequently see `numpy` commands starting with the alias `np` rather than `numpy` (e.g, `np.array()` or `np.argsort`) and `pandas` commands starting with `pd` (e.g., `pd.DataFrame()` or `pd.concat()`).

### Loading Data

Before we can start analysing the data we have to load it into memory. We can read in different kind of data formats. The Pandas package provides many ways to load data. It allows the user to read the data from a local csv or excel file, or pull the data from a relational database. Since we are working with the relational database ada_pub in this class, we will demonstrate how to use pandas to read data from a relational database. For examples to read data from a csv file, refert to the pandas documentation [Getting Data In/Out](pandas.pydata.org/pandas-docs/stable/10min.html#getting-data-in-out).

The function to create a sql query and put the data into a pandas dataframe (more to come) is `pd.read_sql()`. This function will ask for some information about the database, and what query you would like to run. 

### Establish a connection to the ada_pub database
In the most simple case, only 2 parameters are required by the `pd.read_sql()` function to pull data. The first parameter is the connection to the database, the second parameter the actual SQL query. To create a connection we use the sqlalchemy package and tell it which database we want to connect to.

#### Parameter 1: connection

In [49]:
# create postgresql connection - three '/' indicate to use default host, port, username, and password
engine = create_engine('postgresql:///ada_pub')

#### Parameter 2: query
Depending on what data we are interested in, we can use different queries to pull different data. In this example, we will pull all the content of the table projects that we can find for agency USDA in year 2014. The three quotation marks surrounding the query body is called multi-line string. It is quite handy for writing sql queries because the new line character will be considered part of the string, instead of breaking the string

In [50]:
QUERY = '''
SELECT *
FROM projects
WHERE department = 'USDA' AND fy = '2014';
'''

### Pull data from the database
Now that we know what the arguments are for the query, we can pass them to the `pd.read_sql()` function, and obtain the data.

In [51]:
# here we pass the query and the connection to the pd.read_sql() function 
example=pd.read_sql_query(QUERY,con=engine)

In [52]:
example

Unnamed: 0,project_id,project_terms,project_title,department,agency,ic_center,project_number,project_start_date,project_end_date,contact_pi_project_leader,other_pis,congressional_district,duns_number,organization_name,organization_city,organization_state,organization_zip,organization_country,budget_start_date,budget_end_date,cfda_code,fy,fy_total_cost,fy_total_cost_sub_projects
0,589214,Bees; Chemicals; Development; Devices; Diet; ...,IMPROVE NUTRITION FOR HONEY BEE COLONIES TO ST...,USDA,ARS,,ARS-0425868,10/1/2013,2/6/2014,"HOFFMAN, GLORIA D",,2.0,078274365,HONEY BEE RESEARCH INSTITUTE AND NATURE CENTER...,TUCSON,AZ,85719,UNITED STATES,,,10.001,2014,,
1,589224,Affect; Algae; Amendment; Appearance; base; B...,ALGAL-BASED WATER TREATMENT TECHNOLOGIES FOR S...,USDA,ARS,,ARS-0426136,1/1/2014,12/31/2018,"HALL, DAVID GOODSELL",,0.0,,SUBTROPICAL INSECTS AND HORTICULTURE RESEARCH,FORT PIERCE,FL,34945,UNITED STATES,,,10.001,2014,,
2,591601,Area; base; Biological; California; Consumpti...,BIOLOGICAL CONTROL OF PIERCE'S DISEASE OF GRAP...,USDA,NIFA,,0212205,8/20/2014,9/30/2014,"HOPKINS, DO, .",,,002236250,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.203,2014,,
3,593659,absorption; Address; Anemia; Anemia due to Ch...,MOLECULAR MECHSNISMS OF INTESTINAL METAL ION T...,USDA,NIFA,,0217191,7/8/2014,7/9/2014,"COLLINS, JA, F..",,,002236250,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.203,2014,,
4,597171,Climate; Decision Support Systems; Developmen...,ASSESSING CLIMATE INFORMATION NEEDS AND OPPORT...,USDA,NIFA,,0226918,8/19/2014,8/20/2014,"JONES, J.",,,002236250,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.202,2014,,
5,599511,Attention; base; Development; Fisheries; Fish...,DEVELOPMENT OF TOOLS TO ENABLE PLACE-BASED MAN...,USDA,NIFA,,0233896,5/30/2014,9/30/2018,"STRUVE, JU, .",,,002236250,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.203,2014,,
6,689200,Agriculture; Farming environment; New York; N...,NITROGEN LOSSES FROM AGRICULTURE: COMPARING OR...,USDA,NIFA,,1000965,10/1/2013,9/30/2016,"HOWARTH, RO.","MARINO, RO, M",,613809599,CORNELL UNIVERSITY INC,ORISKANY,NY,13424-3921,UNITED STATES,,,10.203,2014,,
7,690030,Animals; Embryo; Genome; programs; Research; ...,NATIONAL ANIMAL GENOME RESEARCH PROGRAM,USDA,NIFA,,1002276,1/6/2014,9/30/2018,"BRUEMMER, J.",,,149546160,COLORADO STATE UNIVERSITY,DENVER,CO,80203-1148,UNITED STATES,,,10.203,2014,,
8,690062,Population; Population Growth; Process; Repro...,"DYNAMICS, PERSISTENCE AND MANAGEMENT OF VERTEB...",USDA,NIFA,,1002321,1/28/2014,12/23/2018,"OLI, M.","OLI, MA, K",,002236250,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.203,2014,,
9,689214,Development; Growth; Lead; Nematoda; next gen...,ANALYSIS OF SEXUAL REPRODUCTION IN A ROOT-KNOT...,USDA,NIFA,,1001006,1/6/2014,9/30/2018,"ENGEBRECHT, J.","BRITT, AN,",,047120084,UNIVERSITY OF CALIFORNIA DAVIS,DAVIS,CA,95618-6153,UNITED STATES,,,10.203,2014,,


By now we have finished loading the data, and we are ready to do some data analysis

## Data Analysis in Pandas

When we are working with Pandas we are thinking in terms of dataframes. It's a pandas representation of a spreadsheet/ sql table/Stata/R or SAS dataset. It contains information such as column names, row indices (starting from 0), and the actual data. They are the basic objects on which we will perform our data analysis.

### Displaying Data

#### The shape of the dataframe
When we get the data, we usually want to know how many rows and columns are there in the data. We can find out the row and column numbers by calling the shape instance variable with a dot operator.

In [53]:
# shape of a dataframe (row number, column number)
example.shape
# We can see that the dataframe contains 3192 rows, and 24 columns

(3192, 24)

In [47]:
# See the list of variables in data
example.count()

project_id                    3192
project_terms                 3192
project_title                 3192
department                    3192
agency                        3192
ic_center                        0
project_number                3192
project_start_date            3192
project_end_date              3192
contact_pi_project_leader     3180
other_pis                     1072
congressional_district        1360
duns_number                   2894
organization_name             3189
organization_city             3189
organization_state            3189
organization_zip              3087
organization_country          3192
budget_start_date                0
budget_end_date                  0
cfda_code                     3192
fy                            3192
fy_total_cost                 1005
fy_total_cost_sub_projects       0
dtype: int64

#### The head and tail of the dataframe
It is also helpful to have a look at the first or last few rows of the data for a first impression, as well as a sanity check. We can call the head()/tail() methods. We can also specify how many lines we would like to see in the parentheses at the end. We choose to display 10. If not specified, by default the first 5 lines will be returned

In [6]:
# display the first few rows of the dataframe
example.head(10)

Unnamed: 0,project_id,project_terms,project_title,department,agency,ic_center,project_number,project_start_date,project_end_date,contact_pi_project_leader,other_pis,congressional_district,duns_number,organization_name,organization_city,organization_state,organization_zip,organization_country,budget_start_date,budget_end_date,cfda_code,fy,fy_total_cost,fy_total_cost_sub_projects
0,589214,Bees; Chemicals; Development; Devices; Diet; ...,IMPROVE NUTRITION FOR HONEY BEE COLONIES TO ST...,USDA,ARS,,ARS-0425868,10/1/2013,2/6/2014,"HOFFMAN, GLORIA D",,2.0,78274365.0,HONEY BEE RESEARCH INSTITUTE AND NATURE CENTER...,TUCSON,AZ,85719,UNITED STATES,,,10.001,2014,,
1,589224,Affect; Algae; Amendment; Appearance; base; B...,ALGAL-BASED WATER TREATMENT TECHNOLOGIES FOR S...,USDA,ARS,,ARS-0426136,1/1/2014,12/31/2018,"HALL, DAVID GOODSELL",,0.0,,SUBTROPICAL INSECTS AND HORTICULTURE RESEARCH,FORT PIERCE,FL,34945,UNITED STATES,,,10.001,2014,,
2,591601,Area; base; Biological; California; Consumpti...,BIOLOGICAL CONTROL OF PIERCE'S DISEASE OF GRAP...,USDA,NIFA,,0212205,8/20/2014,9/30/2014,"HOPKINS, DO, .",,,2236250.0,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.203,2014,,
3,593659,absorption; Address; Anemia; Anemia due to Ch...,MOLECULAR MECHSNISMS OF INTESTINAL METAL ION T...,USDA,NIFA,,0217191,7/8/2014,7/9/2014,"COLLINS, JA, F..",,,2236250.0,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.203,2014,,
4,597171,Climate; Decision Support Systems; Developmen...,ASSESSING CLIMATE INFORMATION NEEDS AND OPPORT...,USDA,NIFA,,0226918,8/19/2014,8/20/2014,"JONES, J.",,,2236250.0,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.202,2014,,
5,599511,Attention; base; Development; Fisheries; Fish...,DEVELOPMENT OF TOOLS TO ENABLE PLACE-BASED MAN...,USDA,NIFA,,0233896,5/30/2014,9/30/2018,"STRUVE, JU, .",,,2236250.0,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.203,2014,,
6,689200,Agriculture; Farming environment; New York; N...,NITROGEN LOSSES FROM AGRICULTURE: COMPARING OR...,USDA,NIFA,,1000965,10/1/2013,9/30/2016,"HOWARTH, RO.","MARINO, RO, M",,613809599.0,CORNELL UNIVERSITY INC,ORISKANY,NY,13424-3921,UNITED STATES,,,10.203,2014,,
7,690030,Animals; Embryo; Genome; programs; Research; ...,NATIONAL ANIMAL GENOME RESEARCH PROGRAM,USDA,NIFA,,1002276,1/6/2014,9/30/2018,"BRUEMMER, J.",,,149546160.0,COLORADO STATE UNIVERSITY,DENVER,CO,80203-1148,UNITED STATES,,,10.203,2014,,
8,690062,Population; Population Growth; Process; Repro...,"DYNAMICS, PERSISTENCE AND MANAGEMENT OF VERTEB...",USDA,NIFA,,1002321,1/28/2014,12/23/2018,"OLI, M.","OLI, MA, K",,2236250.0,UNIVERSITY OF FLORIDA,GAINESVILLE,FL,32611-0110,UNITED STATES,,,10.203,2014,,
9,689214,Development; Growth; Lead; Nematoda; next gen...,ANALYSIS OF SEXUAL REPRODUCTION IN A ROOT-KNOT...,USDA,NIFA,,1001006,1/6/2014,9/30/2018,"ENGEBRECHT, J.","BRITT, AN,",,47120084.0,UNIVERSITY OF CALIFORNIA DAVIS,DAVIS,CA,95618-6153,UNITED STATES,,,10.203,2014,,


In [7]:
# last few rows of the dataframe
# the syntax is similar to head
example.tail()

Unnamed: 0,project_id,project_terms,project_title,department,agency,ic_center,project_number,project_start_date,project_end_date,contact_pi_project_leader,other_pis,congressional_district,duns_number,organization_name,organization_city,organization_state,organization_zip,organization_country,budget_start_date,budget_end_date,cfda_code,fy,fy_total_cost,fy_total_cost_sub_projects
3187,813637,Acids; Affect; Aldehydes; Beta-glucuronidase;...,TECHNOLOGIES FOR IMPROVING PROCESS EFFICIENCIE...,USDA,ARS,,ARS-0427437,8/9/2014,8/8/2019,"DIEN, BRUCE S",,18.0,64539612.0,U.S. AGRICULTURAL RESEARCH SERVICE,PEORIA,IL,61604,UNITED STATES,,,10.001,2014,1345714.0,
3188,813638,base; Biochemical; Biomass; Collaborations; E...,BIOCHEMICAL TECHNOLOGIES TO ENABLE THE COMMERC...,USDA,ARS,,ARS-0427438,8/21/2014,8/20/2019,"SLININGER, PATRICIA J WATSON",,18.0,64539612.0,U.S. AGRICULTURAL RESEARCH SERVICE,PEORIA,IL,61604,UNITED STATES,,,10.001,2014,833060.0,
3189,813639,Acids; Arabinose; Aspergillus; Butanols; carb...,DEVELOP TECHNOLOGIES FOR PRODUCTION OF PLATFOR...,USDA,ARS,,ARS-0427439,8/29/2014,8/28/2019,"SAHA, BADAL C",,18.0,64539612.0,U.S. AGRICULTURAL RESEARCH SERVICE,PEORIA,IL,61604,UNITED STATES,,,10.001,2014,1281630.0,
3190,813643,Agriculture; antimicrobial; Antioxidants; bas...,"ENABLE NEW MARKETABLE, VALUE-ADDED COPRODUCTS ...",USDA,ARS,,ARS-0427684,9/8/2014,9/7/2019,"MOREAU, ROBERT A",,13.0,,AGRICULTURAL RESEARCH SERVICE,WYNDMOOR,PA,19038,UNITED STATES,,,10.001,2014,2133178.0,
3191,813655,Agriculture; Anti-Bacterial Agents; Bacteria;...,NEW BIOBASED PRODUCTS AND IMPROVED BIOCHEMICAL...,USDA,ARS,,ARS-0427980,9/19/2014,9/18/2019,"BISCHOFF, KENNETH M",,18.0,64539612.0,U.S. AGRICULTURAL RESEARCH SERVICE,PEORIA,IL,61604,UNITED STATES,,,10.001,2014,1547666.0,


### Columns, rows, data selection

#### Column names
to see which columns are there in the data, use the following syntax. 

In [8]:
example.columns

Index([u'project_id', u'project_terms', u'project_title', u'department',
       u'agency', u'ic_center', u'project_number', u'project_start_date',
       u'project_end_date', u'contact_pi_project_leader', u'other_pis',
       u'congressional_district', u'duns_number', u'organization_name',
       u'organization_city', u'organization_state', u'organization_zip',
       u'organization_country', u'budget_start_date', u'budget_end_date',
       u'cfda_code', u'fy', u'fy_total_cost', u'fy_total_cost_sub_projects'],
      dtype='object')

#### Single column selection
If we want to select a specific column, we can use the following syntax:

In [9]:
# select a single column: the dataframe variable name, followed by square brackets, and then put the
# the column name between quotes (either single or double). 
example['agency'].head()

0     ARS
1     ARS
2    NIFA
3    NIFA
4    NIFA
Name: agency, dtype: object

In [10]:
# the same would be
example.agency.head()

0     ARS
1     ARS
2    NIFA
3    NIFA
4    NIFA
Name: agency, dtype: object

#### Multiple-column selection
to select multiple columns, wrap the column names in a python list, then put the list or tuple between the brackets after the dataframe

In [11]:
# here we selected the columns and assigned them to a new dataframe example2
example2 = example[['agency', 'project_title', 'fy_total_cost',
                           'project_start_date','project_end_date']]
example2.head()

Unnamed: 0,agency,project_title,fy_total_cost,project_start_date,project_end_date
0,ARS,IMPROVE NUTRITION FOR HONEY BEE COLONIES TO ST...,,10/1/2013,2/6/2014
1,ARS,ALGAL-BASED WATER TREATMENT TECHNOLOGIES FOR S...,,1/1/2014,12/31/2018
2,NIFA,BIOLOGICAL CONTROL OF PIERCE'S DISEASE OF GRAP...,,8/20/2014,9/30/2014
3,NIFA,MOLECULAR MECHSNISMS OF INTESTINAL METAL ION T...,,7/8/2014,7/9/2014
4,NIFA,ASSESSING CLIMATE INFORMATION NEEDS AND OPPORT...,,8/19/2014,8/20/2014


#### single/ multiple cell(s) selection
Use the `loc` method for cell selection. Pass the row and column indices in the _square brackets_ after `loc`. Specify the row index first, and then column name, separated by a comma. Note that both indices will be included.

In [12]:
# single cell selection
# select the cell in row 5 and column rootrace
cell = example2.loc[3, 'project_start_date']
cell

u'7/8/2014'

In [13]:
# multiple cells selection
# option 1: use a python list to explicitly list the rows/columns
cell = example2.loc[[0, 2, 4], 'project_start_date']
cell

0    10/1/2013
2    8/20/2014
4    8/19/2014
Name: project_start_date, dtype: object

In [14]:
# option 2: use colon to indicate contiguous selection
cell = example2.loc[0:4, 'project_start_date']
cell

0    10/1/2013
1     1/1/2014
2    8/20/2014
3     7/8/2014
4    8/19/2014
Name: project_start_date, dtype: object

In [15]:
# if we want to select all columns, we can use a colon symbol :.
row5 = example2.loc[5, :]
row5

agency                                                             NIFA
project_title         DEVELOPMENT OF TOOLS TO ENABLE PLACE-BASED MAN...
fy_total_cost                                                       NaN
project_start_date                                            5/30/2014
project_end_date                                              9/30/2018
Name: 5, dtype: object

### Subsetting Data
#### Subsetting numerical data
Similar to the `where` statement in sql, we can also select only data that meet certain condition. Depending on whether the data is numberical or string, we should choose to use different syntax for each situation. For example, if we would like to select columns that start from year 2015, we can use a larger than or equal to operator condition to subset.

In [16]:
# conditional subsetting: put the conditional statement within the square brackets 
# the conditional statement here is that we want the cost to be higher than or equal to 50.0000.
example3 = example2[example2['fy_total_cost'] >= 50000]
example3.head()

Unnamed: 0,agency,project_title,fy_total_cost,project_start_date,project_end_date
16,NIFA,HATCH ACT OF 1887 - NY,490142.0,6/15/2013,2/14/2014
17,NIFA,ANIMAL HEALTH AND DISEASE RESEARCH PROGRAM - OH,56954.0,1/1/2013,12/31/2013
18,NIFA,HATCH ACT OF 1887 (MULTISTATE RESEARCH FUND) - NY,765728.0,7/1/2013,2/28/2014
19,NIFA,AGRABILITY OF UTAH,180000.0,2/15/2014,2/14/2016
20,NIFA,SGU RESTORING THE BUFFALO ECONOMY EDUCATION PR...,107468.0,3/1/2014,2/28/2017


#### Subsetting string/categorical data
When the column contains string data or categorical data, the comparison operators might not be the choice for data selection. Instead, we can compare each data in a column to a target list to see if the data in column is included in the list. This is done by calling the `isin` method.

In [17]:
# select specific agencies
# we specify the target list within the parentheses of the `isin` method
example4 = example2[example2['agency'].isin(['NIFA', 'ARS'])]
example4.head()

Unnamed: 0,agency,project_title,fy_total_cost,project_start_date,project_end_date
0,ARS,IMPROVE NUTRITION FOR HONEY BEE COLONIES TO ST...,,10/1/2013,2/6/2014
1,ARS,ALGAL-BASED WATER TREATMENT TECHNOLOGIES FOR S...,,1/1/2014,12/31/2018
2,NIFA,BIOLOGICAL CONTROL OF PIERCE'S DISEASE OF GRAP...,,8/20/2014,9/30/2014
3,NIFA,MOLECULAR MECHSNISMS OF INTESTINAL METAL ION T...,,7/8/2014,7/9/2014
4,NIFA,ASSESSING CLIMATE INFORMATION NEEDS AND OPPORT...,,8/19/2014,8/20/2014


#### Subsetting with multiple conditions
If we want to subset the data with more than one condition, we can specify all the conditions and concatenate them with the python keyword `&`. Remember to put every single condition within a pair of parentheses.

In [18]:
# combine both selections from above
example5 = example2[(example2['fy_total_cost'] >= 50000) &
                        (example2['agency'].isin(['NIFA', 'ARS']))]
example5.head()

Unnamed: 0,agency,project_title,fy_total_cost,project_start_date,project_end_date
16,NIFA,HATCH ACT OF 1887 - NY,490142.0,6/15/2013,2/14/2014
17,NIFA,ANIMAL HEALTH AND DISEASE RESEARCH PROGRAM - OH,56954.0,1/1/2013,12/31/2013
18,NIFA,HATCH ACT OF 1887 (MULTISTATE RESEARCH FUND) - NY,765728.0,7/1/2013,2/28/2014
19,NIFA,AGRABILITY OF UTAH,180000.0,2/15/2014,2/14/2016
20,NIFA,SGU RESTORING THE BUFFALO ECONOMY EDUCATION PR...,107468.0,3/1/2014,2/28/2017


### Statistics
#### Descriptive stats
Pandas has integrated some very useful tools to help us understand the distribution of the data. The `describe` method computes the most commonly used descriptive statistics, such as count, mean, standard deviation and quantiles for a dataframe. 

In [19]:
# see the descriptive statistics of the wage column
example.describe()

Unnamed: 0,project_id,congressional_district,fy,fy_total_cost
count,3192.0,1360.0,3192.0,1005.0
mean,698709.10213,5.024265,2014.0,685839.9
std,31939.102729,8.888327,0.0,1077847.0
min,589214.0,0.0,2014.0,2363.0
25%,688547.5,1.0,2014.0,100000.0
50%,689477.5,3.0,2014.0,249983.0
75%,690275.25,5.0,2014.0,768000.0
max,813655.0,98.0,2014.0,7759400.0


#### Value counts and unique values
For categorical values, it is often helpful to figure out what are the unique values of a given column, and the quantity of each data. Let's go back to the welfare data

In [20]:
# find out how many different agencies are there in the data
example['agency'].unique()

array([u'ARS', u'NIFA', u'FS'], dtype=object)

In [21]:
# to count how many observations for each agency appeared in the data
example['agency'].value_counts()

NIFA    2896
FS       184
ARS      112
Name: agency, dtype: int64

We know from the data dictionary that 0 represent not applicable, 1 represents white, not of hispanic origin, 2 represents black, not of hispanic origin, etc. So we can see from the data that most race information is not available in the welfare data.

### Adding and Updating Data
#### Creating columns
We sometimes need to creat a new column, either to save the previously calculation from other columns, or add new information to the dataframe. The syntax is given below:
`dataframe['column_name'] = value`
where:
dataframe is the dataframe in which the new column is created,
column_name is the string of the new column name, 
value is the value of the each cell.

In [22]:
# we can then calculate the monthly cost by dividing the project costs column by 12, 
# and assign this newly computed column to the monthly column
example5['monthly'] = example5['fy_total_cost']/12
example5.head()

Unnamed: 0,agency,project_title,fy_total_cost,project_start_date,project_end_date,monthly
16,NIFA,HATCH ACT OF 1887 - NY,490142.0,6/15/2013,2/14/2014,40845.166667
17,NIFA,ANIMAL HEALTH AND DISEASE RESEARCH PROGRAM - OH,56954.0,1/1/2013,12/31/2013,4746.166667
18,NIFA,HATCH ACT OF 1887 (MULTISTATE RESEARCH FUND) - NY,765728.0,7/1/2013,2/28/2014,63810.666667
19,NIFA,AGRABILITY OF UTAH,180000.0,2/15/2014,2/14/2016,15000.0
20,NIFA,SGU RESTORING THE BUFFALO ECONOMY EDUCATION PR...,107468.0,3/1/2014,2/28/2017,8955.666667


### Grouping and Aggregating Data
#### Group by and aggregation functions
Like in SQL, it is also possible to group the dataframe by a column, and use aggregation function on them, and sort the result

In [23]:
# calculate the how many grats each agency funded
# step1: in the groupby method, we pass the column we want to group by, we can also select what columns
# we want to carry out the operation
# step2: use the count method to count the number of cases
# step3: sort the value in descending order (set the ascending parameter to False)
example.groupby('agency')['project_title'].count().sort_values(ascending=False)      

agency
NIFA    2896
FS       184
ARS      112
Name: project_title, dtype: int64

Other useful aggregation functions are:
sum(): sum
mean(): average
agg(): use a python dictionary to specify aggregation function based on each column

### Merging Dataframes
Pandas provides an ability to merge (join) two datasets together, like you can do in SQL. You can store the results in a new dataframe.

In [24]:
merge_df = pd.merge(example2, example3, on=["project_title", "second var"], how="inner")
merge_df.head()

Unnamed: 0,agency_x,project_title,fy_total_cost_x,project_start_date_x,project_end_date_x,agency_y,fy_total_cost_y,project_start_date_y,project_end_date_y
0,NIFA,HATCH ACT OF 1887 - NY,490142.0,6/15/2013,2/14/2014,NIFA,490142.0,6/15/2013,2/14/2014
1,NIFA,HATCH ACT OF 1887 - NY,490142.0,6/15/2013,2/14/2014,NIFA,4411279.0,9/1/2013,8/31/2016
2,NIFA,HATCH ACT OF 1887 - NY,4411279.0,9/1/2013,8/31/2016,NIFA,490142.0,6/15/2013,2/14/2014
3,NIFA,HATCH ACT OF 1887 - NY,4411279.0,9/1/2013,8/31/2016,NIFA,4411279.0,9/1/2013,8/31/2016
4,NIFA,ANIMAL HEALTH AND DISEASE RESEARCH PROGRAM - OH,56954.0,1/1/2013,12/31/2013,NIFA,56954.0,1/1/2013,12/31/2013


In [25]:
merge_df.shape

(999, 9)

### Saving a CSV
You can save a copy of your dataframe as a .csv file.

In [30]:
example.to_csv("~/example_data.csv", encoding='utf8')