# Data analysis in Python with pandas

## What is pandas?

pandas: Open source library in Python for data analysis, data manipulation, and data visualisation.

Pros:
1. Tons of functionality
2. Well supported by community
3. Active development
4. Lot of documentation
5. Plays well with other packages, for e.g NumPy, Scikit-learn

In [19]:
import pandas as pd

## How do I read a tabular data file into pandas?

Tabular data file: By default tab separated file (tsv)

In [7]:
orders = pd.read_table('http://bit.ly/chiporders')

In [9]:
orders.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [17]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('http://bit.ly/movieusers', delimiter='|', header=None, names=user_cols)

In [18]:
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


> Tip: skiprows and skipfooter params are useful to omit extra data in a file heading or ending.

## How do I select a pandas Series from a DataFrame?

Two basic data structures in pandas
1. DataFrame: Table with rows and columns
2. Series: Each columns is known as pandas Series

In [20]:
ufo = pd.read_csv('http://bit.ly/uforeports')

In [21]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [22]:
type(ufo)

pandas.core.frame.DataFrame

In [24]:
type(ufo['City'])

pandas.core.series.Series

In [26]:
city = ufo.City

In [27]:
city.head()

0                  Ithaca
1             Willingboro
2                 Holyoke
3                 Abilene
4    New York Worlds Fair
Name: City, dtype: object

> Tip: Create a new Series in a DataFrame

In [28]:
ufo['location'] = ufo.City + ', ' + ufo.State

In [29]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,location
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY"
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ"
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO"
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS"
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,"New York Worlds Fair, NY"
