Key benefits of using pandas 
- Handling of missing data.
- Load and save data in different file formats (csv, excel, HDFS etc)
- Can handle heterogeneous data.
- SQL style joining and merging of data tables.
- Size mutability: column can be inserted and deleted from a table.


In [1]:
#first import the libraries 
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()



In [2]:
# machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

### Series and DataFrame 

A **Series** is a vector of data with an index which labels every value in that vector. 

In [3]:
Week = pd.Series([ "Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"])
Week

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
dtype: object

**NOTE:** 

    . Here values of a series is a numpy array.
    . Index of a series is a pandas index object.

In [4]:
Week.values

array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'], dtype=object)

In [5]:
Week.index

Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')

Here indexes to a series is automatically assigned, however we can manually assign any meaningful labels to data in a series 

In [6]:
Week = pd.Series([ "Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"],
                index=["Day_1","Day_2","Day_3","Day_4","Day_5","Day_6","Day_7"])
Week

Day_1       Monday
Day_2      Tuesday
Day_3    Wednesday
Day_4     Thursday
Day_5       Friday
Day_6     Saturday
Day_7       Sunday
dtype: object

In [7]:
# We can access any element using its element
Week['Day_3']

'Wednesday'

In [8]:
# we can still use positional index 
Week[2]

'Wednesday'

A **Data Frame** is a tabular data structure. It stores series as columns. Pandas allows creation and manipulation of higher dimensional data but internally it sores data as 2 dimensional data.

In [9]:
df = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })

In [10]:
df

Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1,3,test,foo
1,1,2013-01-02,1,3,train,foo
2,1,2013-01-02,1,3,test,foo
3,1,2013-01-02,1,3,train,foo


In [11]:
df.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [12]:
# get quick statistic summary of the data
df.describe()

Unnamed: 0,A,C,D
count,4,4,4
mean,1,1,3
std,0,0,0
min,1,1,3
25%,1,1,3
50%,1,1,3
75%,1,1,3
max,1,1,3


In [13]:
# Transposing the data frame
df.T

Unnamed: 0,0,1,2,3
A,1,1,1,1
B,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00
C,1,1,1,1
D,3,3,3,3
E,test,train,test,train
F,foo,foo,foo,foo


In [14]:
# selecting a cloumn 
df['E']

0     test
1    train
2     test
3    train
Name: E, dtype: category
Categories (2, object): [test, train]

In [15]:
df.E

0     test
1    train
2     test
3    train
Name: E, dtype: category
Categories (2, object): [test, train]

In [16]:
# slicing rows
df[1:3]

Unnamed: 0,A,B,C,D,E,F
1,1,2013-01-02,1,3,train,foo
2,1,2013-01-02,1,3,test,foo


In [17]:
# selection by label