# Introduction to Data Analysis using Python

This notebook mainly focuses on data analysis in Python which is done using Pandas library.

I will start this by importing Numpy and Pandas libraries.

In [1]:
import numpy as np
import pandas as pd

## Series

A Series is a one dimensional array that can hold any data type viz. strings, integer, floating number, objects etc. 

The data in a Series can be,

* Python dictionary
* An array
* A value



### From ndarray

If a Series is created from an ndarray, **index** must be the same length as **data**. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [3]:
#np.random.randn() throws random numbers based on the size entered
a = pd.Series(np.random.randn(5), index = ['a','b','c','d','e'])
a

a   -0.056182
b    0.851058
c    1.589496
d   -0.074197
e   -1.166115
dtype: float64

In [4]:
b = pd.Series(np.random.randn(5))
b

0   -0.166011
1    0.221160
2    0.594150
3   -0.318020
4   -1.199077
dtype: float64

### From dict

Series can be created from a dictionary which is a data type in Python as an unordered set of keys and respective values.

In [5]:
d = {'a':1, 'b':2,'c':3, 'd':4}
d

{'a': 1, 'b': 2, 'c': 3, 'd': 4}

In [6]:
c = pd.Series(d)
c

a    1
b    2
c    3
d    4
dtype: int64

### Slicing

In [7]:
#Indexing in Python starts from 0 unlike 1 in R
a[0]

-0.056182221090673182

In [8]:
a[:3]

a   -0.056182
b    0.851058
c    1.589496
dtype: float64

In [9]:
a[3:]

d   -0.074197
e   -1.166115
dtype: float64

In [10]:
a[a > a.mean()]

b    0.851058
c    1.589496
dtype: float64

In [12]:
a[[1,4]]

b    0.851058
e   -1.166115
dtype: float64

A Series can be indexed and values can be set based on the index labels.

In [13]:
d['b']

2

In [15]:
d['c'] = 5
d

{'a': 1, 'b': 2, 'c': 5, 'd': 4}

## DataFrames

A DataFrame is 2-dimensional data structure with different data types. It is similar to a SQL table.

It can have different kinds of input,

* dict
* ndarray
* Series
* another DataFrame

### From dict of Series

In [16]:
d = {'a': pd.Series([1,2,3]),
    'b': pd.Series([4,5,6,7])}
d

{'a': 0    1
 1    2
 2    3
 dtype: int64, 'b': 0    4
 1    5
 2    6
 3    7
 dtype: int64}

In [18]:
df = pd.DataFrame(d)
df

Unnamed: 0,a,b
0,1.0,4
1,2.0,5
2,3.0,6
3,,7


In [19]:
pd.DataFrame(d, index = ['one','two','thr','four'])

Unnamed: 0,a,b
one,,
two,,
thr,,
four,,


In [22]:
#changing the index of df
df.index = ['one','two','thr','four']
df

Unnamed: 0,a,b
one,1.0,4
two,2.0,5
thr,3.0,6
four,,7


In [23]:
df.columns = ['aaa','bbb']
df

Unnamed: 0,aaa,bbb
one,1.0,4
two,2.0,5
thr,3.0,6
four,,7


In [24]:
df['ccc'] = df['aaa'] * df['bbb']
df

Unnamed: 0,aaa,bbb,ccc
one,1.0,4,4.0
two,2.0,5,10.0
thr,3.0,6,18.0
four,,7,


In [25]:
df['ddd'] = 10
df

Unnamed: 0,aaa,bbb,ccc,ddd
one,1.0,4,4.0,10
two,2.0,5,10.0,10
thr,3.0,6,18.0,10
four,,7,,10


Getting summary of a dataframe

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, one to four
Data columns (total 4 columns):
aaa    3 non-null float64
bbb    4 non-null int64
ccc    3 non-null float64
ddd    4 non-null int64
dtypes: float64(2), int64(2)
memory usage: 160.0+ bytes




Getting dimensions (rows, columns) of a dataframe



In [34]:
#dimension of dataframe
df.shape

(4, 4)

In [35]:
#cross multiplication of rows*columns
df.size

16

In [36]:
#no. of rows
len(df)

4

In [37]:
#no. of columns
len(df.columns)

4

### Indexing/Selecting Data

Basics of indexing are as follows.

* *df[col]*       : selects a column and returns a Series
* *df.loc[label]* : select a row by lable and returns a Series
* *df.iloc[loc]*  : select row by integer location and returns Series
* *df[5:10]*      : slice by rows and returns DataFrame

In [26]:
df.loc['one']

aaa     1.0
bbb     4.0
ccc     4.0
ddd    10.0
Name: one, dtype: float64

In [29]:
df.iloc[1]

aaa     2.0
bbb     5.0
ccc    10.0
ddd    10.0
Name: two, dtype: float64

In [30]:
df.iloc[1:4]

Unnamed: 0,aaa,bbb,ccc,ddd
two,2.0,5,10.0,10
thr,3.0,6,18.0,10
four,,7,,10


In [31]:
df['aaa'].loc['two']

2.0

Indexing based on column condition

In [41]:
df.loc[(df['bbb'] > 4)]

Unnamed: 0,aaa,bbb,ccc,ddd
two,2.0,5,10.0,10
thr,3.0,6,18.0,10
four,,7,,10


Indexing based on column condition and return required columns

In [42]:
df.loc[(df['bbb'] > 4), ['aaa', 'bbb']]

Unnamed: 0,aaa,bbb
two,2.0,5
thr,3.0,6
four,,7


In [48]:
df.iloc[:3, :2]

Unnamed: 0,aaa,bbb
one,1.0,4
two,2.0,5
thr,3.0,6


In [50]:
df['aaa']

one     1.0
two     2.0
thr     3.0
four    NaN
Name: aaa, dtype: float64

### Descriptive Statistics

The methods here usually take an **axis** argument. axis = 0 for index which is default and axis = 1 for column

All such methods have a **skipna** option signaling whether to exclude missing data (*True* by default)

In [55]:
df

Unnamed: 0,aaa,bbb,ccc,ddd
one,1.0,4,4.0,10
two,2.0,5,10.0,10
thr,3.0,6,18.0,10
four,,7,,10


In [59]:
df.sum(axis = 1)

one     19.0
two     27.0
thr     37.0
four    17.0
dtype: float64

In [58]:
df.sum(skipna=False)

aaa     NaN
bbb    22.0
ccc     NaN
ddd    40.0
dtype: float64

In [64]:
df.cumsum(skipna=True)

Unnamed: 0,aaa,bbb,ccc,ddd
one,1.0,4.0,4.0,10.0
two,3.0,9.0,14.0,20.0
thr,6.0,15.0,32.0,30.0
four,,22.0,,40.0


In [62]:
#standardization
stand = (df - df.mean())/df.std()
stand

Unnamed: 0,aaa,bbb,ccc,ddd
one,-1.0,-1.161895,-0.949158,
two,0.0,-0.387298,-0.094916,
thr,1.0,0.387298,1.044074,
four,,1.161895,,


In [66]:
m = pd.Series([1,2,3,3,3,5,5,1,1,1,3,8,6])
m.mode()

0    1
1    3
dtype: int64

In [77]:
m1 = pd.DataFrame({'A':np.random.randint(0,9,size=20), 'B':np.random.randint(5,15, size=20)})
m1.mode()

Unnamed: 0,A,B
0,3,5.0
1,7,


### Creating Bins

Continuous values can be discretized using *cut()* (bins based on values) and *qcut()* (based on sample quartiles)

In [98]:
bi = np.random.randn(30)
bi

array([ 1.41805058,  0.50229556, -0.70807939,  0.14304366,  1.01568414,
       -0.75280382, -0.70346642, -1.35709992,  0.7990657 , -3.08614507,
       -0.37633666, -0.59411499, -0.83032898, -1.18065691,  1.74873333,
       -0.23210661, -1.90833862, -0.6351842 , -0.67019114,  0.21082989,
       -0.6381892 , -1.41421306,  0.13982245, -0.45466041, -0.23013597,
       -0.78124001, -0.11164974,  1.94078826,  1.37348855, -0.08824144])

In [101]:
pd.cut(bi, 5)

[(0.935, 1.941], (-0.07, 0.935], (-1.0754, -0.07], (-0.07, 0.935], (0.935, 1.941], ..., (-1.0754, -0.07], (-1.0754, -0.07], (0.935, 1.941], (0.935, 1.941], (-1.0754, -0.07]]
Length: 30
Categories (5, object): [(-3.0912, -2.0808] < (-2.0808, -1.0754] < (-1.0754, -0.07] < (-0.07, 0.935] < (0.935, 1.941]]

In [102]:
pd.cut(bi,5).value_counts()

(-3.0912, -2.0808]     1
(-2.0808, -1.0754]     4
(-1.0754, -0.07]      15
(-0.07, 0.935]         5
(0.935, 1.941]         5
dtype: int64

In [109]:
pd.qcut(bi,[0,0.25,0.5,0.75,1]).value_counts()

[-3.0861, -0.742]    8
(-0.742, -0.415]     7
(-0.415, 0.194]      7
(0.194, 1.941]       8
dtype: int64

### Working with Missing Data

In [117]:
df1 = pd.DataFrame({'one':np.random.randn(5), 'two':list('aeiou')}, index = ('a','c','e','g','h'))
df1

Unnamed: 0,one,two
a,-0.912216,a
c,-0.660245,e
e,0.845405,i
g,-0.760448,o
h,0.567884,u


In [151]:
df2 = df1.reindex(['a','b','c','d','e','g','h'])
df2

Unnamed: 0,one,two
a,-0.912216,a
b,,
c,-0.660245,e
d,,
e,0.845405,i
g,-0.760448,o
h,0.567884,u


In [125]:
df2.isnull()

Unnamed: 0,one,two
a,False,False
b,True,True
c,False,False
d,True,True
e,False,False
g,False,False
h,False,False


In [126]:
df2.notnull()

Unnamed: 0,one,two
a,True,True
b,False,False
c,True,True
d,False,False
e,True,True
g,True,True
h,True,True


In [152]:
df2['three'] = np.random.randint(0,10, size = len(df2))
df2

Unnamed: 0,one,two,three
a,-0.912216,a,2
b,,,2
c,-0.660245,e,0
d,,,9
e,0.845405,i,1
g,-0.760448,o,5
h,0.567884,u,0


In [153]:
df2['one'].sum()

-0.9196203150984694

In [154]:
df2.iloc[4, 2] = None
df2

Unnamed: 0,one,two,three
a,-0.912216,a,2.0
b,,,2.0
c,-0.660245,e,0.0
d,,,9.0
e,0.845405,i,
g,-0.760448,o,5.0
h,0.567884,u,0.0


In [155]:
df2.mean()

one     -0.183924
three    3.000000
dtype: float64

In [161]:
df2[['one','three']].fillna(df2[['one','three']].mean())
#df2

Unnamed: 0,one,three
a,-0.912216,2.0
b,-0.183924,2.0
c,-0.660245,0.0
d,-0.183924,9.0
e,0.845405,3.0
g,-0.760448,5.0
h,0.567884,0.0


In [165]:
df2.fillna(df2.mean())

Unnamed: 0,one,two,three
a,-0.912216,a,2.0
b,-0.183924,,2.0
c,-0.660245,e,0.0
d,-0.183924,,9.0
e,0.845405,i,3.0
g,-0.760448,o,5.0
h,0.567884,u,0.0


In [166]:
df3 = df2[['one','three']]
df3.fillna(0)

Unnamed: 0,one,three
a,-0.912216,2.0
b,0.0,2.0
c,-0.660245,0.0
d,0.0,9.0
e,0.845405,0.0
g,-0.760448,5.0
h,0.567884,0.0


In [170]:
df2.fillna(df2.mean()['one':'three'])

Unnamed: 0,one,two,three
a,-0.912216,a,2.0
b,-0.183924,,2.0
c,-0.660245,e,0.0
d,-0.183924,,9.0
e,0.845405,i,3.0
g,-0.760448,o,5.0
h,0.567884,u,0.0


In [171]:
df2.dropna()

Unnamed: 0,one,two,three
a,-0.912216,a,2.0
c,-0.660245,e,0.0
g,-0.760448,o,5.0
h,0.567884,u,0.0


### Groupby

In [181]:
df3 = pd.DataFrame({'A':['one','two','one','one','three','three','two','two','four','two'],
                   'B':['aaa','bbb','ccc','ccc','ddd','bbb','aaa','bbb','eee','aaa'],
                   'C':np.random.randn(10),
                   'D':np.random.rand(10)})
df3

Unnamed: 0,A,B,C,D
0,one,aaa,-1.161795,0.592734
1,two,bbb,0.356687,0.874028
2,one,ccc,0.496305,0.004049
3,one,ccc,0.066921,0.867182
4,three,ddd,-0.581905,0.933954
5,three,bbb,-0.540484,0.614992
6,two,aaa,0.672938,0.871442
7,two,bbb,1.233322,0.729429
8,four,eee,-0.699898,0.962628
9,two,aaa,0.431778,0.665069


In [196]:
df3.groupby(['A'], sort=False).sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
one,-0.59857,1.463965
two,2.694726,3.139968
three,-1.122389,1.548947
four,-0.699898,0.962628


In [197]:
g1.

4