# Introduction to Data Analysis using Python

This notebook mainly focuses on data analysis in Python which is done using Pandas library.

I will start this by importing Numpy and Pandas libraries.

In [1]:
import numpy as np
import pandas as pd

## Series

A Series is a one dimensional array that can hold any data type viz. strings, integer, floating number, objects etc. 

The data in a Series can be,

* Python dictionary
* An array
* A value



### From ndarray

If a Series is created from an ndarray, **index** must be the same length as **data**. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [3]:
#np.random.randn() throws random numbers based on the size entered
a = pd.Series(np.random.randn(5), index = ['a','b','c','d','e'])
a

a   -0.056182
b    0.851058
c    1.589496
d   -0.074197
e   -1.166115
dtype: float64

In [4]:
b = pd.Series(np.random.randn(5))
b

0   -0.166011
1    0.221160
2    0.594150
3   -0.318020
4   -1.199077
dtype: float64

### From dict

Series can be created from a dictionary which is a data type in Python as an unordered set of keys and respective values.

In [5]:
d = {'a':1, 'b':2,'c':3, 'd':4}
d

{'a': 1, 'b': 2, 'c': 3, 'd': 4}

In [6]:
c = pd.Series(d)
c

a    1
b    2
c    3
d    4
dtype: int64

### Slicing

In [7]:
#Indexing in Python starts from 0 unlike 1 in R
a[0]

-0.056182221090673182

In [8]:
a[:3]

a   -0.056182
b    0.851058
c    1.589496
dtype: float64

In [9]:
a[3:]

d   -0.074197
e   -1.166115
dtype: float64

In [10]:
a[a > a.mean()]

b    0.851058
c    1.589496
dtype: float64

In [12]:
a[[1,4]]

b    0.851058
e   -1.166115
dtype: float64

A Series can be indexed and values can be set based on the index labels.

In [13]:
d['b']

2

In [15]:
d['c'] = 5
d

{'a': 1, 'b': 2, 'c': 5, 'd': 4}

## DataFrames

A DataFrame is 2-dimensional data structure with different data types. It is similar to a SQL table.

It can have different kinds of input,

* dict
* ndarray
* Series
* another DataFrame

### From dict of Series

In [16]:
d = {'a': pd.Series([1,2,3]),
    'b': pd.Series([4,5,6,7])}
d

{'a': 0    1
 1    2
 2    3
 dtype: int64, 'b': 0    4
 1    5
 2    6
 3    7
 dtype: int64}

In [18]:
df = pd.DataFrame(d)
df

Unnamed: 0,a,b
0,1.0,4
1,2.0,5
2,3.0,6
3,,7


In [19]:
pd.DataFrame(d, index = ['one','two','thr','four'])

Unnamed: 0,a,b
one,,
two,,
thr,,
four,,


In [22]:
#changing the index of df
df.index = ['one','two','thr','four']
df

Unnamed: 0,a,b
one,1.0,4
two,2.0,5
thr,3.0,6
four,,7


In [23]:
df.columns = ['aaa','bbb']
df

Unnamed: 0,aaa,bbb
one,1.0,4
two,2.0,5
thr,3.0,6
four,,7


In [24]:
df['ccc'] = df['aaa'] * df['bbb']
df

Unnamed: 0,aaa,bbb,ccc
one,1.0,4,4.0
two,2.0,5,10.0
thr,3.0,6,18.0
four,,7,


In [25]:
df['ddd'] = 10
df

Unnamed: 0,aaa,bbb,ccc,ddd
one,1.0,4,4.0,10
two,2.0,5,10.0,10
thr,3.0,6,18.0,10
four,,7,,10


Getting summary of a dataframe

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, one to four
Data columns (total 4 columns):
aaa    3 non-null float64
bbb    4 non-null int64
ccc    3 non-null float64
ddd    4 non-null int64
dtypes: float64(2), int64(2)
memory usage: 160.0+ bytes




Getting dimensions (rows, columns) of a dataframe



In [34]:
#dimension of dataframe
df.shape

(4, 4)

In [35]:
#cross multiplication of rows*columns
df.size

16

In [36]:
#no. of rows
len(df)

4

In [37]:
#no. of columns
len(df.columns)

4

### Indexing/Selecting Data

Basics of indexing are as follows.

* *df[col]*       : selects a column and returns a Series
* *df.loc[label]* : select a row by lable and returns a Series
* *df.iloc[loc]*  : select row by integer location and returns Series
* *df[5:10]*      : slice by rows and returns DataFrame

In [26]:
df.loc['one']

aaa     1.0
bbb     4.0
ccc     4.0
ddd    10.0
Name: one, dtype: float64

In [29]:
df.iloc[1]

aaa     2.0
bbb     5.0
ccc    10.0
ddd    10.0
Name: two, dtype: float64

In [30]:
df.iloc[1:4]

Unnamed: 0,aaa,bbb,ccc,ddd
two,2.0,5,10.0,10
thr,3.0,6,18.0,10
four,,7,,10


In [31]:
df['aaa'].loc['two']

2.0

Indexing based on column condition

In [41]:
df.loc[(df['bbb'] > 4)]

Unnamed: 0,aaa,bbb,ccc,ddd
two,2.0,5,10.0,10
thr,3.0,6,18.0,10
four,,7,,10


Indexing based on column condition and return required columns

In [42]:
df.loc[(df['bbb'] > 4), ['aaa', 'bbb']]

Unnamed: 0,aaa,bbb
two,2.0,5
thr,3.0,6
four,,7


In [48]:
df.iloc[:3, :2]

Unnamed: 0,aaa,bbb
one,1.0,4
two,2.0,5
thr,3.0,6


In [50]:
df['aaa']

one     1.0
two     2.0
thr     3.0
four    NaN
Name: aaa, dtype: float64