# Pandas

###### Pandas is a standard python library for Data Analysis tasks. It provides high-level data structure and functions specially designed to work with structured or tabular in a fast and easy manner.  Pandas is built on top of numpy library which is used to handle arrays in python. It blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL).

### importing library

In [1]:
import pandas as pd # pd is a shortcut for pandas library

In [2]:
from pandas import Series, DataFrame # for importing Series and data frame data structures

### Data Structures that are widely used in pandas are

###### 1)Series
###### 2)DataFrame

## Series

###### A series in pandas is a one-dimensional Array-Like object which stores a sequence of values and an associated array of data labels, called its index., Index is the same thing as the index in case of arrays.

In [3]:
A = pd.Series([2,3, -9, 5,3,7,8]) #creating series using an array

In [4]:
A # this is how a series looks like

0    2
1    3
2   -9
3    5
4    3
5    7
6    8
dtype: int64

In [5]:
A.values # this is to get all the values of a series in an array object

array([ 2,  3, -9,  5,  3,  7,  8], dtype=int64)

In [6]:
A.index # this is to get information about the indexing of the series

RangeIndex(start=0, stop=7, step=1)

### creating a series using an array of data and an array of index for giving custom index to our series 

In [7]:
B = pd.Series([3, 2, 9, 21,6,87,98], index=['e', 'f', 'z', 'a','s','h','l'])

In [8]:
B

e     3
f     2
z     9
a    21
s     6
h    87
l    98
dtype: int64

In [9]:
B['s'] # accesing data using a single index

6

In [10]:
B[['a', 'f', 'h']] # accesing data using an array

a    21
f     2
h    87
dtype: int64

In [11]:
B[B > 10] #getting all the elements in the series that are larger than 10

a    21
h    87
l    98
dtype: int64

In [12]:
B * 2 # multiplying each element with 2

e      6
f      4
z     18
a     42
s     12
h    174
l    196
dtype: int64

In [13]:
'g' in B # checking if 'g' appears in B 
 

False

In [14]:
dict_data = {'delhi': 53000, 'mumbai': 36100, 'jaipur': 63000, 'faridad': 87000} # a regular dictionary

#### Creating series using a dictionary

In [15]:
C = pd.Series(dict_data) 

In [16]:
C

delhi      53000
mumbai     36100
jaipur     63000
faridad    87000
dtype: int64

In [17]:
C.isnull() # this is to check if there is a null/ missing value in series

delhi      False
mumbai     False
jaipur     False
faridad    False
dtype: bool

## DataFrame

###### A DataFrame represents a rectangular table of data that contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all having the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dictionary, or some other collection of one-dimensional arrays.

In [18]:
data = {'state': ['delhi', 'delhi', 'delhi', 'mumbai', 'mumbai', 'mumbai'],
 'year': [2002, 2003, 2004, 2005, 2003, 2002],
 'pop': [2.5, 3.7, 1.6, 3.4, 4.9, 6.2]} # a regular dictionary

###### creating dataFrame using a dictionary

In [19]:
df = pd.DataFrame(data)

In [20]:
df # this is how a dataframes looks\like

Unnamed: 0,state,year,pop
0,delhi,2002,2.5
1,delhi,2003,3.7
2,delhi,2004,1.6
3,mumbai,2005,3.4
4,mumbai,2003,4.9
5,mumbai,2002,6.2


when a data frame is large we can use head function to display top 5 rows and tail function to display bottom five rows, we call also change the number of rows displayed by passing the number in the head and tail function as an argument.

In [21]:
df.head() 

Unnamed: 0,state,year,pop
0,delhi,2002,2.5
1,delhi,2003,3.7
2,delhi,2004,1.6
3,mumbai,2005,3.4
4,mumbai,2003,4.9


In [22]:
df['pop'] # getting series present under pop coloumn

0    2.5
1    3.7
2    1.6
3    3.4
4    4.9
5    6.2
Name: pop, dtype: float64

In [23]:
df.year #to get values present in year coloumns 

0    2002
1    2003
2    2004
3    2005
4    2003
5    2002
Name: year, dtype: int64

In [24]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
 ....: index=['one', 'two', 'three', 'four',
 ....: 'five', 'six']) # when creating data using 

when creating dataframe using dictionary we can change the indexs and columns of the data frame if columns are present in the diction it will take those values but if not the column in dataframe will have null/missing values 

In [25]:
frame2

Unnamed: 0,year,state,pop,debt
one,2002,delhi,2.5,
two,2003,delhi,3.7,
three,2004,delhi,1.6,
four,2005,mumbai,3.4,
five,2003,mumbai,4.9,
six,2002,mumbai,6.2,


In [26]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])


In [27]:
frame2

Unnamed: 0,year,state,pop,debt
one,2002,delhi,2.5,
two,2003,delhi,3.7,
three,2004,delhi,1.6,
four,2005,mumbai,3.4,
five,2003,mumbai,4.9,
six,2002,mumbai,6.2,


In [28]:
 frame2['debt'] = val # this is how you can reassign a whole column

In [29]:
frame2

Unnamed: 0,year,state,pop,debt
one,2002,delhi,2.5,
two,2003,delhi,3.7,-1.2
three,2004,delhi,1.6,
four,2005,mumbai,3.4,-1.5
five,2003,mumbai,4.9,-1.7
six,2002,mumbai,6.2,


In [30]:
frame2['northern'] = frame2.state == 'delhi' # this is an example how you can add a new column

In [31]:
frame2

Unnamed: 0,year,state,pop,debt,northern
one,2002,delhi,2.5,,True
two,2003,delhi,3.7,-1.2,True
three,2004,delhi,1.6,,True
four,2005,mumbai,3.4,-1.5,False
five,2003,mumbai,4.9,-1.7,False
six,2002,mumbai,6.2,,False


In [32]:
del frame2['northern'] # this is how you delete a column

In [33]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
 ....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}


In [34]:
frame3 = pd.DataFrame(pop)

In [35]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [36]:
frame3.T # this is to take transpose of the dataframe

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [37]:
frame2.values #getting all the values in the data frame

array([[2002, 'delhi', 2.5, nan],
       [2003, 'delhi', 3.7, -1.2],
       [2004, 'delhi', 1.6, nan],
       [2005, 'mumbai', 3.4, -1.5],
       [2003, 'mumbai', 4.9, -1.7],
       [2002, 'mumbai', 6.2, nan]], dtype=object)

In [38]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [39]:
index = obj.index # storing index values of dataframe

In [40]:
index

Index(['a', 'b', 'c'], dtype='object')

In [41]:
index[0:2] # this is to select index from 0 to 2-1=1, if in form i[n;m] then it returns index from n to m-1 this 
            #is also called slicing

Index(['a', 'b'], dtype='object')

In [42]:
frame3.columns # getting columns values

Index(['Nevada', 'Ohio'], dtype='object')

In [43]:
'Ohio' in frame3.columns # checking if a value exist in a column

True

In [44]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [45]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) # this how to change index

In [46]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [47]:
new_obj = obj.drop('c') #this is to remove a row it can be inplace or not 

In [48]:
new_obj

d    4.5
b    7.2
a   -5.3
dtype: float64

In [49]:
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])

In [50]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [51]:
data.drop(['Colorado', 'Ohio']) # removing multiple columns

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [52]:
 data.drop('three', axis=1) # removing a column, axis =0 for rows which is set by default 
                            #and axis = 1 for columns which is for columns

Unnamed: 0,one,two,four
Ohio,0,1,3
Colorado,4,5,7
Utah,8,9,11
New York,12,13,15


In [53]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [54]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [55]:
obj[2:4] #slicing a dataframe 

c    2.0
d    3.0
dtype: float64

In [56]:
 obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [57]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
 .....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
 .....: columns=['one', 'two', 'three', 'four'])

In [58]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [59]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [60]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [61]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

In [62]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [63]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


In [64]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
 .....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])


In [65]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.694292,0.252969,0.228369
Ohio,0.771812,1.09657,1.836151
Texas,0.988264,0.566439,0.481219
Oregon,0.647569,0.719157,0.660036


In [66]:
f = lambda x: x.max() - x.min()

In [67]:
frame.apply(f)

b    1.760076
d    1.815727
e    2.496187
dtype: float64