# Python Pandas - Options and Customization
The Python and NumPy indexing operators "[ ]" and attribute operator "." provide quick and easy access to Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommend that you take advantage of the optimized pandas data access methods explained in this chapter.

.loc()
Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they refer to the label and not the position.

### .loc() has multiple access methods like −
<ul>
    <li>A single scalar label</li>
<li>A list of labels</li>
<li>A slice object</li>
<li>A Boolean array</li>
<li>loc takes two single/list/range operator separated by ','. </li>
  </ul>  
    The first one indicates the row and the second one indicates columns.



In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

#select all rows for a specific column
print (df.loc[:,'A'])

a   -0.756052
b    0.532697
c    0.689525
d    0.363028
e   -2.206971
f   -0.490911
g   -0.906073
h    0.580774
Name: A, dtype: float64


In [3]:
# Select all rows for multiple columns, say list[]
print (df.loc[:,['A','C']])

          A         C
a -0.756052 -0.096841
b  0.532697  0.800913
c  0.689525 -0.259572
d  0.363028  0.815113
e -2.206971  0.820029
f -0.490911  1.949593
g -0.906073 -0.590977
h  0.580774 -0.782290


In [4]:
# Select few rows for multiple columns, say list[]
print (df.loc[['a','b','f','h'],['A','C']])

          A         C
a -0.756052 -0.096841
b  0.532697  0.800913
f -0.490911  1.949593
h  0.580774 -0.782290


In [5]:
# Select range of rows for all columns
print (df.loc['a':'h'])

          A         B         C         D
a -0.756052 -0.255452 -0.096841  1.043768
b  0.532697 -1.012841  0.800913 -1.485888
c  0.689525 -0.240167 -0.259572 -2.069470
d  0.363028  0.076389  0.815113  0.095293
e -2.206971  0.244888  0.820029  1.035065
f -0.490911 -0.212406  1.949593  1.934767
g -0.906073 -0.105332 -0.590977  0.643485
h  0.580774  0.171428 -0.782290 -0.376233


In [6]:
# for getting values with a boolean array
df.loc['a']>0

A    False
B    False
C    False
D     True
Name: a, dtype: bool

### .iloc()
Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing.

The various access methods are as follows −
<ul>
    <li>An Integer</li>
    <li>A list of integers</li>
    <li>A range of values</li>
    </ul>

In [8]:
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# select all rows for a specific column
df.iloc[:4]

Unnamed: 0,A,B,C,D
0,1.96556,0.29825,0.048732,-0.160181
1,1.277546,0.786301,2.04647,-0.734262
2,-0.953631,-1.701168,-0.941769,-0.02774
3,-0.281851,-0.546247,0.133689,-1.021846


In [9]:
df.iloc[1:5, 2:4]

Unnamed: 0,C,D
1,2.04647,-0.734262
2,-0.941769,-0.02774
3,0.133689,-1.021846
4,-0.388644,0.501823


In [12]:
# Slicing through list of values
df.iloc[[1, 3, 5], [1, 3]]


Unnamed: 0,B,D
1,0.786301,-0.734262
3,-0.546247,-1.021846
5,0.524221,0.510517


In [13]:
df.iloc[1:3, :]


Unnamed: 0,A,B,C,D
1,1.277546,0.786301,2.04647,-0.734262
2,-0.953631,-1.701168,-0.941769,-0.02774


In [14]:
df.iloc[:,1:3]

Unnamed: 0,B,C
0,0.29825,0.048732
1,0.786301,2.04647
2,-1.701168,-0.941769
3,-0.546247,0.133689
4,-1.500463,-0.388644
5,0.524221,-1.265441
6,0.249769,0.45647
7,0.421264,-1.156763


# Python Pandas - Statistical Functions

Statistical methods help in the understanding and analyzing the behavior of data. We will now learn a few statistical functions, which we can apply on Pandas objects.

### Percent_change
Series, DatFrames and Panel, all have the function pct_change(). This function compares every element with its prior element and computes the change percentage.



In [16]:
s = pd.Series([1,2,3,4,5,4])
print (s.pct_change())



0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
5   -0.200000
dtype: float64


In [17]:
df = pd.DataFrame(np.random.randn(5, 2))
print (df.pct_change())

          0         1
0       NaN       NaN
1 -1.776352  0.777940
2 -0.384205 -5.037457
3  3.118085 -1.795935
4 -0.467003  1.627480


### Covariance
Covariance is applied on series data. The Series object has a method cov to compute covariance between series objects. NA will be excluded automatically.

In [18]:
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print (s1.cov(s2))

0.7961996529199249


In [19]:
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print (frame['a'].cov(frame['b']))


0.32979579757321886


In [20]:
print (frame.cov())

          a         b         c         d         e
a  0.761723  0.329796 -0.631167  0.074172 -0.569736
b  0.329796  0.866015 -0.182075 -0.121638 -0.109112
c -0.631167 -0.182075  0.882934 -0.213180  0.512286
d  0.074172 -0.121638 -0.213180  0.698718 -0.254396
e -0.569736 -0.109112  0.512286 -0.254396  0.640230


### Correlation
Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall.

In [21]:
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])

print (frame['a'].corr(frame['b']))
print (frame.corr())

-0.2271050056292941
          a         b         c         d         e
a  1.000000 -0.227105 -0.175764  0.226661 -0.126490
b -0.227105  1.000000 -0.344047 -0.285125 -0.284761
c -0.175764 -0.344047  1.000000  0.218145  0.348607
d  0.226661 -0.285125  0.218145  1.000000 -0.370016
e -0.126490 -0.284761  0.348607 -0.370016  1.000000


# Python Pandas - Missing Data 

Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

### When and Why Is Data Missed?
Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.

Let us now see how we can handle missing values (say NA or NaN) using Pandas.

In [26]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df)

        one       two     three
a -0.218071 -0.421186  0.765356
b       NaN       NaN       NaN
c -0.058490  0.269390 -1.377120
d       NaN       NaN       NaN
e  0.457416  0.798680  1.376392
f  0.821515  0.861010 -1.554490
g       NaN       NaN       NaN
h -1.244039  0.384260  0.485950


Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.

Check for Missing Values
To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −

In [32]:
print (df['one'].isnull())

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool


In [33]:
df['one'].notnull()

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool

### Calculations with Missing Data
<ul>
    <li>When summing data, NA will be treated as Zero</li>
    <li>If the data are all NA, then the result will be NA</li>
 <ul>   

In [34]:
df['one'].sum()

-0.24166949461603165

In [35]:
df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
df

Unnamed: 0,one,two
0,,
1,,
2,,
3,,
4,,
5,,


In [36]:
df['one'].sum()

0

### Cleaning / Filling Missing Data

Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.

### Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".

In [37]:
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])

df = df.reindex(['a', 'b', 'c'])

print (df)


        one       two     three
a  0.046657  1.396719 -0.236153
b       NaN       NaN       NaN
c -0.282620 -0.882507  0.470864


In [38]:
print ("NaN replaced with '0':")
print (df.fillna(0))

NaN replaced with '0':
        one       two     three
a  0.046657  1.396719 -0.236153
b  0.000000  0.000000  0.000000
c -0.282620 -0.882507  0.470864


Here, we are filling with value zero; instead we can also fill with any other value.

### Fill NA Forward and Backward
Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values

### 1. pad/fill ---->  Fill methods Forward

### 2.bfill/backfill---> Fill methods Backward

In [39]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

df.fillna(method='pad')

Unnamed: 0,one,two,three
a,0.040491,1.228818,-0.127804
b,0.040491,1.228818,-0.127804
c,1.428146,1.531009,0.409584
d,1.428146,1.531009,0.409584
e,0.422002,-0.401342,-1.350928
f,-1.307411,2.99714,-0.69875
g,-1.307411,2.99714,-0.69875
h,-0.10685,1.737303,-0.787782


In [40]:
df.fillna(method='backfill')

Unnamed: 0,one,two,three
a,0.040491,1.228818,-0.127804
b,1.428146,1.531009,0.409584
c,1.428146,1.531009,0.409584
d,0.422002,-0.401342,-1.350928
e,0.422002,-0.401342,-1.350928
f,-1.307411,2.99714,-0.69875
g,-0.10685,1.737303,-0.787782
h,-0.10685,1.737303,-0.787782


### Drop Missing Values
If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.

In [41]:
 df.dropna()

Unnamed: 0,one,two,three
a,0.040491,1.228818,-0.127804
c,1.428146,1.531009,0.409584
e,0.422002,-0.401342,-1.350928
f,-1.307411,2.99714,-0.69875
h,-0.10685,1.737303,-0.787782


In [42]:
df.dropna(axis=1)

a
b
c
d
e
f
g
h


### Replace Missing (or) Generic Values
Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method.

Replacing NA with a scalar value is equivalent behavior of the fillna() function.

In [43]:
df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})

df.replace({1000:10,2000:60})

Unnamed: 0,one,two
0,10,10
1,20,0
2,30,30
3,40,40
4,50,50
5,60,60


# Python Pandas - GroupBy

Any groupby operation involves one of the following operations on the original object. They are −
<ul>
    <li>Splitting the Object</li>
    <li>Applying a function</li>
    <li>Combining the results</li>
</ul>
In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −
<ul>
    <li>Aggregation − computing a summary statistic</li>
<li>Transformation − perform some group-specific operation</li>
<li>Filtration − discarding the data with some condition</li>
</ul>
Let us now create a DataFrame object and perform all the operations on it −



In [44]:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print (df)

      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
2   Devils     2  2014     863
3   Devils     3  2015     673
4    Kings     3  2014     741
5    kings     4  2015     812
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
9   Royals     4  2014     701
10  Royals     1  2015     804
11  Riders     2  2017     690


### Split Data into Groups
Pandas object can be split into any of their objects. There are multiple ways to split an object like −

obj.groupby('key')
obj.groupby(['key1','key2'])
obj.groupby(key,axis=1)
Let us now see how the grouping objects can be applied to the DataFrame object

Example

In [45]:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print (df.groupby('Team'))

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000027246187408>


In [46]:
df.groupby('Team').groups

{'Devils': Int64Index([2, 3], dtype='int64'),
 'Kings': Int64Index([4, 6, 7], dtype='int64'),
 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
 'Royals': Int64Index([9, 10], dtype='int64'),
 'kings': Int64Index([5], dtype='int64')}

In [47]:
df.groupby(['Team','Year']).groups

{('Devils', 2014): Int64Index([2], dtype='int64'),
 ('Devils', 2015): Int64Index([3], dtype='int64'),
 ('Kings', 2014): Int64Index([4], dtype='int64'),
 ('Kings', 2016): Int64Index([6], dtype='int64'),
 ('Kings', 2017): Int64Index([7], dtype='int64'),
 ('Riders', 2014): Int64Index([0], dtype='int64'),
 ('Riders', 2015): Int64Index([1], dtype='int64'),
 ('Riders', 2016): Int64Index([8], dtype='int64'),
 ('Riders', 2017): Int64Index([11], dtype='int64'),
 ('Royals', 2014): Int64Index([9], dtype='int64'),
 ('Royals', 2015): Int64Index([10], dtype='int64'),
 ('kings', 2015): Int64Index([5], dtype='int64')}

# Python Pandas - Categorical Data
Often in real-time, data includes the text columns, which are repetitive. Features like gender, country, and codes are always repetitive. These are the examples for categorical data.

Categorical variables can take on only a limited, and usually fixed number of possible values. Besides the fixed length, categorical data might have an order but cannot perform numerical operation. Categorical are a Pandas data type.

The categorical data type is useful in the following cases −
<ul>
<li>    
A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.</li>
<li>The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.</li>
<li>As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).</li>
    <ul>   


### Object Creation
Categorical object can be created in multiple ways. The different ways have been described below −

category
By specifying the dtype as "category" in pandas object creation.

In [48]:
s = pd.Series(["a","b","c","a"], dtype="category")
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

The number of elements passed to the series object is four, but the categories are only three. Observe the same in the output Categories.

In [49]:
cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
cat

[a, b, c, a, b, c]
Categories (3, object): [a, b, c]

In [51]:
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'])
cat

[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]

Here, the second argument signifies the categories. Thus, any value which is not present in the categories will be treated as NaN.

Now, take a look at the following example 

In [52]:
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'],ordered=True)
cat

[a, b, c, a, b, c, NaN]
Categories (3, object): [c < b < a]

### Description
Using the .describe() command on the categorical data, we get similar output to a Series or DataFrame of the type string.



In [53]:
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})

df.describe()

Unnamed: 0,cat,s
count,3,3
unique,2,2
top,c,c
freq,2,2


In [54]:
df["cat"].describe()

count     3
unique    2
top       c
freq      2
Name: cat, dtype: object

In [55]:
s = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
s.categories

Index(['b', 'a', 'c'], dtype='object')

In [56]:
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
cat.ordered

False