# PANDAS BASICS


Pandas is a open source python library which is very specific for data analysis providing high performance. It is built on top of numpy.Both NumPy and pandas are often used together, as the pandas library relies heavily on the NumPy array for the implementation of pandas data objects and shares many of its features. 

                                                                                                  Written by,
                                                                                                  Shubam Joshi

                                                                                                  Compiled and Presented by,
                                                                                                  Pravveen Murugesan V
                                                                                                  Pranav G

## Key Features of Pandas

- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of data sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.


### Pandas generally deals with two datastructures - Series and Dataframe

## Pandas Series :

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects). 
The axis labels are collectively called index.

#### Syntax :
pandas.Series( data, index, dtype, copy) 



In [2]:
#create a series from a ndarray

import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print(s)

100    a
101    b
102    c
103    d
dtype: object


In [4]:
#Accessing Data from Series

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

print(s[0]) #Using Position
print(s['a']) #Using Label

#Retrieving Multiple elements 
print(s[:3])
print(s[['a','c','d']])



1
1
a    1
b    2
c    3
dtype: int64
a    1
c    3
d    4
dtype: int64


### "Apply" method on Pandas Series : Invoke function on values of Series

In [5]:
import pandas as pd
import numpy as np 

series = pd.Series([20, 21, 12], index=['a','b','c'])

#Square the values by defining a function and passing it as an argument to apply().
def square(x):
   return x**2
series.apply(square) 

#Square the values by passing an anonymous function as an argument to apply().
series.apply(lambda x: x**2)

a    400
b    441
c    144
dtype: int64

### Note : Apply method throws an error with arrays

## DataFrames:

A Data frame is a two-dimensional data structure(i.e) data is aligned in a tabular fashion in rows and columns.

#### Syntax :
pandas.DataFrame( data, index, columns, dtype, copy)


In [6]:
#Creating Dataframe from Lists

import pandas as pd

data = [['Jose',90],['Matt',80],['Clark',70]]
df = pd.DataFrame(data,index=['Rank1','Rank2','Rank3'],columns=['Name','Percentage'],dtype=float)
print(df)


        Name  Percentage
Rank1   Jose        90.0
Rank2   Matt        80.0
Rank3  Clark        70.0


In [7]:
#Creating Datraframe from Dicts

import pandas as pd

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] #NaN (Not a Number) is appended in missing areas.
df = pd.DataFrame(data, index=['first', 'second'])
print(df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [10]:
#Creating Dataframe from dict of series
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


In [3]:
#import from a csv file

import pandas as pd

dataframe = pd.read_csv(r"C:\Users\Sucha\Desktop\creditcardcsv.csv")
print(dataframe)

      Merchant_id   CCNUMBER  Transaction date  \
0      3160040998  458785658               NaN   
1      3160040998  258785659               NaN   
2      3160041996  677361687               NaN   
3      3160041996  487513198               NaN   
4      3160041996    7970257               NaN   
5      3160241992  333905636               NaN   
6      3160241992  579946416               NaN   
7      3160272997  291714540               NaN   
8      3200016990  214355890               NaN   
9      3200016990  868746595               NaN   
10     3200016990  160390553               NaN   
11     3333780991  688040418               NaN   
12     3737637735  112423763               NaN   
13     4727426967   12616870               NaN   
14     4544655328  493169486               NaN   
15     4499635964  794447079               NaN   
16     5341322312  280968687               NaN   
17     4531717685  743610028               NaN   
18     6397501503  727691964               NaN   


In [6]:
dataframe.head() #returns n rows. Default : 5

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
0,3160040998,458785658,,100.0,1500.0,4000,35552.25,N,5,Y,Y,0,0.0,0,Y
1,3160040998,258785659,,100.0,1500.0,4000,35552.25,N,5,Y,Y,0,0.0,0,Y
2,3160041996,677361687,,185.5,4000.0,2000,1901.876,Y,0,N,N,0,0.0,0,N
3,3160041996,487513198,,185.5,3000.0,2000,1901.876,Y,1,N,N,0,0.0,0,N
4,3160041996,7970257,,185.5,3000.0,2000,1901.876,Y,2,N,N,0,0.0,0,N


In [5]:
dataframe.tail() #returns last n rows. Default : 5

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
3064,3160041896,7970257,,185.5,3000.0,2000,1901.876,Y,5,N,N,0,0.0,0,Y
3065,3160141996,7970257,,185.5,3000.0,2000,1901.876,Y,8,N,N,0,0.0,0,Y
3066,3162041996,7970257,,185.5,3000.0,2000,1901.876,Y,20,N,N,0,0.0,0,Y
3067,3162041996,7970257,,185.5,5542.3,4000,1901.876,Y,20,N,N,0,0.0,0,Y
3068,3162041996,7970257,,185.5,6742.8,4000,1901.876,Y,20,N,N,0,0.0,0,Y


In [16]:
dataframe.info() #print the datatype of each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3069 entries, 0 to 3068
Data columns (total 15 columns):
Merchant_id                       3069 non-null int64
CCNUMBER                          3069 non-null int64
Transaction date                  0 non-null float64
Average Amount/transaction/day    3069 non-null float64
Transaction_amount                3069 non-null float64
limit                             3069 non-null int64
remaining limit                   3069 non-null float64
Is declined                       3069 non-null object
Total Number of declines/day      3069 non-null int64
isForeignTransaction              3069 non-null object
isHighRiskCountry                 3069 non-null object
Daily_chargeback_avg_amt          3069 non-null int64
6_month_avg_chbk_amt              3069 non-null float64
6-month_chbk_freq                 3069 non-null int64
isFradulent                       3069 non-null object
dtypes: float64(5), int64(6), object(4)
memory usage: 359.7+ KB


In [17]:
dataframe.describe() #describe the summary of all the numeric columns in the dataset

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Total Number of declines/day,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq
count,3069.0,3069.0,0.0,3069.0,3069.0,3069.0,3069.0,3069.0,3069.0,3069.0,3069.0
mean,4977191000.0,495612100.0,,503.139303,771.338645,3006.190942,1579.522811,0.9319,41.84262,29.235549,0.262626
std,977683600.0,291196300.0,,289.231604,755.912941,1000.143792,1336.567163,2.188238,183.252377,136.989591,1.255798
min,3160041000.0,71047.0,,0.036577,1.0,2000.0,100.0,0.0,0.0,0.0,0.0
25%,4127151000.0,240886200.0,,257.142182,202.0,2000.0,734.0,0.0,0.0,0.0,0.0
50%,4970798000.0,496246000.0,,494.402452,532.0,4000.0,1386.0,0.0,0.0,0.0,0.0
75%,5835305000.0,746901800.0,,758.34686,1113.0,4000.0,2110.0,0.0,0.0,0.0,0.0
max,6665906000.0,999867700.0,,999.194706,6742.8,4000.0,35552.25,20.0,998.0,998.0,9.0


In [19]:
dataframe.columns #Returns column names

Index(['Merchant_id', 'CCNUMBER', 'Transaction date',
       'Average Amount/transaction/day', 'Transaction_amount', 'limit',
       'remaining limit', 'Is declined', 'Total Number of declines/day',
       'isForeignTransaction', 'isHighRiskCountry', 'Daily_chargeback_avg_amt',
       '6_month_avg_chbk_amt', '6-month_chbk_freq', 'isFradulent'],
      dtype='object')

In [21]:
dataframe.shape #Returns the number of rows and columns

(3069, 15)

In [23]:
dataframe.values #Extracts values of dataframe as np array

array([[3160040998, 458785658, nan, ..., 0.0, 0, 'Y'],
       [3160040998, 258785659, nan, ..., 0.0, 0, 'Y'],
       [3160041996, 677361687, nan, ..., 0.0, 0, 'N'],
       ...,
       [3162041996, 7970257, nan, ..., 0.0, 0, 'Y'],
       [3162041996, 7970257, nan, ..., 0.0, 0, 'Y'],
       [3162041996, 7970257, nan, ..., 0.0, 0, 'Y']], dtype=object)

#### set_index : Set the DataFrame index (row labels) using one or more existing columns. By default yields a new object.

#### Syntax : DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False) ####


In [3]:
#Setting Index
dataframe.set_index('CCNUMBER', inplace = True)

### Sorting by index

#### Syntax :
DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None) 


In [8]:
dataframe.sort_index(ascending=False).head(10) #Sorting by index

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
3068,3162041996,7970257,,185.5,6742.8,4000,1901.876,Y,20,N,N,0,0.0,0,Y
3067,3162041996,7970257,,185.5,5542.3,4000,1901.876,Y,20,N,N,0,0.0,0,Y
3066,3162041996,7970257,,185.5,3000.0,2000,1901.876,Y,20,N,N,0,0.0,0,Y
3065,3160141996,7970257,,185.5,3000.0,2000,1901.876,Y,8,N,N,0,0.0,0,Y
3064,3160041896,7970257,,185.5,3000.0,2000,1901.876,Y,5,N,N,0,0.0,0,Y
3063,4241853103,127002896,,919.007944,2587.0,4000,2977.0,N,0,N,N,0,0.0,0,N
3062,4546243015,372387730,,627.20871,281.0,2000,434.0,N,0,N,N,0,0.0,0,N
3061,5022726674,462833859,,540.67111,79.0,4000,176.0,N,8,Y,Y,0,0.0,0,Y
3060,4931552646,285574927,,782.63858,1708.0,4000,2809.0,N,0,N,N,819,781.0,2,N
3059,3931533472,500426104,,994.171111,483.0,4000,3619.0,N,0,N,N,0,0.0,0,N


### Sort by values

### Syntax
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')


In [15]:
dataframe.sort_values(by='remaining limit').head(10) #Sorting by value

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
1611,6239886634,600213422,,230.802557,39.0,2000,100.0,N,0,N,N,0,0.0,0,N
631,5950044519,757987469,,190.214868,42.0,2000,101.0,N,0,Y,N,0,0.0,0,N
2901,4997094269,905132839,,892.597744,95.0,2000,101.0,N,0,N,N,0,0.0,0,N
2529,5696631575,360216529,,939.214543,89.0,2000,101.0,N,4,N,N,0,0.0,0,N
1365,5858053228,204456072,,802.041451,63.0,4000,101.0,N,0,N,N,0,0.0,0,N
90,5276862350,537604847,,328.09471,5.0,4000,103.0,N,3,N,N,0,0.0,0,N
2322,3508979345,18271147,,205.861419,70.0,2000,103.0,N,0,Y,N,0,0.0,0,Y
759,6577947894,803826679,,150.07261,73.0,4000,103.0,N,0,N,N,0,0.0,0,N
1494,4818869188,554472812,,644.023177,63.0,2000,104.0,N,7,Y,Y,0,0.0,0,Y
974,3553896680,444402398,,356.410382,71.0,2000,104.0,N,0,N,N,0,0.0,0,N


In [10]:
dataframe.sort_values(by='Average Amount/transaction/day', ascending = False).head(10) #Sorting by value_descending

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
2781,4165364368,917632104,,999.194706,352.0,2000,1023.0,N,0,N,N,0,0.0,0,N
45,6324841313,84986568,,998.934275,796.0,2000,1507.0,N,0,N,N,0,0.0,0,N
32,5037335286,106373199,,998.391708,996.0,4000,3647.0,N,0,Y,N,0,0.0,0,N
2618,3527533497,929971093,,998.344074,758.0,4000,815.0,N,0,Y,Y,701,585.0,8,Y
1320,5135958475,256330385,,997.967671,124.0,4000,416.0,N,0,N,N,0,0.0,0,N
1047,3792474266,36994290,,997.621731,222.0,2000,1232.0,N,4,N,N,0,0.0,0,N
1981,5766730673,347412564,,996.989956,404.0,2000,1407.0,N,0,N,N,0,0.0,0,N
985,4516411055,131683800,,996.984323,175.0,2000,362.0,N,0,N,N,0,0.0,0,N
70,5279042349,757995264,,996.585439,182.0,2000,262.0,N,0,N,N,0,0.0,0,N
1261,5573978043,807792121,,996.151761,618.0,2000,1431.0,N,5,N,N,0,0.0,0,N


In [14]:
dataframe.sort_values(by=['Average Amount/transaction/day','remaining limit'], ascending = False).head(10) #Sorting with 2 columns

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
2781,4165364368,917632104,,999.194706,352.0,2000,1023.0,N,0,N,N,0,0.0,0,N
45,6324841313,84986568,,998.934275,796.0,2000,1507.0,N,0,N,N,0,0.0,0,N
32,5037335286,106373199,,998.391708,996.0,4000,3647.0,N,0,Y,N,0,0.0,0,N
2618,3527533497,929971093,,998.344074,758.0,4000,815.0,N,0,Y,Y,701,585.0,8,Y
1320,5135958475,256330385,,997.967671,124.0,4000,416.0,N,0,N,N,0,0.0,0,N
1047,3792474266,36994290,,997.621731,222.0,2000,1232.0,N,4,N,N,0,0.0,0,N
1981,5766730673,347412564,,996.989956,404.0,2000,1407.0,N,0,N,N,0,0.0,0,N
985,4516411055,131683800,,996.984323,175.0,2000,362.0,N,0,N,N,0,0.0,0,N
70,5279042349,757995264,,996.585439,182.0,2000,262.0,N,0,N,N,0,0.0,0,N
1261,5573978043,807792121,,996.151761,618.0,2000,1431.0,N,5,N,N,0,0.0,0,N


### Selection of data

In [53]:
dataframe[50:60] #Using index

Unnamed: 0_level_0,Merchant_id,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
CCNUMBER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
673582112,5175623174,,52.166931,188.0,2000,778.0,N,0,N,N,0,0.0,0,N
382156391,5952354946,,825.879678,663.0,2000,1034.0,N,0,N,N,0,0.0,0,N
764461556,4226525797,,347.476098,314.0,2000,346.0,N,0,N,N,0,0.0,0,N
590405792,4209927538,,502.49511,2960.0,4000,3851.0,N,0,Y,N,0,0.0,0,Y
314185607,4855866036,,777.830643,1844.0,4000,2444.0,N,0,N,N,0,0.0,0,N
8913910,5518736216,,397.445606,229.0,2000,753.0,N,1,N,N,0,0.0,0,N
406568282,3769223968,,770.818689,224.0,2000,255.0,N,0,N,N,0,0.0,0,N
625080054,5809891755,,421.238154,1166.0,4000,2491.0,N,0,N,N,0,0.0,0,N
934887003,4256899166,,266.778861,2076.0,4000,2402.0,N,0,N,N,0,0.0,0,N
190039661,4266029386,,924.984081,236.0,4000,528.0,N,7,N,N,0,0.0,0,Y


In [54]:
dataframe[0::2].head(20) #Selecting even rows

Unnamed: 0_level_0,Merchant_id,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
CCNUMBER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
458785658,3160040998,,100.0,1500.0,4000,35552.25,N,5,Y,Y,0,0.0,0,Y
677361687,3160041996,,185.5,4000.0,2000,1901.876,Y,0,N,N,0,0.0,0,N
7970257,3160041996,,185.5,3000.0,2000,1901.876,Y,2,N,N,0,0.0,0,N
579946416,3160241992,,500.0,3705.2,4000,4000.0,N,0,Y,Y,800,677.2,6,Y
214355890,3200016990,,262.5,500.0,2000,1000.0,N,0,N,N,0,0.0,0,N
160390553,3200016990,,375.0,500.0,4000,2425.6,N,0,N,N,0,0.0,0,N
112423763,3737637735,,622.165491,2629.0,4000,2832.0,N,0,N,N,0,0.0,0,N
493169486,4544655328,,161.166495,301.0,4000,2636.0,N,0,N,N,0,0.0,0,N
280968687,5341322312,,82.079047,787.0,2000,1684.0,N,0,N,N,0,0.0,0,N
727691964,6397501503,,133.981453,1676.0,4000,2441.0,N,4,N,N,0,0.0,0,N


In [55]:
dataframe[['Merchant_id','Transaction_amount','limit']] #Selection using columns

Unnamed: 0_level_0,Merchant_id,Transaction_amount,limit
CCNUMBER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
458785658,3160040998,1500.0,4000
258785659,3160040998,1500.0,4000
677361687,3160041996,4000.0,2000
487513198,3160041996,3000.0,2000
7970257,3160041996,3000.0,2000
333905636,3160241992,3700.0,4000
579946416,3160241992,3705.2,4000
291714540,3160272997,500.0,2000
214355890,3200016990,500.0,2000
868746595,3200016990,500.0,2000


## Two main ways of indexing in Dataframe

1. Position based indexing using df.iloc
2. Label based indexing using df.loc

### Selecting Pandas data using iloc
The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position (ie) It is used to select rows and columns by number, in the order that they appear in the data frame. 

Syntax : data.iloc[row selection, column selection]

Note :  .iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require DataFrame output.
    

In [20]:
print (type(dataframe.iloc[5])) #Returns a Series
print (type(dataframe.iloc[[5]])) #Returns a Dataframe
print (type(dataframe.iloc[0:10])) #Returs a Dataframe
print (dataframe.iloc[1:5,0:4]) #Returns 4 rows and 4 columns as Dataframe 

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
   Merchant_id   CCNUMBER  Transaction date  Average Amount/transaction/day
1   3160040998  258785659               NaN                           100.0
2   3160041996  677361687               NaN                           185.5
3   3160041996  487513198               NaN                           185.5
4   3160041996    7970257               NaN                           185.5


## Selecting Pandas data using loc

The Pandas loc indexer can be used with DataFrames for two different use cases:

1. Selecting rows by label/index
2. Selecting rows with a boolean / conditional lookup

Syntax : data.loc[[row selection], [column selection]]


### 1. Selecting rows by label/index

Selections using the loc method are based on the index of the data frame (if any). This method directly selects based on index values of any rows.

In [81]:
print(dataframe.loc[160390553]) #Returns a series

Merchant_id                       3200016990
Transaction date                         NaN
Average Amount/transaction/day           375
Transaction_amount                       500
limit                                   4000
remaining limit                       2425.6
Is declined                                N
Total Number of declines/day               0
isForeignTransaction                       N
isHighRiskCountry                          N
Daily_chargeback_avg_amt                   0
6_month_avg_chbk_amt                       0
6-month_chbk_freq                          0
isFradulent                                N
Name: 160390553, dtype: object


In [69]:
print(dataframe.loc[[160390553,487513198]]) #Returns a dataframe

           Merchant_id  Transaction date  Average Amount/transaction/day  \
CCNUMBER                                                                   
160390553   3200016990               NaN                           375.0   
487513198   3160041996               NaN                           185.5   

           Transaction_amount  limit  remaining limit Is declined  \
CCNUMBER                                                            
160390553               500.0   4000         2425.600           N   
487513198              3000.0   2000         1901.876           Y   

           Total Number of declines/day isForeignTransaction  \
CCNUMBER                                                       
160390553                             0                    N   
487513198                             1                    N   

          isHighRiskCountry  Daily_chargeback_avg_amt  6_month_avg_chbk_amt  \
CCNUMBER                                                                      
160

In [70]:
print(dataframe.loc[[160390553,487513198],['Merchant_id','Transaction_amount','limit']])

           Merchant_id  Transaction_amount  limit
CCNUMBER                                         
160390553   3200016990               500.0   4000
487513198   3160041996              3000.0   2000


### 2. Boolean / Logical indexing using .loc
With boolean indexing or logical selection, we pass an array or Series of True/False values to the .loc indexer to select the rows where your Series has True values. In most use cases, you will make selections based on the values of different columns in your data set.

In [21]:
dataframe.loc[dataframe['remaining limit'] > 2000].head(10) #returns rows based on the condition specified

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
0,3160040998,458785658,,100.0,1500.0,4000,35552.25,N,5,Y,Y,0,0.0,0,Y
1,3160040998,258785659,,100.0,1500.0,4000,35552.25,N,5,Y,Y,0,0.0,0,Y
5,3160241992,333905636,,500.0,3700.0,4000,4000.0,N,0,Y,Y,800,677.2,6,Y
6,3160241992,579946416,,500.0,3705.2,4000,4000.0,N,0,Y,Y,800,677.2,6,Y
10,3200016990,160390553,,375.0,500.0,4000,2425.6,N,0,N,N,0,0.0,0,N
11,3333780991,688040418,,375.0,500.0,4000,2425.6,N,0,N,N,0,0.0,0,N
12,3737637735,112423763,,622.165491,2629.0,4000,2832.0,N,0,N,N,0,0.0,0,N
14,4544655328,493169486,,161.166495,301.0,4000,2636.0,N,0,N,N,0,0.0,0,N
18,6397501503,727691964,,133.981453,1676.0,4000,2441.0,N,4,N,N,0,0.0,0,N
25,5301391154,437953844,,391.522319,1655.0,4000,2080.0,N,0,N,N,0,0.0,0,N


In [22]:
dataframe.loc[(dataframe['remaining limit'] > 2000) & (dataframe['isFradulent'] == 'Y'),
              ['Merchant_id','CCNUMBER','remaining limit','isFradulent']].head(10)# Returns filtered rows and specified columns

Unnamed: 0,Merchant_id,CCNUMBER,remaining limit,isFradulent
0,3160040998,458785658,35552.25,Y
1,3160040998,258785659,35552.25,Y
5,3160241992,333905636,4000.0,Y
6,3160241992,579946416,4000.0,Y
34,4835465818,171486296,3200.0,Y
48,4061849015,71047,3854.0,Y
53,4209927538,590405792,3851.0,Y
68,5845551567,41267152,2341.0,Y
76,5766099974,821871939,3676.0,Y
119,5035552703,828768501,3169.0,Y


## Pandas DataFrame.isin()

This method helps in selecting rows with having a particular(or Multiple) value in a particular column. 

In [23]:
filter = dataframe['isFradulent'].isin(['Y'])
dataframe[filter].head(10)

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
0,3160040998,458785658,,100.0,1500.0,4000,35552.25,N,5,Y,Y,0,0.0,0,Y
1,3160040998,258785659,,100.0,1500.0,4000,35552.25,N,5,Y,Y,0,0.0,0,Y
5,3160241992,333905636,,500.0,3700.0,4000,4000.0,N,0,Y,Y,800,677.2,6,Y
6,3160241992,579946416,,500.0,3705.2,4000,4000.0,N,0,Y,Y,800,677.2,6,Y
7,3160272997,291714540,,262.5,500.0,2000,2000.0,N,0,N,N,900,345.5,7,Y
13,4727426967,12616870,,719.842203,164.0,2000,214.0,N,2,Y,Y,0,0.0,0,Y
34,4835465818,171486296,,531.366961,1921.0,4000,3200.0,N,9,N,N,0,0.0,0,Y
37,4908884376,167993897,,69.640487,470.0,2000,554.0,N,0,Y,Y,0,0.0,0,Y
38,6023215001,214898833,,790.1523,307.0,2000,558.0,N,9,N,N,0,0.0,0,Y
39,4439427249,704411858,,815.850854,71.0,4000,224.0,N,8,Y,N,0,0.0,0,Y


## Merge and Append (Joins) 

### Merging (Using pd.merge)

### Syntax : 

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True)



In [13]:
#inner join using pd.merge
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
  
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})
   

result = pd.merge(left, right, on='key')

print(left)
print(right)
print(result)


  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K2  C2  D2
3  K3  C3  D3
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3


### Merging(Using df.merge)

### Syntax : 

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)


In [20]:
#left outer join using df.merge

df11 = pd.DataFrame({'A' : [1,2], 'B' : [2, 2]})

df12 = pd.DataFrame({'A' : [4,5,6], 'B': [2,2,2]})

res = df11.merge(df12, on='B', how='left')

print(df11)
print(df12)
print(res)

   A  B
0  1  2
1  2  2
   A  B
0  4  2
1  5  2
2  6  2
   A_x  B  A_y
0    1  2    4
1    1  2    5
2    1  2    6
3    2  2    4
4    2  2    5
5    2  2    6


## Appending(Using pd.concat)

Concatenate pandas objects along a particular axis with optional set logic along the other axes.

### Syntax :

pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)


In [6]:
#Concatenation at axis=0
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                     index=[0, 1, 2, 3])
 
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                     index=[4, 5, 6, 7])
frames = [df1, df2]
result = pd.concat(frames)
print(result)

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7


In [7]:
#Concatenation at axis=1
df3 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])
   

result1 = pd.concat([df1, df3], axis=1, sort=False)
print(result1)

     A    B    C    D    B    D    F
0   A0   B0   C0   D0  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN
2   A2   B2   C2   D2   B2   D2   F2
3   A3   B3   C3   D3   B3   D3   F3
6  NaN  NaN  NaN  NaN   B6   D6   F6
7  NaN  NaN  NaN  NaN   B7   D7   F7


In [8]:
#Concatenation at axis=1 and inner join
result3 = pd.concat([df1, df3], axis=1, join='inner')
print(result3)

    A   B   C   D   B   D   F
2  A2  B2  C2  D2  B2  D2  F2
3  A3  B3  C3  D3  B3  D3  F3


## Appending(Using df.append)

Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.

### Syntax

DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)


In [9]:
result4 = df1.append(df2)
print(result4)

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7


## Grouping and Summarizing Dataframes

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

### Syntax :
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False)


In [106]:
#Grouping and Summmarising
df_grouping = dataframe.groupby('isFradulent')
print(df_grouping.Transaction_amount.mean())# The aggregates can be sum,mean,median,mode,count,max,min,abs,prod,std,var etc.
pd.DataFrame(df_grouping['Transaction_amount'].mean()) #Returns output in dataframe

isFradulent
N    763.575281
Y    823.288972
Name: Transaction_amount, dtype: float64


Unnamed: 0_level_0,Transaction_amount
isFradulent,Unnamed: 1_level_1
N,763.575281
Y,823.288972


### Lambda Functions in Dataframe

In [24]:
# Create a function to be applied

def is_suspicious(x):
  return x>0

#Create a new column

dataframe['trans_suspicious'] = dataframe['6-month_chbk_freq'].apply(is_suspicious)
dataframe.head(10)

Unnamed: 0,Merchant_id,CCNUMBER,Transaction date,Average Amount/transaction/day,Transaction_amount,limit,remaining limit,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent,trans_suspicious
0,3160040998,458785658,,100.0,1500.0,4000,35552.25,N,5,Y,Y,0,0.0,0,Y,False
1,3160040998,258785659,,100.0,1500.0,4000,35552.25,N,5,Y,Y,0,0.0,0,Y,False
2,3160041996,677361687,,185.5,4000.0,2000,1901.876,Y,0,N,N,0,0.0,0,N,False
3,3160041996,487513198,,185.5,3000.0,2000,1901.876,Y,1,N,N,0,0.0,0,N,False
4,3160041996,7970257,,185.5,3000.0,2000,1901.876,Y,2,N,N,0,0.0,0,N,False
5,3160241992,333905636,,500.0,3700.0,4000,4000.0,N,0,Y,Y,800,677.2,6,Y,True
6,3160241992,579946416,,500.0,3705.2,4000,4000.0,N,0,Y,Y,800,677.2,6,Y,True
7,3160272997,291714540,,262.5,500.0,2000,2000.0,N,0,N,N,900,345.5,7,Y,True
8,3200016990,214355890,,262.5,500.0,2000,1000.0,N,0,N,N,0,0.0,0,N,False
9,3200016990,868746595,,375.0,500.0,2000,1425.8,N,0,N,N,0,0.0,0,N,False


## Pivot table in Dataframe

Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame

### Syntax

DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')


In [8]:
dataframe.pivot_table(values='limit',aggfunc='mean',columns='Merchant_id')

Merchant_id,3160040998,3160041896,3160041996,3160141996,3160241992,3160272997,3162041996,3200016990,3333780991,3335028024,...,6659424566,6660066018,6660093468,6661273529,6661610317,6662015632,6663592624,6664866404,6665254221,6665906072
limit,4000.0,2000.0,2000.0,2000.0,4000.0,2000.0,3333.333333,2666.666667,4000.0,4000.0,...,4000.0,2000.0,2000.0,4000.0,2000.0,2000.0,2000.0,4000.0,4000.0,2000.0
