# Pandas examples

## How to compare two dataframes

Let's create two dataframes to use as an example:

In [13]:
import pandas as pd

df1 = pd.DataFrame({'player': ['A', 'B', 'C', 'D'],
                   'points': [12, 15, 17, 24],
                   'assists': [4, 6, 7, 8]})
df1

Unnamed: 0,player,points,assists
0,A,12,4
1,B,15,6
2,C,17,7
3,D,24,8


In [14]:
df2 = pd.DataFrame({'player': ['A', 'B', 'C', 'D'],
                    'points': [12, 24, 26, 29],
                    'assists': [7, 8, 10, 13]})
df2

Unnamed: 0,player,points,assists
0,A,12,7
1,B,24,8
2,C,26,10
3,D,29,13


### Find out if the two dataframes are identical

In [15]:
df1.equals(df2)

False

### Find the difference in numeric columns

In [16]:
df1.set_index('player').subtract(df2.set_index('player'))

Unnamed: 0_level_0,points,assists
player,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,-3
B,-9,-2
C,-9,-3
D,-5,-5


### Find the rows that are equal or different in both dataframes

In [20]:
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')

diff_df

#diff_df['Exist'].value_counts()

Unnamed: 0,player,points,assists,Exist
0,A,12,4,left_only
1,B,15,6,left_only
2,C,17,7,left_only
3,D,24,8,left_only
4,A,12,7,right_only
5,B,24,8,right_only
6,C,26,10,right_only
7,D,29,13,right_only


### Using `compare()`

In [21]:
#we'll need numpy to deal with missing values
import numpy as np

first_df = pd.DataFrame(
    {
        "Stationary": ["Pens", "Scales",
                       "Pencils", "Geometry Box",
                       "Crayon Set"],
        "Price": [100, 50, 25, 100, 65],
        "Quantity": [10, 5, 5, 2, 1]
    },
    columns=["Stationary", "Price", "Quantity"],
)

first_df

Unnamed: 0,Stationary,Price,Quantity
0,Pens,100,10
1,Scales,50,5
2,Pencils,25,5
3,Geometry Box,100,2
4,Crayon Set,65,1


In [22]:
# creating the second dataFrame by copying and modifying the first one
second_df = first_df.copy()
  
second_df.loc[0, 'Price'] = 150 
second_df.loc[1, 'Price'] = 70
second_df.loc[2, 'Price'] = 30
second_df.loc[0, 'Quantity'] = 15
second_df.loc[1, 'Quantity'] = 7
second_df.loc[2, 'Quantity'] = 6

second_df

Unnamed: 0,Stationary,Price,Quantity
0,Pens,150,15
1,Scales,70,7
2,Pencils,30,6
3,Geometry Box,100,2
4,Crayon Set,65,1


In [24]:
# we can have two new columns for each existing column:
first_df.compare(second_df)

Unnamed: 0_level_0,Price,Price,Quantity,Quantity
Unnamed: 0_level_1,self,other,self,other
0,100.0,150.0,10.0,15.0
1,50.0,70.0,5.0,7.0
2,25.0,30.0,5.0,6.0


In [26]:
# or two rows for each row
first_df.compare(second_df, align_axis=0)

Unnamed: 0,Unnamed: 1,Price,Quantity
0,self,100.0,10.0
0,other,150.0,15.0
1,self,50.0,5.0
1,other,70.0,7.0
2,self,25.0,5.0
2,other,30.0,6.0


You'll notice that pandas only returned the different rows/columns, meaning that by default the equal values stay hidden. You can show them if you wish:

In [35]:
first_df.compare(second_df, keep_shape=True)

Unnamed: 0_level_0,Stationary,Stationary,Price,Price,Quantity,Quantity
Unnamed: 0_level_1,self,other,self,other,self,other
0,,,100.0,150.0,10.0,15.0
1,,,50.0,70.0,5.0,7.0
2,,,25.0,30.0,5.0,6.0
3,,,,,,
4,,,,,,


The equal values are parsed as NaN by default so that you easily spot out the differences. But you can show the values instead if you wish:

In [36]:
first_df.compare(second_df, keep_shape=True, keep_equal=True)

Unnamed: 0_level_0,Stationary,Stationary,Price,Price,Quantity,Quantity
Unnamed: 0_level_1,self,other,self,other,self,other
0,Pens,Pens,100,150,10,15
1,Scales,Scales,50,70,5,7
2,Pencils,Pencils,25,30,5,6
3,Geometry Box,Geometry Box,100,100,2,2
4,Crayon Set,Crayon Set,65,65,1,1
