# 101 Pandas Exercises for Data Analysis

## Index
#### 41. How to count the number of missing values in each column?
#### 42. How to replace missing values of multiple numeric columns with the mean?
#### 43. How to use apply function on existing columns with global variables as additional arguments?
#### 44. How to select a specific column from a dataframe as a dataframe instead of a series?
#### 45. How to change the order of columns of a dataframe?
#### 46. How to set the number of rows and columns displayed in the output?
#### 47. How to format or suppress scientific notations in a pandas dataframe?
#### 48. How to format all the values in a dataframe as percentages?
#### 39. How to rename a specific columns in a dataframe?
#### 50. How to create a primary key index by combining relevant columns?


## 41. How to count the number of missing values in each column?

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('Cars93_miss.csv')
df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,38.7,18.0,25.0,Driver & Passenger,Front,...,5.0,195.0,115.0,71.0,38.0,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i


In [4]:
# Solution
df.isnull().sum().sort_values(ascending = False)


Luggage.room          19
MPG.city               9
Fuel.tank.capacity     8
Min.Price              7
Weight                 7
DriveTrain             7
Horsepower             7
Rev.per.mile           6
AirBags                6
Width                  6
Man.trans.avail        5
Origin                 5
Turn.circle            5
Max.Price              5
Cylinders              5
Length                 4
Rear.seat.room         4
Manufacturer           4
Type                   3
Make                   3
RPM                    3
MPG.highway            2
EngineSize             2
Price                  2
Passengers             2
Wheelbase              1
Model                  1
dtype: int64

In [5]:
# Alternatives

n_missings_each_col = df.apply(lambda x: x.isnull().sum())
n_missings_each_col.argmax()


will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  after removing the cwd from sys.path.


'Luggage.room'

## 42. How to replace missing values of multiple numeric columns with the mean?

In [7]:
# Replace missing values in Min.Price and Max.Price columns with their respective mean.
df = pd.read_csv('Cars93_miss.csv')
df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,38.7,18.0,25.0,Driver & Passenger,Front,...,5.0,195.0,115.0,71.0,38.0,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i


In [8]:
# Solution
df_out = df[['Min.Price', 'Max.Price']] 
df_out = df[['Min.Price', 'Max.Price']].apply(lambda x: x.fillna(x.mean()))
print(df_out.head())

   Min.Price  Max.Price
0  12.900000  18.800000
1  29.200000  38.700000
2  25.900000  32.300000
3  17.118605  44.600000
4  17.118605  21.459091


## 43. How to use apply function on existing columns with global variables as additional arguments?

In [15]:
df = pd.read_csv('Cars93_miss.csv')
df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,38.7,18.0,25.0,Driver & Passenger,Front,...,5.0,195.0,115.0,71.0,38.0,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i


In [21]:
df['Max.Price'].median()

19.15

In [22]:
df['Min.Price'].mean()

17.11860465116279

In [16]:
# Solution
d = {'Min.Price': np.nanmean, 'Max.Price': np.nanmedian}
DD = df[['Min.Price', 'Max.Price']]
DD.head(10)

Unnamed: 0,Min.Price,Max.Price
0,12.9,18.8
1,29.2,38.7
2,25.9,32.3
3,,44.6
4,,
5,14.2,17.3
6,19.9,
7,22.6,24.9
8,26.3,26.3
9,33.0,36.3


In [17]:
D = DD.apply(lambda x, d: x.fillna(d[x.name](x)), args=(d, ))
D.head(10)

Unnamed: 0,Min.Price,Max.Price
0,12.9,18.8
1,29.2,38.7
2,25.9,32.3
3,17.118605,44.6
4,17.118605,19.15
5,14.2,17.3
6,19.9,19.15
7,22.6,24.9
8,26.3,26.3
9,33.0,36.3


## 44. How to select a specific column from a dataframe as a dataframe instead of a series?

In [31]:
df = pd.read_csv('Cars93_miss.csv')
df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,38.7,18.0,25.0,Driver & Passenger,Front,...,5.0,195.0,115.0,71.0,38.0,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i


In [32]:
df1 = df.iloc[:, 3:6]
df1.head()

Unnamed: 0,Min.Price,Price,Max.Price
0,12.9,15.9,18.8
1,29.2,33.9,38.7
2,25.9,29.1,32.3
3,,37.7,44.6
4,,30.0,


In [33]:
df2 = df.loc[:, ['Manufacturer','Model','Price']]
df2.head()

Unnamed: 0,Manufacturer,Model,Price
0,Acura,Integra,15.9
1,,Legend,33.9
2,Audi,90,29.1
3,Audi,100,37.7
4,BMW,535i,30.0


In [35]:
df3 = df[['Model','Price','Min.Price',]]
df3.head()

Unnamed: 0,Model,Price,Min.Price
0,Integra,15.9,12.9
1,Legend,33.9,29.2
2,90,29.1,25.9
3,100,37.7,
4,535i,30.0,


In [36]:
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [28]:
# Solution
type(df[['a']])
type(df.loc[:, ['a']])
type(df.iloc[:, [0]])

pandas.core.frame.DataFrame

## 45. How to change the order of columns of a dataframe?

In [37]:
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [38]:
# Solution Q1
df[list('cbade')]

Unnamed: 0,c,b,a,d,e
0,2,1,0,3,4
1,7,6,5,8,9
2,12,11,10,13,14
3,17,16,15,18,19


In [39]:
df1 = df[['b','a','d',]]
df1.head()

Unnamed: 0,b,a,d
0,1,0,3
1,6,5,8
2,11,10,13
3,16,15,18


In [47]:
# Solution Q2 - No hard coding
def switch_columns(df, col1=None, col2=None):
    colnames = df.columns.tolist()
    i1, i2 = colnames.index(col1), colnames.index(col2)
    colnames[i2], colnames[i1] = colnames[i1], colnames[i2]
    return df[colnames]

In [48]:
switch_columns(df, col1='c', col2='e')

Unnamed: 0,c,d,e,b,a
0,2,3,4,1,0
1,7,8,9,6,5
2,12,13,14,11,10
3,17,18,19,16,15


In [45]:
# Solution Q3
df[sorted(df.columns, reverse = True)]

Unnamed: 0,e,d,c,b,a
0,4,3,2,1,0
1,9,8,7,6,5
2,14,13,12,11,10
3,19,18,17,16,15


In [46]:
#Sort the columns in reverse alphabetical order, that is colume 'e' first through column 'a' last.
df.sort_index(axis=1, ascending=False, inplace=True)
df

Unnamed: 0,e,d,c,b,a
0,4,3,2,1,0
1,9,8,7,6,5
2,14,13,12,11,10
3,19,18,17,16,15


## 46. How to set the number of rows and columns displayed in the output?

In [49]:
# Change the pandas display settings on printing the dataframe df it shows a maximum of 10 rows and 10 columns.

df = pd.read_csv('Cars93_miss.csv')
df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,38.7,18.0,25.0,Driver & Passenger,Front,...,5.0,195.0,115.0,71.0,38.0,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i


In [51]:
# Solution
pd.set_option('display.max_columns', 10)
df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,...,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,...,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,...,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,...,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,...,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,...,27.0,13.0,3640.0,non-USA,BMW 535i


In [52]:
# Solution

pd.set_option('display.max_rows', 10)
df

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,...,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,...,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,...,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,...,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,...,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,...,27.0,13.0,3640.0,non-USA,BMW 535i
...,...,...,...,...,...,...,...,...,...,...,...
88,Volkswagen,Eurovan,Van,16.6,19.7,...,34.0,,3960.0,,Volkswagen Eurovan
89,Volkswagen,Passat,Compact,17.6,20.0,...,31.5,14.0,2985.0,non-USA,Volkswagen Passat
90,Volkswagen,Corrado,Sporty,22.9,23.3,...,26.0,15.0,2810.0,non-USA,Volkswagen Corrado
91,Volvo,240,Compact,21.8,22.7,...,29.5,14.0,2985.0,non-USA,Volvo 240


In [53]:
pd.describe_option()

compute.use_bottleneck : bool
    Use the bottleneck library to accelerate if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]

compute.use_numexpr : bool
    Use the numexpr library to accelerate computation if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]

display.chop_threshold : float or None
    if set to a float value, all float values smaller then the given threshold
    will be displayed as exactly 0 by repr and friends.
    [default: None] [currently: None]

display.colheader_justify : 'left'/'right'
    Controls the justification of column headers. used by DataFrameFormatter.
    [default: right] [currently: right]

display.column_space No description available.
    [default: 12] [currently: 12]

display.date_dayfirst : boolean
    When True, prints and parses dates with the day first, eg 20/01/2005
    [default: False] [currently: False]

display.date_year

## 47. How to format or suppress scientific notations in a pandas dataframe?

In [55]:
# Suppress scientific notations like ‘e-03’ in df and print upto 4 numbers after decimal.

df = pd.DataFrame(np.random.random(10)**10, columns=['random'])
df


Unnamed: 0,random
0,0.0002086188
1,1.491652e-13
2,0.0282579
3,0.0474576
4,0.002527086
5,0.02445606
6,7.67479e-05
7,0.01638509
8,0.07253868
9,0.001444218


In [56]:
# Solution 1: Rounding
df.round(4)


Unnamed: 0,random
0,0.0002
1,0.0
2,0.0283
3,0.0475
4,0.0025
5,0.0245
6,0.0001
7,0.0164
8,0.0725
9,0.0014


In [58]:
# Solution 2: Use apply to change format
df.apply(lambda x: '%.4f' % x, axis=1)


0    0.0002
1    0.0000
2    0.0283
3    0.0475
4    0.0025
5    0.0245
6    0.0001
7    0.0164
8    0.0725
9    0.0014
dtype: object

In [59]:
# or
df.applymap(lambda x: '%.4f' % x)

Unnamed: 0,random
0,0.0002
1,0.0
2,0.0283
3,0.0475
4,0.0025
5,0.0245
6,0.0001
7,0.0164
8,0.0725
9,0.0014


## 48. How to format all the values in a dataframe as percentages?

In [61]:
df = pd.DataFrame(np.random.random(7), columns=['random'])
df

Unnamed: 0,random
0,0.857155
1,0.545197
2,0.745719
3,0.048728
4,0.948528
5,0.202956
6,0.88392


In [68]:
# Solution
out = df.style.format({'random': '{0:.3%}'.format})
out


Unnamed: 0,random
0,85.715%
1,54.520%
2,74.572%
3,4.873%
4,94.853%
5,20.296%
6,88.392%


## 49. How to filter every nth row in a dataframe?

In [69]:
#From df, filter the 'Manufacturer', 'Model' and 'Type' for every 20th row starting from 1st (row 0).

df = pd.read_csv('Cars93_miss.csv')
df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,...,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,...,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,...,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,...,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,...,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,...,27.0,13.0,3640.0,non-USA,BMW 535i


In [74]:
# Solution

df.loc[::20,['Manufacturer','Model','Type']].head()

Unnamed: 0,Manufacturer,Model,Type
0,Acura,Integra,Small
20,Chrysler,LeBaron,Compact
40,Honda,Prelude,Sporty
60,Mercury,Cougar,Midsize
80,Subaru,Loyale,Small


In [77]:
# Solution

df.iloc[::20, :][['Manufacturer', 'Model', 'Type']]

Unnamed: 0,Manufacturer,Model,Type
0,Acura,Integra,Small
20,Chrysler,LeBaron,Compact
40,Honda,Prelude,Sporty
60,Mercury,Cougar,Midsize
80,Subaru,Loyale,Small


## 50. How to create a primary key index by combining relevant columns?

In [79]:
df = pd.read_csv('Cars93_miss.csv')
df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,...,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,...,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,...,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,...,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,...,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,...,27.0,13.0,3640.0,non-USA,BMW 535i


In [81]:
# Solution
df[['Manufacturer', 'Model', 'Type']] = df[['Manufacturer', 'Model', 'Type']].fillna('missing')
df[['Manufacturer', 'Model', 'Type']].head()

Unnamed: 0,Manufacturer,Model,Type
0,Acura,Integra,Small
1,missing,Legend,Midsize
2,Audi,90,Compact
3,Audi,100,Midsize
4,BMW,535i,Midsize


In [85]:
df.index = df.Manufacturer + '_' + df.Model + '_' + df.Type
print(df.index)


Index(['Acura_Integra_Small', 'missing_Legend_Midsize', 'Audi_90_Compact',
       'Audi_100_Midsize', 'BMW_535i_Midsize', 'Buick_Century_Midsize',
       'Buick_LeSabre_Large', 'Buick_Roadmaster_Large',
       'Buick_Riviera_Midsize', 'Cadillac_DeVille_Large',
       'Cadillac_Seville_Midsize', 'Chevrolet_Cavalier_Compact',
       'Chevrolet_Corsica_Compact', 'Chevrolet_Camaro_Sporty',
       'Chevrolet_Lumina_Midsize', 'Chevrolet_Lumina_APV_Van',
       'Chevrolet_Astro_Van', 'Chevrolet_Caprice_Large',
       'Chevrolet_Corvette_Sporty', 'missing_Concorde_Large',
       'Chrysler_LeBaron_Compact', 'Chrysler_Imperial_Large',
       'Dodge_Colt_Small', 'Dodge_Shadow_Small', 'Dodge_Spirit_Compact',
       'Dodge_Caravan_Van', 'Dodge_Dynasty_Midsize', 'Dodge_Stealth_Sporty',
       'Eagle_Summit_Small', 'Eagle_Vision_Large', 'Ford_Festiva_Small',
       'Ford_Escort_Small', 'Ford_Tempo_Compact', 'Ford_Mustang_Sporty',
       'Ford_Probe_Sporty', 'Ford_Aerostar_Van', 'Ford_Taurus_Midsize',

In [86]:
print(df.index.is_unique)

True
