# Exploring the Metadata of Data Frames


## What is metadata?

It is data about data.  Or as [Wikipedia says]( https://en.wikipedia.org/wiki/Metadata ):

> "data that provides information about other data",[1] but not the content of the data itself, such as the text of a message or the image itself.

For example, given a file, the data are the contents of the file.  The metadata is the data about the file: the size, date, ownership, group membership, permissions, the file type, etc.

A sound track will contain audio, but the metadata would include the length of the recording, the recording date, the author's name, any copyright info, the performers names, etc.

An digital photo will contain the image with the metadata containing the date, time, exposure settings, GPS info, camera make/model, etc.  Usually this information is stored in the [EXIF portion]( https://en.wikipedia.org/wiki/Exif ) of the digital image.



For a data frame, the metadata is the number of rows and columns, the column names, the column data types, the row indices, the value counts of categorical data, the min/median/mean/max of continuous data, etc.

For a web resource ( e.g. file ), the metadata is what is in the headers of the response to an HTTP request, e.g. the server type, the last update date, the content-length, the content-type, etc.

Metadata is useful for understanding the data itself.  For ML algorithms it is important to understand the metadata to determine what types of ML algorithms are appropriate for the data and if any transformations need to happen if the data is not appropriate.



## Exploring dataframes


In [1]:
import pandas as pd


In [2]:
url = "https://ddc-datascience.s3.amazonaws.com/Projects/Project.2-Housing/Data/Housing.Data.csv"
housing = pd.read_csv( url )


Getting the shape ( rows, columns ) of a Data Frame


In [3]:
housing.shape

(2637, 81)

In [4]:
type(housing.shape)

tuple

In [5]:
rows, columns = housing.shape
{
  "Rows": rows,
  "Columns" : columns
}

{'Rows': 2637, 'Columns': 81}

### Data types and nulls

In [6]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2637 entries, 0 to 2636
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   PID              2637 non-null   int64  
 1   MS SubClass      2637 non-null   int64  
 2   MS Zoning        2637 non-null   object 
 3   Lot Frontage     2188 non-null   float64
 4   Lot Area         2637 non-null   int64  
 5   Street           2637 non-null   object 
 6   Alley            180 non-null    object 
 7   Lot Shape        2637 non-null   object 
 8   Land Contour     2637 non-null   object 
 9   Utilities        2637 non-null   object 
 10  Lot Config       2637 non-null   object 
 11  Land Slope       2637 non-null   object 
 12  Neighborhood     2637 non-null   object 
 13  Condition 1      2637 non-null   object 
 14  Condition 2      2637 non-null   object 
 15  Bldg Type        2637 non-null   object 
 16  House Style      2637 non-null   object 
 17  Overall Qual  

Notice info() provides five kinds of information:
- the number of rows and columns
- the number of rows of each column that are not-nulls
- the data type of the column
- a count of column data types
- the amount of memory used by the Data Frame

How can I extract that information to work with it?

In [7]:
type(housing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2637 entries, 0 to 2636
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   PID              2637 non-null   int64  
 1   MS SubClass      2637 non-null   int64  
 2   MS Zoning        2637 non-null   object 
 3   Lot Frontage     2188 non-null   float64
 4   Lot Area         2637 non-null   int64  
 5   Street           2637 non-null   object 
 6   Alley            180 non-null    object 
 7   Lot Shape        2637 non-null   object 
 8   Land Contour     2637 non-null   object 
 9   Utilities        2637 non-null   object 
 10  Lot Config       2637 non-null   object 
 11  Land Slope       2637 non-null   object 
 12  Neighborhood     2637 non-null   object 
 13  Condition 1      2637 non-null   object 
 14  Condition 2      2637 non-null   object 
 15  Bldg Type        2637 non-null   object 
 16  House Style      2637 non-null   object 
 17  Overall Qual  

NoneType

It's a "none" data type.  Hmmm.  That means that you cannot do anything ( easily ) with the information that it is giving to you.


Let's put this aside for the moment.

### Some basic statistics


In [8]:
housing.describe()

Unnamed: 0,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice
count,2637.0,2637.0,2188.0,2637.0,2637.0,2637.0,2637.0,2637.0,2614.0,2636.0,...,2637.0,2637.0,2637.0,2637.0,2637.0,2637.0,2637.0,2637.0,2637.0,2637.0
mean,714130100.0,57.349261,69.166819,10044.694729,6.097459,5.569966,1971.288586,1984.202882,101.887911,438.441199,...,94.305271,46.984452,22.813424,2.368601,15.775123,2.130072,42.014031,6.243838,2007.795601,179986.230186
std,188752700.0,42.499091,23.356779,6742.549521,1.411522,1.118262,30.306986,20.913077,179.578232,449.602326,...,126.927272,66.564333,61.177638,23.1606,55.783751,35.14014,393.158781,2.722093,1.306403,78309.251522
min,526301100.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,12789.0
25%,528477000.0,20.0,58.0,7436.0,5.0,5.0,1954.0,1965.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,129500.0
50%,535453000.0,50.0,68.0,9450.0,6.0,5.0,1973.0,1993.0,0.0,368.0,...,0.0,27.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,160000.0
75%,907187000.0,70.0,80.0,11526.0,7.0,6.0,2001.0,2004.0,164.0,732.0,...,168.0,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,213000.0
max,1007100000.0,190.0,313.0,164660.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,1424.0,742.0,584.0,407.0,576.0,800.0,12500.0,12.0,2010.0,745000.0


In [9]:
type(housing.describe())

A data frame.  I can work with that using all the methods that a data frame has.

In [10]:
housing.describe().columns

Index(['PID', 'MS SubClass', 'Lot Frontage', 'Lot Area', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area',
       'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area',
       'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath',
       'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces',
       'Garage Yr Blt', 'Garage Cars', 'Garage Area', 'Wood Deck SF',
       'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch',
       'Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold', 'SalePrice'],
      dtype='object')

In [11]:
len(housing.describe().columns)

38

In [12]:
len(housing.describe( include = "all" ).columns)

81

In [13]:
(
  housing
  .describe( include = "all" )
  .transpose()
  .astype({"count": int})
)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
PID,2637,,,,714130147.70383,188752674.750322,526301100.0,528477010.0,535453040.0,907187010.0,1007100110.0
MS SubClass,2637,,,,57.349261,42.499091,20.0,20.0,50.0,70.0,190.0
MS Zoning,2637,7,RL,2043,,,,,,,
Lot Frontage,2188,,,,69.166819,23.356779,21.0,58.0,68.0,80.0,313.0
Lot Area,2637,,,,10044.694729,6742.549521,1300.0,7436.0,9450.0,11526.0,164660.0
...,...,...,...,...,...,...,...,...,...,...,...
Mo Sold,2637,,,,6.243838,2.722093,1.0,4.0,6.0,8.0,12.0
Yr Sold,2637,,,,2007.795601,1.306403,2006.0,2007.0,2008.0,2009.0,2010.0
Sale Type,2637,10,WD,2286,,,,,,,
Sale Condition,2637,6,Normal,2166,,,,,,,


### Nulls

In [14]:
housing.isnull()

Unnamed: 0,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,False,False,False,False,False,False,True,False,False,False,...,False,True,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,...,False,True,True,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,...,False,True,False,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2632,False,False,False,False,False,False,True,False,False,False,...,False,True,False,False,False,False,False,False,False,False
2633,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
2634,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
2635,False,False,False,False,False,False,True,False,False,False,...,False,True,False,True,False,False,False,False,False,False


In [15]:
type(housing.isnull())

In [16]:
housing.isnull().sum()

PID                 0
MS SubClass         0
MS Zoning           0
Lot Frontage      449
Lot Area            0
                 ... 
Mo Sold             0
Yr Sold             0
Sale Type           0
Sale Condition      0
SalePrice           0
Length: 81, dtype: int64

In [17]:
type(housing.isnull().sum())

#### Calculate proportion of nulls

In [18]:
nulls = housing.isnull().sum()
filter = nulls > 0
( nulls[ filter ].sort_values( ascending = False ) / housing.shape[0] * 100 ).round(1)

Pool QC           99.6
Misc Feature      96.4
Alley             93.2
Fence             80.0
Mas Vnr Type      60.9
Fireplace Qu      48.7
Lot Frontage      17.0
Garage Yr Blt      5.6
Garage Cond        5.6
Garage Qual        5.6
Garage Finish      5.6
Garage Type        5.5
Bsmt Exposure      2.9
BsmtFin Type 2     2.8
Bsmt Qual          2.8
BsmtFin Type 1     2.8
Bsmt Cond          2.8
Mas Vnr Area       0.9
Bsmt Half Bath     0.1
Bsmt Full Bath     0.1
BsmtFin SF 1       0.0
Garage Cars        0.0
Garage Area        0.0
Total Bsmt SF      0.0
Bsmt Unf SF        0.0
BsmtFin SF 2       0.0
dtype: float64

### Data types

In [19]:
housing.dtypes

PID                 int64
MS SubClass         int64
MS Zoning          object
Lot Frontage      float64
Lot Area            int64
                   ...   
Mo Sold             int64
Yr Sold             int64
Sale Type          object
Sale Condition     object
SalePrice           int64
Length: 81, dtype: object

In [20]:
type(housing.dtypes)

### Memory usage

In [21]:
housing.memory_usage(deep=False, index = False)

PID               21096
MS SubClass       21096
MS Zoning         21096
Lot Frontage      21096
Lot Area          21096
                  ...  
Mo Sold           21096
Yr Sold           21096
Sale Type         21096
Sale Condition    21096
SalePrice         21096
Length: 81, dtype: int64

In [22]:
housing.memory_usage(deep=False, index = False).sum()

1708776

In [23]:
housing.memory_usage(deep=True, index = False).sum()

7303051

## Automate: first pass

In [24]:
df = housing
df_info = pd.concat( [
  pd.DataFrame( [ df.isna().sum().to_dict() ], index = ["Nulls"] ).transpose().astype( { "Nulls": int } ),
  pd.DataFrame( [ df.dtypes.to_dict() ], index = [ "Data_Types"] ).transpose().astype( { "Data_Types": 'category' }),
  pd.DataFrame( [ df.memory_usage(deep=True, index = False).to_dict() ], index = ["Memory"] ).transpose(),
  df.describe( include = "all" ).transpose().astype( { "count": int } ).rename( columns = { "50%" : "median"} ),
], axis = "columns")
df_info["IRQ"] = df_info["75%"] - df_info["25%"]
df_info["range"] = df_info["max"] - df_info["min"]
df_info["sum"] = df_info["mean"] * df_info["count"]
df_info.insert( 1, "Nulls_pct" , round( df_info["Nulls"] / df_info.shape[0], 2) )


print( f"Memory: { (df_info['Memory'].sum() / 1000 / 1000 ):_} MB" )
print( df_info["Data_Types"].value_counts().to_json() )
df_info.sort_values(by = ["Data_Types", "max"], ascending=[1,0])


Memory: 7.303051 MB
{"object":43,"int64":27,"float64":11}


Unnamed: 0,Nulls,Nulls_pct,Data_Types,Memory,count,unique,top,freq,mean,std,min,25%,median,75%,max,IRQ,range,sum
PID,0,0.00,int64,21096,2637,,,,714130147.70383,188752674.750322,526301100.0,528477010.0,535453040.0,907187010.0,1007100110.0,378710000.0,480799010.0,1883161199495.0
SalePrice,0,0.00,int64,21096,2637,,,,179986.230186,78309.251522,12789.0,129500.0,160000.0,213000.0,745000.0,83500.0,732211.0,474623689.0
Lot Area,0,0.00,int64,21096,2637,,,,10044.694729,6742.549521,1300.0,7436.0,9450.0,11526.0,164660.0,4090.0,163360.0,26487860.0
Misc Val,0,0.00,int64,21096,2637,,,,42.014031,393.158781,0.0,0.0,0.0,0.0,12500.0,0.0,12500.0,110791.0
Gr Liv Area,0,0.00,int64,21096,2637,,,,1496.98521,495.209631,334.0,1128.0,1441.0,1740.0,5642.0,612.0,5308.0,3947550.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Pool QC,2626,32.42,object,84681,11,4,Gd,4,,,,,,,,,,
Fence,2109,26.04,object,100110,528,4,MnPrv,306,,,,,,,,,,
Misc Feature,2541,31.37,object,87168,96,4,Shed,87,,,,,,,,,,
Sale Type,0,0.00,object,158288,2637,10,WD,2286,,,,,,,,,,


Notes:
- remove columns with 20% or more of null values


- remove rows with 5% or fewer null values




In [25]:
for dt in df_info["Data_Types"].unique():
  print(dt)

int64
object
float64


In [26]:
df_info["Data_Types"].value_counts()


Data_Types
object     43
int64      27
float64    11
Name: count, dtype: int64

In [27]:
for dt in df_info["Data_Types"].unique():
  filter = df_info["Data_Types"] == dt
  cols = df_info[ filter ].index
  housing[ cols ].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2637 entries, 0 to 2636
Data columns (total 27 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   PID              2637 non-null   int64
 1   MS SubClass      2637 non-null   int64
 2   Lot Area         2637 non-null   int64
 3   Overall Qual     2637 non-null   int64
 4   Overall Cond     2637 non-null   int64
 5   Year Built       2637 non-null   int64
 6   Year Remod/Add   2637 non-null   int64
 7   1st Flr SF       2637 non-null   int64
 8   2nd Flr SF       2637 non-null   int64
 9   Low Qual Fin SF  2637 non-null   int64
 10  Gr Liv Area      2637 non-null   int64
 11  Full Bath        2637 non-null   int64
 12  Half Bath        2637 non-null   int64
 13  Bedroom AbvGr    2637 non-null   int64
 14  Kitchen AbvGr    2637 non-null   int64
 15  TotRms AbvGrd    2637 non-null   int64
 16  Fireplaces       2637 non-null   int64
 17  Wood Deck SF     2637 non-null   int64
 18  Open Por

In [28]:
filter = df_info["Data_Types"] == "int64"
cols = df_info[ filter ].index
housing[ cols ].nunique().sort_values(ascending = False) / 2637 * 100

Series([], dtype: float64)

## Automate: second pass

In [29]:
metatdata_df = pd.DataFrame()
metatdata_df

In [30]:
metatdata_df["Nulls"] = housing.isnull().sum()
metatdata_df

Unnamed: 0,Nulls
PID,0
MS SubClass,0
MS Zoning,0
Lot Frontage,449
Lot Area,0
...,...
Mo Sold,0
Yr Sold,0
Sale Type,0
Sale Condition,0


In [31]:
metatdata_df["Nulls_pct"] = ( metatdata_df["Nulls"] / housing.shape[0] * 100 ).round(1)
metatdata_df

Unnamed: 0,Nulls,Nulls_pct
PID,0,0.0
MS SubClass,0,0.0
MS Zoning,0,0.0
Lot Frontage,449,17.0
Lot Area,0,0.0
...,...,...
Mo Sold,0,0.0
Yr Sold,0,0.0
Sale Type,0,0.0
Sale Condition,0,0.0


In [32]:
metatdata_df["Data_types"] = housing.dtypes
metatdata_df

Unnamed: 0,Nulls,Nulls_pct,Data_types
PID,0,0.0,int64
MS SubClass,0,0.0,int64
MS Zoning,0,0.0,object
Lot Frontage,449,17.0,float64
Lot Area,0,0.0,int64
...,...,...,...
Mo Sold,0,0.0,int64
Yr Sold,0,0.0,int64
Sale Type,0,0.0,object
Sale Condition,0,0.0,object


In [33]:
metatdata_df["Memory"] = housing.memory_usage( deep = True)
metatdata_df

Unnamed: 0,Nulls,Nulls_pct,Data_types,Memory
PID,0,0.0,int64,21096
MS SubClass,0,0.0,int64,21096
MS Zoning,0,0.0,object,155728
Lot Frontage,449,17.0,float64,21096
Lot Area,0,0.0,int64,21096
...,...,...,...,...
Mo Sold,0,0.0,int64,21096
Yr Sold,0,0.0,int64,21096
Sale Type,0,0.0,object,158288
Sale Condition,0,0.0,object,166537


In [34]:
metatdata_df = metatdata_df.join( housing.describe( include = "all" ).transpose() )
metatdata_df

Unnamed: 0,Nulls,Nulls_pct,Data_types,Memory,count,unique,top,freq,mean,std,min,25%,50%,75%,max
PID,0,0.0,int64,21096,2637.0,,,,714130147.70383,188752674.750322,526301100.0,528477010.0,535453040.0,907187010.0,1007100110.0
MS SubClass,0,0.0,int64,21096,2637.0,,,,57.349261,42.499091,20.0,20.0,50.0,70.0,190.0
MS Zoning,0,0.0,object,155728,2637,7,RL,2043,,,,,,,
Lot Frontage,449,17.0,float64,21096,2188.0,,,,69.166819,23.356779,21.0,58.0,68.0,80.0,313.0
Lot Area,0,0.0,int64,21096,2637.0,,,,10044.694729,6742.549521,1300.0,7436.0,9450.0,11526.0,164660.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mo Sold,0,0.0,int64,21096,2637.0,,,,6.243838,2.722093,1.0,4.0,6.0,8.0,12.0
Yr Sold,0,0.0,int64,21096,2637.0,,,,2007.795601,1.306403,2006.0,2007.0,2008.0,2009.0,2010.0
Sale Type,0,0.0,object,158288,2637,10,WD,2286,,,,,,,
Sale Condition,0,0.0,object,166537,2637,6,Normal,2166,,,,,,,


In [35]:
metatdata_df = metatdata_df.astype( {"count": int }).rename( columns = {"50%": "median"})
metatdata_df

Unnamed: 0,Nulls,Nulls_pct,Data_types,Memory,count,unique,top,freq,mean,std,min,25%,median,75%,max
PID,0,0.0,int64,21096,2637,,,,714130147.70383,188752674.750322,526301100.0,528477010.0,535453040.0,907187010.0,1007100110.0
MS SubClass,0,0.0,int64,21096,2637,,,,57.349261,42.499091,20.0,20.0,50.0,70.0,190.0
MS Zoning,0,0.0,object,155728,2637,7,RL,2043,,,,,,,
Lot Frontage,449,17.0,float64,21096,2188,,,,69.166819,23.356779,21.0,58.0,68.0,80.0,313.0
Lot Area,0,0.0,int64,21096,2637,,,,10044.694729,6742.549521,1300.0,7436.0,9450.0,11526.0,164660.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mo Sold,0,0.0,int64,21096,2637,,,,6.243838,2.722093,1.0,4.0,6.0,8.0,12.0
Yr Sold,0,0.0,int64,21096,2637,,,,2007.795601,1.306403,2006.0,2007.0,2008.0,2009.0,2010.0
Sale Type,0,0.0,object,158288,2637,10,WD,2286,,,,,,,
Sale Condition,0,0.0,object,166537,2637,6,Normal,2166,,,,,,,


Function

In [36]:
def metadata( dataframe ):
  '''Given a dataframe, returns a dataframe of metadata about the dataframe'''
  metatdata_df = pd.DataFrame()
  metatdata_df["Nulls"] = dataframe.isnull().sum()
  metatdata_df["Nulls_pct"] = ( dataframe.isnull().mean() * 100 ).round(1)
  metatdata_df["Data_types"] = dataframe.dtypes
  metatdata_df["Memory"] = dataframe.memory_usage( deep = True )
  metatdata_df = metatdata_df.join( dataframe.describe( include = "all" ).transpose() )
  metatdata_df = metatdata_df.astype( {"count": int }).rename( columns = {"50%": "median"})
  return metatdata_df

In [37]:
metadata(housing)

Unnamed: 0,Nulls,Nulls_pct,Data_types,Memory,count,unique,top,freq,mean,std,min,25%,median,75%,max
PID,0,0.0,int64,21096,2637,,,,714130147.70383,188752674.750322,526301100.0,528477010.0,535453040.0,907187010.0,1007100110.0
MS SubClass,0,0.0,int64,21096,2637,,,,57.349261,42.499091,20.0,20.0,50.0,70.0,190.0
MS Zoning,0,0.0,object,155728,2637,7,RL,2043,,,,,,,
Lot Frontage,449,17.0,float64,21096,2188,,,,69.166819,23.356779,21.0,58.0,68.0,80.0,313.0
Lot Area,0,0.0,int64,21096,2637,,,,10044.694729,6742.549521,1300.0,7436.0,9450.0,11526.0,164660.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mo Sold,0,0.0,int64,21096,2637,,,,6.243838,2.722093,1.0,4.0,6.0,8.0,12.0
Yr Sold,0,0.0,int64,21096,2637,,,,2007.795601,1.306403,2006.0,2007.0,2008.0,2009.0,2010.0
Sale Type,0,0.0,object,158288,2637,10,WD,2286,,,,,,,
Sale Condition,0,0.0,object,166537,2637,6,Normal,2166,,,,,,,


## Automate: import from GitHub

In [38]:
!curl -s -O https://raw.githubusercontent.com/rwcitek/example-c11/main/python.modules/metadata.py


In [39]:
!ls -l


total 8
-rw-r--r-- 1 root root 1163 Jun 20 21:26 metadata.py
drwxr-xr-x 1 root root 4096 Jun 18 13:23 sample_data


In [40]:
import metadata

In [41]:
md = metadata.metadata(housing)
md

Unnamed: 0,count,unique,top,freq,mean,std,min,Q1_25%,Q2_median,Q3_75%,max,Nulls,Nulls_pct,Data_types,Memory
PID,2637,2637,,,714130147.70383,188752674.750322,526301100.0,528477010.0,535453040.0,907187010.0,1007100110.0,0,0.0,int64,21096
MS SubClass,2637,16,,,57.349261,42.499091,20.0,20.0,50.0,70.0,190.0,0,0.0,int64,21096
MS Zoning,2637,7,RL,2043,,,,,,,,0,0.0,object,155728
Lot Frontage,2188,126,,,69.166819,23.356779,21.0,58.0,68.0,80.0,313.0,449,17.0,float64,21096
Lot Area,2637,1799,,,10044.694729,6742.549521,1300.0,7436.0,9450.0,11526.0,164660.0,0,0.0,int64,21096
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mo Sold,2637,12,,,6.243838,2.722093,1.0,4.0,6.0,8.0,12.0,0,0.0,int64,21096
Yr Sold,2637,5,,,2007.795601,1.306403,2006.0,2007.0,2008.0,2009.0,2010.0,0,0.0,int64,21096
Sale Type,2637,10,WD,2286,,,,,,,,0,0.0,object,158288
Sale Condition,2637,6,Normal,2166,,,,,,,,0,0.0,object,166537


In [44]:
metadata.cols_to_drop(housing)

{'Alley': 93.2, 'Pool QC': 99.6, 'Fence': 80.0, 'Misc Feature': 96.4}