## Dropna - pandas

To download the House Prices dataset, please refer to the lecture **Datasets** in **Section 2** of this course.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
# We'll use the following variables:

cols_to_use = [
    "OverallQual",
    "TotalBsmtSF",
    "1stFlrSF",
    "GrLivArea",
    "WoodDeckSF",
    "BsmtUnfSF",
    "LotFrontage",
    "MasVnrArea",
    "GarageYrBlt",
    "BsmtQual",
    "FireplaceQu",
    "SalePrice",
]

In [3]:
# Load the House Prices dataset.

data = pd.read_csv("../../houseprice.csv", usecols=cols_to_use)

data.head()

Unnamed: 0,LotFrontage,OverallQual,MasVnrArea,BsmtQual,BsmtUnfSF,TotalBsmtSF,1stFlrSF,GrLivArea,FireplaceQu,GarageYrBlt,WoodDeckSF,SalePrice
0,65.0,7,196.0,Gd,150,856,856,1710,,2003.0,0,208500
1,80.0,6,0.0,Gd,284,1262,1262,1262,TA,1976.0,298,181500
2,68.0,7,162.0,Gd,434,920,920,1786,TA,2001.0,0,223500
3,60.0,7,0.0,TA,540,756,961,1717,Gd,1998.0,0,140000
4,84.0,8,350.0,Gd,490,1145,1145,2198,TA,2000.0,192,250000


In [4]:
# Let's separate into training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),
    data["SalePrice"],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

((1022, 11), (438, 11))

## Drop when NA in any variable

In [5]:
X_train_t = X_train.dropna()
X_test_t = X_test.dropna()

X_train_t.shape, X_test_t.shape

((415, 11), (170, 11))

We dropped a lot of observations, this is probably not what we want.

## Drop when NA in certain variables

In [6]:
X_train_t = X_train.dropna(subset=["MasVnrArea", "BsmtQual"])
X_test_t = X_test.dropna(subset=["MasVnrArea", "BsmtQual"])

X_train_t.shape, X_test_t.shape

((993, 11), (422, 11))

## Drop when NA in all variables

In [7]:
X_train_t = X_train.dropna(how="all")
X_test_t = X_test.dropna(how="all")

X_train_t.shape, X_test_t.shape

((1022, 11), (438, 11))

## Require that many non-NA values

In [8]:
X_train_t = X_train.dropna(thresh=0.5)
X_test_t = X_test.dropna(thresh=0.5)

X_train_t.shape, X_test_t.shape

((1022, 11), (438, 11))