# Imputation

Goal: Understand and clean our data so we can derive better insights

## 1. Import Libraries

In [1]:
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## 2. Load the Dataset

In [2]:
df = pd.read_csv("data/NY-House-Dataset-Small.csv")

# Drop BROKERTITLE and 'ADMINISTRATIVE_AREA_LEVEL_2'
df = df.drop(["BROKERTITLE", "ADMINISTRATIVE_AREA_LEVEL_2"], axis=1)

In [3]:
df.info()
df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4815 entries, 0 to 4814
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TYPE          4815 non-null   object 
 1   PRICE         4815 non-null   int64  
 2   BEDS          4815 non-null   int64  
 3   BATH          4815 non-null   float64
 4   PROPERTYSQFT  4815 non-null   float64
 5   STATE         4815 non-null   object 
 6   MAIN_ADDRESS  4815 non-null   object 
 7   LOCALITY      4791 non-null   object 
 8   SUBLOCALITY   4815 non-null   object 
dtypes: float64(2), int64(2), object(5)
memory usage: 338.7+ KB


Index(['TYPE', 'PRICE', 'BEDS', 'BATH', 'PROPERTYSQFT', 'STATE',
       'MAIN_ADDRESS', 'LOCALITY', 'SUBLOCALITY'],
      dtype='object')

## 3. Finding Missing Data

Steps:
1. Use Descriptive Statistics to examine data
2. Identify missing values
3. Understand why the data is missing
4. Decide to impute or drop values
5. Document your approach

### LOCALITY

Cells with missing data are not always so nicely filled with Null values. Let's take a look at how we can identify missing data.

In [5]:
# Count the missing data
df["LOCALITY"].isna().sum()

df["LOCALITY"].value_counts()

LOCALITY
New York           2468
New York County     966
Queens County       555
Kings County        462
Bronx County        179
Richmond County      58
United States        34
Na                   30
-                    22
Brooklyn              6
Queens                6
The Bronx             4
Flatbush              1
Name: count, dtype: int64

In [6]:
# Map missing values to None
def locality_to_none(word):
    if word in ["Na", "-", "United States"]:
        return None
    return word

df["LOCALITY"] = df["LOCALITY"].map(locality_to_none)
df["LOCALITY"].value_counts()

LOCALITY
New York           2468
New York County     966
Queens County       555
Kings County        462
Bronx County        179
Richmond County      58
Brooklyn              6
Queens                6
The Bronx             4
Flatbush              1
Name: count, dtype: int64

In [7]:
# Count missing data
df["LOCALITY"].isna().sum()

110

### PRICE

In [11]:
# Get the min and max values of PRICE
df["PRICE"].max()
df["PRICE"].min()

# Get the row with the max price value
df.loc[df["PRICE"] == df["PRICE"].max()]


Unnamed: 0,TYPE,PRICE,BEDS,BATH,PROPERTYSQFT,STATE,MAIN_ADDRESS,LOCALITY,SUBLOCALITY
305,House for sale,2147483647,7,6.0,10000.0,"New York, NY 10309","6659-6675 Amboy RdNew York, NY 10309",New York,Richmond County


We'll drop the row where PRICE == to the max interger because we're unable to find or validate the price

In [12]:
# Drop by index -- this isn't a safe way to drop rows because if you ran multiple times you would lose multiple rows of data
df.drop(305)

df = df.drop(df.loc[df["PRICE"] == 2147483647].index)

## 4. Imputation

In [14]:
# Get the indices where "PRICE" == 0 to verify
price_0 = df.loc[df["PRICE"] == 0].index

# Map all the rows where price is 0 to None
df["PRICE"] = df["PRICE"].map(lambda price: None if price == 0 else price)

### Univariate: Mean and Median Imputation

In [16]:
# Mean imputation
avg_prices = df["PRICE"].fillna(df["PRICE"].mean())
avg_prices.iloc[price_0]

# Median imputation
med_prices = df["PRICE"].fillna(df["PRICE"].median())
med_prices.iloc[price_0]

62       825000.0
81       825000.0
134      825000.0
178      825000.0
278      825000.0
          ...    
4509     489000.0
4553    4880000.0
4597    2800000.0
4648    3900000.0
4763    1950000.0
Name: PRICE, Length: 79, dtype: float64

### Multivariate Imputation
Estimating the values of the column based off of values of other columns. This method uses machine learning algorithms for estimation/prediction.

##### *Transform Categorical Data to Numeric*

In [21]:
df["SUBLOCALITY"].value_counts()

# Cast our sublocality column to be categorical
pd.Categorical(df["SUBLOCALITY"])

df["SUBLOCALITY_CODES"] = pd.Categorical(df["SUBLOCALITY"]).codes

df[["SUBLOCALITY", "SUBLOCALITY_CODES"]]

Unnamed: 0,SUBLOCALITY,SUBLOCALITY_CODES
0,Manhattan,10
1,New York County,12
2,Richmond County,16
3,New York County,12
4,New York County,12
...,...,...
4810,New York,11
4811,Queens County,14
4812,New York County,12
4813,Queens,13


##### *Iterative Imputer*
[IterativeImputer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer)

In [23]:
# Only works on numerical columns
numerical = df.select_dtypes("number")

# Create a new imputer object
it_imp = IterativeImputer(max_iter=10)

# We need to train our model on our data
it_imp = it_imp.fit(numerical)

# Use our model to tranform our data
imp_data = pd.DataFrame(it_imp.transform(numerical), columns=numerical.columns)
imp_data


Unnamed: 0,PRICE,BEDS,BATH,PROPERTYSQFT,SUBLOCALITY_CODES
0,315000.0,2.0,2.000000,1400.000000,10.0
1,195000000.0,7.0,10.000000,17545.000000,12.0
2,260000.0,4.0,2.000000,2015.000000,16.0
3,69000.0,3.0,1.000000,445.000000,12.0
4,55000000.0,7.0,2.373861,14175.000000,12.0
...,...,...,...,...,...
4809,599000.0,1.0,1.000000,2184.207862,11.0
4810,245000.0,1.0,1.000000,2184.207862,14.0
4811,1275000.0,1.0,1.000000,2184.207862,12.0
4812,598125.0,2.0,1.000000,655.000000,13.0


In [24]:
# Check our imputed prices
imp_data.iloc[price_0]

# Save our data back to the dataframe
for col in imp_data.columns:
    df[col] = imp_data[col]

df.iloc[price_0]

Unnamed: 0,TYPE,PRICE,BEDS,BATH,PROPERTYSQFT,STATE,MAIN_ADDRESS,LOCALITY,SUBLOCALITY,SUBLOCALITY_CODES
62,Co-op for sale,7.676513e+05,1.0,1.0,835.000000,"New York, NY 10065","333 E 66th St Apt 8ENew York, NY 10065",New York County,New York,11.0
81,Condo for sale,9.628645e+05,2.0,2.0,1065.000000,"New York, NY 10128","200 E 94th St Apt 414New York, NY 10128",New York County,New York,11.0
134,Foreclosure,3.230162e+06,5.0,4.0,3740.000000,"New York, NY 10031","517 W 142nd StNew York, NY 10031",New York,New York County,12.0
178,House for sale,1.840778e+06,4.0,4.0,2100.000000,"Staten Island, NY 10312","82 Vineland AveStaten Island, NY 10312",New York,Richmond County,16.0
278,Co-op for sale,5.263571e+05,3.0,1.0,550.000000,"New York, NY 10023","165 W End Ave Apt 16CNew York, NY 10023",New York County,New York,11.0
...,...,...,...,...,...,...,...,...,...,...
4509,House for sale,4.990000e+05,3.0,2.0,1890.000000,"Bronx, NY 10470","4429 Matilda AveBronx, NY 10470",New York,Bronx County,16.0
4553,Condo for sale,6.000000e+05,1.0,1.0,830.000000,"Manhattan, NY 10018","400 Fifth Ave Unit 57EManhattan, NY 10018",New York County,New York,14.0
4597,Co-op for sale,6.990000e+05,1.0,1.0,2184.207862,"New York, NY 10021","23 E 74th St Apt 14FNew York, NY 10021",New York County,New York,11.0
4648,Condo for sale,7.950000e+05,2.0,1.0,759.000000,"New York, NY 10022","641 Fifth Ave Unit 30FNew York, NY 10022",New York County,New York,11.0
