#### Data Types & Missing Values

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/NYC_Rolling_Sales_Dataset/nyc-rolling-sales.csv')

##### Convert Columns to Correct Data Types

In [2]:
df['SALE PRICE'] = pd.to_numeric(df['SALE PRICE'], errors='coerce')
df['LAND SQUARE FEET'] = pd.to_numeric(df['LAND SQUARE FEET'], errors='coerce')
df['GROSS SQUARE FEET'] = pd.to_numeric(df['GROSS SQUARE FEET'], errors='coerce')
df['RESIDENTIAL UNITS'] = pd.to_numeric(df['RESIDENTIAL UNITS'], errors='coerce')
df['COMMERCIAL UNITS'] = pd.to_numeric(df['COMMERCIAL UNITS'], errors='coerce')

##### Check Missing Values

In [3]:
df.isnull().sum().sort_values(ascending=False)

GROSS SQUARE FEET                 27612
LAND SQUARE FEET                  26252
SALE PRICE                        14561
NEIGHBORHOOD                          0
Unnamed: 0                            0
BOROUGH                               0
BLOCK                                 0
TAX CLASS AT PRESENT                  0
BUILDING CLASS CATEGORY               0
LOT                                   0
APARTMENT NUMBER                      0
EASE-MENT                             0
BUILDING CLASS AT PRESENT             0
ADDRESS                               0
COMMERCIAL UNITS                      0
RESIDENTIAL UNITS                     0
ZIP CODE                              0
TOTAL UNITS                           0
YEAR BUILT                            0
TAX CLASS AT TIME OF SALE             0
BUILDING CLASS AT TIME OF SALE        0
SALE DATE                             0
dtype: int64

##### Percentage of Missing Values

In [4]:
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_percent.sort_values(ascending=False)

GROSS SQUARE FEET                 32.658372
LAND SQUARE FEET                  31.049818
SALE PRICE                        17.222170
NEIGHBORHOOD                       0.000000
Unnamed: 0                         0.000000
BOROUGH                            0.000000
BLOCK                              0.000000
TAX CLASS AT PRESENT               0.000000
BUILDING CLASS CATEGORY            0.000000
LOT                                0.000000
APARTMENT NUMBER                   0.000000
EASE-MENT                          0.000000
BUILDING CLASS AT PRESENT          0.000000
ADDRESS                            0.000000
COMMERCIAL UNITS                   0.000000
RESIDENTIAL UNITS                  0.000000
ZIP CODE                           0.000000
TOTAL UNITS                        0.000000
YEAR BUILT                         0.000000
TAX CLASS AT TIME OF SALE          0.000000
BUILDING CLASS AT TIME OF SALE     0.000000
SALE DATE                          0.000000
dtype: float64

##### Fill Missing Values

In [5]:
df['LAND SQUARE FEET'] = df['LAND SQUARE FEET'].fillna(
    df['LAND SQUARE FEET'].median()
)

# üêº Module 5: Data Types & Missing Values (Pandas)

---

## üìå Objective
To clean and prepare data by correcting data types and handling missing values, ensuring the NYC dataset is accurate and analysis-ready.

---

## üìÇ Dataset
**NYC Rolling Sales Dataset** This module focuses on auditing the raw CSV data for inconsistencies, such as numbers stored as text and gaps in land/area records.

---

## üîß Work Done
In this module, I implemented a robust cleaning pipeline:

1. **Environment Setup**: 
    * Imported `Pandas` and `NumPy`.
    * Loaded the dataset via `read_csv()`.
2. **Type Conversion**: 
    * Used `pd.to_numeric()` with `errors='coerce'` to fix columns that were improperly read as objects (strings).
    * Impacted Columns: `SALE PRICE`, `LAND SQUARE FEET`, `GROSS SQUARE FEET`, `RESIDENTIAL UNITS`, and `COMMERCIAL UNITS`.
3. **Data Imputation (Filling)**: 
    * Handled missing values in `LAND SQUARE FEET` by filling them with the **median** value to maintain statistical distribution.

---
