### Day_020: Data Cleaning & Missing Value Treatment
***Today's Goal :*** Make the dataset clean, usable, and analysis-ready by handling missing values and fixing obvious data issues.

#### Load the dataset

In [1]:
import pandas as pd
df = pd.read_csv('AB_NYC_2019.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


#### Check missing values 

In [2]:
df.isnull().sum().sort_values(ascending=False)

last_review                       10052
reviews_per_month                 10052
host_name                            21
name                                 16
neighbourhood_group                   0
neighbourhood                         0
id                                    0
host_id                               0
longitude                             0
latitude                              0
room_type                             0
price                                 0
number_of_reviews                     0
minimum_nights                        0
calculated_host_listings_count        0
availability_365                      0
dtype: int64

- In real projects, we always keep the original dataset unchanged.

In [3]:
cleaned_df = df.copy()

#### Decide Strategy for Each Missing Column
- ***Don’t blindly fill everything. Decide why data is missing.***
    - reviews_per_month → Missing because no reviews yet
    - last_review → Missing because listing never reviewed
    - host_name / name → Text info, low impact on modeling

#### Handle missing values

In [4]:
cleaned_df['reviews_per_month'].fillna(0 , inplace= True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cleaned_df['reviews_per_month'].fillna(0 , inplace= True)


***Logical imputation***
- Missing value here usually means the listing has no reviews yet, not that data is wrong
- Filling with 0 correctly represents zero activity, not an average or random value
- This keeps the row instead of deleting useful listing

In [5]:
cleaned_df.drop(columns=['last_review'], inplace=True)

***Removing low-importance, high-missing features***
- last_review has many missing values
- It is a date column, which needs extra processing to be useful
- For price prediction, this column has low impact compared to other features
- Keeping it would add complexity without much benefit at this stage

In [6]:
cleaned_df.drop(columns=['name','host_name'], inplace=True)

***Dropping irrelevant features***
- These columns are free-text and not useful for numerical modeling
- They have high cardinality (many unique values)
- Encoding them would add noise and increase model complexity
- They do not directly affect pricing patterns

##### Recheck missing values 

In [7]:
cleaned_df.isnull().sum().sort_values(ascending=False)

id                                0
host_id                           0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

#### Check for Duplicate Rows

In [8]:
cleaned_df.duplicated().sum()

np.int64(0)

#### Validate Data Types 

In [9]:
df.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

#### Sanity Check: Check for unrealistic values:

In [10]:
(cleaned_df['price'] <= 0).sum()

np.int64(11)

In [11]:
(cleaned_df['price'] <= 0).mean() * 100

np.float64(0.02249718785151856)

- Only ~0.02% of rows have price <= 0 That is extremely small (almost nothing).
- DROP these rows safely

In [12]:
cleaned_df = cleaned_df[cleaned_df['price'] > 0]

In [13]:
(df['minimum_nights'] <= 0).sum()

np.int64(0)

- There are ZERO rows where minimum_nights <= 0
- So nothing to drop here

#### Learnings
- Learned how to identify invalid or unrealistic values using logical conditions.
- Understood how to measure the impact of problematic data before removing it.
- Practiced making data-cleaning decisions based on percentage, not emotion.
- Learned when to drop rows and when to keep them unchanged.
- Improved understanding of data validation as a separate step from data cleaning.