### Step 1: Import Libraries



In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler,MinMaxScaler,OrdinalEncoder,OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

### Step 2: Load the Dataset

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/sumony2j/Data_Cleaning_Preprocessing/refs/heads/main/AB_NYC_2019.csv')


### Step 3: View the Data
Once the data is loaded, take a look at the first few rows to get an idea of what it contains.

**Task**: Use a function to display the first five rows of the DataFrame `df`.

### Step 4: Check Dataset Dimensions
Knowing the size of the dataset can help you plan data processing steps.

**Task**: Use a function to display the dimensions (rows and columns) of `df`.


### Step 5: Get Data Overview
Understanding the data types and number of non-null entries in each column is crucial for data cleaning.

**Task**: Use a function to get a summary of `df` and its columns.


### Step 6: Check for Missing or NULL Values
Some columns may have missing data. Identifying these will guide you in handling missing values.

**Task**: Write code to find the total number of missing values in each column of `df`.


### Step 7: Drop Unnecessary Columns
Some columns in the dataset may not be relevant for analysis. In this exercise:

**Task**: Write code to drop these columns from `df`.


### Step 8: Handle Missing Values
The `reviews_per_month` column has missing values. Let's replace missing values with the most frequent value in this column.

**Task**:
- Initialize a `SimpleImputer` with the strategy `"most_frequent"`.
- Use it to fill missing values in the `reviews_per_month` column of `df`.
- After coding this task. Add another code cell to explore other imputer strategies. Then add another MD cell to discuss the pros and cons of each. Test at least 2 more strats 


### Step 9: Identify Categorical Columns
Categorical columns hold non-numeric data and will require encoding. Identify these columns in `df`.

**Task**:
- Write code to find columns with an `object` data type.
- Print each column name and the number of unique values it contains.


### Step 10: Check Unique Room Types
The `room_type` column has different categories. Listing them will help you understand the types of rentals available.

**Task**: Write code to find the unique values in the `room_type` column.


### Step 11: Prepare Room Type Data for Encoding
Convert the `room_type` column to a NumPy array and reshape it for encoding.

**Task**:
- Convert `room_type` to a list and then to a NumPy array.
- Reshape it to have one column and many rows (use `-1` as the first dimension).


### Step 12: Encode Room Type
Use `OrdinalEncoder` to transform the `room_type` array.

**Task**:
- Initialize an `OrdinalEncoder`.
- Apply it to `room_type` and update `df['room_type']` with the encoded values.


### Step 13: One-Hot Encode Neighborhood
Convert `neighbourhood` into binary columns using `OneHotEncoder`.

**Task**:
- Extract the `neighbourhood` column as a DataFrame.
- Use `OneHotEncoder` to encode it, setting `sparse_output=False`.
- Store the result as a new DataFrame with columns named after each neighborhood.
- Also add a markdown after the code cell and explain why we are using One-Hot encoding and what alternatives we can use


### Step 14: Add Encoded Columns to DataFrame
Add the encoded neighborhood columns back to `df`.

**Task**: Write code to concatenate the one-hot encoded neighborhood DataFrame with `df`.


### Step 15: Encode Neighborhood Group
One-hot encode the `neighbourhood_group` column.

**Task**:
- Extract the `neighbourhood_group` column.
- Apply one-hot encoding to this column as you did with `neighbourhood`.
- Print the categories to confirm.


### Step 16: Final Data Check
After processing the data, check the first few rows of `df` to confirm the transformations.

**Task**: Write code to display the first few rows of `df`.
