### **Data Wrangling**

**Problem Statement**: Data Wrangling on Real Estate Market

**Dataset**: "RealEstate_Prices.csv"

**Description**: The dataset contains information about housing prices in a specific real estate
market. It includes various attributes such as property characteristics, location, sale prices,
and other relevant features. The goal is to perform data wrangling to gain insights into the
factors influencing housing prices and prepare the dataset for further analysis or modeling.

**Tasks to Perform**:
1. Import the "RealEstate_Prices.csv" dataset. Clean column names by removing spaces,
special characters, or renaming them for clarity.
2. Handle missing values in the dataset, deciding on an appropriate strategy (e.g.,
imputation or removal).
3. Perform data merging if additional datasets with relevant information are available
(e.g., neighborhood demographics or nearby amenities).
4. Filter and subset the data based on specific criteria, such as a particular time period,
property type, or location.
5. Handle categorical variables by encoding them appropriately (e.g., one-hot encoding or
label encoding) for further analysis.
6. Aggregate the data to calculate summary statistics or derived metrics such as average
sale prices by neighborhood or property type.
7. Identify and handle outliers or extreme values in the data that may affect the analysis
or modeling process.

## ------------------------------------------------------------------------------------------------------------

## **Importing Libraries**

In [None]:
import numpy as np
import pandas as pd

## ------------------------------------------------------------------------------------------------------------

## **Task 1**

In [None]:
df = pd.read_csv("RealEstate_Price.csv")

In [None]:
df

Unnamed: 0,Home,Price,SqFt,Bedrooms,Bathrooms,Offers,Brick,Neighborhood,Sales_date
0,1,114300.0,1790,2.0,2,2,No,East,15-01-2021
1,2,114200.0,2030,4.0,2,3,No,East,21-09-2022
2,3,114800.0,1740,3.0,2,1,No,East,13-03-2022
3,4,94700.0,1980,3.0,2,3,No,East,31-08-2021
4,5,119800.0,2130,3.0,3,3,No,East,31-08-2021
...,...,...,...,...,...,...,...,...,...
123,124,119700.0,1900,3.0,3,3,Yes,East,02-10-2022
124,125,147900.0,2160,4.0,3,3,Yes,East,13-03-2020
125,126,113500.0,2070,2.0,2,2,No,North,17-08-2021
126,127,149900.0,2020,3.0,3,1,No,West,15-05-2022


In [None]:
df.columns = df.columns.str.strip()
df.columns = df.columns.str.replace(" ", "_")

## ------------------------------------------------------------------------------------------------------------

## **Task 2**

In [None]:
df.isnull().sum()

Home            0
Price           4
SqFt            0
Bedrooms        4
Bathrooms       0
Offers          0
Brick           0
Neighborhood    0
Sales_date      0
dtype: int64

In [None]:
price_mean = df['Price'].mean()
bedrooms_mode = df['Bedrooms'].mode()[0]

# Impute missing values with the mode
df['Price'].fillna(price_mean, inplace=True)
df['Bedrooms'].fillna(bedrooms_mode, inplace=True)

In [None]:
df.isnull().sum()

Home            0
Price           0
SqFt            0
Bedrooms        0
Bathrooms       0
Offers          0
Brick           0
Neighborhood    0
Sales_date      0
dtype: int64

## ------------------------------------------------------------------------------------------------------------

## **Task 3 & 4**

In [None]:
#Filter
filtered_df = df[df['Bedrooms'] <= 2]
filtered_df

Unnamed: 0,Home,Price,SqFt,Bedrooms,Bathrooms,Offers,Brick,Neighborhood,Sales_date
0,1,114300.0,1790,2.0,2,2,No,East,15-01-2021
11,12,123000.0,1870,2.0,2,2,Yes,East,20-11-2021
18,19,111400.0,1700,2.0,2,1,Yes,East,08-07-2021
28,29,69100.0,1600,2.0,2,3,No,North,12-05-2020
31,32,112300.0,1930,2.0,2,2,Yes,North,05-10-2022
34,35,130718.548387,2000,2.0,2,3,No,North,26-08-2022
36,37,117500.0,1880,2.0,2,2,No,North,13-02-2021
42,43,105600.0,1990,2.0,2,3,No,East,24-07-2020
46,47,129800.0,1990,2.0,3,2,No,North,06-05-2022
48,49,115900.0,1980,2.0,2,2,No,East,29-08-2020


In [None]:
#Subset
subset_df = df[["Price", "SqFt", "Offers"]]
subset_df

Unnamed: 0,Price,SqFt,Offers
0,114300.0,1790,2
1,114200.0,2030,3
2,114800.0,1740,1
3,94700.0,1980,3
4,119800.0,2130,3
...,...,...,...
123,119700.0,1900,3
124,147900.0,2160,3
125,113500.0,2070,2
126,149900.0,2020,1


## ------------------------------------------------------------------------------------------------------------

## **Task 5**

In [None]:
label_mapping = {"No": 0, "Yes": 1}
df['Brick'] = df['Brick'].map(label_mapping)

In [None]:
df['Brick']

0      0
1      0
2      0
3      0
4      0
      ..
123    1
124    1
125    0
126    0
127    0
Name: Brick, Length: 128, dtype: int64

## ------------------------------------------------------------------------------------------------------------

## **Task 6**

In [None]:
grouped_data = df.groupby(['Neighborhood'])['Price'].mean().reset_index()

In [None]:
grouped_data

Unnamed: 0,Neighborhood,Price
0,East,124891.935484
1,North,111348.570381
2,West,159294.871795


## ------------------------------------------------------------------------------------------------------------

## **Task 7**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Home          128 non-null    int64  
 1   Price         128 non-null    float64
 2   SqFt          128 non-null    int64  
 3   Bedrooms      128 non-null    float64
 4   Bathrooms     128 non-null    int64  
 5   Offers        128 non-null    int64  
 6   Brick         128 non-null    int64  
 7   Neighborhood  128 non-null    object 
 8   Sales_date    128 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 9.1+ KB


In [None]:
df['Price'] = df['Price'].astype(int)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Home          128 non-null    int64  
 1   Price         128 non-null    int32  
 2   SqFt          128 non-null    int64  
 3   Bedrooms      128 non-null    float64
 4   Bathrooms     128 non-null    int64  
 5   Offers        128 non-null    int64  
 6   Brick         128 non-null    int64  
 7   Neighborhood  128 non-null    object 
 8   Sales_date    128 non-null    object 
dtypes: float64(1), int32(1), int64(5), object(2)
memory usage: 8.6+ KB


In [None]:
from scipy import stats

# Calculate the Z-scores for the 'fare_amount' column
df['z_score'] = stats.zscore(df['Price'])

# Define a threshold for identifying outliers (e.g., 3)
threshold = 3

# Create a boolean mask to identify outliers
outliers_mask = (df['z_score'] > threshold) | (df['z_score'] < -threshold)

# Extract the outliers
outliers = df[outliers_mask]

# Print the outliers
outliers

Unnamed: 0,Home,Price,SqFt,Bedrooms,Bathrooms,Offers,Brick,Neighborhood,Sales_date,z_score
103,104,211200,2440,4.0,3,3,1,West,15-01-2020,3.038658


In [None]:
df.loc[103, 'Price'] = df.loc[103, 'Price'] if df.loc[103, 'Price'] <= threshold else df['Price'].median()

In [None]:
print("Result of the 'Price' column:")
df['Price']

Result of the 'Price' column:


0      114300
1      114200
2      114800
3       94700
4      119800
        ...  
123    119700
124    147900
125    113500
126    149900
127    124600
Name: Price, Length: 128, dtype: int32

In [None]:
from scipy import stats

# Calculate the Z-scores for the 'fare_amount' column
df['z_score'] = stats.zscore(df['Price'])

# Define a threshold for identifying outliers (e.g., 3)
threshold = 3

# Create a boolean mask to identify outliers
outliers_mask = (df['z_score'] > threshold) | (df['z_score'] < -threshold)

# Extract the outliers
outliers = df[outliers_mask]

# Print the outliers
outliers

Unnamed: 0,Home,Price,SqFt,Bedrooms,Bathrooms,Offers,Brick,Neighborhood,Sales_date,z_score
