The dataset contains information about housing prices in a specific real estate 
market. It includes various attributes such as property characteristics, location, sale prices, 
and other relevant features. The goal is to perform data wrangling to gain insights into the 
factors influencing housing prices and prepare the dataset for further analysis or modeling.
Tasks to Perform:
1. Import the "RealEstate_Prices.csv" dataset. Clean column names by removing spaces, 
special characters, or renaming them for clarity.
2. Handle missing values in the dataset, deciding on an appropriate strategy (e.g., 
imputation or removal).
3. Perform data merging if additional datasets with relevant information are available 
(e.g., neighborhood demographics or nearby amenities).
4. Filter and subset the data based on specific criteria, such as a particular time period, 
property type, or location.
5. Handle categorical variables by encoding them appropriately (e.g., one-hot encoding or 
label encoding) for further analysis.
6. Aggregate the data to calculate summary statistics or derived metrics such as average 
sale prices by neighborhood or property type.
7. Identify and handle outliers or extreme values in the data that may affect the analysis 
or modeling process.

In [2]:
import pandas as pd

# 1. Import the dataset and clean column names
data = pd.read_csv("realestate.csv")
data.columns = data.columns.str.strip()  # Remove leading/trailing spaces

# 2. Handle missing values (You need to decide on the strategy)
# Example: Let's assume we want to remove rows with missing values
data.dropna(inplace=True)

# 3. Perform data merging if additional datasets are available (not provided here)

# 4. Filter and subset the data based on specific criteria
# Example: Filter based on a particular year and distance
filtered_data = data[(data['transactiondate'] >= 2013) & (data['distance'] <= 500)]

# 5. Handle categorical variables by one-hot encoding (if applicable)
# Example: If 'stores' is a categorical variable
filtered_data = pd.get_dummies(filtered_data, columns=['stores'])

# 6. Aggregate the data to calculate summary statistics
# Example: Calculate average sale price by property age
average_price_by_age = filtered_data.groupby('houseage')['unit_area'].mean()

# 7. Identify and handle outliers
# Example: Remove rows with extreme values in 'unit_area'
lower_bound = filtered_data['unit_area'].quantile(0.05)
upper_bound = filtered_data['unit_area'].quantile(0.95)
filtered_data = filtered_data[(filtered_data['unit_area'] >= lower_bound) & (filtered_data['unit_area'] <= upper_bound)]

# Display the processed DataFrame
print(filtered_data.head())


    No  transactiondate  houseage   distance  latitude  longitude  unit_area   
7    8         2013.417      20.3  287.60250  24.98042  121.54228       46.7  \
10  11         2013.083      34.8  405.21340  24.97349  121.53372       41.4   
11  12         2013.333       6.3   90.45606  24.97433  121.54310       58.1   
18  19         2013.417      16.9  368.13630  24.96750  121.54451       42.3   
21  22         2013.417      10.5  279.17260  24.97528  121.54541       51.6   

    stores_0  stores_1  stores_3  stores_4  stores_5  stores_6  stores_7   
7      False     False     False     False     False      True     False  \
10     False      True     False     False     False     False     False   
11     False     False     False     False     False     False     False   
18     False     False     False     False     False     False     False   
21     False     False     False     False     False     False      True   

    stores_8  stores_9  stores_10  
7      False     False    