## Cleaning the data

Before diving into the analysis, it's essential to ensure that the data is clean and free from any inconsistencies.

## Dataset Variables

The dataset consists of the following variables:

- **Property Type**: The type or category of the property (e.g., apartment, house).
- **Address**: The specific address or location of the property.
- **Region**: The broader region or district where the property is located.
- **Rent**: The monthly rent price for the property.
- **Total Rent**: The total rent, which might include additional costs or fees.
- **Area**: The size of the property in square meters.
- **Rooms**: The number of bedrooms in the property.
- **Bathrooms**: The number of bathrooms in the property.
- **Garage**: Indicates whether the property has a garage (1 for Yes, 0 for No).
- **Furnished**: Indicates whether the property is furnished (1 for Yes, 0 for No).

With a clear understanding of these variables, we can now proceed with the data cleaning and initial analyses.


In [1]:
import pandas as pd
import os

def clean_data(filename):
    """
    Load and clean the housing data.

    :param filename: Name of the CSV file containing housing data.
    :return: A cleaned DataFrame.
    """
    
    # Construct the full path to the data
    data_path = os.path.join("C:\\", "github", "Rio-Housing-Scrapper", "Data(29-09_2023)", filename)
    
    # Import the data
    df = pd.read_csv(data_path)

    # Remove duplicated rows
    df_cleaned = df.drop_duplicates().reset_index(drop=True)

    # Rename columns
    df_cleaned = df_cleaned.rename(columns={'aluguel': 'rent', 'aluguel_total': 'total_rent'})

    # Replace NaN with 0 in the "garage" column
    df_cleaned['garage'].fillna(0, inplace=True)

    # Convert values in the "furnished" column to 1 (True) and 0 (False)
    df_cleaned['furnished'] = df_cleaned['furnished'].astype(int)

    # Remove ", Rio de Janeiro" from the "region" column
    df_cleaned['region'] = df_cleaned['region'].str.replace(", Rio de Janeiro", "")

    return df_cleaned

# Usage example
df_cleaned = clean_data("housing_data.csv")

# Renaming columns to English for better understanding in the analysis
df_cleaned = df_cleaned.rename(columns={
    'type_of_property': 'Property Type',
    'address': 'Address',
    'region': 'Region',
    'rent': 'Rent',
    'total_rent': 'Total Rent',
    'area': 'Area',
    'rooms': 'Rooms',
    'bathrooms': 'Bathrooms',
    'garage': 'Garage',
    'furnished': 'Furnished'
})

# Showing some rows
df_cleaned


Unnamed: 0,Property Type,Address,Region,Rent,Total Rent,Area,Rooms,Bathrooms,Garage,Furnished
0,Apartamento,Rua Marechal Bittencourt,Riachuelo,700.000,1.368,55,2,1,1.0,0
1,Apartamento,Rua Tonelero,Copacabana,4.200,6.012,140,5,2,1.0,1
2,Apartamento,Rua Correa Dutra,Flamengo,2.450,3.198,53,1,1,0.0,0
3,Apartamento,Rua Theodor Herzl,Botafogo,3.300,4.075,61,2,1,0.0,0
4,Apartamento,Rua Hilário de Gouvêia,Copacabana,1.750,2.649,27,1,1,0.0,0
...,...,...,...,...,...,...,...,...,...,...
2568,Studio e kitnet,Avenida Flamboyants da Península,Barra da Tijuca,9.375,12.290,135,1,2,0.0,1
2569,Apartamento,Rua Homem de Melo,Tijuca,3.840,4.906,180,3,2,1.0,1
2570,Casa em condomínio,Rua Lage do Muriaé,Taquara,4.840,5.562,306,3,4,0.0,0
2571,Casa,Avenida Arapogi,Braz de Pina,2.910,3.025,181,4,2,0.0,0


In [3]:
## saving the data_frame
path = r"C:\github\Rio-Housing-Scrapper\Data(29-09_2023)\housing_data_set.csv"
df_cleaned.to_csv(path, index=False)

# Descriptive Statistics

