# House Prices Dataset - Exploratory Data Analysis (EDA)

In this notebook, we will explore the House Prices dataset. The objectives of this EDA are to:

- **Understand the Data Structure:**  
  Examine the columns, data types, and general summary statistics.
  
- **Identify Missing Values:**  
  Determine which features have missing values and the extent of these missing data points.

- **Visualize Distributions:**  
  Look at the distribution of key variables, especially the target variable `SalePrice`, and see how they relate to other features.

- **Detect Outliers:**  
  Identify any outliers or unusual data points that might need further investigation.

By doing this, we will lay the foundation for effective data cleaning and feature engineering in later steps.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Enable inline plotting
%matplotlib inline

# Load the training dataset from the 'data/raw' folder
df = pd.read_csv('C:/Users/ycarvalho/OneDrive - EDENRED/Documentos/Data_Analysis_MLE_Kaggle/House_Prices_Dataset/data/raw/train.csv')

print("First 5 rows of the dataset:")
print(df.head())

# Display basic information about the dataset
print("\nDataset Information:")
print(df.info())

# Show summary statistics of the dataset
print("\nSummary Statistics:")
print(df.describe())

# Check for missing values
print("\nMissing Values in Each Column:")
print(df.isnull().sum())

# Step 2: Data Cleaning

In this step, we'll clean the House Prices dataset by addressing missing values, handling outliers, and preparing the data for feature engineering. Here are the main tasks:

1. **Identify Missing Values:**  
   We'll inspect each column to see which ones have missing data and how many values are missing.

2. **Handle Missing Values:**  
   Depending on the column:
   - For numeric columns, we can fill missing values with the median (or mean) value.
   - For categorical columns, we may fill missing values with the mode or mark them as "None" if appropriate.
   - In some cases, if a column has too many missing values, we might consider dropping it.

3. **Detect Outliers:**  
   We'll look at the distributions of key numerical features (like `SalePrice`, `GrLivArea`, etc.) to identify any outliers that might need special handling.

4. **Data Type Conversions:**  
   Ensure that each feature is in the correct format (e.g., categorical features are encoded as such).

Cleaning the data properly will help ensure that our subsequent feature engineering and model building steps work more effectively.


In [None]:
# Visualize missing values
plt.figure(figsize=(12,6))
sns.heatmap(df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap")
plt.show()