# Introduction 4: Exploratory Data Analysis (EDA)
This notebook focuses on understanding your data before modeling. You will learn how to explore the structure, distribution, and relationships in the Boston housing dataset, highlighting the importance of EDA in any data science project.

In [1]:
from sklearn.datasets import load_boston


In [2]:
# The Boston housing dataset is returned as a dictionary-like object:
# print(load_boston())

# To understand what features and metadata are included, print the dataset description:
print(load_boston()["DESCR"])


.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

Now that we've examined the dataset, we can ask important questions: Is 506 houses enough for a reliable model? What year is this data from, and is it still relevant?

More importantly, some features in this dataset (like the `B` variable, which encodes the proportion of Black residents) are problematic and raise ethical concerns. Using such variables in a predictive model can perpetuate bias and lead to discriminatory outcomes.

This highlights why it's essential to look closely at your data—not just for technical reasons, but for ethical ones as well. The Boston housing dataset has been widely used in teaching, but it's now deprecated in scikit-learn due to these issues.

Beyond technical correctness, responsible data science means:
- Understanding your dataset and its context.
- Being skeptical of model performance, especially if results seem "too good."
- Considering the ethical and social impact of your models.

Scikit-learn's API is powerful, but the most important part of data science is using it thoughtfully and responsibly.