# Lesson Objective — Ames Housing Dataset Project

In this session, we will simulate a realistic data science workflow using the Ames Housing dataset. The focus is not only on data analysis but also on **project structure, reproducibility, and documentation**, which are essential skills for advanced data science practice in both academic and industry settings.

## 1. Environment Setup

The first step involves configuring a reproducible working environment. This includes setting up Python dependencies, defining package versions, and ensuring that the project can be executed consistently across different machines. Proper environment management is critical for collaboration and long-term project maintenance.


In [1]:
#via pycharm
#https://www.jetbrains.com/help/pycharm/creating-virtual-environment.html

#via vscode + anaconda
#https://code.visualstudio.com/docs/python/environments

## 2. Project Scaffolding

A well-defined project structure is essential for reproducibility, collaboration, and long-term maintainability in data science projects. One widely adopted reference is the Cookiecutter Data Science template, which provides best practices for organizing datasets, code, documentation, and outputs in a consistent and scalable way. While this structure is not mandatory, it serves as a strong starting point for professional and academic projects.

Below is a suggested folder organization for the Ames Housing project:

In [2]:
# ames-project/
# │
# ├── data/
# │   ├── raw/
# │   └── processed/
# │
# ├── notebooks/
# ├── src/
# ├── reports/
# │   └── figures/
# │
# ├── README.md
# ├── requirements.txt
# └── .gitignore

It is important to note that this structure is only a recommendation. The exact organization may vary depending on the project scope, team preferences, or specific research needs. The key principle is to maintain clarity, consistency, and separation of concerns so that anyone interacting with the project can quickly understand its components.

### Folder and File Overview
`data/`

This directory stores all datasets used in the project. It is commonly divided into:
-   `raw/` → original, immutable datasets as obtained from the source. These should never be modified directly.
-   `processed/` → cleaned or transformed datasets ready for analysis or modeling.

Maintaining this separation helps ensure reproducibility and traceability of preprocessing steps.

`notebooks/`
This folder contains Jupyter notebooks used for exploratory data analysis (EDA), prototyping models, visualization, and experimentation. Notebooks are ideal for iterative work but should eventually be complemented by modular code in the src/ folder for production-level workflows.

`src/`
The main source code directory. This is where reusable scripts, preprocessing pipelines, feature engineering code, model training routines, and evaluation functions should live. Keeping code modular here improves maintainability and prevents notebooks from becoming overly complex.

`reports/`
This directory stores generated outputs such as analysis reports, summaries, or presentation materials.
-   `figures/` → dedicated subfolder for plots, charts, and visualizations produced during analysis.

Separating outputs from raw analysis helps organize communication artifacts clearly.

`README.md`
This file provides an overview of the project, including:
-   project objectives
-   dataset description
-   environment setup instructions
-   usage guidelines
-   key findings or notes

A good README significantly improves project usability and reproducibility.

`requirements.txt`

This file lists all Python dependencies required to run the project. It can be generated using:

In [3]:
# pip freeze > requirements.txt

and installed with:

In [4]:
# pip install -r requirements.txt

This ensures that collaborators can recreate the same computational environment.

`.gitignore`
This file specifies which files or directories Git should ignore. Typical entries include:

-   temporary files
-   environment-specific configuration files
-   large raw datasets (when not versioned)
-   system-generated artifacts
Proper use of `.gitignore` keeps the repository clean and manageable.

In [5]:
# estrutura acima é uma sugestao mas isso pode variar dependendo do projeto e das preferências do time. O importante é manter uma organização clara e consistente para facilitar a colaboração e a manutenção do projeto.

In [6]:
#explicando cada um dos itens da estrutura:

#data/: pasta para armazenar os dados do projeto. Pode ser dividida em subpastas como raw/ para os dados brutos e processed/ para os dados processados.

#notebooks/: pasta para armazenar os notebooks Jupyter usados para exploração, análise e visualização dos dados.

#src/: pasta para armazenar o código-fonte do projeto, como scripts de pré-processamento, modelagem e avaliação.

#reports/: pasta para armazenar os relatórios gerados a partir da análise dos dados, incluindo figuras e gráficos.

#README.md: arquivo de texto que fornece uma visão geral do projeto, incluindo a descrição, os objetivos, as instruções de instalação e uso, e outras informações relevantes.

#requirements.txt: arquivo que lista as dependências do projeto, facilitando a instalação das bibliotecas necessárias.
# pip freeze > requirements.txt
# pip install -r requirements.txt

#.gitignore: arquivo que especifica quais arquivos ou pastas devem ser ignorados pelo sistema de controle de versão Git, como arquivos temporários, dados brutos, ou arquivos de configuração local.

## 3. Dataset Loading

At this stage, we focus on reading the Ames Housing dataset safely and reproducibly. This includes handling file paths, verifying data integrity, and ensuring that the dataset loads correctly into a pandas DataFrame for further analysis.

In [7]:
#libraries
import pandas as pd

In [8]:
#descarregar o dataset do Kaggle e colocar na pasta data/raw/
# https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset?resource=download

In [11]:
#leitura do dataset usando pandas
data_path = '../data/raw/AmesHousing.csv'

df = pd.read_csv(data_path)
df.head()


Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [12]:
#print size of the dataset
print(f"Dataset shape: {df.shape}")

Dataset shape: (2930, 82)


## 4. Initial Data Exploration

At this stage, we conduct a structured exploratory analysis of the Ames Housing dataset to gain a comprehensive understanding of its structure, content, and potential data quality issues. This step goes beyond simple descriptive statistics — it aims to develop contextual awareness of the variables, assess data reliability, and prepare the dataset for downstream modeling tasks.

A key objective is to systematically inspect all columns in the dataframe, ensuring that no variable is overlooked during the initial profiling phase. This includes identifying data types, understanding variable semantics, detecting inconsistencies, and evaluating missing data patterns.

### Variable Typing and Contextual Understanding

Each column should first be classified according to its statistical and computational type, such as:
-   Numerical (continuous or discrete)
-   Categorical (nominal or ordinal)
-   Datetime or temporal variables
-   Identifier-like variables (e.g., IDs)

Beyond technical typing, students should also investigate the semantic meaning of each variable. For example, a variable like `Alley` in the Ames dataset refers to the type of alley access to a property. Understanding this context is essential because missing values in such variables may reflect the absence of an alley rather than incomplete data.

This contextual interpretation is fundamental for making sound preprocessing decisions.

### Summary Profiling Table

As part of the exploratory process, students should construct a summary table where:
-   each row corresponds to a dataset variable, and
-   each column contains descriptive profiling metrics.

This table serves as a compact diagnostic overview of the dataset.

### Description of Each Summary Column

**Number of Values** (`num of values`)

Represents the total number of non-null observations in the column. This helps verify dataset completeness and identify columns with substantial missing information.

---

**Percentage of Missing Values** (`% of missing values`)

Indicates the proportion of missing data relative to the total dataset size. This metric is critical for assessing data quality and determining whether imputation, exclusion, or domain interpretation is required.

---

**Unique Values** (`Categorical Variables`)

Counts the number of distinct categories present in categorical columns. This helps evaluate cardinality, detect potential encoding challenges, and identify variables that may behave like identifiers rather than true categorical features.

---

**Mean** (`Numerical Variables`)

Provides the arithmetic average for numeric columns. Although simple, this statistic offers a first indication of central tendency and can highlight anomalies when compared with medians or expected domain values.

---

**Python Variable Type** (`python type`)

Refers to the pandas dtype assigned to the column (e.g., int64, float64, object, datetime64). This classification is essential for determining appropriate preprocessing steps and verifying whether the inferred dtype aligns with the variable’s intended meaning.

---

**Count of Float Values** (`float`)

Reports how many entries in the column are explicitly stored as floating-point numbers. This is useful for detecting unintended type mixing or numeric coercion issues.

---

**Count of Integer Values** (`int`)

Measures how many values are stored as integers. Differences between integer counts and total numeric observations may indicate type inconsistencies or missing-value casting effects.

---

**Count of String Values** (`str`)

Indicates the number of string-type observations. This is particularly useful in columns expected to be numeric, where the presence of strings may signal parsing errors, data corruption, or inconsistent data entry.

### Validation and Consistency Check

Students should perform a manual double-check of the generated summary table to ensure correctness. Automated profiling can sometimes produce misleading results due to hidden type coercion, encoding issues, or missing-value representations.

This verification step encourages critical thinking and reinforces an important principle in advanced data science:

> Automated analysis should always be complemented by human validation.

In [13]:
#para alem de head o que significa esse dataset? sobre o que ele fala?

In [14]:
#faça um descritivo de cada uma das colunas e um descritivo do dataset
#use informacoes web e IA generativa para isso
#este trabalho é importante para que voce entenda o que cada coluna significa e o que o dataset representa, para que voce possa fazer uma analise mais profunda e tirar conclusoes mais precisas sobre os dados.
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   str    
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   str    
 7   Alley            198 non-null    str    
 8   Lot Shape        2930 non-null   str    
 9   Land Contour     2930 non-null   str    
 10  Utilities        2930 non-null   str    
 11  Lot Config       2930 non-null   str    
 12  Land Slope       2930 non-null   str    
 13  Neighborhood     2930 non-null   str    
 14  Condition 1      2930 non-null   str    
 15  Condition 2      2930 non-null   str    
 16  Bldg Type        2930 non-null   str    
 17  House Style      2930 non

In [15]:
#Order: A unique identifier for each property (integer).
#PID: Another unique identifier for each property (integer).

#ha diferença entre Order e PID? ambos são identificadores únicos para cada propriedade, mas podem ter sido gerados de maneiras diferentes ou usados para propósitos distintos. Order pode ser um número sequencial atribuído às propriedades, enquanto PID pode ser um identificador mais complexo que inclui informações adicionais, como a localização ou o tipo de propriedade. Para entender melhor a diferença entre os dois, seria necessário verificar a documentação do dataset ou analisar os valores contidos em cada coluna.

In [16]:
#MS SubClass: The type of dwelling involved in the sale (categorical).
#count unique values in MS SubClass sorted by id
df['MS SubClass'].value_counts().sort_index()

MS SubClass
20     1079
30      139
40        6
45       18
50      287
60      575
70      128
75       23
80      118
85       48
90      109
120     192
150       1
160     129
180      17
190      61
Name: count, dtype: int64

In [17]:
#MS Zoning: The general zoning classification of the sale (categorical).
#count unique values in MS Zoning sorted by id
df['MS Zoning'].value_counts().sort_index()

MS Zoning
A (agr)       2
C (all)      25
FV          139
I (all)       2
RH           27
RL         2273
RM          462
Name: count, dtype: int64

In [18]:
#Lot Frontage: Linear feet of street connected to the property (numerical).
#Lot Area: Lot size in square feet (numerical).
#Street: Type of road access to the property (categorical).
#count unique values in Street sorted by id
df['Street'].value_counts().sort_index()
#whats this mean? Grvl and Pave are the two types of street access in the dataset. Grvl stands for gravel, which means that the property has access to a gravel road. Pave stands for paved, which means that the property has access to a paved road.

Street
Grvl      12
Pave    2918
Name: count, dtype: int64

In [19]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   str    
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   str    
 7   Alley            198 non-null    str    
 8   Lot Shape        2930 non-null   str    
 9   Land Contour     2930 non-null   str    
 10  Utilities        2930 non-null   str    
 11  Lot Config       2930 non-null   str    
 12  Land Slope       2930 non-null   str    
 13  Neighborhood     2930 non-null   str    
 14  Condition 1      2930 non-null   str    
 15  Condition 2      2930 non-null   str    
 16  Bldg Type        2930 non-null   str    
 17  House Style      2930 non

In [20]:
#Alley: Type of alley access to the property (categorical).
#count unique values in Alley sorted by id
df['Alley'].value_counts().sort_index()
#whats this mean? Grvl and Pave are the two types of alley access in the dataset. Grvl stands for gravel, which means that the property has access to a gravel alley. Pave stands for paved, which means that the property has access to a paved alley.
#diference between Street and Alley: Street refers to the type of road access to the property, while Alley refers to the type of alley access to the property. Street can be either gravel or paved, while Alley can be either gravel, paved, or no alley access (NA).

Alley
Grvl    120
Pave     78
Name: count, dtype: int64

In [23]:
#create a table with summary statistics with:
#- name of variable
#- type of variable (numerical, categorical, etc.)
#- num of values
#- % of missing values
#- unique values (for categorical variables)
#- mean for numerical variables
#- python type of variable (int, float, object, etc.)
#- % of missing python type of variable (int, float, object, etc.)


# def pct_inconsistent_types(col):
#     types = col.dropna().apply(lambda x: type(x).__name__)
#     if len(types) == 0:
#         return 0
#     dominant = types.mode()[0]
#     return (types != dominant).mean() * 100

# print(df['Alley'].apply(lambda x: type(x).__name__).value_counts())

# print(pct_inconsistent_types(df['Alley']))

summary_table = pd.DataFrame({
    'Num of Values': df.count(),
    '% of Missing Values': df.isnull().mean() * 100,
    'Unique Values (Categorical)': df.select_dtypes(include=['object', 'string']).nunique(),
    'Mean (Numerical)': df.select_dtypes(include='number').mean(),
    'Python Type': df.dtypes,
    # '% Inconsistent Python Types': df.apply(pct_inconsistent_types)
})

summary_table


Unnamed: 0,Num of Values,% of Missing Values,Unique Values (Categorical),Mean (Numerical),Python Type
1st Flr SF,2930,0.000000,,1159.557679,int64
2nd Flr SF,2930,0.000000,,335.455973,int64
3Ssn Porch,2930,0.000000,,2.592491,int64
Alley,198,93.242321,2.0,,str
Bedroom AbvGr,2930,0.000000,,2.854266,int64
...,...,...,...,...,...
Utilities,2930,0.000000,3.0,,str
Wood Deck SF,2930,0.000000,,93.751877,int64
Year Built,2930,0.000000,,1971.356314,int64
Year Remod/Add,2930,0.000000,,1984.266553,int64


In [24]:
python_types_dist = df.apply(
    lambda col: col.apply(lambda x: type(x).__name__)
                  .value_counts(normalize=True) * 100
)


#transpose python_types_dist to have types as columns and variables as index
python_types_dist = python_types_dist.transpose()

python_types_dist

Unnamed: 0,float,int,str
Order,,100.0,
PID,,100.0,
MS SubClass,,100.0,
MS Zoning,,,100.0
Lot Frontage,100.0,,
...,...,...,...
Mo Sold,,100.0,
Yr Sold,,100.0,
Sale Type,,,100.0
Sale Condition,,,100.0


In [25]:
#merge summary_table and python_types_dist on index
final_summary = summary_table.merge(python_types_dist, left_index=True, right_index=True, how='left')

In [26]:
final_summary

Unnamed: 0,Num of Values,% of Missing Values,Unique Values (Categorical),Mean (Numerical),Python Type,float,int,str
1st Flr SF,2930,0.000000,,1159.557679,int64,,100.0,
2nd Flr SF,2930,0.000000,,335.455973,int64,,100.0,
3Ssn Porch,2930,0.000000,,2.592491,int64,,100.0,
Alley,198,93.242321,2.0,,str,93.242321,,6.757679
Bedroom AbvGr,2930,0.000000,,2.854266,int64,,100.0,
...,...,...,...,...,...,...,...,...
Utilities,2930,0.000000,3.0,,str,,,100.000000
Wood Deck SF,2930,0.000000,,93.751877,int64,,100.0,
Year Built,2930,0.000000,,1971.356314,int64,,100.0,
Year Remod/Add,2930,0.000000,,1984.266553,int64,,100.0,


In [27]:
#name of columns of final_summary
final_summary.columns

Index(['Num of Values', '% of Missing Values', 'Unique Values (Categorical)',
       'Mean (Numerical)', 'Python Type', 'float', 'int', 'str'],
      dtype='str')

In [None]:
#print unique values from Alley column
print(df['Alley'].unique())
print()
#print count of unique values from Alley column
print(df['Alley'].value_counts())

In [None]:
(120+78)/len(df)

In [None]:
#criar a coluna tipo de dados (categorical, numerical, etc.) com base no tipo de dados do pandas
def data_type(col):
    if pd.api.types.is_numeric_dtype(col):
        return 'Numerical'
    elif pd.api.types.is_string_dtype(col):
        return 'Categorical'
    elif pd.api.types.is_datetime64_any_dtype(col):
        return 'Datetime'
    else:
        return 'Other'

final_summary['Data Type'] = df.apply(data_type)

In [None]:
final_summary

## 5. Writing the Project README

The final step of this exercise is to produce a well-structured README file that documents the project clearly and professionally. Writing good documentation is a core skill in advanced data science because it ensures reproducibility, facilitates collaboration, and allows others to quickly understand the purpose, structure, and findings of the project.
The README should be written with the assumption that the reader has no prior knowledge of the project.

### Project Title

Start with a clear and concise project title. The title should reflect the dataset, the analytical focus, or the main objective of the project. A good title immediately communicates the scope and theme of the work.

### Project Description

(Background, Motivation, and Objectives)

Provide a short but informative overview of the project. This section should include:
-   the background context of the problem or dataset
-   the motivation for conducting the analysis
-   the main objectives or research questions

The goal is to help readers understand why the project exists, what problem it addresses, and what outcomes are expected.

### Installation Instructions

Include step-by-step guidance on how to reproduce the working environment. This typically involves:
-   creating and activating a virtual environment
-   installing dependencies (e.g., via requirements.txt)
-   instructions for running notebooks, scripts, or pipelines

Clear installation instructions are essential for reproducibility and collaboration, particularly in academic and professional settings.

### Data Source and Data Structure

Describe the origin of the dataset and how it is organized. This section should include:
-   where the data was obtained (e.g., repository, competition, public dataset)
-   format of the data (CSV, database export, etc.)
-   number of observations and variables
-   general description of key features

This contextual information helps readers interpret the dataset correctly and understand its potential limitations.

### Initial Data Inspection

Provide an overview of the first exploratory analysis. This should include:
-   summary statistics for numerical variables
-   data type classification
-   missing value analysis
-   any notable patterns or irregularities

You may also include visualizations (e.g., distributions, missing-value plots) to support the description. The objective is to demonstrate familiarity with the dataset before deeper analysis.

### Summary Table of Key Variables

Include a summary table containing the most relevant variables identified during the initial exploration. The table should have:
-   a clear caption explaining its purpose
-   well-defined metrics (e.g., missing values, central tendency, type consistency)
-   appropriate formatting for readability

This table serves as a compact diagnostic overview of the dataset.

### Interpretation and Preliminary Insights

Accompany the summary table with a short analytical discussion. This section should:
-   highlight important patterns or anomalies
-   identify possible data quality issues
-   discuss variable relevance for future modeling
-   suggest potential preprocessing steps

This interpretation demonstrates analytical maturity and connects descriptive statistics with the broader data science workflow.

Overall, the README should communicate the project in a structured, professional, and reproducible way. Beyond documenting technical steps, it should reflect critical thinking about the dataset, methodological choices, and next analytical directions.

In [None]:
# https://medium.datadriveninvestor.com/how-to-write-a-good-readme-for-your-data-science-project-on-github-ebb023d4a50e

#https://www.youtube.com/watch?v=vgZuTpOj9fE

In [None]:
#create a readme file

#first: Project Title

#second: Project Description (background, motivation, and objectives)
#provide a brief overview of the project, including the problem statement, the motivation behind it, and the objectives you aim to achieve. This section should give readers a clear understanding of what the project is about and why it is important.

#installation instructions
#provide step-by-step instructions on how to set up the project environment. This may include installing necessary libraries, setting up virtual environments, and any other dependencies required to run the project. You can also include instructions for running the code, such as how to execute scripts or notebooks.

#third: Data Source and Data Structure
#describe the data sources used in the project, including where the data was obtained from and any relevant details about the data collection process. Additionally, provide an overview of the data structure, such as the format of the data (e.g., CSV, JSON), the number of records, and the features or variables included in the dataset.

#initial check of the data
#provide an initial exploration of the data, including summary statistics, data types, and any missing values. This section can also include visualizations to help understand the distribution of the data and identify any potential issues or patterns.

#add summary table with most relevante variables
#include caption of table and explain what the table shows and how it can be used for further analysis.
#add an analysis based on this table, highlighting any interesting findings, potential data quality issues, or insights that can be drawn from the summary statistics. This analysis should provide a deeper understanding of the dataset and inform the next steps in the data science workflow.