<a href="https://colab.research.google.com/github/veselm73/BP_temp/blob/main/code/01RAD_Ex04_HW_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 01RAD – Homework Assignment 01 (After Exercise 04)

This homework guides you through data preparation, exploratory analysis, and simple linear regression using a housing market dataset.




## Conditions and grading

- Work on the assignment individually or in Team. If you discuss specific questions with classmates, mention it in the corresponding answer.





## Submission

Submit your work as a Jupyter notebook (`.ipynb`) runnable in Google Colab. Include your name at the top of the notebook. Deadline: **November 2nd  2025**.




## Dataset

Use the CSV file hosted at:

```
https://raw.githubusercontent.com/francji1/01RAD/main/data/sarasota_houses_mod.csv
```

Load the data with `pandas.read_csv`. The table contains 1 057 houses from the Sarasota (FL) area. Columns:

| column | description |
| --- | --- |
| `price` | sale price in USD |
| `living_area` | interior living area in square feet |
| `bathrooms` | number of bathrooms (can be fractional) |
| `bedrooms` | number of bedrooms |
| `fireplaces` | count of fireplaces |
| `lot_size` | lot size in acres |
| `age` | age of the house (years) |
| `fireplace` | boolean indicator whether the house has at least one fireplace |

You will convert the imperial units during the tasks below.




## Data preview



In [None]:
# preview the dataset
import pandas as pd

url = "https://raw.githubusercontent.com/francji1/01RAD/main/data/sarasota_houses_mod.csv"
houses = pd.read_csv(url)
houses.head()


In [None]:
# TODO: import required libraries and load the dataset into a pandas DataFrame named `houses`



## Task 01 – Data audit

Check whether the dataset contains missing values. If it does, discuss whether you can safely remove the affected observations. Identify which variables are quantitative and which are qualitative (categorical). If a variable could be treated either way, state your choice and rationale. Compute basic descriptive statistics for each variable.



In [None]:

### Suggested exchange rates and unit conversions

# with an exchange rate of **1 USD = 23 CZK** and express the price in thousands of CZK.

# Convert areas to square metres:
#  - `living_area` (square feet) → multiply by **0.092903**.
#  - `lot_size` (acres) → multiply by **4046.86**.




In [None]:
# TODO: Task 01


## Task 02 – Unit conversion and filtering

Create a cleaned subset of the data that satisfies all of the following:

1. Convert `price` to thousands of CZK using the exchange rate given above.
2. Convert `living_area` and `lot_size` to square metres.
3. Keep only houses that are older than 10 years but not older than 50 years.
4. Keep only houses with price below 7 500 CZK (in thousands), and lot size between 500 m² and 5 000 m².
5. Convert `bathrooms` and `bedrooms` to categorical variables with three levels of your choice (justify the cut points in your report).

Use this filtered dataset for the remaining tasks unless explicitly noted otherwise, and focus on these variables: `price_czk`, `living_area_m2`, `lot_size_m2`, `bedrooms_cat`, `bathrooms_cat`, `age`, `fireplace`.



In [None]:
# TODO: Task 02


## Task 03 – Price comparison (fireplace vs no fireplace)

Compare the mean price of houses with a fireplace to those without one. Test the hypothesis that houses with a fireplace have a higher mean price at the 1% significance level. Clearly state the hypotheses, the test statistic you use, its value, and your conclusion.




# Data visualisation

## Task 04 – Exploratory plots

- Draw scatter plots for each pair of numerical variables, using colour to indicate the presence of a fireplace (`fireplace`).
- Plot boxplots (or violin plots) of `price_czk` against the categorical versions of `bedrooms`, `bathrooms`, and the boolean `fireplace` indicator.
- Display a histogram of `price_czk` and overlay a kernel density estimate.




## Task 05 – Combined categories

For the combinations of `bedrooms_cat` and `bathrooms_cat`, visualise the distribution of `price_czk`. Ensure that the plot clearly shows which combinations exist in the filtered dataset and whether price levels differ across them.




## Task 06 – Focus on two-bedroom houses

Restrict the data to houses with exactly two bedrooms (before categorisation). Plot `price_czk` against `living_area_m2`, colour the points by `fireplace`, and scale the point size according to the number of bathrooms (treat `bathrooms` as numeric for this plot).




**From this point on, continue working with the subset of two-bedroom houses unless a task specifies otherwise.**



In [None]:
# TODO: Task 06


# Simple linear regression




## Task 07 – Simple regression (with and without intercept)

Fit two linear models explaining `price_czk` by `living_area_m2`: one with an intercept and one without. Report $R^2$ and the $F$-statistic for both models. Choose the model you prefer and justify your choice. Using the selected model, answer whether price depends on living area and by how much the expected price changes if the living area increases by 20 m².



In [None]:
# TODO: Task 07


## Task 08 – Separate models by fireplace

Fit the same simple regression separately for houses with a fireplace and without a fireplace. Which group exhibits a stronger linear relationship between price and living area? By how much does the slope differ between the two models? Compute 95% confidence intervals for the slopes and discuss whether they overlap. Estimate the percentage difference in expected price for a 160 m² house with a fireplace versus one without a fireplace.



In [None]:
# TODO: Task 08


## Task 09 – Visual comparison of models

Create a scatter plot of `living_area_m2` versus `price_czk` showing the two fitted regression lines (with and without a fireplace). Add 90% confidence bands for the mean predictions. Use the plot to comment on whether expected prices differ for houses with living area below 120 m². Explain whether this comparison is appropriate.



In [None]:
# TODO: Task 09


## Task 10 – Residual diagnostics

Plot histograms of the residuals from the models in Task 09. Overlay the density of a normal distribution with mean zero and variance equal to the estimated $\hat{\sigma}^2$ of each model. Comment on the findings and suggest further model improvements. Plot corresponding QQ plots and  discuss them.



In [None]:
# TODO: Task 10

# Task
Analyze the provided housing dataset ("housing.csv") to understand the factors influencing house prices. This involves data auditing, cleaning, transformation, exploratory data analysis through visualizations, and fitting linear regression models to predict house prices based on relevant features, including a comparison of models for houses with and without fireplaces. Finally, analyze the residuals of the fitted models.

## Task 01 - data audit

### Subtask:
Check for missing values, identify variable types, and compute descriptive statistics.


**Reasoning**:
Check for missing values, print data types, and compute descriptive statistics as requested by the subtask instructions.



In [None]:
print("Missing values per column:")
print(houses.isnull().sum())

print("\nData types:")
houses.info()

print("\nDescriptive statistics:")
display(houses.describe())

## Task 02 - unit conversion and filtering

### Subtask:
Convert units and filter the dataset based on the specified criteria. Also, categorize `bathrooms` and `bedrooms`.


**Reasoning**:
Apply the unit conversions and filtering steps to the `houses` DataFrame to create the cleaned subset as specified in the instructions. Then, categorize the 'bedrooms' and 'bathrooms' columns and select the final set of columns.



In [None]:
# 1. Convert price to thousands of CZK
houses['price_czk'] = houses['price'] * 23 / 1000

# 2. Convert living_area to square metres
houses['living_area_m2'] = houses['living_area'] * 0.092903

# 3. Convert lot_size to square metres
houses['lot_size_m2'] = houses['lot_size'] * 4046.86

# 4. Filter by age (older than 10 years but not older than 50)
filtered_houses = houses[(houses['age'] >= 11) & (houses['age'] < 50)].copy()

# 5. Filter by price and lot size
filtered_houses = filtered_houses[(filtered_houses['price_czk'] < 7500) &
                                  (filtered_houses['lot_size_m2'] >= 500) &
                                  (filtered_houses['lot_size_m2'] <= 5000)].copy()

# 6. Categorize bedrooms
# Justification: Common sense groupings for bedrooms. 1-2 bedrooms are smaller, 3-4 are typical family homes, 5+ are larger homes.
bins_bedrooms = [0, 2.5, 4.5, filtered_houses['bedrooms'].max() + 1]
labels_bedrooms = ['1-2', '3-4', '5+']
filtered_houses['bedrooms_cat'] = pd.cut(filtered_houses['bedrooms'], bins=bins_bedrooms, labels=labels_bedrooms, right=False, include_lowest=True)


# 7. Categorize bathrooms
# Justification: Groupings based on typical bathroom counts. 1 bathroom is standard, 1.5-2 covers full and half baths, 2.5+ covers multiple full baths.
bins_bathrooms = [0, 1.25, 2.25, filtered_houses['bathrooms'].max() + 1]
labels_bathrooms = ['1', '1.5-2', '2.5+']
filtered_houses['bathrooms_cat'] = pd.cut(filtered_houses['bathrooms'], bins=bins_bathrooms, labels=labels_bathrooms, right=False, include_lowest=True)


# 8. Select specified columns
cleaned_houses = filtered_houses[['price_czk', 'living_area_m2', 'lot_size_m2', 'bedrooms_cat', 'bathrooms_cat', 'age', 'fireplace']]

display(cleaned_houses.head())
display(cleaned_houses.info())
display(cleaned_houses.describe(include='all'))

## Task 03 - price comparison

### Subtask:
Compare the mean price of houses with a fireplace to those without one. Test the hypothesis that houses with a fireplace have a higher mean price at the 1% significance level. Clearly state the hypotheses, the test statistic you use, its value, and your conclusion.


In [None]:
# TODO: Task 03

from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm


# Separate data based on fireplace presence
fireplace_yes = cleaned_houses[cleaned_houses['fireplace'] == True]['price_czk']
fireplace_no = cleaned_houses[cleaned_houses['fireplace'] == False]['price_czk']

# Calculate mean prices
mean_price_yes = fireplace_yes.mean()
mean_price_no = fireplace_no.mean()

print(f"Mean price with fireplace: {mean_price_yes:.2f} thousand CZK")
print(f"Mean price without fireplace: {mean_price_no:.2f} thousand CZK")

# Perform independent samples t-test using statsmodels
# Hypotheses:
# H0: The mean price of houses with a fireplace is equal to the mean price of houses without a fireplace. (μ_yes = μ_no)
# H1: The mean price of houses with a fireplace is higher than the mean price of houses without a fireplace. (μ_yes > μ_no)

# Since statsmodels t-test is for equality of means, we can use it to get the t-statistic and p-value
# for a two-sided test, and then adjust the p-value for a one-sided test if the mean of the 'yes' group is higher.
ttest_result = stats.ttest_ind(fireplace_yes, fireplace_no, equal_var=False, alternative='greater') # Assuming unequal variances (Welch's t-test)

t_statistic = ttest_result.statistic
p_value = ttest_result.pvalue

print(f"\nT-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Significance level
alpha = 0.01

# Conclusion
if p_value < alpha:
    print(f"\nConclusion: Since the p-value ({p_value:.4f}) is less than the significance level ({alpha}), we reject the null hypothesis.")
    print("There is sufficient evidence to conclude that houses with a fireplace have a significantly higher mean price than houses without a fireplace.")
else:
    print(f"\nConclusion: Since the p-value ({p_value:.4f}) is greater than the significance level ({alpha}), we fail to reject the null hypothesis.")
    print("There is not enough evidence to conclude that houses with a fireplace have a significantly higher mean price than houses without a fireplace.")

**Reasoning**:
Separate the DataFrame by fireplace presence, calculate the mean price for each group, and perform an independent samples t-test to compare the means. State the hypotheses, test statistic, and conclusion based on the p-value and significance level.



In [None]:
# TODO: Task 04

import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set_theme(style="whitegrid")

# 1. Scatter plots for each pair of numerical variables, using colour to indicate fireplace
numerical_vars = ['price_czk', 'living_area_m2', 'lot_size_m2', 'age']
sns.pairplot(cleaned_houses, vars=numerical_vars, hue='fireplace', palette='viridis')
plt.suptitle('Pairwise Scatter Plots of Numerical Variables by Fireplace', y=1.02)
plt.show()


# 2. Boxplots of price_czk against categorical variables
categorical_vars = ['bedrooms_cat', 'bathrooms_cat', 'fireplace']

for var in categorical_vars:
    plt.figure(figsize=(8, 6))
    if var == 'fireplace':
        sns.boxplot(x=var, y='price_czk', data=cleaned_houses, palette='viridis', hue=var, legend=False)
    else:
        sns.boxplot(x=var, y='price_czk', data=cleaned_houses, palette='viridis', hue=var, legend=False)
    plt.title(f'Boxplot of Price (CZK) by {var}')
    plt.ylabel('Price (thousand CZK)')
    plt.xlabel(var.replace('_cat', '').capitalize()) # Clean up labels
    plt.show()


# 3. Histogram of price_czk and overlay a kernel density estimate
plt.figure(figsize=(8, 6))
sns.histplot(cleaned_houses['price_czk'], kde=True, color='skyblue', bins=30)
plt.title('Histogram and KDE of Price (thousand CZK)')
plt.xlabel('Price (thousand CZK)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# TODO: Task 05

import matplotlib.pyplot as plt
import seaborn as sns

# Create a new column for combined categories
cleaned_houses['bedroom_bathroom_combo'] = cleaned_houses['bedrooms_cat'].astype(str) + ' Bedrooms, ' + cleaned_houses['bathrooms_cat'].astype(str) + ' Bathrooms'

# Order the combinations by median price for better visualization
order = cleaned_houses.groupby('bedroom_bathroom_combo')['price_czk'].median().sort_values().index

# Visualize the distribution of price_czk for each combination using boxplots
plt.figure(figsize=(12, 8))
sns.boxplot(x='price_czk', y='bedroom_bathroom_combo', data=cleaned_houses, order=order, palette='viridis')
plt.title('Distribution of Price (thousand CZK) by Bedroom and Bathroom Combination')
plt.xlabel('Price (thousand CZK)')
plt.ylabel('Bedroom and Bathroom Combination')
plt.tight_layout()
plt.show()

# Display the count of houses in each combination to show which exist
print("Count of houses in each bedroom and bathroom combination:")
display(cleaned_houses['bedroom_bathroom_combo'].value_counts().sort_index())