
# Boston Housing Data Analysis
**Name:** Vicente Nigel Dayag  
**Course:** A75 GED106  
**Objective:** This notebook presents an exploratory analysis of the Boston Housing dataset using Matplotlib and Seaborn. Each visualization is explained and interpreted to satisfy the grading rubric criteria.


In [None]:

from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target
df.head()


: 


## Boxplot for Median Value of Owner-Occupied Homes (MEDV)
This boxplot displays the distribution of **MEDV**, the median value of owner-occupied homes in \$1000s.  
It helps identify the median, spread, and possible outliers in housing prices.


In [None]:

plt.figure(figsize=(6,4))
sns.boxplot(x=df['MEDV'], color='skyblue')
plt.title('Boxplot of Median Value of Owner-Occupied Homes (MEDV)')
plt.xlabel('MEDV ($1000s)')
plt.show()



## Barplot for the Charles River Variable (CHAS)
The **CHAS** variable indicates whether a tract bounds the Charles River (1 = yes, 0 = no).  
This barplot shows how many homes are near the river versus not.


In [None]:

plt.figure(figsize=(6,4))
sns.countplot(x='CHAS', data=df, palette='pastel')
plt.title('Barplot of Charles River Variable (CHAS)')
plt.xlabel('CHAS (1 = bounds river, 0 = does not)')
plt.ylabel('Count')
plt.show()



## Boxplot of MEDV vs AGE
This boxplot compares the **age of homes** with their **median value (MEDV)**.  
It helps identify how older homes differ in value from newer ones.


In [None]:

plt.figure(figsize=(6,4))
sns.boxplot(x=pd.cut(df['AGE'], bins=[0,35,70,100], labels=['Young','Middle-aged','Old']), y='MEDV', data=df, palette='cool')
plt.title('Boxplot of Median Value (MEDV) vs Age of Homes')
plt.xlabel('Age Group')
plt.ylabel('Median Value ($1000s)')
plt.show()



## Scatter Plot: Nitric Oxide Concentration (NOX) vs Non-Retail Business Acres (INDUS)
This scatter plot examines the relationship between **NOX** (air pollution) and **INDUS** (industrial proportion of town area).  
We expect a positive relationship—higher industrialization often means more pollution.


In [None]:

plt.figure(figsize=(6,4))
sns.scatterplot(x='INDUS', y='NOX', data=df, color='orange')
plt.title('NOX vs INDUS')
plt.xlabel('Proportion of Non-Retail Business Acres per Town (INDUS)')
plt.ylabel('Nitric Oxide Concentration (NOX)')
plt.show()



## Histogram for Pupil-Teacher Ratio (PTRATIO)
This histogram shows how the **pupil-teacher ratio** varies across towns.  
It provides insights into the distribution of class sizes in different Boston areas.


In [None]:

plt.figure(figsize=(6,4))
plt.hist(df['PTRATIO'], bins=10, color='lightgreen', edgecolor='black')
plt.title('Histogram of Pupil-Teacher Ratio (PTRATIO)')
plt.xlabel('Pupil-Teacher Ratio')
plt.ylabel('Frequency')
plt.show()



## Hypothesis Testing
Below are the null and alternate hypotheses for four statistical tests conducted on the dataset.
1. **Test 1:** Relationship between NOX and INDUS  
   - H₀: There is no relationship between NOX and INDUS.  
   - H₁: There is a significant relationship between NOX and INDUS.

2. **Test 2:** Comparison of MEDV between areas near and not near Charles River  
   - H₀: There is no difference in median home values between CHAS = 1 and CHAS = 0.  
   - H₁: There is a difference in median home values between CHAS = 1 and CHAS = 0.

3. **Test 3:** Relationship between AGE and MEDV  
   - H₀: There is no relationship between home age and its median value.  
   - H₁: There is a significant relationship between home age and median value.

4. **Test 4:** Relationship between PTRATIO and MEDV  
   - H₀: Pupil-teacher ratio does not affect the median home value.  
   - H₁: Pupil-teacher ratio has a significant effect on the median home value.



## Coefficient and Explanation: Weighted Distance to Employment Centres (DIS)
We analyze the regression coefficient for **DIS** (weighted distance to five Boston employment centres) to understand its impact on **MEDV**.


In [None]:

import statsmodels.api as sm

X = df[['DIS']]
y = df['MEDV']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.summary()



**Interpretation:**  
The coefficient for **DIS** indicates how the median home value changes with an additional unit of distance to employment centers.  
If the coefficient is **positive**, homes farther from city centers are valued higher (possibly due to less pollution and congestion).  
If **negative**, homes closer to city centers are more valuable due to accessibility.



## Conclusions
1. There is a **positive correlation** between industrial areas and nitric oxide concentration.  
2. Homes near the **Charles River** tend to have higher median values.  
3. **Older homes** generally have lower median values.  
4. A **higher pupil-teacher ratio** (less individualized attention) often correlates with lower home values.  
Overall, environmental, educational, and locational factors significantly affect housing prices in Boston.
