<a href="https://colab.research.google.com/github/waelrash1/predictive_analytics_DT302/blob/main/Regression_Lab_Self_guided.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Self-Guided Lab on Simple and Multiple Regression using California Housing Dataset

Author: Wael Rashwan

<img src="https://th.bing.com/th/id/OIG.w_wVRNXtZ8W0CmBmw1ZR?pid=ImgGn" width=200, height=150 >

## Objectives:
- Understand and apply simple linear regression.
- Understand and apply multiple linear regression.
- Learn how to interpret the results of regression models.
- Gain experience with Python’s `statsmodels` and `scikit-learn` libraries.
- Learn how to leverage the geospatial features (latitude and longitude) in the California Housing dataset to improve the predictive performance of your models.

## Setup

```python
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
%matplotlib inline



## The California Housing Dataset

<img src="https://th.bing.com/th/id/OIG.IfEUcPs2z3d2iFfiYQvA?pid=ImgGn" width=300>

**Data Set Characteristics:**

 **Number of Instances:** 20640

 **Number of Attributes:** 8 numeric, predictive attributes and the target

 **Attribute Information:**
  * MedInc        median income in block group
  * HouseAge      median house age in block group
  * AveRooms      average number of rooms per household
  * AveBedrms     average number of bedrooms per household
  * Population    block group population
  * AveOccup      average number of household members
  * Latitude      block group latitude
  * Longitude     block group longitude


**Missing Attribute Values:** None

This dataset was obtained from the StatLib repository.
[link](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)

## Label/Target variable
`The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).`

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).


A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
```python
sklearn.datasets.fetch_california_housing
```
**Reference**:

 Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
  Statistics and Probability Letters, 33 (1997) 291-297




## Part 1: Load and Explore the Data
```python
# Load the California Housing dataset
california_housing = datasets.fetch_california_housing()
df = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
target = pd.DataFrame(california_housing.target, columns=["MedHouseVal"])

# Display the first few rows of the dataset
df.head()
```


### Visualise dataset

```python
import matplotlib.pyplot as plt
df.hist(figsize=(12, 10), bins=30, edgecolor="black")
target.hist(figsize=(4, 2.5), bins=30, edgecolor="black")
plt.subplots_adjust(hspace=0.7, wspace=0.4)
```


>  ### Task 1: Familiarize yourself with the dataset.
* `df.describe()`
* `target.describe()`


### Question: Which features might be good predictors for the median house value? Write your comments and observation below.






---


### **Add You Answer here:**

*

*



---


## Part 2: Simple Linear Regression with statsmodels [Explain](https://mlu-explain.github.io/linear-regression/)
### 2.1 Simple Linear Regression Without a Constant




```python
# Use MedInc as single predictor
X = df["MedInc"]
y = target["MedHouseVal"]
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
```
> ### Task 2: Interpret the output. What does the coefficient for 'MedInc' tell you?
### Question: What is the meaning of the R-squared value in this context?


---


### **Add You Answer here:**

*

*



---

### 2.2 Simple Linear Regression With a Constant
``` python

X = sm.add_constant(df["MedInc"])  # Adding a constant
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
```
> ### Task 2: Compare this output to the previous model. How has the inclusion of a constant term affected the results?

### Question: How does the coefficient for 'MedInc' change, and why?




---


### **Add You Answer here:**

*

*



---



## Part 3: Multiple Linear Regression with statsmodels
### 3.1 Building a Multiple Linear Regression Model
```python

X = df[["MedInc", "AveRooms", "Population", "AveOccup"]]
X = sm.add_constant(X)  # Adding a constant
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
```

> ### Task: Interpret the output. How do each of the features relate to the target variable 'MedHouseVal'?
### Question: Which variables appear to be the most significant in predicting house value?


## Part 4: Linear Regression with scikit-learn
### 4.1 Fitting a Linear Regression Model
```python

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=42)

lm = LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

```
### Print out R-squared and Mean Squared Error
```python

print("R-squared:", r2_score(y_test, predictions))
print("Mean Squared Error:", mean_squared_error(y_test, predictions))
```
> ### Task: Run the model and note down the R-squared and Mean Squared Error.
### Question: How does this model's performance compare to the previous models?



---


### **Add You Answer here:**

*

*



---

## 4.2 Analyzing Coefficients and Intercept
```python

print("Coefficients:", lm.coef_)
print("Intercept:", lm.intercept_)
```
> ### Task: Examine the coefficients. Which features have the largest impact on the target variable?

 ### Question: How does the intercept in this model compare to the intercept in the statsmodels models?


---


### **Add You Answer here:**

*

*



---

## Part 5: Geospatial Data Analysis with California Housing Dataset
### Objective:
> Learn how to leverage the geospatial features (latitude and longitude) in the California Housing dataset to improve the predictive performance of your models.

### 5.1: Visualizing Geospatial Data
Before diving into feature engineering, it's important to understand the distribution of your data.

## Task 1: Plot the locations of the housing data on a map. You can use libraries like matplotlib, seaborn, or folium for this.

``` python

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 7))
plt.scatter(df['Longitude'], df['Latitude'], alpha=0.1)
plt.title('Geospatial Distribution of California Housing Data')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
```

> ### Question: What patterns do you notice? Are there particular areas with higher densities of data points?




---


### **Add You Answer here:**

*

*



---

### 5.2: Creating Distance-Based Features

> ### Task 2: Create a new feature representing the distance of each housing block to a specific point of interest. For this example, let's use Downtown Los Angeles as a point of interest (coordinates: 34.0522° N, 118.2437° W).

```python

from geopy.distance import geodesic

# Coordinates for Downtown Los Angeles
la_coords = (34.0522, -118.2437)

df['distance_to_LA'] = df.apply(lambda row: geodesic((row['Latitude'], row['Longitude']), la_coords).km, axis=1)
```

> ### Question: How does adding this new feature affect the distribution of your data? Try plotting a histogram of this new feature.




---


### **Add You Answer here:**

*

*



---

### 5.3: Clustering Geospatial Data

> ### Task 3: Use a clustering algorithm like K-Means to categorize the housing blocks into different regions based on their latitude and longitude.

```python
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=10, random_state=0).fit(df[['Latitude', 'Longitude']])
df['location_cluster'] = kmeans.labels_
```
### Task 4: Visualize the clusters on a map.


```python
# Initialize map centered around California
import folium
m = folium.Map(location=[36.7783, -119.4179], zoom_start=6)

# Define colors for clusters
colors = ['red', 'blue', 'green', 'purple', 'orange']

# Add points to the map
for idx, row in df.iterrows():
    cluster_idx = int(row['Cluster'])
    folium.CircleMarker([row['Latitude'], row['Longitude']],
                        radius=5,
                        color=colors[cluster_idx],
                        fill=True,
                        fill_color=colors[cluster_idx],
                        fill_opacity=0.9).add_to(m)

# Show the map
m
```



> ### Question: Do the clusters make intuitive sense? How might these clusters be useful in your predictive model?




---


### **Add You Answer here:**

*

*



---

## Conclusion and Next Steps
> ### Reflection: Take a moment to reflect on what you've learned. What concepts are clear? What concepts might need more review?

## Next Steps: Consider extending the lab by:
* Adding more variables to the multiple regression model.
* Trying different combinations of features.
* Applying data transformations.
* Incorporating External Geospatial Data (see some idea in part 6).



### Part 6: Incorporating External Geospatial Data-- Homework activity
### Objective:
> ### Learn how to integrate external geospatial datasets to improve the predictive modeling of housing prices in California.

### 6.1: Identifying Relevant External Datasets
> Task: Research and list potential external geospatial datasets that could be relevant for predicting housing prices. This could include data on:

* Points of interest (schools, parks, shopping centers, etc.)
* Crime rates
* Demographic information
* Public transportation accessibility
* Air quality or environmental data

### Examples of datasets could include:

* California schools location and performance data
* Crime reports by region
* Census data for demographic information
* Public transportation stations and their frequencies

### 6.2: Accessing and Preprocessing External Datasets
> Task: Choose one of the datasets from previous Task, access it, and perform any necessary preprocessing to make it suitable for integration. This could involve cleaning the data, handling missing values, and converting the geospatial information to the same coordinate system as your California Housing dataset.

### 6.3: Merging Datasets
> Task: Merge the external dataset with the California Housing dataset based on the geographical information.

You might use spatial joins if the datasets are in a geospatial format.
Alternatively, you could calculate distances between points of interest and housing data, creating new features based on these distances.

### 6.4: Feature Engineering
Task: Based on the merged data, create new features that could be relevant for predicting housing prices. This could be the distance to the nearest point of interest, the average crime rate in the area, or the demographic makeup of the neighborhood.

### 6.5: Model Training and Evaluation
Task : Re-train your predictive models from earlier in the lab, but this time include the new features from the external dataset.

Question: How have your model's performance metrics changed after including these features?


# Part 7: Apply Your Knowledge to Another Dataset
## 7.1 Load and Explore the Diabetes Dataset
<div>
<img src="https://th.bing.com/th/id/OIG.hpNJE9n4ZMnF5WBL4AHy?pid=ImgGn" width="300" />
</div>



```python

from sklearn.datasets import load_diabetes

# Load the Diabetes dataset
diabetes_data = load_diabetes()
df_diabetes = pd.DataFrame(diabetes_data.data, columns=diabetes_data.feature_names)
target_diabetes = pd.DataFrame(diabetes_data.target, columns=["Diabetes Progression"])

# Display the first few rows of the diabetes dataset
df_diabetes.head()
```

### Task: Familiarize yourself with the new dataset. What features are available? What is the target variable?
### Question: Based on your initial observations, which features do you think might be good predictors for diabetes progression?

## 7.2 Simple Linear Regression on the Diabetes Dataset
Choose one feature that you think might be a good predictor for diabetes progression.

```python
X_diabetes = sm.add_constant(df_diabetes["feature_name"])  # Replace "feature_name" with your chosen feature
y_diabetes = target_diabetes["Diabetes Progression"]

model_diabetes = sm.OLS(y_diabetes, X_diabetes).fit()
predictions_diabetes = model_diabetes.predict(X_diabetes)
model_diabetes.summary()
```
### Task: Interpret the output. What does the coefficient for your chosen feature tell you?
### Question: What is the R-squared value, and what does it tell you in this context?

## 7.3 Multiple Linear Regression on the Diabetes Dataset
Choose a set of features that you think might be good predictors for diabetes progression.

``` python

X_diabetes_multi = df_diabetes[["feature1", "feature2", "feature3"]]  # Replace with your chosen features
X_diabetes_multi = sm.add_constant(X_diabetes_multi)

model_diabetes_multi = sm.OLS(y_diabetes, X_diabetes_multi).fit()
predictions_diabetes_multi = model_diabetes_multi.predict(X_diabetes_multi)
model_diabetes_multi.summary()
```

### Task: Interpret the output. How do each of the features relate to the target variable 'Diabetes Progression'?
### Question: Which variables appear to be the most significant in predicting diabetes progression?

## 7.4 Linear Regression with scikit-learn on the Diabetes Dataset
```python
X_train_diabetes, X_test_diabetes, y_train_diabetes, y_test_diabetes = train_test_split(
    df_diabetes, target_diabetes, test_size=0.2, random_state=42)

lm_diabetes = LinearRegression()
model_diabetes_sklearn = lm_diabetes.fit(X_train_diabetes, y_train_diabetes)
predictions_diabetes_sklearn = lm_diabetes.predict(X_test_diabetes)

# Print out R-squared and Mean Squared Error
print("R-squared:", r2_score(y_test_diabetes, predictions_diabetes_sklearn))
print("Mean Squared Error:", mean_squared_error(y_test_diabetes, predictions_diabetes_sklearn))
```
### Task: Run the model and note down the R-squared and Mean Squared Error. How does this model's performance compare to the previous models on the California housing dataset?
### Question: What steps could you take to improve the model’s performance?