<a href="https://colab.research.google.com/github/royn5618/EP_23_Intro_to_ML_Workshop/blob/main/Notebooks/Linear_Regression_California_Housing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns

In [None]:
from sklearn.datasets import fetch_california_housing

df = fetch_california_housing(as_frame=True, return_X_y=False)['frame']

# Explore Data

## Basic Data Info

In [None]:
df.info()

**Observations**
- the dataset contains 20,640 samples and 8 features

- all features are numerical features encoded as floating number

- there are no missing values*

*But there could be data anomalies.

## Scan Few Data Points

In [None]:
df.head()

In [None]:
df.describe()

# Univariate Analysis

Analyzing the range, value concentration and outliers for each variable.

In [None]:
geog_columns =  ['longitud', 'latitude']
numeric_columns = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
                   'Population',	'AveOccup']

target_column = 'MedHouseVal'

In [None]:
for each_col in numeric_columns:
  sns.histplot(df[each_col], bins=30)
  plt.show()

**Observations**

- The median income is a distribution with a long tail. It means that the salary of people is more or less normally distributed but there is some people getting a high salary.

- Regarding the average house age, the distribution is more or less uniform

- 'AveRooms', 'AveBedrms','Population',	'AveOccup' have extreme values i.e. they potentially has outliers.

# Target Analysis

Here, instead of using histogram, we are using a box plot.

In [None]:
sns.boxplot(x=df[target_column])

In [None]:
sns.scatterplot(
    data=df,
    x="Longitude",
    y="Latitude",
    size=target_column,
    hue=target_column,
    palette='crest',
    alpha=0.5,
)

plt.show()

**Observations:**

- The high-valued houses are mostly located on the coastline, around the big cities of San Diego, Los Angeles, San Jose, or San Francisco.

# Bivariate Correlation Analysis

In this section, only two columns - AveBedrms and AveRooms are selected to demonstrate fitting of a linear regression line.

In [None]:
lin_reg_cols= ['AveBedrms', 'AveRooms']

We will use a scatteplot here to analyse the nature of their correlation.

In [None]:
df_lin_reg = df[df[lin_reg_cols[0]] < 20][lin_reg_cols]
sns.scatterplot(df_lin_reg, # selecting AveRooms upto 20 for a closer look
                x=lin_reg_cols[0],
                y=lin_reg_cols[1]
                )

plt.show()

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

In [None]:
reg = LinearRegression()
reg.fit(np.array(df_lin_reg['AveBedrms']).reshape(-1, 1), df_lin_reg['AveRooms'])

In [None]:
reg.coef_

In [None]:
reg.intercept_

In [None]:
sns.scatterplot(df_lin_reg, # selecting AveRooms upto 20 for a closer look
                x=lin_reg_cols[0],
                y=lin_reg_cols[1]
                )
y_plot = []
for i in range(20):
    y_plot.append((reg.coef_ * i) + reg.intercept_)
plt.plot(range(len(y_plot)),y_plot,color='black',label = 'pred')
plt.show()

In [None]:
reg.predict(np.array(2).reshape(1, -1))

# Fitting Linear Regression on Multiple Features

Tip:

1. Use Scikit-Learn Pipeline
2. Use Standard Scalar to Scale the Numeric Values between 0 and 1.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

In [None]:
df.head()

In [None]:
# Drop Lat and Long
df_train = df[numeric_columns]
df_target = df[target_column]

In [None]:
# split data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_train,
                                                    df_target,
                                                    test_size=0.2,
                                                    random_state=42)

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
# Train and predict using linear regression
lin_reg = LinearRegression()
# Train with the train dataset
lin_reg.fit(X_train, y_train)

In [None]:
coeff_df = pd.DataFrame(lin_reg.coef_, X_train.columns, columns=['Coefficient'])
coeff_df

In [None]:

intercept_df = pd.Series(lin_reg.intercept_)
intercept_df

In [None]:
# Make prediction
y_pred = lin_reg.predict(X_test)

In [None]:
y_pred

# Evaluating Linear Regression Models

- Mean Absolute Error
- Mean Squared Error
- Root Mean Squared Error   

Link: https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
from sklearn import metrics

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
df_eval = pd.DataFrame()
df_eval['actual'] = y_test
df_eval['predictions'] = y_pred
df_eval.head()

In [None]:
df_eval[:15].plot(kind='bar',figsize=(16,10))
plt.show()

# Revisit

In [None]:
X_train.describe()

In [None]:
scale_cols = ['Population']

## Applying Standard Scaler

Link: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Features will be transformed such that the mean is 0 and standard deviation is 1 ~ which is quivalent to a normal standard distribution.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
scaler.fit(X_train[scale_cols])

In [None]:
X_train[scale_cols] = scaler.transform(X_train[scale_cols])

In [None]:
X_train.describe()

## Exercise

In [None]:
# Train and predict using linear regression

# Train with the train dataset


In [None]:
# Transform test set


In [None]:
# Make predictions


In [None]:
# Evaluate

# Solution



```

lin_reg_sc = LinearRegression()
lin_reg_sc.fit(X_train, y_train)

X_test[scale_cols] = scaler.transform(X_test[scale_cols])
y_pred = lin_reg.predict(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


```

