<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# TOWS - Weaknesses and Strengths

This session focuses on identifying trends in structured data with a technique called linear regression.

Linear regression can help us find relationships between different features in our data. For example, if we're interested in the relationship between our net profit and particular product features, a linear regression might show us that future increases in profit might be more likely to come from sales in particular regions than by a particular demographic. If this was the case, the business might identify those regions as strengths, and other regions as weaknesses.

In the business scenario below, we will use linear regression to identify a relationship between sale price and key features of the product that is being sold.

Linear regression can also help with trend analysis which can help stakeholders predict if aspects of their business are either (a) improving which may indicate strengths that they can build on, or (b) declining which may indicate weaknesses which they need to address.

The following web page provides some suggestions as to the types of data that may be useful for business trend analysis:
[Business Queensland - Trend analysis for business improvement](https://www.business.qld.gov.au/running-business/growing-business/trend-analysis)


## Linear Regression

Linear regression is used to identify the relationship between a dependent variable and one or more independent variables and is typically leveraged to make predictions about future outcomes. The results from the linear regression help in predicting an unknown value depending on the relationship with the predicting variables. Linear regression fits a straight line that minimizes the discrepancies between predicted and actual output values. Further information in [Linear regression](https://en.wikipedia.org/wiki/Linear_regression).

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Linear_least_squares_example2.svg/440px-Linear_least_squares_example2.svg.png" style="width:50%">

We start by importing some machine learning packages from the `Scikit Learn` python library

In [16]:
import pandas as pd
import plotly.express as px

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

## Business Scenario

You're working with a real estate agency in Darwin (NT), and you're trying to get an understanding of which features of houses are most likely to be valued by buyers - and therefore command a higher price in the market. Although obviously build quality and location will have a significant impact, for this analysis we're going to focus on the typical house features that are advertised: bedrooms, bathrooms, and car parking.



#### Data

Load and get to know the data.

The included dataset is derived from a [dataset of realestate property prices in Darwin](https://www.kaggle.com/datasets/thedevastator/australian-housing-data-1000-properties-sampled?resource=download) 

In [180]:
file_path = "data/"
file_name = "RealEstateAU_NT_property.csv"
property_df = pd.read_csv(f"{file_path}{file_name}")
property_df

Unnamed: 0,property,price,bedrooms,bathrooms,parking
0,Apartment,99950,0,1,1
1,Apartment,175000,1,1,1
2,Apartment,180000,1,1,0
3,Apartment,180000,1,1,0
4,Apartment,215000,1,1,1
...,...,...,...,...,...
395,House,899000,5,2,3
396,House,1100000,5,3,6
397,House,650000,6,3,3
398,House,849000,6,4,6


For linear regression, we need numeric values. To include the property type in our model, we can change it to a numeric value.

In [181]:
# replacing values
property_df['property'].replace(['House', 'Apartment'],
                        [1, 2], inplace=True)

property_df

Unnamed: 0,property,price,bedrooms,bathrooms,parking
0,2,99950,0,1,1
1,2,175000,1,1,1
2,2,180000,1,1,0
3,2,180000,1,1,0
4,2,215000,1,1,1
...,...,...,...,...,...
395,1,899000,5,2,3
396,1,1100000,5,3,6
397,1,650000,6,3,3
398,1,849000,6,4,6


In [182]:
# Get the descriptive statiistics
property_df.describe()

Unnamed: 0,property,price,bedrooms,bathrooms,parking
count,400.0,400.0,400.0,400.0,400.0
mean,1.54,510273.1,2.655,1.6475,1.9425
std,0.499022,229750.8,1.076509,0.590861,1.264195
min,1.0,99950.0,0.0,1.0,0.0
25%,1.0,383750.0,2.0,1.0,1.0
50%,2.0,475000.0,3.0,2.0,2.0
75%,2.0,590000.0,3.0,2.0,2.0
max,2.0,1950000.0,6.0,5.0,10.0


Are their any obvious correlations in the data?

In [183]:

property_corr = property_df.corr(numeric_only=True)
property_corr

Unnamed: 0,property,price,bedrooms,bathrooms,parking
property,1.0,-0.2891,-0.646066,-0.168811,-0.427391
price,-0.2891,1.0,0.594463,0.601668,0.455834
bedrooms,-0.646066,0.594463,1.0,0.584556,0.543392
bathrooms,-0.168811,0.601668,0.584556,1.0,0.368719
parking,-0.427391,0.455834,0.543392,0.368719,1.0


In [184]:
# Create a heatmap to visualise the correlations
pc_fig = px.imshow(property_corr) 
pc_fig.show()

Property and parking are least correlated with price. Bedrooms and bathrooms are most correlated with price.

#### Learning check-in 1
Throughout this notebook, we'll ask you to record an indicator of your learning. The following code facilitates this. Run the cell and follow the prompts.

In [None]:
# library to record and plot learning checkins
import sys; sys.path.append('./.local_libs'); from learning_checkin import *
# Run this cell to check-in
learning_checkin()

#### Alternate visualisations

Different visualisations can sometimes help better understand the relationships in the data. When looking for linear relationships, a pair plot can be helpful...

In [185]:
price_fig = px.scatter_matrix(property_df) # Create a pair plot to see the linearity of the variables
price_fig.show()

### Creating a model

To fit a linear regression model, we need to assign the dependent variable that we want to predict to the Y-axis (price), and draw the X-axis data from the independent variables(property,bedrooms,bathrooms,parking).

In [186]:
# Independent variables

X_data = property_df[['bedrooms','bathrooms','parking','property']]
X_data

Unnamed: 0,bedrooms,bathrooms,parking,property
0,0,1,1,2
1,1,1,1,2
2,1,1,0,2
3,1,1,0,2
4,1,1,1,2
...,...,...,...,...
395,5,2,3,1
396,5,3,6,1
397,6,3,3,1
398,6,4,6,1


In [187]:
# Dependent variable

y_data = property_df['price']
y_data

0        99950
1       175000
2       180000
3       180000
4       215000
        ...   
395     899000
396    1100000
397     650000
398     849000
399    1200000
Name: price, Length: 400, dtype: int64

As we are training a model from the data, we need to split the data into training data, and test data (a reserved portion of the data to test the model).

In [188]:
# Break the current dataset into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data,  train_size=0.8, random_state=0) # Train size determines the perceptage use for training the model

In [189]:
# Create a new linear regression model
linear_model = LinearRegression() 

# Train the model with the train dataset
linear_model.fit(X_train, y_train) 

# Predict using the testing dataset
linear_predictions = linear_model.predict(X_test) 
linear_predictions

array([ 389047.19432155,  605388.33121111,  641826.4994789 ,
        605388.33121111,  425485.36258935,  355609.79648701,
        535377.89136389,  571816.05963168,  355609.79648701,
        458922.76042389,  285599.35663979,  501940.49352935,
        535377.89136389,  641826.4994789 ,  285599.35663979,
        285599.35663979,  425485.36258935,  355609.79648701,
        605388.33121111,  605388.33121111,  535377.89136389,
        638690.85530077,  422349.71841121,  641826.4994789 ,
        535377.89136389,  605388.33121111,  535377.89136389,
        285599.35663979,  538378.66179714,  535377.89136389,
        535377.89136389,  605388.33121111,  641826.4994789 ,
        425485.36258935,  355609.79648701,  535377.89136389,
        535377.89136389,  501940.49352935,  425620.23633423,
       1354589.06163853,  535377.89136389,  641826.4994789 ,
        285599.35663979,  285599.35663979,  459057.63416877,
        285599.35663979,  775576.09081707,  672128.25313531,
        355609.79648701,

In [190]:
X_test

Unnamed: 0,bedrooms,bathrooms,parking,property
132,2,1,2,2
309,3,2,2,2
341,4,2,2,1
196,3,2,2,2
246,3,1,2,1
...,...,...,...,...
14,1,1,1,2
363,4,2,2,1
304,3,2,2,2
361,4,2,2,1


In [191]:
actual_predict_df = pd.DataFrame({'actual':y_test,'predict':linear_predictions})
actual_predict_df

Unnamed: 0,actual,predict
132,279000,389047.194322
309,549000,605388.331211
341,580000,641826.499479
196,749000,605388.331211
246,485000,425485.362589
...,...,...
14,315000,285599.356640
363,660000,641826.499479
304,440000,605388.331211
361,649000,641826.499479


In [192]:

px.scatter(test_predict_df,trendline='ols')

Note that the trend line is a visualisation of the linear model. To understand the extent to which our predictions are explained by the model, we can obtain some metrics from the test data.

In [193]:
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_test, linear_predictions))

# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, linear_predictions))

# The coefficients
print("Coefficients: \n", linear_model.coef_)


Coefficient of determination: 0.57
Mean squared error: 16214122543.16
Coefficients: 
 [ 70010.43984722 146330.69704234  33437.39783454  33572.27157943]




The coefficent of determination (R-squared) is a coeficient between 0 and 1 that represents the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

The mean squared error (MSE) is the average squared difference between the predicted values and the actual values.

The coefficients are associated with the 4 independent variables - bedrooms, bathrooms, parking, and property.



#### Learning check-in 2

In [None]:
# Run this cell to check in for your learning session
learning_checkin()

### Making predictions

Experiment with the code below to see how the model can be used to predict on different combinations of independent variables.

How might this be used to address the business scenario?

In [205]:
new_df = pd.DataFrame({'bedrooms':[4],'bathrooms':[2],'parking':[2],'property':[1]})
linear_model.predict(new_df)[0]

641826.4994789017

#### Learning check-in 3

In [None]:
# Run this cell to check in for your learning session
learning_checkin()

### Exploring further

Can you improve the model? 

Is the property type necessary?

Could you create different models for different property types? Would this be helpful?

#### Learning check-in 4

In [None]:
# Run this cell to check in for your learning session
learning_checkin()

#### Visualise your learning check-in data

In [None]:
# Run this cell to plot your check-ins for this session
plot_checkin()