<a href="https://colab.research.google.com/github/thuc-github/MIS710-T12023/blob/main/Week%203/MIS710_Lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MIS710 Lab 3 Week 3**
Author: Associate Professor Lemai Nguyen

Objective: to learn and practise linear regression models with scikit-learn

Dataset: HousingPrice

Source: https://www.kaggle.com/datasets/yasserh/housing-prices-dataset 
The dataset was modified to allow you deal with missing data. 
**Download the modified data at the unit site.**

**To do before the class:**
1. complete Labs 0, 1 and 2
2. learn Lecture 3: Supervised Machine Learning: Linear Regression
3. download the housing.csv dataset and store it in your Google drive, MIS710 folder

**Student name:**

Student ID: 

## **Loading Libraries**


In [None]:
#load libraries
import pandas as pd #for data manipulation and analysis
import numpy as np #for working with arrays

#import data visualisation libraries 
import matplotlib.pyplot as plt
import seaborn as sns



## **Mount your Google drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## **Loading Data** 


1.   Load the dataset
2.   Explore the data



In [None]:
# load dataset
records = pd.read_csv("/content/drive/MyDrive/MIS710/Housing31.csv")

#explore the dataset
print(records)

print('Sample size:', records.shape[0])
print('Number of columns:', records.shape[1]) 

In [None]:
print(records.info())
print(records.shape)

In [None]:
#area is wrongly documented as string
records['area'] = records['area'].apply(pd.to_numeric, errors='coerce')

## **Are there missing data?** 

The data preprocessing is cyclic with analysing and visualising data, handling missing data, and feature engineering. For the learning purpose, we show you some simple techniques, you should move between the activities yourself. 

In [None]:
#learn to use for loop, and accessing elements of a dataframe using iloc
#Count missing data
for i in records.iloc[:,0:]: 
  miss=records[i].isna().sum()
  print(i,'missing: ', miss)


In [None]:
#another way to find missing data is using the function isnull()
#read about isnull() here https://pandas.pydata.org/docs/reference/api/pandas.isnull.html 
#read further at https://www.sharpsightlabs.com/blog/pandas-isnull/ 
print(records.isnull().sum())

## **Variable analysis**

Stats and visualtion
1.  Univariate analysis
2.  Bivariate analysis
3.  Multivariate analysis




**Univariate analysis explores and visualises each variable at a time**


In [None]:
#overview 
records.describe()


In [None]:
#set the formatting for floating numbers 
pd.set_option('display.float_format', lambda x: '%.3f' % x)

data_types =['object', 'float', 'int'] 
records.describe(include=data_types)

## **You can use stats results to decide on and handle missing data**

In [None]:
#describe categorical variables
records['area'].describe()

In [None]:
#describe categorical variables
records['furnishingstatus'].describe()

In [None]:
records['mainroad'].describe()

In [None]:
records['mainroad'].mode()[0]

In [None]:
#Fill in missing numerical data with mean and categorical data with mode
records['area'].fillna(records['area'].mean(),inplace=True)
records['furnishingstatus'].fillna(records['furnishingstatus'].mode()[0], inplace=True) #there can be more than one mode

#do it yourself for mainroad

**Visualise each numerical variable**

In [None]:
#using seaborn https://seaborn.pydata.org/generated/seaborn.histplot.html


In [None]:
#create a boxplot


**It's your turn: explore other numerical variables**

In [None]:
#visualise other numerical variables one at a time


**Explore each categorical variable**

In [None]:
#explore each categorial variable
print(records['furnishingstatus'].value_counts())
print('Furnishing Status mode: ', records['furnishingstatus'].mode())

In [None]:
#Another way to do it
records.furnishingstatus.value_counts()

Do it yourself for other categorical variables

**Visualise each categorical variable**

In [None]:
#Using seaborn
sns.countplot(x=records['bathrooms'])

**Visualise other categorical variables**

In [None]:
cat_variables = ['bedrooms', 'bathrooms','stories', 'parking','mainroad','guestroom','basement', 'hotwaterheating', 'airconditioning', 'prefarea', 'furnishingstatus']
for i in cat_variables:
   plt.figure()
   sns.countplot(x=records[i])


## **Multivariate visualisation**

**Display a countplot for one categorical variable grouped by a second categorical variable**
https://seaborn.pydata.org/generated/seaborn.countplot.html 

In [None]:
sns.countplot(data=records, x='prefarea', hue='mainroad')

In [None]:
#Do it yourself 


**Compare distributions of numerical variables using boxplots**
https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [None]:
sns.boxplot(data=records, x='price', y='basement')

In [None]:
#Do it yourself, hint: using x=  y= and hue=

**Ploting diagram to see relationships between two numerical variables**
https://seaborn.pydata.org/generated/seaborn.scatterplot.html 

In [None]:
sns.scatterplot(data=records, x='area', y='price')

In [None]:
#Let't move price to the first column
first_column=records.pop('price')
records.insert(0,'price',first_column)

In [None]:
records.iloc[9:14]

In [None]:
#generate heatmaps to explore relationships
sns.heatmap(records.corr(), square=True, cmap='Blues', annot=True)
plt.show()

In [None]:
#generate dendrograms to show hierarchical clustering  
sns.clustermap(records.corr(), square=True, cmap='Blues', annot=True, row_cluster=False)
plt.show()

## **Encoding data**

In [None]:
#Last week, we learned to convert categorical variables to numerical using LabelEncoder
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
records['mainroad_N'] = encoder.fit_transform(records['mainroad'])
records['basement_N'] = encoder.fit_transform(records['basement'])


In [None]:
#there are other ways of doing this, for example
records['hotwaterheating_N'] = records['hotwaterheating'].apply(lambda x: 1 if x == 'yes' else 0)

records.sample(10)

In [None]:
#Another way is getting all catagorical columns
cat_variables = records.select_dtypes(include=['object']).columns
#Convert categorical columns to numeric
records[cat_variables] = records[cat_variables].apply(encoder.fit_transform)

# Display the updated dataset
print(records)

In [None]:
#OPTIONAL
#another day, defining your OWN function
#convert categorical data to numerical 
def coding_furnishingstatus(x):
        if x=='furnished': return 3
        if x=='semi-furnished': return 2
        if x=='unfurnished': return 1
       
records['furnishingstatus_N'] = records['furnishingstatus'].apply(coding_furnishingstatus)

records.iloc[9:14]

In [None]:
#write code to drop redudant columns
records= records.drop(['mainroad_N','basement_N','hotwaterheating_N','furnishingstatus_N'], axis=1)
print(records.info())

## **Feature Selection**

In [None]:
#feature selection
features=['area']
X=records[features]
X.head()

In [None]:
#specify the label
y=records['price']
y.head()

## **Split the Dataset**

Split arrays or matrices into random train and test subsets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split

In [None]:
from sklearn.model_selection import train_test_split # Import train_test_split function

# Split dataset into training set 70% and test set 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)  # 70% training and 30% testing 

#inspect the split datasets
print(X_train.head())
print(y_train.head())

print('Training dataset size:',X_train.shape)
print('Test dataset size:',y_train.shape)


## **Training a Linear Regression Model**

1.   Train a model using the training dataset
2.   Make prediction using the model for the test dataset

Read about Linear Regression https://scikit-learn.org/stable/modules/linear_model.html

LinearRegression will take in its fit method arrays X, y and will store the coefficients of the linear model in its coef_ member






In [None]:
#import linear_model 
from sklearn import linear_model

#create a linear_model object
reg = linear_model.LinearRegression()

**Train a model**

In [None]:
# Train a Regression model (regressor) with the training dataset 
reg=reg.fit(X_train, y_train)

**Make predictions using the model and the test set**

In [None]:
#Make predictions for the test dataset
y_pred = reg.predict(X_test)


**Inspect the predictions and the original labels**

In [None]:
plt.scatter(y_test, y_pred) 
plt.xlabel("Actual prices") 
plt.ylabel("Predicted prices") 
plt.title("Actual prices vs Predicted prices")
plt.show()

In [None]:
#set the formatting for floating numbers 
pd.set_option('display.float_format', lambda x: '%.0f' % x)
area=X_test['area']
#inspection
inspection=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
inspection=pd.DataFrame({'Area':area, 'Actual':y_test, 'Predicted':y_pred})
inspection.head(20)

**Getting the Intercept and Coefficients**

In [None]:
print('%.2f' % reg.intercept_) 
print('%.2f' % reg.coef_)
print('Price = ', '%.2f' % reg.intercept_, ' + ', '%.2f' % reg.coef_, ' * ', 'Area' )


In [None]:
sns.scatterplot(data=inspection, x='Area', y='Actual')
sns.regplot(data=inspection, x='Area', y='Predicted', color='blue')

## **Performance Metrics**

In [None]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.2f}")

In [None]:
from sklearn.metrics import mean_absolute_error

# Calculate and print the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.0f}")

In [None]:
from sklearn.metrics import mean_squared_error

# Calculate and print the root mean square error
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Square Error: {rmse:.0f}")

In [None]:
#Examine the performance using the descriptive stats of proice 
records['price'].describe()

## **Repeat from the feature selection steps to create multiple linear regression model**

In [None]:
#run the following code and examine the correlations among the variables
records.corr()

In [None]:
#select relevant features and train and evaluate a model

# Try it yourself! 

**Do it yourself:** Repeat the above steps with the housing dataset to consilidate your learning

## Loading libraries

In [None]:
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score,mean_squared_error

## Import dataset

In [None]:
# Load data using pandas.read_csv(filepath_or_url, sep=',')
url = 'https://raw.githubusercontent.com/thuc-github/MIS710-T12023/main/Week%203/insurance.csv'

df = pd.read_csv(url)


## EDA

* How many rows and columns in the dataset? 
* Return the first n rows.
* What are the columns and their datatypes?
* Is there any missing values? 
* How to deal with categorical features? 
* Any strong correlation from the dataset?  
* What are the stats for the `charges`? Plot overall distribution of `charges`; and ditribution of chareges for smoker and non-smokers. Practice more with `bmi`, `age` and `sex` variables. 



## Data preparation 


1.   Prepare X, y
2.   Prepare X_train, X_test, y_train, y_test (hint: using `train_test_split')



## Model implementation

1. Try with the original data. What's the performance?
2. Let's add data normalisation. Has the performance been improved?