# Introductory Applied Machine Learning (IAML) Coursework 1 - Semester 2, January 2021.

### Author: Nigel Goddard

## Important Instructions

#### It is important that you follow the instructions below carefully for things to work properly.

You need to set up and activate your environment as you would do for your labs, see Learn section on Labs.  **You will need to use Noteable to create one of the files you will submit (the PDF)**.  Do **NOT** create the PDF in some other way, we will not be able to mark it.  If you want to develop your answers in your own environment, you should make sure you are using the same packages we are using, by running the cell which does imports below.

Read the instructions in this notebook carefully, especially where asked to name variables with a specific name. Wherever you are required to produce code you should use code cells, otherwise you should use markdown cells to report results and explain answers. In most cases we indicate the nature of answer we are expecting (code/text), and also provide the required code/markdown cell.

The .csv files that you will be using are located in the ./datasets directory that is included in the git repository with this file.

Keep your answers brief and concise. Most written questions can be answered in 2-3 lines, a few will take longer.

Make sure to distinguish between attributes (columns of the data) and features (which typically refers only to the independent variables, i.e. excluding the target variables).

Make sure to show all your code/working.

Write readable code. While we do not expect you to follow PEP8 to the letter, the code should be adequately understandable, with plots/visualisations correctly labelled. Do use inline comments when doing something non-standard. When asked to present numerical values, make sure to represent real numbers in the appropriate precision to exemplify your answer. 

You will see <html>\\pagebreak</html> at the start of each subquestion.  ***Do not remove these, if you do we will not be able to mark your coursework.***

### SUBMISSION Process
This assignment will account for 20% of your final mark. We ask you to submit answers to all questions.

You will submit a PDF of your Notebook, and the Notebook itself.  Your grade will be based on the PDF, we will only use the Notebook if we need to see details.  **You must use the following procedure to create the materials to submit and then submit them**.

1. Make sure your Notebook and the datasets are in Noteable and will run.  If you developed your answers in Noteable, this is already done.  If you developed your answers in your own environment, you will need to uploading your Notebook  to Noteable and make sure it runs ok.

2. Select **Kernel->Restart & Run All** to create a clean copy of your submission, this will run the cells in order from top to bottom.  This may take a while (minutes) to complete, ensure that all the output and plots have completed before you proceed by waiting for the last cell's banner message to be printed.

3. Select **File->Download as->PDF via LaTeX (.pdf)** and wait for the PDF to be created and downloaded.

4. Select **File->Download as->Notebook (.ipynb)**

5. You now should have in your download folder the pdf and the notebook.  Rename them ***sNNNNNNN.pdf*** and ***sNNNNNNN.ipynb***, where sNNNNNNN is your matriculation number (student number), e.g. s1234567.

6. Now submit **the PDF** to Gradescope on Learn.  There is video guidance on Learn (***Assessment->Assignment Submission***) on how to do this.  It is **very important** that during Gradescope submission you **indicate which pages of your PDF correspod to which of the questions** - the video gives guidance on this, you can tick multiple pages for each question.

7. Finally submit the Notebook itself (named as indicated in **5** above) to Learn.  You do this at ***Assessment->Assignment Submission->Assignment 1 - Submit your Notebook***

The submission deadline for this assignment is **9th February 2021 at 16:00 UK time (UTC)**.  Don't leave it to the last minute!

#### IMPORTS
Execute the cell below to import all packages you will be using for this assignment.  If you are not using Noteable, make sure the python and package version numbers reported match the python and package numbers specified in the comment at the end of this cell.

In [None]:
import os
import platform
import sys
import sklearn
import numpy as np
np.random.seed(260393)
import pandas as pd
import seaborn as sns
import matplotlib as mp
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.naive_bayes import CategoricalNB
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

import warnings 
warnings.filterwarnings('ignore')

print("All packages imported!")
print("python=={}".format(platform.python_version()))
print("seaborn=={}".format(sns.__version__))
print("scikit-learn=={}".format(sklearn.__version__))
print("pandas=={}".format(pd.__version__))
print("numpy=={}".format(np.__version__))
print("matplotlib=={}".format(mp.__version__))

# You should see this output:
# All packages imported!
# python==3.7.6
# seaborn==0.11.0
# scikit-learn==0.23.2
# pandas==1.1.4
# numpy==1.19.4
# matplotlib==3.2.2

\pagebreak

# Question 1 Linear Regression

#### 88 marks out of 190 for this coursework

### House Prices Dataset

The aim of this task is to predict house prices in Melbourne, Victoria, Australia using Linear Regression. The dataset consists of [historic data of houses sold over several years](https://www.kaggle.com/anthonypino/melbourne-housing-market).


***Attribute description:***

- Suburb: Suburb
- Address: Address
- Rooms: Number of bedrooms
- Price: Price in Australian dollars
- Method: sale method (we won't use this)
- Type: h (house), u (apartment), t (townhouse)
- SellerG: Real Estate Agent
- Date: Date sold
- Distance: Distance from Central Business District in kilometres
- Regionname: General Region
- Propertycount: Number of properties that exist in the suburb.
- Bedroom2 : Scraped # of Bedrooms (from different source)
- Bathroom: Number of Bathrooms
- Car: Number of carspots
- Landsize: Land Size in Sq. Metres
- BuildingArea: Building Size in Sq. Metres
- YearBuilt: Year the house was built
- CouncilArea: Governing council for the area
- Lattitude: Latitude
- Longtitude: Longitude


\pagebreak

# ========== Question 1.1 --- [21 marks] ==========

Answer (in brief) the following questions:

1. [Text]A How do regression and classification tasks differ in Machine Learning? Give an example of each. 
2. [Code]A Read in the Melbourne_housing.csv data and name it ***aushouse***. Use functions to describe the dataset.
3. [Text]A Comment on the characteristics of the attributes just displayed, including where data might be missing. Describe two different strategies for dealing with the missing data. 

(1) How do regression and classification tasks differ in Machine Learning? Give an example of each.

(1) ***Your answer goes here:***

(2) Read in the Melbourne housing data and name it ***aushouse***. Use functions to describe the dataset.

In [None]:
#(2) # your code goes here

(3) Comment on the characteristics of the attributes just displayed, including where data might be missing. Describe two different strategies for dealing with the missing data.

(3) ***Your answer goes here:***

\pagebreak

# ========== Question 1.2 --- [10 marks] ==========

Answer (in brief) the following questions:

1. [Text]B One feature that could well be important in predicting price is the Type of property (house, apartment or townhouse).  What type of feature is this, and why can't we use it in linear regression?  Describe a method for using information from this type of feature in linear regression.
2. [Code]B Convert the Type feature using the method you described above. Remove features which cannot be used for linear regression, and the YearBuilt feature which we won't use.

(1) One feature that could well be important in predicting price is the Type of property (house, apartment or townhouse).  What type of feature is this, and why can't we use it in linear regression?  Describe a method for using information from this type of feature in linear regression (6)

(1) ***your answer goes here:***

(2) Convert the Type feature using the method you described above (if they used a different method, if you can easily see if their code matches their method then give them marks for it). Remove features which cannot be used for linear regression, and the YearBuilt feature which we won't use, and show a correlation heatmap with values

In [None]:
# (2) # Your code goes here

\pagebreak

# ========== Question 1.3 --- [13 marks] ==========

Answer (in brief) the following questions:

1. [Code]C Show a correlation heatmap with values.
2. [Text]C Comment on what you see in the correlation heatmap values, with regard to this task.  Which features could you drop, and why?
3. [Code]C Drop the features you've identified, and remove any instances with missing attribute values.  

(1) Show a correlation heatmap with values.

In [None]:
# (1) # Your code goes here

(2) Comment on what you see in the correlation heatmap values, with regard to this task.  Which features could you drop, and why?

(2) ***your answer goes here***

(3) Drop the features you have identified, and remove any instances with missing attribute values.

In [None]:
# (3) # your code goes here

\pagebreak

# ========== Question 1.4 --- [20 marks] ==========

Answer (in brief) the following questions:

1. [Code] Create ***X*** and ***y*** from <html><var>aushousing</var></html> and then use train_test_split to create training and test set, with the testing set being 20% of the entire data. Set the random_state to 0 for reproducibility.
Hint: Look at the Lab exercises for an example.
2. [Code] Fit a LinearRegression to the training set, print the intercept and a DataFrame showing the coefficient of each attribute.
3. [Text] Describe the meaning of the intercept and the coefficients. Comment on the coefficients, including their size, and what they tell us about the relationship between the features and the price. Are there any coefficients with surprising values?

(1) Create ***X*** and ***y*** from <html><var>aushousing</var></html> and then use train_test_split to create training and test set, with the testing set being 20% of the entire data. Set the random_state to 0 for reproducibility.

In [None]:
#(1) # your code goes here

(2) Fit a LinearRegression to the training set, print the intercept and a DataFrame showing the coefficient of each attribute.

In [None]:
#(2) # your code goes here

(3) Describe the meaning of the intercept and the coefficients. Comment on the coefficients, including their size, and what they tell us about the relationship between the features and the price.  Which house properties affect the price most?  Does the type of property matter much?

(3) ***your answer goes here***

\pagebreak

# ========== Question 1.5 --- [24 marks] ==========

Answer (in brief) the following questions:

1. [Code] Print the Root Mean Squared Error (RMSE) and <html><var>R<sup>2</sup></var></html>.
2. [Text] Explain the meaning and output of the RMSE and <html><var>R<sup>2</sup></var></html>. What do they tell us about the fit of the data?

(1) Print the Root Mean Squared Error (RMSE) and <html><var>R<sup>2</sup></var></html>.

In [None]:
# (1) # Your code goes here

(2) Explain the meaning and output of the RMSE and <html><var>R<sup>2</sup></var></html>. What do they tell us about the fit of the data?

(2) ***Your answer goeshere:***

\pagebreak

# Question 2 Naive Bayes

#### 102 marks out of 190 for this coursework

### Income dataset

The aim of this task is to predict whether a United States person has income over \$50,000. The dataset is derived from the [1994 US census data](https://archive.ics.uci.edu/ml/datasets/Adult).

***Attribute description:***

attributes in _italics_ are ones we will not use

- _age: age in years_
- work: working status
- _fnlwgt: weighting factor_
- edu: education level
- _edunum: years of education_
- marit: marital-status
- occ: occupation
- rel: relationship status
- race: race.
- sex: sex
- _cg: capital-gains_
- _cl: capital-losses_
- _hours: hours of work per week_
- over50k: income over $50,000


\pagebreak

# ========== Question 2.1 --- [16 marks] ==========

Answer (in brief) the following questions:  

1. [Text] Why is the Naive Bayes method called that?  What is "naive" about it and what is Bayesian about it?
2. [Code] Read in the income data (income.csv) and name it income.  Remove attributes we won't use (in _italics_ in the list above, i.e _age, fnlwgt, edunum, cg, cl, hours_), and remove instances with missing data.
3. [Code] Use a library function to show the attributes, their type, and how many there are of each.
4. [Code] We'll use the integer attribute as the class to predict.  How many classes are there, and what integer values to they have?
5. [Code] Use another library function to show details about each of the other attributes (the features), including the frequency of the most prevalent category.

(1) Why is the Naive Bayes method called that?  What is "naive" about it and what is Bayesian about it?

***your answer goes here***

(2) Read in the income data (income.csv) and name it income.  Remove attributes we won't use (in _italics_ in the list above), and remove instances with missing data.

In [None]:
#(2) # Your code goes here

(3) Use a library function to show the attributes, their type, and how many there are of each.

In [None]:
#(3) # Your code goes here

(4) We'll use the integer attribute as the class to predict.  How many classes are there, and what integer values to they have?

In [None]:
#(4) # Your code goes here

Use another library function to show details about each of the other attributes (the features), including the frequency of the most prevalent category.

In [None]:
#(5) # Your code goes here

\pagebreak

# ========== Question 2.2 --- [16 marks] ==========

Answer (in brief) the following questions:  

1. [Code] List the feature names.  
2. [Code] Use seaborn functions to show a bar chart for each of the features of the number of instances with each attribute value, with distinct counts for each target class shown side by side.
3. [Text] Comment on the plots you've created. Are there rare categories? Do the features look like they will be good for the classification task?

(1) List the feature names. 

In [None]:
#(1) # Your code goes here

(2) Use seaborn functions to show a bar chart for each of the features of the number of instances with each attribute value, with distinct counts for each target class shown side by side.

In [None]:
#(2) # Your code goes here

(3) Comment on the plots you've created.  Are there rare categories? Do the features look like they will be good for the classification task 

***your answer goes here***

\pagebreak

# ========== Question 2.3 --- [12 marks] ==========

Answer the following questions: 
1. [Code] Set **target_encoded** to be the array of class values, and show some the values for some of the instances.
2. [Code] Use OrdinalEncoder to transform the categorical feature values to numeric values
3. [Code] Store the encodings in a data frame called income_encoded.  Show the feature values for some of the instances. 

(1) Set ***target_encoded*** to be the array of class values, and show the values for ten of the instances.

In [None]:
#(1) # Your code goes here

(2) Use OrdinalEncoder to transform the categorical feature values to numeric values

In [None]:
#(2) # Your code goes here

(3) Store the encodings in a data frame called income_encoded.  Show the feature values for the first five instances. 

In [None]:
#(3) # Your code goes here

\pagebreak

# ========== Question 2.4 --- [22 marks] ==========

Answer the following questions: 
1. [Code]  Set ***X*** to be the data frame of independent variables, and ***y*** the array of dependent variables. Split the data into training and test sets using test_train_split, with a testing fraction of 20%, and setting the random state to zero for consistency.
2. [Code] A simple baseline for classification tasks is to always predict the most common class.  Create an array of predictions according to this baseline, and show the following performance statistics: number of missclassified instances, accuracy, F1, precision, recall.
3. [Code]  Creat a confusion matrix and display it as an annotated heatmap (use sns_heatmap).
4. [Text] Comment on what you see in the statistics from (2) and the confusion matrices in (3)

(1) Set ***X*** to be the data frame of independent variables, and ***y*** the array of dependent variables. Split the data into training and test sets using ***test_train_split***, with a testing fraction of 20%, and setting the random state to zero for consistency.

In [None]:
#(1) # Your code goes here

(2) A simple baseline for classification tasks is to always predict the most common class in the training set (highest prior probability).  Print the class with the highest prior probability.  Create an array of predictions on the training set you just made, according to this baseline, and show the following performance statistics: number of missclassified instances, accuracy, F1, precision, recall.

In [None]:
#(2) # Your code goes here

3. Create a confusion matrix between the training targets for the baseline predictor, and the normalised confusion matrix.  Display the confusion matrix.  Also display the normalised confusion matrix as an annotated heatmap. ***Hint:*** Use the plot_confusion_matrix function from the labs for the heatmap.

In [None]:
#(3) # Your code goes here

(4) Comment on what you see in the statistics from (2) and the confusion matrices in (3)

(4) ***your answer goes here***

\pagebreak

# ========== Question 2.5 --- [22 marks] ==========

Answer the following questions: 
1. [Code] Train a categorical Naive Bayes classifier on the training data you made.
2. [Code] Report the classifier's accuracy, precision and recall and F1 on the **training** dataset.  Also report the confusion matrix and the normalised confusion matrix for the result.
3. [Text] Interpret the values of the accuracy, F1, precision and recall. Comment on the performance of the model, comparing to the baseline. Is the accuracy a reasonable metric to use for this dataset? Interpret the numbers in the confusion matrix. Does it look like you would expect to find in a "good" classifier?

(1) Train a categorical Naive Bayes model on the training data you made,and it's predictions on the same training data.

In [None]:
# (1) Your code goes here

(2) Report the classifier's accuracy, precision and recall and F1 on the training dataset.  Also report the confusion matrix and the normalised confusion matrix for the result.

In [None]:
# (2) # Your code goes here

3. Interpret the values of the accuracy, F1, precision and recall. Comment on the performance of the model, comparing to the baseline. Is the accuracy a reasonable metric to use for this dataset? Interpret the numbers in the confusion matrix. Does it look like you would expect to find in a "good" classifier?

***your answer goes here***

\pagebreak

# ========== Question 2.6 --- [14 marks] ==========

Answer the following questions: 
1. [Code] Now evaluate the classifier on the testing data you made.  Report the classifier's accuracy, precision and recall and F1 on the testing dataset.  Also report the confusion matrix and the normalised confusion matrix for the result.
2. [Text] In a short paragraph (2-3 sentences) compare and comment on the results with the training data.
3. [Text] Since the categorical data has been encoded as numbers, we could now train a GNB on the data.  Would you expect the GNB to perform better or worse than the CNB, and why?

(1) Now evaluate the classifier on the testing data you made.  Report the classifier's accuracy, precision and recall and F1 on the testing dataset.  Also report the confusion matrix and the normalised confusion matrix for the result.

In [None]:
# (1) # Your code goes here

(2) In a short paragraph (2-3 sentences) compare and comment on the results with the training data.

(2) ***your answer goes here***

 (3) Since the categorical data has been encoded as numbers, we could now train a GNB on the data.  Would you expect the GNB to perform better or worse than the CNB, and why?

(3) ***your answer goes here***

\pagebreak

In [None]:
# This cell's output will confirm all cells have been run if you select Kernel->Restart & Run All.
# Wait until you see the output printed
print("*****************************")
print("*                           *")
print("* All cells have been run!! *")
print("*                           *")
print("*****************************")