# Assessed Task 1

## 1 Dataset

In this task, you will have the opportunity to use the `KNeighborsRegressor` to predict a car's market price using its attributes. 

This data set consists of three types of entities: 
* the specification of an auto in terms of various characteristics;
* its assigned insurance risk rating;
* its normalized losses in use as compared to other cars. 

The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. 

## 2 Attribute Information

1. symboling: -3, -2, -1, 0, 1, 2, 3. 
2. normalized-losses: continuous from 65 to 256. 
3. make: alfa-romero, audi, bmw, chevrolet, dodge, honda,isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo 
4. fuel-type: diesel, gas. 
5. aspiration: std, turbo. 
6. num-of-doors: four, two. 
7. body-style: hardtop, wagon, sedan, hatchback, convertible. 
8. drive-wheels: 4wd, fwd, rwd. 
9. engine-location: front, rear. 
10. wheel-base: continuous from 86.6 120.9. 
11. length: continuous from 141.1 to 208.1. 
12. width: continuous from 60.3 to 72.3. 
13. height: continuous from 47.8 to 59.8. 
14. curb-weight: continuous from 1488 to 4066. 
15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor. 
16. num-of-cylinders: eight, five, four, six, three, twelve, two. 
17. engine-size: continuous from 61 to 326. 
18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. 
19. bore: continuous from 2.54 to 3.94. 
20. stroke: continuous from 2.07 to 4.17. 
21. compression-ratio: continuous from 7 to 23. 
22. horsepower: continuous from 48 to 288. 
23. peak-rpm: continuous from 4150 to 6600. 
24. city-mpg: continuous from 13 to 49. 
25. highway-mpg: continuous from 16 to 54. 
26. price: continuous from 5118 to 45400.

## 3 Relevant Papers

Kibler, D., Aha, D.W., & Albert,M. (1989). Instance-based prediction of real-valued attributes. Computational Intelligence, Vol 5, 51--57. 

## 4 Import Data

**Q1**: Upload the dataset of `Cars.data`. Import the dataset into a data frame named **cars**. Please also name each attribute (see *Attribute Information* section above).

In [0]:
import pandas as pd
df = pd.DataFrame(pd.read_csv('Cars.data', sep=",", header=None))
df.columns = ["symboling", "normalized-losses", "make", "fule-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "wheel-base", "length", "width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]


Upload the data, import the file and declare it as a dataframe using Pandas. This dataframe is named as **df** .

Later, we name each attribute using *pandas.DataFrame.columns* attribute

**Q2**: Select only the columns with continuous values imported into a new data frame named **numeric_cars** and report the first five instances.

Note: The first attribute 'symboling' will be treated as a categorical (discrete) variable.

In [0]:
import numpy as np
df = df.replace('?',np.nan)
numeric_cars = pd.concat([df[df.columns[df.isnull().any()]], df.select_dtypes(include=[np.number])], axis=1, join='inner').drop(columns=['num-of-doors','symboling'])
numeric_cars.head()

![Output](https://i.imgur.com/duAByZX.png)
The *Attribute Information* section has information on all the continuous columns, and are therefore *numpy.number* type. However, some continuous columns containing *?* are object type and cannot be selected using *.select_dtypes()*

Therefore, we can convert the columns containing *?* with NaN using *.replace()*. Later concat the NaN columns and continuous column using inner join while dropping the *num-of-doors* and *symboling* columns since they aren't continuous and save it into dataframe called **numeric_cars**

## 5 Data Cleaning

**Q3**: Replace the "?" with "NaN" in the **numeric_cars** data.  
Hint: Refer to `np.nan`

This has already been done in the previous question 

**Q4**: Set **numeric_cars** data type into "float" and report the number of null values for each attribute (column).

In [0]:
numeric_cars = numeric_cars.astype(float)
numeric_cars.isna().sum(axis = 0)

![Output](https://i.imgur.com/ooGcYjy.png)
Now that the **numeric_cars** dataframe only has continuous values, you can use the *.astype()* function to convert it to 'float' type.

*.isna()* returns a dataframe with *True/False* values, this can be summed up based on each column setting the axis as '0'

**Q5**: Because `price` is the column to predict, let's remove any rows with missing `price` values, if any. And again report the number of null values for each attribute.

In [0]:
numeric_cars = numeric_cars.dropna(axis=0, subset=['price'])
numeric_cars.isna().sum(axis = 0)

![Output](https://i.imgur.com/vqRo4to.png)
To remove required  missing values, the *.dropna()* function can be used by setring the subset to 'price' 

The number of null values can be reported just as before

**Q6**: Replace missing values in other columns using column means and check that there's no more missing values!

In [0]:
numeric_cars = numeric_cars.fillna(numeric_cars.mean())

*.fillna()* helps filling up NaN with a specific value, here we replace it with the column mean using *.mean()*. This automatically calculates mean for each column

**Q7**: Normalise all columnns to range from 0 to 1 except for the target column.  
Hint: If you want to normalise your data, you can do so as my suggest and calculate the following:  
$z_i=(x_i-min(x))/(max(x)-min(x))$  
where $x=(x_1,x_2,\cdots,x_n)$ and $z_i$ is now your $i_{th}$ normalised data.

In [0]:
numeric_cars_norm = (numeric_cars.loc[:, numeric_cars.columns != 'price']-numeric_cars.loc[:, numeric_cars.columns != 'price'].min())/(numeric_cars.loc[:, numeric_cars.columns != 'price'].max()-numeric_cars.loc[:, numeric_cars.columns != 'price'].min())
numeric_cars[numeric_cars_norm.columns] = numeric_cars_norm

The 'price' column can be excluded using the *.loc* attribute, we can normalize the data using the given formula by using the *.min()* and *.max()*  as needed. 

## 6 Univariate Model [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) with Fixed k Value

In [0]:
rmse_results = {}
train_cols = numeric_cars.columns.drop('price')
train_cols

![Output](https://i.imgur.com/k2j3Juh.png)

**Q8**: For each column (minus `price`), train a model, return RMSE value and add to the dictionary `rmse_results`. How to calculate RMSE in python, refer to the [link]( https://stackoverflow.com/questions/45173451/scikit-learn-how-to-calculate-root-mean-square-error-rmse-in-percentage). You require to use `sklearn.metrics import mean_squared_error` and `np.sqrt`.

Note: 
1. Randomise order of rows in data frame before splitting data into training and test set. Set the random seed to 1.
2. The first 80% of the data are used as training set and the rest are as test set.
3. Fit a KNN model using default k value
4. Report the rmse_results under an ascending order (RMSE) with the following format:  
`horsepower    8000.0000`   
`highway-mpg   9000.0000`


In [0]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from collections import OrderedDict

numeric_cars_rand = numeric_cars.sample(n = numeric_cars.shape[0],random_state=1)
X = pd.DataFrame(numeric_cars_rand[train_cols])        
y = pd.DataFrame(numeric_cars_rand['price'])
X_train, X_test, y_train, y_test = train_test_split(X , y, test_size = 0.2, random_state=1)

The rows in **numeric_cars** can be randomized using the *.sample* function setting the 'n' value to all the rows in the dataframe. 

Now we divide the dataframe into input variables **X** (every column except price) and output variables **y** (containing the 'price' column) 

These variables can now be divided to their respective training and testing sets using the *train_test_split()* function present in the *model_selection* package while setting the 'test_size' to 0.2 (20%)


In [0]:
for column in X_test:
  reg = KNeighborsRegressor().fit(X_train[[column]],y_train)
  y_pred = pd.DataFrame(reg.predict(X_test[[column]]))
  rmse_results[column] = np.sqrt(mean_squared_error(y_test,y_pred))
  
rmse_results = OrderedDict(sorted(rmse_results.items(), key=lambda t: t[1]))
for column in rmse_results:
  print("{} {:.4f}".format(column,round(rmse_results[column])))

![Output](https://i.imgur.com/IqDvkzE.png)
Now fit the training sets using the *KNeighborsRegressor().fit()* function making sure to import *.neighbours* package. 

Predict the output value using the *.predict()* function and compare them with the test values to get their mean square error using *mean_squared_error()* present in the *.metrics* package, later calculate the root mean square using *numpy.sqrt*

Sort the **rmse_results** using *collections.OrderedDict()* and print the dictionary