<a id = 'top'></a>
### HOUSING PRICE PREDICTION MODEL
 - [PROJECT AIM ](#aim)
 - [DATA](#data)
    - [loading the data](#load)
    - [normalising and splitting the data into train and test sets](#nsplit)
 - [Logistic REGRESSION MODEL ](#linear)
 - [SVM CLASSIFIER](#ridge)
 - [KNN CLASSIFIER MODEL](#knn)
 - [DECISION TREE CLASSIFIER MODEL](#dtree)
 - [RANDOM FOREST CLASSIFIER MODEL](#rforest)
  - [Conclusion & Report](#conclusion)
---------------------------------------------------------------------[BACK TO TOP](#top)

In [1]:
# importing required libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.express as px
import sklearn
import time,math,random

<a id = 'aim'></a>
###  Aim
1. developing a predictive classification model that could detect  handwritten digits from 0 to 9
2. tune the hyperparameters of the model 
3. find the model that best performs on our data
- we will construct a model that can detect (classify any 9x9 flattend greyscale image of hand
  written digit ) into one of 0 to 9 or 1 to 10 numbers(classes)
<a id = 'data'></a>

### Data
- we will be loading the digits dataset that stores
- the pixel intensities of 1797 grayscale images each of size 8x8 as 1797 x 64 numpy array 
  here actual image size is 8x8 which is flattened into a 64 units vector 
- the intensity of each pixel is encoded in 8 bits ie ( each value in a input feature lies between 
  0 to 256
- the size of input feature dataset is 1797 x 64
- the size of target labels dataset is 1797 x 1
- which will be split into training and test datasets

---------------------------------------------------------------------[BACK TO TOP](#top)

<a id = 'load'></a>
#### loading the data

In [2]:
from sklearn.datasets import load_digits
data = load_digits()
xdata = data['data']
ydata = data['target']
print(xdata.shape,ydata.shape)

(1797, 64) (1797,)


<a id = 'nsplit'></a>
#### normalizing and splitting the data into train and test sets
- since the data represents the pixel intensities represented by 8 bits
-  just divide by 255 to normalize 

In [6]:
# 
xdata = xdata/255 
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(xdata,ydata,test_size = .1)
print(xtrain.shape,ytrain.shape)
print(xtest.shape,ytest.shape)

(1617, 64) (1617,)
(180, 64) (180,)


<a id = 'log'></a>
### logistic regression
---------------------------------------------------------------------[BACK TO TOP](#top)

In [16]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression()
para = {'C':np.logspace(-3,1,10),  # inv of regularisation constant
       'penalty':['l1','l2'],     # type of regularisation
       'solver':['newton-cg', 'lbfgs' , 'liblinear', 'saga']}
from sklearn.model_selection import GridSearchCV
logcv = GridSearchCV(log,para,cv=9) # train 8 folds cross validate 9th fold
logcv.fit(xtrain,ytrain)
print('the best set of hyper prameters are \n',logcv.best_params_)
print('the best validation score = ',logcv.best_score_)

yhat = logcv.predict(xtest)

from sklearn.metrics import classification_report,confusion_matrix
print('Error metrics on test set \n',classification_report(ytest,yhat))
px.imshow(pd.DataFrame(confusion_matrix(ytest,yhat),columns=list(range(10)),index=list(range(10)))
                       ,text_auto=True,title='Confusion matrix on test data',
          labels={'x':'predicted','y':'actual'})


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_i

the best set of hyper prameters are 
 {'C': 10.0, 'penalty': 'l1', 'solver': 'saga'}
the best validation score =  0.9585385198979238
Error metrics on test set 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       0.95      0.95      0.95        19
           2       1.00      1.00      1.00        14
           3       1.00      0.96      0.98        23
           4       0.95      0.95      0.95        20
           5       0.95      1.00      0.97        19
           6       1.00      0.92      0.96        13
           7       0.94      1.00      0.97        16
           8       0.85      1.00      0.92        17
           9       1.00      0.84      0.91        19

    accuracy                           0.96       180
   macro avg       0.96      0.96      0.96       180
weighted avg       0.96      0.96      0.96       180




The max_iter was reached which means the coef_ did not converge



<a id = 'svm'></a>
### svm classifier
---------------------------------------------------------------------[BACK TO TOP](#top)

In [17]:
from sklearn.svm import SVC
svm = SVC()
para = {'C':np.logspace(-3,3,10),
        'gamma':np.logspace(-3,3,10),
        'kernel':['rbf','sigmoid','linear','poly']}

svmcv = GridSearchCV(svm,para,cv=9) # train 8 folds cross validate 9th fold
svmcv.fit(xtrain,ytrain)
print('_____________________________________________________________________')
print('the best set of hyper prameters are \n',svmcv.best_params_)
print('the best validation score is = ',svmcv.best_score_)

yhat = svmcv.predict(xtest)

print('_____________________________________________________________')
print('Error metrics on test set \n',classification_report(ytest,yhat))
px.imshow(pd.DataFrame(confusion_matrix(ytest,yhat),columns=list(range(10)),index=list(range(10)))
                       ,text_auto=True,title='Confusion matrix on test data',
          labels={'x':'predicted','y':'actual'})

_____________________________________________________________________
the best set of hyper prameters are 
 {'C': 2.154434690031882, 'gamma': 46.41588833612773, 'kernel': 'rbf'}
the best validation score is =  0.9900993171942893
_____________________________________________________________
Error metrics on test set 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        19
           2       1.00      1.00      1.00        14
           3       1.00      1.00      1.00        23
           4       1.00      1.00      1.00        20
           5       1.00      1.00      1.00        19
           6       1.00      1.00      1.00        13
           7       0.94      1.00      0.97        16
           8       1.00      1.00      1.00        17
           9       1.00      0.95      0.97        19

    accuracy                           0.99       180
   macro avg       0.99      0.

<a id = 'knn'></a>
### KNN CLASSIFIER
---------------------------------------------------------------------[BACK TO TOP](#top)

In [19]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

para = {'n_neighbors':list(range(5,20)),'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
knncv = GridSearchCV(knn,para,cv=9) # train on 8 folds and cross validate on 9th fold
knncv.fit(xtrain,ytrain)           # train and cross validate
print('_______________________________________________________')
print('the best set of hyper prameters are \n',knncv.best_params_)
print('the best cross validation error = ', knncv.best_score_)
print('_______________________________________________________')
yhat = knncv.predict(xtest)

print('Error metrics on test set \n',classification_report(ytest,yhat))
px.imshow(pd.DataFrame(confusion_matrix(ytest,yhat),columns=list(range(10)),index=list(range(10)))
                       ,text_auto=True,title='Confusion matrix on test data',
          labels={'x':'predicted','y':'actual'})        

_______________________________________________________
the best set of hyper prameters are 
 {'algorithm': 'auto', 'n_neighbors': 5}
the best cross validation error =  0.9882371198013656
_______________________________________________________
Error metrics on test set 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        19
           2       1.00      1.00      1.00        14
           3       1.00      0.96      0.98        23
           4       0.95      1.00      0.98        20
           5       0.95      1.00      0.97        19
           6       1.00      1.00      1.00        13
           7       1.00      1.00      1.00        16
           8       1.00      1.00      1.00        17
           9       1.00      0.95      0.97        19

    accuracy                           0.99       180
   macro avg       0.99      0.99      0.99       180
weighted avg       0.99 

<a id = 'dtree'></a>
### DECISON TREE CLASSIFIER
---------------------------------------------------------------------[BACK TO TOP](#top)

In [20]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
para = {'splitter':['best', 'random'],
        'max_depth' :[2*n for n in range(1,10)],
        'max_features':['auto', 'sqrt'],
        'min_samples_leaf'  : [1, 2, 4],
        'min_samples_split' : [2, 5, 10]}

dtreecv = GridSearchCV(dtree,para,cv=9)
dtreecv.fit(xtrain,ytrain)
print('_______________________________________________________')
print('the best set of hyper prameters are \n',dtreecv.best_params_)
print('the best cross validation error = ', dtreecv.best_score_)
print('_______________________________________________________')
yhat = dtreecv.predict(xtest)

print('Error metrics on test set \n',classification_report(ytest,yhat))
px.imshow(pd.DataFrame(confusion_matrix(ytest,yhat),columns=list(range(10)),index=list(range(10)))
                       ,text_auto=True,title='Confusion matrix on test data',
          labels={'x':'predicted','y':'actual'})  


_______________________________________________________
the best set of hyper prameters are 
 {'max_depth': 16, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best'}
the best cross validation error =  0.8318021932547074
_______________________________________________________
Error metrics on test set 
               precision    recall  f1-score   support

           0       0.91      1.00      0.95        20
           1       0.89      0.84      0.86        19
           2       0.85      0.79      0.81        14
           3       0.85      0.74      0.79        23
           4       0.80      0.80      0.80        20
           5       0.84      0.84      0.84        19
           6       1.00      0.77      0.87        13
           7       0.64      0.88      0.74        16
           8       0.75      0.88      0.81        17
           9       0.75      0.63      0.69        19

    accuracy                           0.82       180
   macro 

<a id = 'rforest'></a>
### RANDOM FOREST CLASSIFIER
---------------------------------------------------------------------[BACK TO TOP](#top)

In [23]:
from sklearn.ensemble import RandomForestClassifier
rforest = RandomForestClassifier()
para = {
        'n_estimators':list(range(5,10)),
        'max_depth' :[2*n for n in range(1,10)],}

rforestcv = GridSearchCV(rforest,para,cv=9)
rforestcv.fit(xtrain,ytrain)
print('_______________________________________________________')
print('the best set of hyper prameters are \n',rforestcv.best_params_)
print('the best cross validation error = ', rforestcv.best_score_)
print('_______________________________________________________')
yhat = rforestcv.predict(xtest)

print('Error metrics on test set \n',classification_report(ytest,yhat))
px.imshow(pd.DataFrame(confusion_matrix(ytest,yhat),columns=list(range(10)),index=list(range(10)))
                       ,text_auto=True,title='Confusion matrix on test data',
          labels={'x':'predicted','y':'actual'}) 


_______________________________________________________
the best set of hyper prameters are 
 {'max_depth': 14, 'n_estimators': 9}
the best cross validation error =  0.9536140423477479
_______________________________________________________
Error metrics on test set 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       0.95      1.00      0.97        19
           2       1.00      1.00      1.00        14
           3       1.00      0.96      0.98        23
           4       0.90      0.95      0.93        20
           5       1.00      0.89      0.94        19
           6       1.00      1.00      1.00        13
           7       0.89      1.00      0.94        16
           8       1.00      1.00      1.00        17
           9       0.89      0.84      0.86        19

    accuracy                           0.96       180
   macro avg       0.96      0.96      0.96       180
weighted avg       0.96    

<a id = 'conclusion'></a>
### CONCLUSIONS
- comparing the error metrics of various parameters to determine the model that best performs
on our data
- these metrics are calculated on the test data ( the data on which neither the model is trained nor cross validated)
- the accuracy , precision ,recall are the weighted averages over every class
                                    Accuracy     Precision  Recall      F1
      1. logistic regression      :  0.96         0.96       0.96      0.96   
      2. svm  classifier          :  0.99         0.99       0.99      0.99
      2. knn classifier           :  0.99         0.99       0.99      0.99
      3. decision tree classifier :  0.82         0.83       0.82      0.82
      4. random forest classifier :  0.96         0.96       0.96      0.96


### Report 
- the knn and svm classifier are performing way better on the test data(the data upon which the model is neither trained nor cross validated)
---------------------------------------------------------------------[BACK TO TOP](#top)

In [30]:
px.bar(y=[96,99,99,83,96] , x=['log','svm','knn','dtree','rforest'],height=400,width=600,
      title='Test Accuracy of various classification models',labels={'x':'model','y':'accuracy %'},
      color = [0,2,2,0,0],text_auto=True)