<h1 align='center'> CS353 Machine Learning Lab</h1>
<h1 align='center'>Lab-2 (19/03/21)</h1>
<h2 align='center'>Shumbul Arifa (181CO152)</h2>

## Task:
Perform naive bayes classification on breast cancer standard dataset.

**Attributes:**

1. radius (mean of distances from center to points on the perimeter)

2. texture (standard deviation of gray-scale values)

3. perimeter

4. area

5. smoothness (local variation in radius lengths)

6. compactness (perimeter^2 / area - 1.0)

7. concavity (severity of concave portions of the contour)

8. concave points (number of concave portions of the contour)

9. symmetry

10. fractal dimension (“coastline approximation” - 1)

The mean, standard error, and “worst” or largest (mean of the three worst/largest values) of these features were computed for each image, resulting in 30 features. For instance, field 0 is Mean Radius, field 10 is Radius SE, field 20 is Worst Radius.

**class:**
- WDBC-Malignant

- WDBC-Benign

**Results:**

Accuracy Obtained:
- 93.86 % (using python implemented function for naive bayes classifier)
- 94.73 % (using sklearn std library function for naive bayes classifier)

## Dataset

`FuelConsumption.csv`:

I have dused a fuel consumption dataset, **`FuelConsumption.csv`**, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. [Dataset source](http://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64)

- **MODELYEAR** e.g. 2014
- **MAKE** e.g. Acura
- **MODEL** e.g. ILX
- **VEHICLE CLASS** e.g. SUV
- **ENGINE SIZE** e.g. 4.7
- **CYLINDERS** e.g 6
- **TRANSMISSION** e.g. A6
- **FUEL CONSUMPTION in CITY(L/100 km)** e.g. 9.9
- **FUEL CONSUMPTION in HWY (L/100 km)** e.g. 8.9
- **FUEL CONSUMPTION COMB (L/100 km)** e.g. 9.2
- **CO2 EMISSIONS (g/km)** e.g. 182   --> low --> 0

# Importing Libraries

In [98]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

# Loading dataset

In [99]:
df = pd.read_csv('FuelConsumptionCo2.csv')
df.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067 entries, 0 to 1066
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   MODELYEAR                 1067 non-null   int64  
 1   MAKE                      1067 non-null   object 
 2   MODEL                     1067 non-null   object 
 3   VEHICLECLASS              1067 non-null   object 
 4   ENGINESIZE                1067 non-null   float64
 5   CYLINDERS                 1067 non-null   int64  
 6   TRANSMISSION              1067 non-null   object 
 7   FUELTYPE                  1067 non-null   object 
 8   FUELCONSUMPTION_CITY      1067 non-null   float64
 9   FUELCONSUMPTION_HWY       1067 non-null   float64
 10  FUELCONSUMPTION_COMB      1067 non-null   float64
 11  FUELCONSUMPTION_COMB_MPG  1067 non-null   int64  
 12  CO2EMISSIONS              1067 non-null   int64  
dtypes: float64(4), int64(4), object(5)
memory usage: 108.5+ KB


In [101]:
df.describe()

Unnamed: 0,MODELYEAR,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
count,1067.0,1067.0,1067.0,1067.0,1067.0,1067.0,1067.0,1067.0
mean,2014.0,3.346298,5.794752,13.296532,9.474602,11.580881,26.441425,256.228679
std,0.0,1.415895,1.797447,4.101253,2.79451,3.485595,7.468702,63.372304
min,2014.0,1.0,3.0,4.6,4.9,4.7,11.0,108.0
25%,2014.0,2.0,4.0,10.25,7.5,9.0,21.0,207.0
50%,2014.0,3.4,6.0,12.6,8.8,10.9,26.0,251.0
75%,2014.0,4.3,8.0,15.55,10.85,13.35,31.0,294.0
max,2014.0,8.4,12.0,30.2,20.5,25.8,60.0,488.0


# Data Preprocessing

Object type columns have to me made categorical type. There are no missing values.

In [102]:
from pandas.api.types import CategoricalDtype

s = pd.Series(["MAKE", "MODEL", "VEHICLECLASS", "TRANSMISSION", "FUELTYPE"])

for i in s:
    df[i] = df[i].astype('category')
    df[i] = df[i].cat.codes
df

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,0,329,0,2.0,4,10,3,9.9,6.7,8.5,33,196
1,2014,0,329,0,2.4,4,20,3,11.2,7.7,9.6,29,221
2,2014,0,330,0,1.5,4,17,3,6.0,5.8,5.9,48,136
3,2014,0,389,11,3.5,6,11,3,12.7,9.1,11.1,25,255
4,2014,0,483,11,3.5,6,11,3,12.1,8.7,10.6,27,244
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1062,2014,38,624,11,3.0,6,11,2,13.4,9.8,11.8,24,271
1063,2014,38,624,11,3.2,6,11,2,13.2,9.5,11.5,25,264
1064,2014,38,625,11,3.0,6,11,2,13.4,9.8,11.8,24,271
1065,2014,38,625,11,3.2,6,11,2,12.9,9.3,11.3,25,260


In [103]:
df.dtypes

MODELYEAR                     int64
MAKE                           int8
MODEL                         int16
VEHICLECLASS                   int8
ENGINESIZE                  float64
CYLINDERS                     int64
TRANSMISSION                   int8
FUELTYPE                       int8
FUELCONSUMPTION_CITY        float64
FUELCONSUMPTION_HWY         float64
FUELCONSUMPTION_COMB        float64
FUELCONSUMPTION_COMB_MPG      int64
CO2EMISSIONS                  int64
dtype: object

In [104]:
X = df.loc[:, df.columns!='CO2EMISSIONS']
y = df['CO2EMISSIONS']
print(X.shape)
print(y.shape)

(1067, 12)
(1067,)


In [105]:
X.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG
0,2014,0,329,0,2.0,4,10,3,9.9,6.7,8.5,33
1,2014,0,329,0,2.4,4,20,3,11.2,7.7,9.6,29
2,2014,0,330,0,1.5,4,17,3,6.0,5.8,5.9,48
3,2014,0,389,11,3.5,6,11,3,12.7,9.1,11.1,25
4,2014,0,483,11,3.5,6,11,3,12.1,8.7,10.6,27


In [106]:
y.head()

0    196
1    221
2    136
3    255
4    244
Name: CO2EMISSIONS, dtype: int64

# Splitting Data

We are using X-y split method with test size 20 % and random state 5.

In [107]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [108]:
X_train.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG
533,2014,15,450,10,3.7,6,12,3,12.9,8.9,11.1,25
421,2014,11,406,10,3.7,6,2,2,12.5,8.2,10.6,27
834,2014,28,30,13,3.7,6,12,3,13.0,9.5,11.4,25
587,2014,17,316,12,3.6,6,4,1,17.7,13.0,15.6,18
305,2014,9,145,1,3.6,6,4,2,12.8,8.6,10.9,26


In [109]:
X_test.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG
331,2014,9,354,11,3.6,6,2,1,18.9,12.9,16.2,17
467,2014,12,573,11,3.6,6,2,2,14.8,9.9,12.6,22
938,2014,33,288,13,1.0,3,6,3,6.9,5.7,6.4,44
60,2014,2,508,2,4.0,8,3,3,14.2,9.7,12.2,23
71,2014,3,167,10,6.0,12,13,3,20.0,12.2,16.5,17


# Implementing Logistic regression

## Part - 1: Using python from scratch

In [110]:
from sklearn import preprocessing
from sklearn.pipeline import Pipeline

In [111]:
scaled = preprocessing.StandardScaler().fit(X_train)
scaled.fit_transform(X_train)
scaled.fit_transform(X_test)
X_train.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG
533,2014,15,450,10,3.7,6,12,3,12.9,8.9,11.1,25
421,2014,11,406,10,3.7,6,2,2,12.5,8.2,10.6,27
834,2014,28,30,13,3.7,6,12,3,13.0,9.5,11.4,25
587,2014,17,316,12,3.6,6,4,1,17.7,13.0,15.6,18
305,2014,9,145,1,3.6,6,4,2,12.8,8.6,10.9,26


In [112]:
y_test.head()

331    259
467    290
938    147
60     281
71     380
Name: CO2EMISSIONS, dtype: int64

In [113]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV

In [114]:
log_reg = LogisticRegression()
log_reg.fit(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [115]:
log_reg.coef_

array([[-0.00253353,  0.00453407,  0.01119432, ..., -0.00116918,
        -0.00174007,  0.00827901],
       [ 0.0011056 ,  0.00358497, -0.00634149, ..., -0.0021526 ,
        -0.00322058,  0.01579013],
       [-0.00020206,  0.00479618,  0.00098106, ..., -0.0010565 ,
        -0.00152216,  0.00626986],
       ...,
       [-0.00028809, -0.00383138,  0.00464887, ...,  0.00261999,
         0.0034896 , -0.00542003],
       [-0.00131181, -0.00352738,  0.0067809 , ...,  0.00129153,
         0.0017765 , -0.00275458],
       [ 0.00133215, -0.00148304, -0.0113834 , ...,  0.00207869,
         0.00239991, -0.00334127]])

In [116]:
log_reg.intercept_

array([-1.25808651e-06,  5.48960946e-07, -1.00402601e-07, -7.64726421e-09,
        7.70985447e-07,  9.46115925e-07,  5.69692866e-08,  6.79430344e-07,
       -7.40039324e-06,  1.52732199e-07,  3.03694590e-07, -9.96103493e-07,
        1.11886275e-06,  2.58529398e-07,  3.59737939e-07,  3.99242716e-07,
        2.86720993e-07,  5.66943826e-07,  6.42083321e-07,  5.21833215e-07,
        4.53170400e-07, -1.39315808e-07, -3.41552725e-06,  6.29743921e-07,
        5.60257426e-07,  8.93208359e-07,  1.25669782e-06,  4.45177337e-07,
        5.38527327e-07,  5.51158048e-07,  5.73957440e-07,  6.57068046e-07,
        3.73147008e-07,  1.03206447e-06,  6.00470999e-07,  7.63742690e-07,
        6.29602725e-07,  1.11603927e-06,  5.52180923e-07,  8.98274387e-07,
        8.05499587e-07,  2.50347644e-07,  3.76339226e-07,  1.01325320e-06,
        7.15920135e-07,  1.14467021e-06,  9.23979152e-07,  8.24748260e-07,
        3.70773924e-07,  1.00991891e-06,  1.23179757e-06,  5.45543729e-07,
        1.38712542e-06,  

In [117]:
y_pred = log_reg.predict(X_test)
y_pred

array([294, 317, 184, 317, 294, 225, 317, 179, 179, 209, 184, 294, 271,
       209, 294, 225, 209, 317, 271, 306, 184, 271, 380, 230, 196, 294,
       292, 209, 242, 317, 292, 209, 317, 179, 271, 184, 294, 179, 294,
       294, 292, 179, 184, 209, 179, 294, 294, 207, 294, 294, 271, 207,
       317, 242, 292, 294, 179, 230, 306, 209, 294, 242, 294, 294, 179,
       179, 207, 179, 179, 184, 179, 271, 179, 294, 230, 271, 264, 207,
       179, 184, 242, 184, 292, 294, 207, 294, 184, 271, 209, 225, 179,
       179, 179, 294, 179, 294, 294, 225, 317, 179, 306, 271, 317, 184,
       209, 294, 179, 179, 292, 209, 224, 179, 179, 230, 179, 179, 184,
       271, 294, 317, 209, 294, 294, 242, 224, 306, 306, 294, 242, 179,
       306, 179, 294, 209, 251, 179, 209, 179, 294, 294, 179, 184, 271,
       306, 317, 179, 184, 184, 184, 294, 317, 242, 179, 230, 179, 317,
       294, 264, 271, 306, 242, 271, 225, 271, 209, 271, 294, 207, 271,
       242, 224, 179, 294, 317, 306, 242, 292, 271, 207, 207, 29

In [118]:
y_pred.shape

(214,)

In [119]:
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.metrics import mean_squared_log_error,explained_variance_score
from sklearn import metrics

In [120]:
prediction = log_reg.predict(X_test)
print('\n\nMetrics for model evualtion:\n\n')

mean_sq_error = mean_squared_error(y_test,prediction)
mean_abs_error = mean_absolute_error(y_test,prediction)
R2_score = metrics.r2_score(y_test,prediction)
explained_variance_score = metrics.explained_variance_score( y_test,prediction, multioutput='uniform_average')

print(f' Mean-sq-error            : {mean_sq_error}\n\n Mean-abs-error           : {mean_abs_error}\n\n R2-score                 : {R2_score}\n\n Explained-variance-score : {explained_variance_score}\n')




Metrics for model evualtion:


 Mean-sq-error            : 2583.5934579439254

 Mean-abs-error           : 39.39719626168224

 R2-score                 : 0.4320076480785501

 Explained-variance-score : 0.5112549362589073



In [124]:
log_reg.score(X_test, y_test)

0.037383177570093455

In [121]:
from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

Model accuracy score: 0.0374
