# Supervised and Unsupervised Learning

Depending on the type of the data and the model to be built, you can separate the learning problems into two broad categories:

### Supervised learning. 
They are the methods in which the training set contains additional attributes that you want to predict (target). 
  ##### Classification: 
The data in the training set belong to two or more classes or categories; then, the data, already being labeled, allow us to teach the system to recognize the characteristics that distinguish each class. When you will need to consider a new value unknown to the system, the system will evaluate its class according to its characteristics.
 ###### Regression: 
When the value to be predicted is a continuous variable. The simplest case to understand is when you want to find the line which describes the trend from a series of points represented in a scatterplot.

### Unsupervised learning. 
These are the methods in which the training set consists of a series of input values x without any corresponding
target value.
 ##### Clustering: 
 The goal of these methods is to discover groups of similar examples in a dataset.
 ##### Dimensionality reduction: 
 Reduction of a high-dimensional dataset to one with only two or three dimensions is useful not just for data
visualization, but for converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions conveys much more information.

In addition to these two main categories, there is a further group of methods which have the purpose of validation and evaluation of the models.

# Training Set and Testing Set
Machine learning enables learning some properties by a model from a data set and applying them to new data. This is because a common practice in machine learning is to evaluate an algorithm. This valuation consists of splitting the data into two parts, one called the training set, with which we will learn the properties of the data, and the other called the testing set, on which to test these properties.

In [10]:
import pandas as pd
from sklearn.feature_selection import RFE #recursive feature elimination
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn import datasets, linear_model, metrics

# function to calculate r-squared, MAE, RMSE
from sklearn.metrics import r2_score , mean_absolute_error,mean_squared_error

print(os.getcwd())

os.chdir('C:\\Analytics\\Personal\\Machine Learning\\Training\\R\\Dataset')

C:\Users\manish.khati\Python\Class\Day5


In scikit-learn,to use the predictive model for the linear regression you must import linear_model module and then use the manufacturer LinearRegression() constructor for creating the predictive model, which you call linreg.

In [38]:
from sklearn import linear_model
linreg = linear_model.LinearRegression()

In [39]:
#First you will need to break the 442 patients into a training set 
#(composed of the first 422 patients) and a test set (the last 20 patients)

In [40]:
from sklearn import datasets
diabetes = datasets.load_diabetes()

In [68]:
type(diabetes)


sklearn.utils.Bunch

In [42]:
diabetes1 = pd.DataFrame(data= np.c_[diabetes['data'], diabetes['target']],
columns= diabetes['feature_names'] + ['target'])

diabetes1

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068330,-0.092204,75.0
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041180,-0.096346,97.0
6,-0.045472,0.050680,-0.047163,-0.015999,-0.040096,-0.024800,0.000779,-0.039493,-0.062913,-0.038357,138.0
7,0.063504,0.050680,-0.001895,0.066630,0.090620,0.108914,0.022869,0.017703,-0.035817,0.003064,63.0
8,0.041708,0.050680,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.014956,0.011349,110.0
9,-0.070900,-0.044642,0.039062,-0.033214,-0.012577,-0.034508,-0.024993,-0.002592,0.067736,-0.013504,310.0


In [43]:
diabetes1['is_train'] = np.random.uniform(0, 1, len(diabetes1)) <= .75
diabetes1.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target,is_train
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0,False
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0,True
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0,False
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0,True
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0,True


In [44]:
diabetes['target']

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

In [54]:
#x_train1 = train[diabetes['feature_names']]
#y_train1 = diabetes.target[:-20]
#x_test1 = diabetes.data[-20:]
#y_test1 = diabetes.target[-20:]

train, test = diabetes1[diabetes1['is_train']==True], diabetes1[diabetes1['is_train']==False]

features = diabetes1.columns[:10]
features
x_train1 = train[features]
y_train1 = train["target"]
x_test1 = test[features]
y_test1 = test["target"]
x_test1

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930
12,0.016281,-0.044642,-0.028840,-0.009113,-0.004321,-0.009769,0.044958,-0.039493,-0.030751,-0.042499
19,-0.027310,-0.044642,-0.018062,-0.040099,-0.002945,-0.011335,0.037595,-0.039493,-0.008944,-0.054925
22,-0.085430,-0.044642,-0.004050,-0.009113,-0.002945,0.007767,0.022869,-0.039493,-0.061177,-0.013504
31,-0.023677,-0.044642,-0.065486,-0.081414,-0.038720,-0.053610,0.059685,-0.076395,-0.037128,-0.042499
34,0.016281,-0.044642,-0.063330,-0.057314,-0.057983,-0.048912,0.008142,-0.039493,-0.059473,-0.067351
38,-0.001882,0.050680,0.071397,0.097616,0.087868,0.075407,-0.021311,0.071210,0.071424,0.023775
41,-0.099961,-0.044642,-0.067641,-0.108957,-0.074494,-0.072712,0.015505,-0.039493,-0.049868,-0.009362
42,-0.060003,0.050680,-0.010517,-0.014852,-0.049727,-0.023547,-0.058127,0.015858,-0.009919,-0.034215


In [55]:
type(x_train1)

pandas.core.frame.DataFrame

In [70]:
#Now, apply the training set to the predictive model through the use of fit() function.

linreg.fit(x_train1,y_train1)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [57]:
#Once the model is trained you can get the ten b coefficients calculated for each variable, 
#using the coef_ attribute of the predictive model.
linreg.coef_

array([-6.36020829e+00, -3.01485913e+02,  5.31907231e+02,  3.46342208e+02,
       -3.49838703e-01, -1.70694394e+02, -2.80673030e+02,  8.56287956e+01,
        5.02940683e+02,  3.35275018e+01])

In [58]:
linreg.intercept_

152.5132298350728

In [59]:
#If you apply the test set to the linreg prediction model you will get a series of targets to 
#be compared with the values actually observed.

linreg.predict(x_test1)

array([204.72347435, 173.65446775, 116.14960928, 124.31266606,
       118.85846601,  68.8659893 ,  82.86564781, 244.92639493,
        72.20171901, 142.45135829, 112.18884849, 134.43637156,
       130.1710063 , 118.42237838, 144.36258751, 191.25269117,
        81.11493501, 172.69923517, 153.41049072, 150.0618837 ,
       157.25105621, 160.18199876,  78.50675471, 311.43230912,
       192.44092742, 209.68806923, 241.76216959, 155.28163599,
       118.44230694, 109.70323438, 117.65189772, 242.483317  ,
        62.77733439, 271.30981704, 258.78394611,  90.73962639,
       152.72577427, 194.79719178, 155.46737026, 135.66473202,
       166.11700317, 153.08559638, 190.05264897, 171.08283088,
       155.15939408, 128.6804643 , 151.51469031, 224.79570277,
       124.6773803 , 233.09360214, 162.4933103 , 159.16674312,
       172.62077386, 189.04714006, 182.12605812, 232.78824556,
       275.81836903, 109.55087142,  56.68915699, 125.6238889 ,
       207.21707159, 196.19678738, 113.41838898, 160.78

### How Good Is Your Model?
There are three metrics widely used for evaluating linear model performance.
#R-squared
#RMSE
#MAE


The R-squared metric is the most popular practice of evaluating how well your model fits the data. R-squared value designates the total proportion of variance in the dependent variable explained by the independent variable. It is a value between 0 and 1; the value toward 1 indicates a better model fit

In [61]:
# Run the model on X_test and show the first five results
list(linreg.predict(x_test1)[0:5])

[204.72347434765848,
 173.65446774877682,
 116.14960928287482,
 124.31266606471516,
 118.85846601171599]

In [63]:
# View the R-Squared score

Pred = linreg.predict(x_test1)
r_squared = r2_score(Pred, y_test1)
r_squared

0.07148722812718966

In [65]:
# Adjusted R Squared
1 - (1-r_squared)*(len(y_train1)-1)/(len(y_train1)-x_train1.shape[1]-1)

0.04309234213719537

In [66]:
# View the first five test Y values
list(y_test1)[0:5]

[151.0, 141.0, 179.0, 168.0, 68.0]

In [50]:
#The difference between the model’s predicted values and the actual values is how is we judge as model’s 
#accuracy, because a perfectly accurate model would have residuals of zero.

#The most common statistic used for quantitative Ys is the residual sum of squares

# Apply the model we created using the training data 
# to the test data, and calculate the RSS.
((y_test - linreg.predict(x_test)) **2).sum()

40091.35205379645

In [51]:
#Note: You can also use Mean Squared Error, which is RSS divided by the degrees of freedom

# Calculate the MSE
np.mean((linreg.predict(x_test) - y_test) **2)

2004.5676026898225