What we are going to cover:-  

0. An end to end Scikit Learn Workflow  
1. Getting the data ready  
2. Choose the right estimator/ algorithm for our problem  
3. Fit the model/ algorithm and use it to make prediction on our data
4. Evaluating the model
5. Improve the model  
6. Save and load a trained model  
7. Putting it all together !!

# 2. Choosing the right estimator/algorithm for your problem

#### Note that:-
* Sklearn refers to machine learning models, algorithms as estimators
* Classification problem- predicting a category (heart disease or not)
* Sometimes you will see `clf` (short for classifier) used as a classification estimator
* Regression problem- predicting a number (selling price of a car)

#### ML map
<img src= "scikit-images/ml_map.png">

## 2.1- Picking a machine learning model for a regression problem

Lets use the california housing dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing

In [1]:
import numpy as np
import pandas as pd

In [2]:
import sklearn
from sklearn import datasets
#dir(datasets)

##### California Housing Dataset
* The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).
* This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
* An household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surpinsingly large values for block groups with few households and many empty houses, such as vacation resorts.

#### Data Set Characteristics:
* Number of Instances  
20640  
* Number of Attributes  
8 numeric, predictive attributes and the target  

* Attribute Information  
MedInc median income in block group  
HouseAge median house age in block group  
AveRooms average number of rooms per household  
AveBedrms average number of bedrooms per household  
Population block group population  
AveOccup average number of household members  
Latitude block group latitude  
Longitude block group longitude  

* Missing Attribute Values  
None

In [3]:
from sklearn.datasets import fetch_california_housing
housing= fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [4]:
# using feature to predict the target

housing_df= pd.DataFrame(housing["data"])
housing_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [5]:
housing_df= pd.DataFrame(housing["data"], columns= housing["feature_names"])
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [6]:
housing_df["target"]= housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [7]:
housing_df= housing_df.drop("MedHouseVal",axis=1)
housing_df

KeyError: "['MedHouseVal'] not found in axis"

In [8]:
# Import algorithm
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

# Create the data
X= housing_df.drop("target",axis=1)
y= housing_df["target"]      # median house price in $100,000s

# Split into train and test sets
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2)

# Instatiate & fit the model (on the training data)
model=Ridge()
model.fit(X_train,y_train)

# Check the score of the model (on the test set)
model.score(X_test, y_test)


0.5758549611440128

#### This value is showing how predictive our model is ..with feature values how much we can predict our target
The coefficient of determination is a statistical measurement that examines how differences in one variable can be explained by the difference in a second variable, when predicting the outcome of a given event. In other words, this coefficient, which is more commonly known as R-squared (or R^2), assesses how strong the linear relationship is between two variables  
COD lies between 0 to 1

##### What if Ridge dont work or the score didnt fit our needs?
Switching to other model  
Ensemble model- is a combination of smaller models to try and make better prediction than just a single model

#### A random forest is a combination of lots of decision trees
n_estimators i.e 100 by default means our model is going to use 100 decision trees for making the prediction or,  
we can say it is like 100 different models predicting the value

In [9]:
# Import the RandomForestRegressor model class from the ensemble models
# last one (Ridge) is a linear model
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
X= housing_df.drop("target",axis=1)
y= housing_df["target"]      # median house price in $100,000s

# Split into train and test sets
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2)

# Instatiate & fit the model (on the training data)
model= RandomForestRegressor()
model.fit(X_train,y_train)

# Check the score of the model (on the test set)
model.score(X_test, y_test)

0.8065734772187598

-------------------------------------------------------------

## 2.2 Picking a machine learning model for a classification problem

In [12]:
heart_disease= pd.read_csv("heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [13]:
len(heart_disease)

303

from map we should try LinearSVC

In [26]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

#Setup random seed
np.random.seed(42)

# make the data
X = heart_disease.drop("target",axis=1)
y= heart_disease["target"]

# split the data
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2)

# instantiate LinearSVC
clf= LinearSVC()
clf.fit(X_train,y_train)

#evaluate the LinearSVC
clf.score(X_test,y_test)



0.8688524590163934

In [23]:
heart_disease["target"].value_counts()

1    165
0    138
Name: target, dtype: int64

trying other model

In [27]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

#Setup random seed
np.random.seed(42)

# make the data
X = heart_disease.drop("target",axis=1)
y= heart_disease["target"]

# split the data
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2)

# instantiate LinearSVC
clf= RandomForestClassifier()
clf.fit(X_train,y_train)

#evaluate the LinearSVC
clf.score(X_test,y_test)

0.8524590163934426

Tidbit:  
    1. If you have structured data use ensemble methods  
    2. If you have unstructured data use deep learning or transfer learning