## Scikit-Learn (sklearn) Course

<span>
--- topics<br>
0. sklearn workflow overview<br>
1. preparing data (collecting, exploring, cleaning, transforming, reducing, splitting)<br>
2. defining problem / selecting machine learning model<br>
3. training model and making predictions<br>
4. evaluating model<br>
5. improving model<br>
6. saving and loading model<br>
7. putting it all together
</span>

## 2. Selecting Machine Learning Model

#### Concepts

--- model selection<br>
model selection is based on the [sklean model selection map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)<br>
when the map suggests several models, each one should be tested

--- model selection tip<br>
with structured data, ensemble models perform better<br>
with unstructured data, deep/transfer learning models perform better

--- decision trees<br>
a decision tree is basically programming itself by bootstrapping an algorithm onto the data<br>
this algorithm is composed of simple if-then-else blocks

--- ensemble models<br>
ensemble models combine the predictions of several base estimators<br>
averaging methods provide an average of predictions from independent base estimators (random forest)<br>
boosting methods linearly combine predictions from independent base estimators (gradient tree boosting)

--- random forest<br>
random forest is an averaging ensemble model<br>
combines the predictions of decision trees by majority voting

--- model evaluation<br>
the score of a ridge regressor is the coefficient of determination (r-square)<br>
r-square quantifies the linear relationship between variances of feature and target variables

#### Selecting regression model

In [None]:
### imports ------------------------------------------------------------------------------------------------------------

import numpy, pandas

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

In [None]:
### preparing data -----------------------------------------------------------------------------------------------------

### loading california housing regression dataset
housing_dict = fetch_california_housing()

### creating california housing dataframe
housing_df = pandas.DataFrame(data=housing_dict["data"], columns=housing_dict["feature_names"])
housing_df["MedHouseVal"] = housing_dict["target"]

### splitting data features/target
features = housing_df.drop(columns="MedHouseVal")
target = housing_df.loc[:, "MedHouseVal"]

### splitting data train/test
numpy.random.seed(42)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)

In [None]:
### ridge regressor ----------------------------------------------------------------------------------------------------

### instantiating model
numpy.random.seed(42)
regressor = Ridge()

### training model
regressor.fit(features_train, target_train)

### evaluating model
regressor.score(features_test, target_test)

In [None]:
### random forest regressor --------------------------------------------------------------------------------------------

### instantiating model
numpy.random.seed(42)
regressor = RandomForestRegressor()

### training model
regressor.fit(features_train, target_train)

### evaluating model
regressor.score(features_test, target_test)

#### Selecting classification model

In [None]:
### imports ------------------------------------------------------------------------------------------------------------

from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

In [None]:
### preparing data -----------------------------------------------------------------------------------------------------

### loading heart disease classification data into dataframe
heart_disease = pandas.read_csv("data-heart-disease.csv")

### splitting data features/target
features = heart_disease.drop(columns="target")
target = heart_disease.loc[:, "target"]

### splitting data train/test
numpy.random.seed(42)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)

In [None]:
### linear support vector classifier -----------------------------------------------------------------------------------

### instantiating model
numpy.random.seed(42)
classifier = LinearSVC()

### training model
classifier.fit(features_train, target_train)

### evaluating model
classifier.score(features_test, target_test)

In [None]:
### random forest classifier -------------------------------------------------------------------------------------------

### instantiating model
numpy.random.seed(42)
classifier = RandomForestClassifier()

### training model
classifier.fit(features_train, target_train)

### evaluating model
classifier.score(features_test, target_test)