## Basic machine learning
There are three main types of statistical learning methods: supervised learning, unsupervised learning, and reinforcement learning. Each method has its own problem space and appropriate use cases.

### Supervised Learning: 
This is where you have input variables (X) and an output variable (Y), and you use an algorithm to learn the mapping function from the input to the output. The ultimate goal is to approximate the mapping function so well that when you have new input data (X), you can predict the output variables (Y) for that data.

It is called "supervised learning" because the process of an algorithm learning from the training dataset is akin to a teacher supervising the learning process. We know the correct answers; the algorithm iteratively makes predictions on the training data and is corrected by the teacher.

The two main types of supervised learning problems are:

* Regression: The output variable is a real or continuous value, such as "salary" or "weight". Algorithms used for these types of problems include Linear Regression, Decision Trees, and Support Vector Regression.
* Classification: The output variable is a category, such as "red" or "blue" or "disease" and "no disease". Algorithms used for these types of problems include Logistic Regression, Naive Bayes, and Random Forest.

### Unsupervised Learning: 
This is where you only have input data (X) and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

These are called unsupervised learning because there is no correct answers and there is no teacher. Algorithms are left to their own to discover interesting structures in the data.

The main types of unsupervised learning problems are:

* Clustering: The task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Algorithms include K-Means, Hierarchical Clustering, and DBSCAN.
* Association: The task of finding interesting relationships or associations among a set of items. This is often used for market basket analysis. Algorithms include Apriori and FP-Growth.

### Reinforcement Learning: 
It is about interaction between a learning agent and the environment. The agent takes actions in the environment to reach a certain goal. The environment, in return, gives reward or penalty (reinforcement signal) to the agent. The agent's task is to learn to make optimal decisions.

A typical example is learning to play a game like chess. The agent decides the next move, the environment changes (the opponent makes a move), and the agent receives a reward (winning or losing the game).

#### SciKit-learn
Scikit-learn is an open-source machine learning library in Python. It features various machine learning algorithms, including those for classification, regression, and clustering. It also provides tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

##### Classification: 

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model
clf = DecisionTreeClassifier()

# Fit the model to the training data
clf.fit(X_train, y_train)

# Use the trained model to make predictions on the test data
predictions = clf.predict(X_test)

##### Regression:

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model
reg = LinearRegression()

# Fit the model
reg.fit(X_train, y_train)

# Make predictions
predictions = reg.predict(X_test)

##### Clustering: 
Scikit-learn provides several clustering algorithms like K-Means, Hierarchical clustering, DBSCAN, etc. 

In [None]:
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X = iris.data

# Initialize the model
kmeans = KMeans(n_clusters=3, random_state=0)

# Fit the model
kmeans.fit(X)

# Get cluster labels for each sample
labels = kmeans.labels_

Scikit-learn also provides many utility functions for preprocessing data, tuning hyperparameters, evaluating models, etc. that make the whole process of building and evaluating machine learning models easier and more efficient.

### Model evaluation
Model evaluation is a critical part of the machine learning pipeline. Once we've trained our model, we need to know how well it's performing. Model evaluation is a bit different for classification and regression problems due to the different nature of their output.
#### Classification Model Evaluation Metrics:

##### Accuracy: 
It is the ratio of correctly predicted observations to the total observations. However, it's not a good choice with imbalanced classes.

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)

##### Confusion Matrix: 
A table used to describe the performance of a classification model. It presents a clear picture of precision, recall, F1-score, and support.

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)


##### Precision: 
It is the ratio of correctly predicted positive observations to the total predicted positives.

##### Recall (Sensitivity): 
It is the ratio of correctly predicted positive observations to the all observations in actual class.

##### F1 Score: 
It is the weighted average of Precision and Recall. It tries to find the balance between precision and recall.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

##### ROC-AUC: 
ROC curve is a graph showing the performance of a classification model at all classification thresholds. AUC stands for "Area under the ROC Curve". An excellent model has AUC near to 1, whereas a poor model has AUC near to 0.

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, predictions)

#### Regression Model Evaluation Metrics:

##### Mean Absolute Error (MAE): 
It is the mean of the absolute value of the errors. It's the easiest to understand, because it's the average error.

##### Mean Squared Error (MSE): 
It is the mean of the squared errors. It's more popular than MAE, because MSE "punishes" larger errors.

##### Root Mean Squared Error (RMSE): 
It is the square root of the mean of the squared errors. It measures the standard deviation of the residuals.

##### R-squared (Coefficient of determination): 
Represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. A higher value is better.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, predictions)

The choice of metric depends on your business objective. Sometimes, we might prefer a model with a higher recall than a high precision, for example in cancer prediction, we want to capture as many positives as possible. In another case like email spam detection, we want to be as precise as possible to not put important emails in the spam folder.

In the case of regression, lower values of MAE, MSE, or RMSE suggest a better fit to the data. A higher R-squared indicates a higher proportion of variance explained by the model.