# WEEK 6

## Introduction to Scikit-learn


training data, features, labels, and models
Overview of the scikit-learn API and its main classes, including Estimator, Transformer, and Predictor
Examples of how to use scikit-learn to train and evaluate simple machine learning models
Tips on how to preprocess and prepare data for use with scikit-learn
Techniques for evaluating the performance of machine learning models, such as train/test split and cross-validation
Strategies for improving the performance of machine learning models, such as hyperparameter tuning and ensemble methods
Resources for learning more about scikit-learn and machine learning in general.

### Installing Scikit-learn

In [None]:
#!pip install scikit-learn # If you are using Google Colab
#!conda install -c conda-forge scikit-learn --yes # If you are using your own computer

### Importing Scikit-learn

In [1]:
import sklearn as sk
import numpy as np      
import pandas as pd
import matplotlib.pyplot as plt


### Important basic concepts

**Training data**: In machine learning, we train models using a dataset of examples. This dataset is called the training data. The training data consists of a set of features and labels.

**Features**: A feature is an individual measurable property or characteristic of a phenomenon being observed. In the context of machine learning, features are typically numeric values or strings that represent some aspect of the data. For example, in a dataset of customer information, features might include the customer's age, gender, and income level.

**Labels**: A label is the correct answer for a given example in the training data. In supervised learning, the goal is to train a model to make predictions for new, unseen examples by learning the relationship between the features and labels in the training data. For example, in a dataset of customer information, the label might be whether or not the customer is likely to purchase a particular product.

**Models**: A machine learning model is a mathematical representation of a system or process that is being studied. The model is trained on the training data and is then used to make predictions on new, unseen examples. There are many different types of machine learning models, including linear regression, logistic regression, decision trees, and neural networks. The choice of model depends on the nature of the problem being solved and the type of data being used.

**Estimator**: An estimator is any object that learns from data. This includes machine learning models, as well as utilities for fitting and evaluating models. Estimators implement a fit method, which trains the estimator on a dataset, and a predict method, which makes predictions using the trained model.

**Transformer**: A transformer is an estimator that can transform data. This is often used as a step in a data processing pipeline. Transformers implement a transform method, which takes in a dataset and returns a transformed version of the data.

**Predictor**: A predictor is an estimator that makes predictions based on a trained model. Predictors implement a predict method, which takes in a dataset and returns predictions for each example in the dataset.

## Splitting data

**Splitting data into train and test sets**: To evaluate a machine learning model, it is common to split the data into a training set and a test set. The model is trained on the training set, and then the test set is used to evaluate its performance. scikit-learn provides the train_test_split function for this purpose.

Let's start by loading our dataset from sklearn, like in the previous lesson.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
#test_size=0.3 means that 30% of the data is used for testing and 70% for training
#random_state=0 means that the data is split in the same way every time
#This is useful for debugging, but should be removed when you are done
#The data is split randomly, so the results will be different every time
#If you want to get the same results as me, use random_state=0
#If you want to get different results, use random_state=None
#The data is split into 4 variables:
#X_train: The features of the training set
#X_test: The features of the test set
#y_train: The labels of the training set
#y_test: The labels of the test set


Now, let's created a basic predictive model:

In [8]:
# Create a model
from sklearn import neighbors

model = neighbors.KNeighborsClassifier(n_neighbors=3) # You can change this to any model you want


# Train a model on the training data
model.fit(X_train, y_train)
print(model)

# Make predictions on the test data
y_pred = model.predict(X_test)
#y_pred is a list of the predictions for each data point in the test set


KNeighborsClassifier(n_neighbors=3)


# Now, let's evaluate this model according to its accuracy.

In [9]:

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', accuracy)

Accuracy:  0.9777777777777777
