In [2]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [8]:
import sklearn
from sklearn.datasets import load_iris

In [5]:
pip show scikit-learn

Name: scikit-learn
Version: 0.24.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: c:\users\prakriti.regmi\appdata\local\programs\python\python36\lib\site-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: sklearn
Note: you may need to restart the kernel to use updated packages.


## Preparing Data- Sample Dataset

In [2]:
import pandas as pd

data = pd.read_csv("bank.csv")
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [6]:
data["balance"].unique()

array([ 2143,    29,     2, ...,  8205, 14204, 16353], dtype=int64)

### Feature Vector

In [10]:
X = data.drop(columns=["y"])
print(X.shape)  # (n_samples, n_features)

(45211, 16)


In [11]:
X.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown


### Target Vector

In [13]:
y = data["y"]
print(y.shape)
print(type(y))

(45211,)
<class 'pandas.core.series.Series'>


## 2.2.1 Preparing Data - Sample Dataset

**as_frame**
* Available for datasets that support tabular data (e.g., load iris,
load wine).
* If True, returns a Pandas DataFrame/Series for data and target.


In [8]:
df = load_iris(as_frame=True)
df.data.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [9]:
df = load_iris()
df.data.head()

AttributeError: 'numpy.ndarray' object has no attribute 'head'

**return_X_y**
* Almost all dataset loader functions (e.g., load iris, load wine,
  load digits) include this parameter.
* If True, returns (X, y) directly.
* If False (default), returns a Bunch object, which contains
  additional metadata.


In [16]:
from sklearn.datasets import load_iris

# Default behavior: return_X_y=False
bunch_data = load_iris(return_X_y=False)
print("Keys:", bunch_data.keys())
print("Feature names:", bunch_data.feature_names)
print("Target names:", bunch_data.target_names)

print("----")
# Using return_X_y=True
X, y = load_iris(return_X_y=True)
print("Shape of X (features):", X.shape)
print("Shape of y (target):", y.shape)


Keys: dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
----
Shape of X (features): (150, 4)
Shape of y (target): (150,)


# Data Pre-Processing in Scikit-learn

### ⚠️ Caution: 
This pre-processing is done **after cleaning and analyzing your data**.

---

## Pre-Processing and Transformers in Scikit-learn

### **Transformers**  
Transformers are objects designed to transform data into a format that is better suited for machine learning models. In Scikit-learn, pre-processing methods are often implemented as **Transformers**.

Transformers provide a consistent interface through the following key methods:

1. **`fit(X)`**: Learns the transformation parameters.
2. **`transform(X)`**: Uses the learned parameters to transform data.
3. **`fit_transform(X)`**: Combines both `fit` and `transform` in a single step.

---

### **`sklearn.preprocessing` Module**  
This module provides methods and utilities to preprocess data for machine learning models. Pre-processing helps improve the performance of models by:

- **Scaling**: Adjusting feature values to have similar ranges (e.g., Min-Max Scaling, Standardization).  
- **Normalizing**: Adjusting data to a specific distribution.  
- **Encoding**: Converting categorical data into numeric representations.

---

By properly pre-processing your data, you can make it more suitable for machine learning algorithms, leading to improved model performance.


# 2.4.1 Key Features of `sklearn.preprocessing` Module

The `sklearn.preprocessing` module provides essential tools for preparing data for machine learning. Below are its key features:

---

## **1. Scaling Features**  
Scaling ensures uniformity across data dimensions by standardizing or normalizing the range of features.

### Examples:  
- **`StandardScaler`**: Standardizes features to have a zero mean and unit variance.  
- **`MinMaxScaler`**: Scales features to a specified range (e.g., [0, 1]).

---


By leveraging these features, you can preprocess your dataset to ensure it is ready for machine learning algorithms.


In [19]:
from sklearn.preprocessing import StandardScaler

data = [[0, 0], [0, 0], [1, 1], [1, 1]]

scaler = StandardScaler()
scaler.fit(data)
transformed_data = scaler.transform(data)
transformed_data_direct = scaler.fit_transform(data)


print("\nTransformed Data (Using transform):")
print(transformed_data)

print("\nTransformed Data (Using fit_transform):")
print(transformed_data_direct)



Transformed Data (Using transform):
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]

Transformed Data (Using fit_transform):
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]



## **2. Normalization**  
Normalization adjusts rows to a unit norm, ensuring that each row's magnitude is 1.

### Example:  
- **`Normalizer`**: Applies normalization to feature vectors.

---


In [22]:
from sklearn.preprocessing import Normalizer

data = [[4, 1], [1, 2], [3, 6], [0, 8]]

normalizer = Normalizer()

normalized_data = normalizer.fit_transform(data)
print("\nNormalized Data (Using fit_transform):")
print(normalized_data)



Normalized Data (Using fit_transform):
[[0.9701425  0.24253563]
 [0.4472136  0.89442719]
 [0.4472136  0.89442719]
 [0.         1.        ]]


## **3. Encoding Categorical Features**  
Encodes categorical data into a numerical format suitable for machine learning models.

### Examples:  
- **`OneHotEncoder`**: Encodes categorical features as a one-hot numeric array.  
- **`LabelEncoder`**: Converts categorical labels into integers.

---

In [25]:
from sklearn.preprocessing import OneHotEncoder

# Example categorical data
data = [['Red'], ['Green'], ['Blue'], ['Green'], ['Red']]
encoder = OneHotEncoder(sparse=False)
one_hot_encoded = encoder.fit_transform(data)
print("\nOneHotEncoded Data:")
print(one_hot_encoded)




OneHotEncoded Data:
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [26]:
from sklearn.preprocessing import LabelEncoder

labels = ['Red', 'Green', 'Blue', 'Green', 'Red']
print("\nOriginal Labels:")
print(labels)

label_encoder = LabelEncoder()

integer_encoded = label_encoder.fit_transform(labels)
print("\nLabelEncoded Data:")
print(integer_encoded)

# Mapping of classes
print("\nMapping of Classes:")
for class_label, encoded_label in zip(label_encoder.classes_, range(len(label_encoder.classes_))):
    print(f"{class_label}: {encoded_label}")



Original Labels:
['Red', 'Green', 'Blue', 'Green', 'Red']

LabelEncoded Data:
[2 1 0 1 2]

Mapping of Classes:
Blue: 0
Green: 1
Red: 2


---


## **2.5 Getting Ready For Training: Train-Test Split**

### **Splitting Data for Training and Testing**
The `train_test_split` function is used to divide data into training and testing sets, ensuring the model is evaluated on unseen data.

---

### **Parameters and Syntax**

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,  # Proportion of data to be used for testing (float between 0 and 1)
    train_size=None,  # Optional, can be set if test_size is None
    random_state=None,  # Seed for the random number generator (for reproducibility)
    shuffle=True,  # Whether or not to shuffle the data before splitting
    stratify=None  # Ensures class distribution is similar in both train and test sets (useful for classification)
)


In [28]:
from sklearn.model_selection import train_test_split
import numpy as np


X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Training Features:", X_train)
print("Testing Features:", X_test)
print("Training Labels:", y_train)
print("Testing Labels:", y_test)


Training Features: [[4]
 [9]
 [3]
 [8]
 [6]
 [2]
 [5]]
Testing Features: [[ 1]
 [ 7]
 [10]]
Training Labels: [1 0 0 1 1 1 0]
Testing Labels: [0 0 1]


---

## **1. Estimators in Scikit-learn**

### **Definition**:
Estimators are machine learning objects in Scikit-learn that learn patterns from data and can make predictions based on those patterns.

---

### **Key Methods of Estimators**:
1. **`fit(X, [y])`**:  
   Trains the model using the input data `X` and optional target labels `y`.  

2. **`predict(X)`**:  
   Predicts the target labels for the given input data `X`.  

3. **`predict_proba(X)`**:  
   Predicts probabilities for each class (if the model supports it).  
   *Note*: Not all estimators provide this method.

---

### **Types of Estimators**:
Estimators in Scikit-learn include various models for classification and regression tasks.

#### **Examples**:
- **Classification Models**:  
  - Logistic Regression  
  - Decision Trees  

- **Regression Models**:  
  - Linear Regression  
  - Random Forest Regressor  


## How Estimators Class are Built in scikit learn?

In [30]:
class Estimator(object):
    def fit(self, X, y=None):
        """Fits estimator to data."""
        # Store the data to simulate learning
        self.X = X
        self.y_ = y
        return self

    def predict(self, X):
        """Makes predictions using the learned data."""
        # For this simple example, we are just going to return the same values from the training data
        # In a real estimator, this would involve applying a trained model to the new data X
        if not hasattr(self, "X"):
            raise ValueError("Model has not been fitted yet.")
        return self.y_  # Returning the stored labels for demonstration purposes


## Example Build a KNN Classifier.


In [31]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load iris dataset
data = load_iris()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Set hyper-parameters for the KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)

# Train the model using the training data
clf.fit(X_train, y_train)

# Make predictions on the first 5 samples from X
print("Predictions for the first 5 samples:", clf.predict(X[:5]))

# Compute class probabilities for the first 5 samples
print("Class probabilities for the first 5 samples:", clf. (X[:5]))


Predictions for the first 5 samples: [0 0 0 0 0]
Class probabilities for the first 5 samples: [[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


##  Model Evaluation with sklearn.metrics

In [32]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Example for regression
y_true_regression = [3, -0.5, 2, 7]
y_pred_regression = [2.5, 0.0, 2, 8]

# Calculate metrics
mae = mean_absolute_error(y_true_regression, y_pred_regression)
mse = mean_squared_error(y_true_regression, y_pred_regression)
r2 = r2_score(y_true_regression, y_pred_regression)

# Print the results
print(f"MAE: {mae}, MSE: {mse}, R2: {r2}")


MAE: 0.5, MSE: 0.375, R2: 0.9486081370449679


In [34]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Example for classification
y_true_classification = [0, 1, 1, 0, 1]
y_pred_classification = [0, 1, 0, 0, 1]

# Calculate metrics
accuracy = accuracy_score(y_true_classification, y_pred_classification)
precision = precision_score(y_true_classification, y_pred_classification)
recall = recall_score(y_true_classification, y_pred_classification)
f1 = f1_score(y_true_classification, y_pred_classification)

print(f"Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}")


Accuracy: 0.8, Precision: 1.0, Recall: 0.6666666666666666, F1-Score: 0.8


## Task Overview 

## Diabetes Classification

The **Pima Indian Diabetes Dataset** is available from sources like Kaggle and contains the following columns:

- **Pregnancies**
- **Glucose**
- **BloodPressure**
- **SkinThickness**
- **Insulin**
- **BMI**
- **DiabetesPedigreeFunction**
- **Age**
- **Outcome** (whether the patient has diabetes or not)

### Tasks:

#### 1. **Regression Task:**
   - **Goal:** Predict the **Blood Pressure** of the patients based on other features.
   - **Model to Use:** **Linear Regression** from Scikit-learn.

#### 2. **Classification Task:**
   - **Goal:** Predict whether the patient has diabetes (target column: **Outcome**).
   - **Models to Use:** **Logistic Regression** or **K-Nearest Neighbors (KNN)**.

### Evaluation:
Once the models are built, evaluate them appropriately using relevant metrics such as accuracy, precision, recall, F1 score, or mean squared error (for regression tasks).


In [5]:
#Task1
#import dataset
import pandas as pd
data = pd.read_csv('diabetes.csv')


In [2]:
#split the data
from sklearn.model_selection import train_test_split
X = data.drop('BloodPressure', axis=1)
y = data['BloodPressure']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [3]:
#train the model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)


In [4]:
#test the model
from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')


Mean Squared Error: 402.8523890426409


In [6]:
#Task2

#split data
from sklearn.model_selection import train_test_split
X = data.drop('Outcome', axis=1)
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [7]:
#train the model
from sklearn.linear_model import LogisticRegression
model_logreg = LogisticRegression()
model_logreg.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [8]:
#tain KNN dataset
from sklearn.neighbors import KNeighborsClassifier
model_knn = KNeighborsClassifier()
model_knn.fit(X_train, y_train)


In [10]:
#compare both modelaccuracies
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred_logreg = model_logreg.predict(X_test)
y_pred_knn = model_knn.predict(X_test)

print('Logistic Regression:')
print(f'Accuracy: {accuracy_score(y_test, y_pred_logreg)}')
print(f'Precision: {precision_score(y_test, y_pred_logreg)}')
print(f'Recall: {recall_score(y_test, y_pred_logreg)}')
print(f'F1 Score: {f1_score(y_test, y_pred_logreg)}')

print('\nK-Nearest Neighbors:')
print(f'Accuracy: {accuracy_score(y_test, y_pred_knn)}')
print(f'Precision: {precision_score(y_test, y_pred_knn)}')
print(f'Recall: {recall_score(y_test, y_pred_knn)}')
print(f'F1 Score: {f1_score(y_test, y_pred_knn)}')


Logistic Regression:
Accuracy: 0.7467532467532467
Precision: 0.6379310344827587
Recall: 0.6727272727272727
F1 Score: 0.6548672566371682

K-Nearest Neighbors:
Accuracy: 0.6623376623376623
Precision: 0.5245901639344263
Recall: 0.5818181818181818
F1 Score: 0.5517241379310345
