<a href="https://colab.research.google.com/github/smdr111/ML_roadmap/blob/main/05_ml_04_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Imgur](https://i.imgur.com/5pXzCIu.png)

# Data Science va Sun'iy Intellekt Praktikum

## 5-MODUL. Machine Learning

### 5.1 - ML loyiha qadamlari

## Ma`lumotlarni ML uchun tayyolrash

### Pipeline - konveyer

Mavzu boshida biz jarayonlarni avtomatlashtirish haqida gapirdik. Buning uchun scikit-learn da maxsus **pipeline** tushunchasi bor. Pipeline ingliz tilidan gaz (neft) quvuri deb tarjima qilinadi. Gaz A nuqtadan B nuqtaga yetkazib berilishida bir nechta oraliq ishlov berish stansiyalaridan o'tadi.

Bizning ma'lumotlar ham shunday, boshlang'ich nuqtasidan bevosita MLga yetib kelunga qadar bir nechta jarayonolardan o'tdi. Yuqorida biz har bir jarayonni qo'lda yozib chiqdik, pipeline yordamida esa biz barcha qadamlarni birlashtirib - pipeline (yoki konveyer) hosil qilishimiz mumkin.

Pipeline so'zini konveyer deb tarjima qilishimga sabab, ma'lumotlarimiz huddi konveyerdan o'tgani kabi turli bosqichlarda turli o'zgarishlardan o'tayapti.

Biz konveyerni 2 qismga bo'lamiz:
- Sonli ustunlarga ishlov berish
- Matnli ustunlarga ishlov berish

In [None]:
import pandas as pd
import numpy as np
import sklearn # scikit-learn kutubxonasi

# Onlayn dataset joylashgan manzilini ko'rsatamiaz
URL = "https://github.com/ageron/handson-ml2/blob/master/datasets/housing/housing.csv?raw=true"
df = pd.read_csv(URL)

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
# bizga kerak ustunlar indekslari
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self # bizni funksiyamiz faqat transformer. estimator emas
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room: # add_bedrooms_per_room ustuni ixtiyoriy bo'ladi
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_pipeline = Pipeline([
          ('imputer', SimpleImputer(strategy='median')),
          ('attribs_adder', CombinedAttributesAdder(add_bedrooms_per_room = True)),
          ('std_scaler', StandardScaler())
])

Yuoqirda biz sonli ustunlar uchun konveyer yaratdik (`num_pipeline`).

Pipeline 3 ta transformerdan iborat (`imputer`, `atrribs_adder` va `std_scaler`), umid qilamanki, ularning vazifasi endi sizga tushunarli.
Bu transformerlarga siz istalgancha nom berishingiz mumkin.

Pipeline ihsga tushrish uchun `.fit_transform()` metodiga murojaat qilamiz.

In [None]:
num_pipeline.fit_transform(housing_num)

array([[ 1.27258656, -1.3728112 ,  0.34849025, ..., -0.17491646,
         0.05137609, -0.2117846 ],
       [ 0.70916212, -0.87669601,  1.61811813, ..., -0.40283542,
        -0.11736222,  0.34218528],
       [-0.44760309, -0.46014647, -1.95271028, ...,  0.08821601,
        -0.03227969, -0.66165785],
       ...,
       [ 0.59946887, -0.75500738,  0.58654547, ..., -0.60675918,
         0.02030568,  0.99951387],
       [-1.18553953,  0.90651045, -1.07984112, ...,  0.40217517,
         0.00707608, -0.79086209],
       [-1.41489815,  0.99543676,  1.85617335, ..., -0.85144571,
        -0.08535429,  1.69520292]])

Sonli ustunlarga ishlov beruvchi konveyer tayyor, matni ustunlarchi?

Buning uchun maxsus `ColumnTransformer` obyektiga murojaat qilamiz, bu ham pipeline bir ko'rinishi. `ColumnTransformer` ichiga biz yuqorida yasalgan `num_ipeline` ham qo'shib yuboramiz.

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])

Mana yakuniy, to'liq konveyer tayyor bo'ldi (`full_pipeline`).

Konveyerni ishga tushirish uchun `.fit_transform()` metodini chaqrisih kifoya.

In [None]:
housing_prepared = full_pipeline.fit_transform(housing)

In [None]:
housing_prepared[0:5,:]

array([[ 1.27258656, -1.3728112 ,  0.34849025,  0.22256942,  0.21122752,
         0.76827628,  0.32290591, -0.326196  , -0.17491646,  0.05137609,
        -0.2117846 ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [ 0.70916212, -0.87669601,  1.61811813,  0.34029326,  0.59309419,
        -0.09890135,  0.6720272 , -0.03584338, -0.40283542, -0.11736222,
         0.34218528,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.44760309, -0.46014647, -1.95271028, -0.34259695, -0.49522582,
        -0.44981806, -0.43046109,  0.14470145,  0.08821601, -0.03227969,
        -0.66165785,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [ 1.23269811, -1.38217186,  0.58654547, -0.56148971, -0.40930582,
        -0.00743434, -0.38058662, -1.01786438, -0.60001532,  0.07750687,
         0.78303162,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.10855122,  0.5320839 ,  1

Ma'lumotlar ML uchun tayyor.