<a href="https://colab.research.google.com/github/tabaraei/CheatSheet/blob/master/notebooks/Scikit-Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dataset

### 1- Load the dataset

Please ensure that you have the latest scikit-learn version.
Run `!pip install --upgrade scikit-learn` if necessary.

You can load dataset from predifined datasets within sklearn, or using open-source datasets such as [openml](https://www.openml.org)!

#### 1-1- Regression data:

In [38]:
from sklearn.datasets import load_iris
import pandas as pd

dataset = load_iris(as_frame=True)
reg_data = pd.DataFrame(dataset.data, columns=dataset.feature_names)
reg_target = dataset.target
dataset.frame.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


#### 1-2- Classification data

In [39]:
from sklearn.datasets import fetch_openml
import pandas as pd

dataset = fetch_openml(name='adult', version=2, as_frame=True)
cls_data = pd.DataFrame(dataset.data, columns=dataset.feature_names)
cls_target = dataset.target
dataset.frame.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
1,38.0,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
2,28.0,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
3,44.0,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K
4,18.0,,103497.0,Some-college,10.0,Never-married,,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K


### 2- Train Test Split

In [40]:
from sklearn.model_selection import train_test_split

X_train_reg, X_test_reg, y_train_reg, y_test_reg = \
    train_test_split(reg_data, reg_target, random_state=1, test_size=0.1)
    
X_train_cls, X_test_cls, y_train_cls, y_test_cls = \
    train_test_split(cls_data, cls_target, random_state=1, test_size=0.1)

## Preprocessing

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(data_train)

## Model

Activate the diagram display to see the visualized model.

In [43]:
from sklearn import set_config
set_config(display='diagram')

### 1- Regression

#### 1-1- Linear Regression

In [46]:
from sklearn.linear_model import LinearRegression

X, y = X_train_reg, y_train_reg
clf = LinearRegression()
clf.fit(X, y)

#### 1-2- Logistic Regression

In [48]:
from sklearn.linear_model import LogisticRegression

X, y = X_train_reg, y_train_reg
clf = LogisticRegression(random_state=0, max_iter=1000)
clf.fit(X, y)

### 2- Classification

#### 2-1- Desicion Tree

In [59]:
from sklearn.tree import DecisionTreeClassifier

X, y = X_train_cls.select_dtypes(exclude='category'), y_train_cls
clf = DecisionTreeClassifier()
clf.fit(X, y)

## Evaluation

Predict and evaluate model performance on a model. First let's create a model:

In [61]:
X, y, X_test, y_test = X_train_reg, y_train_reg, X_test_reg, y_test_reg
clf = LinearRegression()
clf.fit(X, y)

### 1- Predict

In [60]:
y_pred = clf.predict(X)
y_pred[:5]

array(['<=50K', '>50K', '>50K', '<=50K', '>50K'], dtype=object)

In [54]:
(y == y_pred).mean()

0.9984075346361216

In [63]:
accuracy = clf.score(X_test, y_test)
print(f"The test accuracy using a {clf.__class__.__name__} is "f"{accuracy:.3f}")

The test accuracy using a LinearRegression is 0.930
