Python For Data Science Cheat Sheet: Scikit-learn
Scikit-learn is an open source Python library that implements a range of machine learning, preprocessing, cross-validation and visualization algorithms using a unified interface.



![Scikit-learn_Cheat_Sheet](https://img-blog.csdnimg.cn/2020071123453956.png)

A Basic Example


In [1]:
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred)


0.631578947368421

Loading The Data
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays, such as Pandas DataFrame, are also acceptable.



In [2]:
import numpy as np
X = np.random.random((10,5))
y = np.array(['M','M','F','F','M','F','M','M','F','F'])
X[X < 0.7] = 0

# Preprocessing The Data  

## Standardization

In [3]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

## Normalization

In [4]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)


## Binarization


In [5]:
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)


## Encoding Categorical Features
 

In [6]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)


## Imputing Missing Values

SimpleImputer类提供了输入缺失值的基本策略。缺失值可以用常量值或使用缺失值所在列的统计信息（平均值、中位数或最频繁）进行填充。

In [7]:
#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=0, strategy='mean')
imp.fit_transform(X_train)


array([[-0.91090798, -1.59775374],
       [-1.0271058 ,  0.08448757],
       [ 0.59966379, -1.59775374],
       [ 0.01867465, -0.96691325],
       [ 0.48346596, -0.33607276],
       [-1.25950146,  0.29476773],
       [-1.37569929,  0.71532806],
       [-0.79471015, -1.17719341],
       [-1.14330363,  0.71532806],
       [ 2.45882905,  1.55644871],
       [-0.79471015,  0.71532806],
       [-0.79471015,  1.34616854],
       [-0.21372101, -0.33607276],
       [ 0.83205945, -0.1257926 ],
       [-0.44611666,  1.76672887],
       [ 1.41304859,  0.29476773],
       [ 0.01867465, -0.54635292],
       [ 2.22643339, -0.96691325],
       [-0.32991883, -1.17719341],
       [ 0.13487248,  0.29476773],
       [-1.0271058 ,  1.13588838],
       [-1.49189712, -1.59775374],
       [ 0.59966379, -0.54635292],
       [-1.60809495, -0.33607276],
       [-0.91090798,  1.13588838],
       [ 1.64544425, -0.1257926 ],
       [ 0.25107031,  0.71532806],
       [ 0.48346596, -1.8080339 ],
       [ 1.8778399 ,

## Generating Polynomial Features

In [8]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
poly.fit_transform(X)

array([[1.        , 0.        , 0.93517581, ..., 0.        , 0.        ,
        0.8979944 ],
       [1.        , 0.        , 0.86286738, ..., 0.        , 0.        ,
        0.26350621],
       [1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.1720051 ],
       ...,
       [1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.        , 0.88345816, ..., 0.33236764, 0.30733483,
        0.28418741],
       [1.        , 0.        , 0.89006998, ..., 0.        , 0.        ,
        0.        ]])

## Training And Test Data

In [9]:
from sklearn.model_selection import train_test_split
print(X.shape)
print(y.shape)
# 需要 shape 是一样的
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)

(10, 5)
(10,)


# create you model

## Supervised Learning Estimators

有监督学习模型包括：
- Linear Regression  
- Support Vector Machines (SVM)  
- Naive Bayes  
- KNN  


In [10]:
### Linear Regression

from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)

### Support Vector Machines (SVM)

from sklearn.svm import SVC
svc = SVC(kernel='linear')

### Naive Bayes

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

### KNN

from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

## Unsupervised Learning Estimators

无监督学习模型包括：
- Principal Component Analysis (PCA)
- K Means
- 层次聚类(Hierarchical Clustering)
- 基于密度聚类Mean Shift
- 基于密度聚类DBSCAN
- 高斯混合模型(GMM)与EM
- 奇异值分解(SVD)
- t-分布领域嵌入式算法(t-SNE)

In [11]:
### PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)

### K Means

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)

# Model Fitting

In [12]:
## Supervised learning
lr.fit(X, y)
knn.fit(X_train, y_train)
svc.fit(X_train, y_train)
## Unsupervised Learning
k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)

# Prediction


In [14]:
## Supervised Estimators

y_pred = svc.predict(np.random.random((2,5)))
y_pred = lr.predict(X_test)
y_pred = knn.predict_proba(X_test)
## Unsupervised Estimators

y_pred = k_means.predict(X_test)

# Evaluate Your Model's Performance
## Classification Metrics


In [16]:
### Accuracy Score

knn.score(X_test, y_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
### Classification Report

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

### Confusion Matrix

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      0.50      0.67         2
           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         0

    accuracy                           0.33         3
   macro avg       0.33      0.17      0.22         3
weighted avg       0.67      0.33      0.44         3

[[1 0 1]
 [0 0 1]
 [0 0 0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Regression Metrics

In [18]:

### Mean Absolute Error

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2]
mean_absolute_error(y_true, y_pred)
### Mean Squared Error

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)
### R2 Score

from sklearn.metrics import r2_score
r2_score(y_true, y_pred)


-1.3461538461538463

## Clustering Metrics

In [24]:

### Adjusted Rand Index

from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred)
### Homogeneity

from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred)
### V-measure

from sklearn.metrics import v_measure_score
v_measure_score(y_true, y_pred)


0.7336804366512111

## Cross-Validation


In [26]:
from sklearn.model_selection import cross_val_score

print(cross_val_score(knn, X_train, y_train, cv=4))
print(cross_val_score(lr, X, y, cv=2))

[0.5 0.5 0.5 0. ]
[-2.12560396 -3.31232586]




# Tune Your Model
## Grid Search

In [38]:
#from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import GridSearchCV

params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
print(X_train.shape)
print(y_train.shape)
grid = GridSearchCV(estimator=knn,param_grid=params)
grid.fit(X_train, y_train)

print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)


(7, 5)
(7,)


ValueError: n_splits=5 cannot be greater than the number of members in each class.

## Randomized Parameter Optimization


In [32]:
#from sklearn.grid_search import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
rsearch = RandomizedSearchCV(estimator=knn,
   param_distributions=params,
   cv=4,
   n_iter=8,
   random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

1.0




Going Further
Begin with our scikit-learn tutorial for beginners, in which you'll learn in an easy, step-by-step way how to explore handwritten digits data, how to create a model for it, how to fit your data to your model and how to predict target values. In addition, you'll make use of Python's data visualization library matplotlib to visualize your results.

In [None]:

from sklearn import set_config
 
set_config(display='diagram')  
 
