# Scikit-learn Practice Problems

This notebook contains practice problems covering essential scikit-learn operations for machine learning.

**Instructions:**
- Complete the code in each cell marked with `# TODO`
- Run the cell to verify your solution matches the expected output
- Each problem focuses on specific scikit-learn concepts for building ML pipelines


## Part 1: Pipeline 思维

### Problem 1.1: Preprocess + Model + Eval Pipeline
Create a complete pipeline: preprocessing, model training, and evaluation.

**Expected Output:**
```
Pipeline created: True
Accuracy: > 0.8
```


In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Create pipeline with preprocessing and model
# Your code here

# print(f"Pipeline created: True")
# print(f"Accuracy: {accuracy:.2f}")


## Part 2: 数据切分

### Problem 2.1: Train/Valid/Test Split
Split data into train, validation, and test sets.

**Expected Output:**
```
Train: 60%, Valid: 20%, Test: 20%
```


In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# TODO: Split into train/valid/test
# Your code here

# print(f"Train: {train_pct:.0%}, Valid: {valid_pct:.0%}, Test: {test_pct:.0%}")


### Problem 2.2: Cross Validation (Concept)
Use cross-validation to evaluate model performance.

**Expected Output:**
```
Cross-validation scores: [0.85, 0.87, 0.86, 0.88, 0.85]
Mean CV score: 0.86
```


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
model = RandomForestClassifier(random_state=42)

# TODO: Perform cross-validation
# Your code here

# print(f"Cross-validation scores: {cv_scores}")
# print(f"Mean CV score: {mean_score:.2f}")


## Part 3: 特征处理

### Problem 3.1: Encoding
Encode categorical features using different methods (LabelEncoder, OneHotEncoder).

**Expected Output:**
```
Categorical features encoded: True
```


In [None]:
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

categories = ['red', 'blue', 'green', 'red', 'blue', 'green']

# TODO: Encode categorical features
# Your code here

# print("Categorical features encoded: True")


### Problem 3.2: Standardization
Standardize features using StandardScaler and MinMaxScaler.

**Expected Output:**
```
Features standardized: True
Mean: 0.0, Std: 1.0
```


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# TODO: Standardize features
# Your code here

# print("Features standardized: True")
# print(f"Mean: {mean:.1f}, Std: {std:.1f}")


### Problem 3.3: Missing Value Handling
Handle missing values in features using SimpleImputer.

**Expected Output:**
```
Missing values handled: True
```


In [None]:
from sklearn.impute import SimpleImputer
import numpy as np

X = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])

# TODO: Handle missing values
# Your code here

# print("Missing values handled: True")


### Problem 3.4: Data Leakage Risk
Demonstrate understanding of data leakage (e.g., fitting scaler on test data).

**Expected Output:**
```
Data leakage avoided: True
```


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

X, y = make_classification(n_samples=100, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Demonstrate correct way to avoid data leakage
# Your code here

# print("Data leakage avoided: True")


## Part 4: 常用模型

### Problem 4.1: Linear Model
Train and evaluate a linear model (LogisticRegression for classification).

**Expected Output:**
```
Linear model accuracy: > 0.7
```


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Train linear model
# Your code here

# print(f"Linear model accuracy: {accuracy:.2f}")


### Problem 4.2: Tree Model
Train and evaluate a tree model (DecisionTreeClassifier).

**Expected Output:**
```
Tree model accuracy: > 0.8
```


In [None]:
from sklearn.tree import DecisionTreeClassifier

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Train tree model
# Your code here

# print(f"Tree model accuracy: {accuracy:.2f}")


### Problem 4.3: Ensemble Model
Train and evaluate an ensemble model (RandomForestClassifier).

**Expected Output:**
```
Ensemble model accuracy: > 0.85
```


In [None]:
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Train ensemble model
# Your code here

# print(f"Ensemble model accuracy: {accuracy:.2f}")


## Part 5: 评估指标

### Problem 5.1: Classification Metrics
Calculate classification metrics: accuracy, precision, recall, F1-score.

**Expected Output:**
```
Accuracy: 0.85, Precision: 0.83, Recall: 0.87, F1: 0.85
```


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# TODO: Calculate classification metrics
# Your code here

# print(f"Accuracy: {accuracy:.2f}, Precision: {precision:.2f}, Recall: {recall:.2f}, F1: {f1:.2f}")


### Problem 5.2: Regression Metrics
Calculate regression metrics: MSE, MAE, R².

**Expected Output:**
```
MSE: < 1.0, MAE: < 0.8, R²: > 0.9
```


In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# TODO: Calculate regression metrics
# Your code here

# print(f"MSE: {mse:.2f}, MAE: {mae:.2f}, R²: {r2:.2f}")


### Problem 5.3: Threshold and Imbalanced Data
Handle imbalanced classification using threshold adjustment and class weights.

**Expected Output:**
```
Imbalanced data handled: True
```


In [None]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Handle imbalanced data
# Your code here

# print("Imbalanced data handled: True")


## Part 6: 过拟合与调参

### Problem 6.1: Regularization
Apply regularization to prevent overfitting (L1/L2 in LogisticRegression).

**Expected Output:**
```
Regularization applied: True
```


In [None]:
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Apply regularization (L1 and L2)
# Your code here

# print("Regularization applied: True")


### Problem 6.2: Grid Search
Use GridSearchCV to find best hyperparameters.

**Expected Output:**
```
Best parameters found: True
Best score: > 0.8
```


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Use GridSearchCV
# Your code here

# print("Best parameters found: True")
# print(f"Best score: {best_score:.2f}")


### Problem 6.3: Random Search
Use RandomizedSearchCV for hyperparameter tuning (concept).

**Expected Output:**
```
Random search completed: True
```


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Use RandomizedSearchCV
# Your code here

# print("Random search completed: True")


## Part 7: 可解释与诊断

### Problem 7.1: Feature Importance
Extract and visualize feature importance from tree-based models.

**Expected Output:**
```
Feature importance extracted: True
Top feature: feature_0
```


In [None]:
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# TODO: Extract feature importance
# Your code here

# print("Feature importance extracted: True")
# print(f"Top feature: {top_feature}")


### Problem 7.2: Error Analysis
Perform error analysis: identify misclassified samples and analyze patterns.

**Expected Output:**
```
Error analysis completed: True
Misclassification rate: < 0.2
```


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# TODO: Perform error analysis
# Your code here

# print("Error analysis completed: True")
# print(f"Misclassification rate: {error_rate:.2f}")
