## 欢迎进入 Notebook  

这里你可以编写代码，文档  

### 关于文件目录  


**project**：project 目录是本项目的工作空间，可以把将项目运行有关的所有文件放在这里，目录中文件的增、删、改操作都会被保留  


**input**：input 目录是数据集的挂载位置，所有挂载进项目的数据集都在这里，未挂载数据集时 input 目录被隐藏  


**temp**：temp 目录是临时磁盘空间，训练或分析过程中产生的不必要文件可以存放在这里，目录中的文件不会保存  


In [None]:
# 查看个人持久化工作区文件
!ls /home/mw/project/

In [3]:
# 查看当前挂载的数据集目录
!ls /home/mw/input/

quant4533


In [1]:
#（一）数据预处理

import pandas as pd
import numpy as np
import re
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import joblib

# 1. 数据加载
train_path = '/home/mw/project/output/train.csv'
test_path = '/home/mw/project/output/test.csv'

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

# 保存测试集ID用于最终提交
test_ids = test_df['ID']

# 2. 数据预处理函数
def preprocess_data(df, is_train=True):
    df = df.copy()
    
    # 分离目标变量
    if is_train:
        y = df['价格']
        df = df.drop(['价格'], axis=1)
    else:
        y = None
    
    # 日期特征处理
    date_cols = [col for col in ['交易时间', '上次交易'] if col in df.columns]
    for col in date_cols:
        df[col] = pd.to_datetime(df[col], errors='coerce')
        df[f'{col}_year'] = df[col].dt.year
        df[f'{col}_month'] = df[col].dt.month
        
        # 填充缺失值
        df[f'{col}_year'] = df[f'{col}_year'].fillna(df[f'{col}_year'].median())
        df[f'{col}_month'] = df[f'{col}_month'].fillna(df[f'{col}_month'].median())
    
    # 楼层信息处理
    if '所在楼层' in df.columns:
        df['楼层类型'] = df['所在楼层'].str.extract(r'(低|中|高)层', expand=False).fillna('未知')
        df['总楼层'] = pd.to_numeric(
            df['所在楼层'].str.extract(r'共(\d+)层', expand=False),
            errors='coerce'
        ).fillna(18)  # 默认填充中位数
    
    # 建筑年代处理
    if '建筑年代' in df.columns:
        def extract_year(s):
            years = re.findall(r'\d{4}', str(s))
            return np.mean([int(y) for y in years]) if years else np.nan
            
        df['建筑年代_数值'] = df['建筑年代'].apply(extract_year)
        df['建筑年代_数值'] = df['建筑年代_数值'].fillna(df.groupby('区域')['建筑年代_数值'].transform('median'))
        
        # 分箱处理
        bins = [0, 1980, 1990, 2000, 2010, 2020, np.inf]
        labels = ['1980前', '80年代', '90年代', '00年代', '10年代', '20年代']
        df['建筑年代_区间'] = pd.cut(
            df['建筑年代_数值'],
            bins=bins,
            labels=labels,
            right=False
        ).cat.add_categories('未知').fillna('未知')
    
    return df, y

# 应用预处理
X_train, y_train = preprocess_data(train_df, is_train=True)
X_test, _ = preprocess_data(test_df, is_train=False)

# 3. 修正后的数据清洗类
class DataCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.price_cap = None
        self.median_values = {}
        self.mode_values = {}
        
    def fit(self, X, y=None):
        if y is not None:
            self.price_cap = np.percentile(y, 99.9)
        
        # 存储数值特征的填充值
        num_cols = X.select_dtypes(include=['int64', 'float64']).columns
        for col in num_cols:
            if X[col].isnull().any():
                if col in ['coord_x', 'coord_y', '小区平均月租金']:
                    self.median_values[col] = X.groupby('环线')[col].median()
                else:
                    self.median_values[col] = X[col].median()
        
        # 存储分类特征的填充值
        cat_cols = X.select_dtypes(include=['object', 'category']).columns
        for col in cat_cols:
            modes = X[col].mode()
            self.mode_values[col] = modes[0] if len(modes) > 0 else 'missing'
        
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        
        # 处理异常值
        if '上次交易_year' in X.columns:
            X['上次交易_year'] = np.where(X['上次交易_year'] > 2022, 2022, X['上次交易_year'])
        
        # 数值特征填充
        for col, val in self.median_values.items():
            if isinstance(val, pd.Series):  # 分组填充
                X[col] = X[col].fillna(X['环线'].map(val))
            else:  # 全局填充
                X[col] = X[col].fillna(val)
        
        # 分类特征填充
        for col, val in self.mode_values.items():
            X[col] = X[col].fillna(val)
        
        # 删除无用列
        drop_cols = ['抵押信息', '楼层类型', '户型介绍', '核心卖点', '建筑年代']
        X = X.drop([col for col in drop_cols if col in X.columns], axis=1)
        
        # 处理目标变量
        if y is not None:
            y = np.log1p(np.where(y > self.price_cap, self.price_cap, y))
            return X, y
        return X
    
    def fit_transform(self, X, y=None):
        self.fit(X, y)
        if y is not None:
            return self.transform(X, y)
        return self.transform(X)

# 应用数据清洗（修正调用方式）
cleaner = DataCleaner()
cleaned_data = cleaner.fit_transform(X_train, y_train)
if isinstance(cleaned_data, tuple):
    X_train_cleaned, y_train_cleaned = cleaned_data
else:
    X_train_cleaned = cleaned_data
    y_train_cleaned = y_train  # 保持原y_train不变

X_test_cleaned = cleaner.transform(X_test)

# 4. 特征工程管道
# 数值特征
numeric_features = X_train_cleaned.select_dtypes(include=['int64', 'float64']).columns.tolist()
numeric_features = [col for col in numeric_features 
                   if col not in ['ID'] 
                   and not col.endswith('_month')]

# 分类特征
categorical_features = [
    '环线', '房屋户型', '房屋朝向', '建筑结构', '装修情况',
    '梯户比例', '配备电梯', '交易权属', '房屋用途', '建筑年代_区间'
]

# 构建预处理管道
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
], remainder='drop')

# 5. 验证集划分
X_train, X_val, y_train, y_val = train_test_split(
    X_train_cleaned, y_train_cleaned,
    test_size=0.2,
    random_state=42
)

# 6. 保存预处理对象
joblib.dump(preprocessor, '/home/mw/project/output/linear_preprocessor_final.joblib')
joblib.dump(cleaner, '/home/mw/project/output/data_cleaner_final.joblib')

# 7. 结果检查
print("="*50)
print("预处理完成！")
print(f"训练集形状: {X_train.shape}")
print(f"验证集形状: {X_val.shape}")
print(f"测试集形状: {X_test_cleaned.shape}")
print(f"数值特征数: {len(numeric_features)}")
print(f"分类特征数: {len(categorical_features)}")
print("="*50)


预处理完成！
训练集形状: (67306, 32)
验证集形状: (16827, 32)
测试集形状: (14786, 33)
数值特征数: 8
分类特征数: 10


In [2]:
#（二）线性模型

import pandas as pd
import numpy as np
import re
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import joblib

# 1. 数据加载
train_path = '/home/mw/project/output/train.csv'
test_path = '/home/mw/project/output/test.csv'

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

# 保存测试集ID用于最终提交
test_ids = test_df['ID']

# 2. 增强版数据预处理函数
def preprocess_data(df, is_train=True):
    df = df.copy()
    
    # 分离目标变量
    if is_train:
        y = df['价格']
        df = df.drop(['价格'], axis=1)
    else:
        y = None
    
    # 日期特征处理（带容错）
    date_cols = [col for col in ['交易时间', '上次交易'] if col in df.columns]
    for col in date_cols:
        df[col] = pd.to_datetime(df[col], errors='coerce')
        df[f'{col}_year'] = df[col].dt.year
        df[f'{col}_month'] = df[col].dt.month
        
        # 填充缺失值
        df[f'{col}_year'] = df[f'{col}_year'].fillna(df[f'{col}_year'].median())
        df[f'{col}_month'] = df[f'{col}_month'].fillna(df[f'{col}_month'].median())
    
    # 楼层信息处理（带容错）
    if '所在楼层' in df.columns:
        df['楼层类型'] = df['所在楼层'].str.extract(r'(低|中|高)层', expand=False).fillna('未知')
        df['总楼层'] = pd.to_numeric(
            df['所在楼层'].str.extract(r'共(\d+)层', expand=False),
            errors='coerce'
        ).fillna(18)  # 默认填充中位数
    
    # 建筑年代处理（增强容错性）
    if '建筑年代' in df.columns:
        def extract_year(s):
            try:
                years = re.findall(r'\d{4}', str(s))
                return np.mean([int(y) for y in years]) if years else np.nan
            except:
                return np.nan
            
        df['建筑年代_数值'] = df['建筑年代'].apply(extract_year)
        if '区域' in df.columns:
            df['建筑年代_数值'] = df['建筑年代_数值'].fillna(df.groupby('区域')['建筑年代_数值'].transform('median'))
        else:
            df['建筑年代_数值'] = df['建筑年代_数值'].fillna(df['建筑年代_数值'].median())
        
        # 分箱处理（确保特征创建）
        bins = [0, 1980, 1990, 2000, 2010, 2020, np.inf]
        labels = ['1980前', '80年代', '90年代', '00年代', '10年代', '20年代']
        df['建筑年代_区间'] = pd.cut(
            df['建筑年代_数值'],
            bins=bins,
            labels=labels,
            right=False
        ).cat.add_categories('未知').fillna('未知')
    
    return df, y

# 应用预处理
X_train, y_train = preprocess_data(train_df, is_train=True)
X_test, _ = preprocess_data(test_df, is_train=False)

# 3. 增强版数据清洗类
class DataCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.price_cap = None
        self.median_values = {}
        self.mode_values = {}
        self.required_features = []  # 记录需要保留的特征
        
    def fit(self, X, y=None):
        if y is not None:
            self.price_cap = np.percentile(y, 99.9)
        
        # 记录所有原始特征
        self.required_features = X.columns.tolist()
        
        # 存储数值特征的填充值
        num_cols = X.select_dtypes(include=['int64', 'float64']).columns
        for col in num_cols:
            if X[col].isnull().any():
                if col in ['coord_x', 'coord_y', '小区平均月租金'] and '环线' in X.columns:
                    self.median_values[col] = X.groupby('环线')[col].median()
                else:
                    self.median_values[col] = X[col].median()
        
        # 存储分类特征的填充值
        cat_cols = X.select_dtypes(include=['object', 'category']).columns
        for col in cat_cols:
            modes = X[col].mode()
            self.mode_values[col] = modes[0] if len(modes) > 0 else 'missing'
        
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        
        # 确保所有需要的特征都存在
        for feat in self.required_features:
            if feat not in X.columns:
                X[feat] = np.nan if X.select_dtypes(include=['int64', 'float64']).empty else 'missing'
        
        # 处理异常值
        if '上次交易_year' in X.columns:
            X['上次交易_year'] = np.where(X['上次交易_year'] > 2022, 2022, X['上次交易_year'])
        
        # 数值特征填充
        for col, val in self.median_values.items():
            if col in X.columns:
                if isinstance(val, pd.Series) and '环线' in X.columns:  # 分组填充
                    X[col] = X[col].fillna(X['环线'].map(val))
                else:  # 全局填充
                    X[col] = X[col].fillna(val)
        
        # 分类特征填充
        for col, val in self.mode_values.items():
            if col in X.columns:
                X[col] = X[col].fillna(val)
        
        # 删除无用列（但保留预处理创建的特征）
        drop_cols = ['抵押信息', '户型介绍', '核心卖点', '建筑年代']  # 保留'建筑年代_区间'和'建筑年代_数值'
        X = X.drop([col for col in drop_cols if col in X.columns], axis=1)
        
        # 处理目标变量
        if y is not None:
            y = np.log1p(np.where(y > self.price_cap, self.price_cap, y))
            return X, y
        return X
    
    def fit_transform(self, X, y=None):
        self.fit(X, y)
        if y is not None:
            return self.transform(X, y)
        return self.transform(X)

# 应用数据清洗
cleaner = DataCleaner()
cleaned_data = cleaner.fit_transform(X_train, y_train)
if isinstance(cleaned_data, tuple):
    X_train_cleaned, y_train_cleaned = cleaned_data
else:
    X_train_cleaned = cleaned_data
    y_train_cleaned = y_train

X_test_cleaned = cleaner.transform(X_test)

# 4. 特征工程管道（带特征存在性检查）
def get_features(X):
    # 数值特征
    numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
    numeric_features = [col for col in numeric_features 
                       if col not in ['ID'] 
                       and not col.endswith('_month')]
    
    # 分类特征（只保留实际存在的特征）
    possible_cat_features = [
        '环线', '房屋户型', '房屋朝向', '建筑结构', '装修情况',
        '梯户比例', '配备电梯', '交易权属', '房屋用途', '建筑年代_区间'
    ]
    categorical_features = [col for col in possible_cat_features if col in X.columns]
    
    return numeric_features, categorical_features

numeric_features, categorical_features = get_features(X_train_cleaned)

# 构建预处理管道
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
], remainder='drop')

# 5. 验证集划分
X_train, X_val, y_train, y_val = train_test_split(
    X_train_cleaned, y_train_cleaned,
    test_size=0.2,
    random_state=42
)

# 6. 保存预处理对象
joblib.dump(preprocessor, '/home/mw/project/output/linear_preprocessor_final.joblib')
joblib.dump(cleaner, '/home/mw/project/output/data_cleaner_final.joblib')

# 7. 模型训练和比较
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1)
}

results = []
for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    pipeline.fit(X_train, y_train)
    
    # 评估
    y_pred = pipeline.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    r2 = r2_score(y_val, y_pred)
    
    results.append({
        'Model': name,
        'RMSE': rmse,
        'R2': r2,
        'Pipeline': pipeline
    })
    
    print(f"{name} - RMSE: {rmse:.4f}, R2: {r2:.4f}")

# 8. 生成预测结果
for result in results:
    model_name = result['Model']
    pipeline = result['Pipeline']
    
    # 预测测试集
    test_pred = pipeline.predict(X_test_cleaned)
    test_pred_price = np.expm1(test_pred)  # 反转log1p
    
    # 保存结果
    submission = pd.DataFrame({
        'ID': test_ids,
        'Price': test_pred_price
    })
    submission.to_csv(f'/home/mw/project/output/{model_name}_submission.csv', index=False)
    print(f"{model_name}预测结果已保存")

print("所有模型训练和预测完成！")


LinearRegression - RMSE: 64394285.9013, R2: -5223738892959791.0000
Ridge - RMSE: 0.5186, R2: 0.6611
Lasso - RMSE: 0.7881, R2: 0.2176


  test_pred_price = np.expm1(test_pred)  # 反转log1p


LinearRegression预测结果已保存
Ridge预测结果已保存
Lasso预测结果已保存
所有模型训练和预测完成！


In [6]:
!pip install xgboost lightgbm




In [15]:
import sys
print("Python 路径:", sys.executable)  # 查看当前 Notebook 使用的 Python 路径
!which pip  # 查看 pip 路径（Linux/Mac）
# !where pip  # Windows


Python 路径: /opt/conda/bin/python
/opt/conda/bin/pip


In [16]:
!conda install -c conda-forge scikit-learn=1.4.0 --yes


Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: / ^C
- 

In [12]:
import sklearn
print("scikit-learn 版本:", sklearn.__version__)


scikit-learn 版本: 1.0.1


In [17]:
#（三）线性模型和树模型

import pandas as pd
import numpy as np
import re
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PowerTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn import __version__ as sk_version
from distutils.version import LooseVersion
import joblib

# 1. 数据加载
train_path = '/home/mw/project/output/train.csv'
test_path = '/home/mw/project/output/test.csv'

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

# 保存测试集ID用于最终提交
test_ids = test_df['ID']

# 2. 增强版数据预处理函数
# 1. 修改_extract_floor_number函数定义（移到类外部作为独立函数）
def extract_floor_number(floor_str):
    """提取楼层数字的独立函数"""
    if pd.isna(floor_str):
        return np.nan
    match = re.search(r'(\d+)', str(floor_str))
    return int(match.group(1)) if match else np.nan

# 2. 修改preprocess_data函数中的调用方式
def preprocess_data(df, is_train=True):
    df = df.copy()
    
    # 分离目标变量
    if is_train:
        y = df['价格']
        df = df.drop(['价格'], axis=1)
    else:
        y = None
    
    # 日期特征处理
    date_cols = [col for col in ['交易时间', '上次交易'] if col in df.columns]
    for col in date_cols:
        df[col] = pd.to_datetime(df[col], errors='coerce')
        df[f'{col}_year'] = df[col].dt.year
        df[f'{col}_month'] = df[col].dt.month
        df[f'{col}_quarter'] = df[col].dt.quarter
        
        # 填充缺失值
        df[f'{col}_year'] = df[f'{col}_year'].fillna(df[f'{col}_year'].median())
        df[f'{col}_month'] = df[f'{col}_month'].fillna(df[f'{col}_month'].median())
        df[f'{col}_quarter'] = df[f'{col}_quarter'].fillna(df[f'{col}_quarter'].median())
    
    # 楼层信息处理
    if '所在楼层' in df.columns:
        df['楼层类型'] = df['所在楼层'].str.extract(r'(低|中|高)层', expand=False).fillna('未知')
        df['总楼层'] = pd.to_numeric(
            df['所在楼层'].str.extract(r'共(\d+)层', expand=False),
            errors='coerce'
        ).fillna(18)
        
        # 修改这里：使用独立的extract_floor_number函数
        df['相对楼层'] = df['所在楼层'].apply(lambda x: extract_floor_number(x))
        df['相对楼层'] = df['相对楼层'] / df['总楼层']
    
    # 面积特征处理
    if '建筑面积' in df.columns:
        df['建筑面积'] = pd.to_numeric(df['建筑面积'], errors='coerce')
        df['建筑面积'] = df['建筑面积'].fillna(df['建筑面积'].median())
        
    if '套内面积' in df.columns:
        df['套内面积'] = pd.to_numeric(df['套内面积'], errors='coerce')
        df['套内面积'] = df['套内面积'].fillna(df['套内面积'].median())
        df['得房率'] = df['套内面积'] / df['建筑面积']
    
    # 经纬度处理
    if 'lon' in df.columns and 'lat' in df.columns:
        df['lon'] = pd.to_numeric(df['lon'], errors='coerce')
        df['lat'] = pd.to_numeric(df['lat'], errors='coerce')
        df['lon'] = df['lon'].fillna(df['lon'].median())
        df['lat'] = df['lat'].fillna(df['lat'].median())
    
    # 建筑年代处理
    if '年份' in df.columns:
        df['房龄'] = 2022 - df['年份']  # 假设当前年份为2022
    
    return df, y


# 辅助函数：提取楼层数字
def _extract_floor_number(floor_str):
    if pd.isna(floor_str):
        return np.nan
    match = re.search(r'(\d+)', str(floor_str))
    return int(match.group(1)) if match else np.nan

# 3. 增强版数据清洗类
class AdvancedDataCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.price_cap = None
        self.median_values = {}
        self.mode_values = {}
        self.required_features = []
        
    def fit(self, X, y=None):
        if y is not None:
            self.price_cap = np.percentile(y, 99.5)  # 更严格的异常值处理
        
        # 记录所有原始特征
        self.required_features = X.columns.tolist()
        
        # 存储数值特征的填充值
        num_cols = X.select_dtypes(include=['int64', 'float64']).columns
        for col in num_cols:
            if X[col].isnull().any():
                if col in ['lon', 'lat'] and '区域' in X.columns:
                    self.median_values[col] = X.groupby('区域')[col].median()
                else:
                    self.median_values[col] = X[col].median()
        
        # 存储分类特征的填充值
        cat_cols = X.select_dtypes(include=['object', 'category']).columns
        for col in cat_cols:
            modes = X[col].mode()
            self.mode_values[col] = modes[0] if len(modes) > 0 else 'missing'
        
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        
        # 确保所有需要的特征都存在
        for feat in self.required_features:
            if feat not in X.columns:
                X[feat] = np.nan if feat in self.median_values else 'missing'
        
        # 处理异常值
        for col in X.select_dtypes(include=['int64', 'float64']).columns:
            if col in ['建筑面积', '套内面积']:
                upper_limit = X[col].quantile(0.99)
                X[col] = np.where(X[col] > upper_limit, upper_limit, X[col])
        
        # 数值特征填充
        for col, val in self.median_values.items():
            if col in X.columns:
                if isinstance(val, pd.Series) and '区域' in X.columns:
                    X[col] = X[col].fillna(X['区域'].map(val))
                else:
                    X[col] = X[col].fillna(val)
        
        # 分类特征填充
        for col, val in self.mode_values.items():
            if col in X.columns:
                X[col] = X[col].fillna(val)
        
        # 删除无用列
        drop_cols = ['抵押信息', '户型介绍', '核心卖点', '房屋优势', '周边配套', '交通出行']
        X = X.drop([col for col in drop_cols if col in X.columns], axis=1)
        
        # 处理目标变量
        if y is not None:
            y = np.log1p(np.where(y > self.price_cap, self.price_cap, y))
            return X, y
        return X
    
    def fit_transform(self, X, y=None):
        self.fit(X, y)
        if y is not None:
            return self.transform(X, y)
        return self.transform(X)

# 应用预处理
X_train, y_train = preprocess_data(train_df, is_train=True)
X_test, _ = preprocess_data(test_df, is_train=False)

# 应用数据清洗
cleaner = AdvancedDataCleaner()
cleaned_data = cleaner.fit_transform(X_train, y_train)
if isinstance(cleaned_data, tuple):
    X_train_cleaned, y_train_cleaned = cleaned_data
else:
    X_train_cleaned = cleaned_data
    y_train_cleaned = y_train

X_test_cleaned = cleaner.transform(X_test)

# 4. 特征工程管道
def get_features(X):
    # 数值特征
    numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
    numeric_features = [col for col in numeric_features 
                       if col not in ['ID'] 
                       and not col.endswith('_month')]
    
    # 分类特征
    possible_cat_features = [
        '环线', '房屋户型', '房屋朝向', '建筑结构', '装修情况',
        '梯户比例', '配备电梯', '交易权属', '房屋用途', '楼层类型'
    ]
    categorical_features = [col for col in possible_cat_features if col in X.columns]
    
    return numeric_features, categorical_features

numeric_features, categorical_features = get_features(X_train_cleaned)

# 构建预处理管道
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])


# 自动适配不同版本的参数
onehot_params = {'handle_unknown': 'ignore'}
if LooseVersion(sk_version) >= LooseVersion('1.2'):
    onehot_params['sparse_output'] = False
else:
    onehot_params['sparse'] = False

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(**onehot_params))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
], remainder='drop')

# 5. 验证集划分
X_train, X_val, y_train, y_val = train_test_split(
    X_train_cleaned, y_train_cleaned,
    test_size=0.2,
    random_state=42
)

# 6. 模型训练和比较
models = {
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'ElasticNet': ElasticNet(),
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
    'XGBoost': XGBRegressor(n_estimators=100, random_state=42),
    'LightGBM': LGBMRegressor(n_estimators=100, random_state=42)
}

# 参数网格
param_grids = {
    'Ridge': {'model__alpha': [0.1, 1, 10]},
    'Lasso': {'model__alpha': [0.001, 0.01, 0.1]},
    'ElasticNet': {'model__alpha': [0.001, 0.01, 0.1], 'model__l1_ratio': [0.5, 0.7, 0.9]},
    'RandomForest': {'model__max_depth': [10, 20, None]},
    'XGBoost': {'model__max_depth': [3, 6, 9], 'model__learning_rate': [0.01, 0.1]},
    'LightGBM': {'model__num_leaves': [31, 63], 'model__learning_rate': [0.01, 0.1]}
}

best_models = {}
results = []

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # 网格搜索
    grid_search = GridSearchCV(
        pipeline,
        param_grids[name],
        cv=5,
        scoring='neg_root_mean_squared_error',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    
    # 评估
    y_pred = grid_search.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    r2 = r2_score(y_val, y_pred)
    
    results.append({
        'Model': name,
        'RMSE': rmse,
        'R2': r2,
        'Best_Params': grid_search.best_params_,
        'Pipeline': grid_search.best_estimator_
    })
    
    best_models[name] = grid_search.best_estimator_
    print(f"{name} - RMSE: {rmse:.4f}, R2: {r2:.4f}")
    print(f"Best params: {grid_search.best_params_}")

# 7. 选择最佳模型
best_result = sorted(results, key=lambda x: x['RMSE'])[0]
best_model = best_result['Pipeline']
print(f"\n最佳模型: {best_result['Model']}, RMSE: {best_result['RMSE']:.4f}")

# 8. 保存模型和预处理对象
joblib.dump(best_model, '/home/mw/project/output/best_model.joblib')
joblib.dump(preprocessor, '/home/mw/project/output/preprocessor.joblib')
joblib.dump(cleaner, '/home/mw/project/output/data_cleaner.joblib')

# 9. 生成预测结果
test_pred = best_model.predict(X_test_cleaned)
test_pred_price = np.expm1(test_pred)  # 反转log1p

# 保存结果
submission = pd.DataFrame({
    'ID': test_ids,
    'price': test_pred_price
})
submission.to_csv('/home/mw/project/output/best_model_submission.csv', index=False)
print("预测结果已保存")

# 10. 保存所有模型的预测结果
for name, model in best_models.items():
    test_pred = model.predict(X_test_cleaned)
    test_pred_price = np.expm1(test_pred)
    
    submission = pd.DataFrame({
        'ID': test_ids,
        'price': test_pred_price
    })
    submission.to_csv(f'/home/mw/project/output/{name}_submission.csv', index=False)
    print(f"{name}预测结果已保存")

print("所有模型训练和预测完成！")


Fitting 5 folds for each of 3 candidates, totalling 15 fits


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


Ridge - RMSE: 0.5146, R2: 0.6629
Best params: {'model__alpha': 1}
Fitting 5 folds for each of 3 candidates, totalling 15 fits


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


Lasso - RMSE: 0.5452, R2: 0.6217
Best params: {'model__alpha': 0.001}
Fitting 5 folds for each of 9 candidates, totalling 45 fits


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


ElasticNet - RMSE: 0.5384, R2: 0.6311
Best params: {'model__alpha': 0.001, 'model__l1_ratio': 0.5}
Fitting 5 folds for each of 3 candidates, totalling 15 fits


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


RandomForest - RMSE: 0.1917, R2: 0.9532
Best params: {'model__max_depth': None}
Fitting 5 folds for each of 6 candidates, totalling 30 fits


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


XGBoost - RMSE: 0.2307, R2: 0.9322
Best params: {'model__learning_rate': 0.1, 'model__max_depth': 9}
Fitting 5 folds for each of 4 candidates, totalling 20 fits


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008540 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1520
[LightGBM] [Info] Number of data points in the train set: 67306, number of used features: 279
[LightGBM] [Info] Start training from score 14.046411
LightGBM - RMSE: 0.2209, R2: 0.9379
Best params: {'model__learning_rate': 0.1, 'model__num_leaves': 63}

最佳模型: RandomForest, RMSE: 0.1917
预测结果已保存
Ridge预测结果已保存
Lasso预测结果已保存
ElasticNet预测结果已保存
RandomForest预测结果已保存
XGBoost预测结果已保存
LightGBM预测结果已保存
所有模型训练和预测完成！


# ***#从训练日志中的到的模型评估结果：  
Fitting 5 folds for each of 3 candidates, totalling 15 fits  
Ridge - RMSE: 0.5146, R2: 0.6629  
Best params: {'model__alpha': 1}  
Fitting 5 folds for each of 3 candidates, totalling 15 fits  
Lasso - RMSE: 0.5452, R2: 0.6217  
Best params: {'model__alpha': 0.001}  
Fitting 5 folds for each of 9 candidates, totalling 45 fits  
ElasticNet - RMSE: 0.5321, R2: 0.6398  
Best params: {'model__alpha': 0.001, 'model__l1_ratio': 0.5}  
Fitting 5 folds for each of 3 candidates, totalling 15 fits  
RandomForest - RMSE: 0.1917, R2: 0.9532  
Best params: {'model__max_depth': None}  
Fitting 5 folds for each of 6 candidates, totalling 30 fits  
XGBoost - RMSE: 0.2083, R2: 0.9426  
Best params: {'model__learning_rate': 0.1, 'model__max_depth': 6}  
Fitting 5 folds for each of 6 candidates, totalling 30 fits  
LightGBM - RMSE: 0.2209, R2: 0.9379  
Best params: {'model__learning_rate': 0.1, 'model__num_leaves': 63}  




In [None]:
C

#在另一个notebook中运行得到如下结果：  
Fitting 3 folds for each of 15 candidates, totalling 45 fits  
RandomForest - RMSE: 0.1909, R2: 0.9536  
Best params: {'model__n_estimators': 150, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 0.8, 'model__max_depth': None, 'model__bootstrap': False}