### 分类算法
该算法能根据Red Hat的特征和活动准确识别哪些客户具有最大的潜在商业价值<br>
https://www.kaggle.com/c/predicting-red-hat-business-value

逻辑回归

In [1]:
# read the data
import pandas as pd
import numpy as np
dir = 'data/01_redHat/'
df_people = pd.read_csv(dir+'people.csv')
df_act_train = pd.read_csv(dir+'act_train.csv')

ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

### 整理数据<br>
将数据中缺失值很多的列删掉后再合并两表

In [None]:
# 删除act_train表缺失值太多的维度 
x = ['date','char_1','char_2','char_3','char_4','char_5','char_6','char_7','char_8','char_9']
df_act_train.drop(x, axis=1,inplace=True)

df_act_train.iloc[0] # 删除完后
# df_act_train.columns

In [None]:
# 删除people表中缺失值太多数据
df_people.drop(['date'], axis=1, inplace=True)

In [None]:
# join people表和act_train表
df = pd.merge(df_people,df_act_train, on='people_id')

# 删除对模型无用维度：people id, group_1, activity id
df.drop(['people_id', 'group_1', 'activity_id'], axis=1, inplace=True)

df.shape

### 预处理数据
缺失值填充、连续值归一化、类别值转换

In [None]:
# 缺失值处理
df.isnull().any()

In [None]:
# 发现char_10_y有缺失值
#df.dropna(how="all") # 将行全为NaN的删除
df['char_10_y']=df['char_10_y'].fillna(method='pad')

df.dtypes

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# 将每一列都转换为categorical，除了char_38是连续值
for i in df.columns:
    if i != 'char_38':
        #df[i] = df[i].astype('category')
        df[i] = le.fit_transform(df[i])# 注意要先label encoder一下，不然会报错：string不能转换
        df[i] = pd.Categorical(df[i])

        
# 将char_38归一化
max_min_scaler = lambda x:(x-np.min(x))/(np.max(x)-np.min(x))
df[['char_38']].apply(max_min_scaler)

df.dtypes

### 切分训练、测试数据集

In [None]:
# 将特征X和标签y分开
x = df.iloc[:,:-1]
y = df.iloc[:,-1]

# 划分训练数据集和测试数据集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

### 拟合模型——逻辑回归

In [None]:
from sklearn.linear_model import LogisticRegression
# C是正则化的力度，1e4=10000
# solver是用于优化的算法：‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’
logreg = LogisticRegression(C=1e4, solver='newton-cg')
logreg.fit(x_train, y_train)

In [None]:
# 获得模型的参数
logreg.get_params()

In [None]:
# 为模型打分
logreg.score(x_test, y_test)

### 交叉验证

In [None]:
# 交叉验证
from sklearn.model_selection import cross_val_score
scores = cross_val_score(logreg, x, y, cv=5, scoring='accuracy')

In [None]:
# ROC曲线
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.metrics import auc
# 预测结果
y_predict = logreg.prefict(x_test)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_predict, pos_label=2)

#画图
plt.plot(fpr, tpr, marker = 'o')
plt.show()

In [None]:
# 求AUC
from sklearn.metrics import auc
AUC = auc(fpr, tpr)