## 朴素贝叶斯的优缺点

优点：

  - 算法逻辑简单,易于实现（算法思路很简单，只要使用贝叶斯公式转化即可！）
  - 分类过程中时空开销小（假设特征相互独立，只会涉及到二维存储）
  
缺点：
  - 朴素贝叶斯假设属性之间相互独立，这种假设在实际过程中往往是不成立的。在属性之间相关性越大，分类误差也就越大。


sklearn中有3种不同类型的朴素贝叶斯：

  - 高斯分布型：用于classification问题，假定属性/特征服从正态分布的。
  - 多项式型：用于离散值模型里。比如文本分类问题里面我们提到过，我们不光看词语是否在文本中出现，也得看出现次数。如果总词数为n，出现词数为m的话，有点像掷骰子n次出现m次这个词的场景。
  - 伯努利型：最后得到的特征只有0(没出现)和1(出现过)。


### 我们使用iris数据集进行分类

In [4]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn import datasets
iris = datasets.load_iris()
gnb = GaussianNB()
scores=cross_val_score(gnb, iris.data, iris.target, cv=10)
print("Accuracy:%.3f"%scores.mean())

Accuracy:0.953


### Kaggle比赛之“旧金山犯罪分类预测”

In [11]:
import pandas as pd  
import numpy as np  
from sklearn import preprocessing  
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss  
train = pd.read_csv('train.csv', parse_dates = ['Dates'])  
test = pd.read_csv('test.csv', parse_dates = ['Dates'])  
train.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


我们依次解释一下每一列的含义：

    Date: 日期
    Category: 犯罪类型，比如 Larceny/盗窃罪 等.
    Descript: 对于犯罪更详细的描述
    DayOfWeek: 星期几
    PdDistrict: 所属警区
    Resolution: 处理结果，比如说『逮捕』『逃了』
    Address: 发生街区位置
    X and Y: GPS坐标


train.csv中的数据时间跨度为12年，包含了将近90w的记录。另外，这部分数据，大家从上图上也可以看出来，大部分都是『类别』型，比如犯罪类型，比如星期几。

###    （2）特征预处理

  sklearn.preprocessing模块中的 LabelEncoder函数可以对类别做编号，我们用它对犯罪类型做编号；pandas中的get_dummies( )可以将变量进行二值化01向量，我们用它对”街区“、”星期几“、”时间点“进行因子化。

In [12]:
#对犯罪类别:Category; 用LabelEncoder进行编号  
leCrime = preprocessing.LabelEncoder()  
crime = leCrime.fit_transform(train.Category)   #39种犯罪类型  
#用get_dummies因子化星期几、街区、小时等特征  
days=pd.get_dummies(train.DayOfWeek)  
district = pd.get_dummies(train.PdDistrict)  
hour = train.Dates.dt.hour  
hour = pd.get_dummies(hour)  
#组合特征  

trainData = pd.concat([hour, days, district], axis = 1)  #将特征进行横向组合  
trainData['crime'] = crime   #追加'crime'列  
days = pd.get_dummies(test.DayOfWeek)  
district = pd.get_dummies(test.PdDistrict)  
hour = test.Dates.dt.hour  
hour = pd.get_dummies(hour)  
testData = pd.concat([hour, days, district], axis=1)  
trainData 


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,CENTRAL,INGLESIDE,MISSION,NORTHERN,PARK,RICHMOND,SOUTHERN,TARAVAL,TENDERLOIN,crime
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,37
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,21
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,21
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,16
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,16
5,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,16
6,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,36
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,36
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,16
9,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,16


In [13]:
trainData.columns

Index([           0,            1,            2,            3,            4,
                  5,            6,            7,            8,            9,
                 10,           11,           12,           13,           14,
                 15,           16,           17,           18,           19,
                 20,           21,           22,           23,     'Friday',
           'Monday',   'Saturday',     'Sunday',   'Thursday',    'Tuesday',
        'Wednesday',    'BAYVIEW',    'CENTRAL',  'INGLESIDE',    'MISSION',
         'NORTHERN',       'PARK',   'RICHMOND',   'SOUTHERN',    'TARAVAL',
       'TENDERLOIN',      'crime'],
      dtype='object')

###   (3) 建模

In [14]:
from sklearn.naive_bayes import BernoulliNB
import time
features=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION',  
 'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN']  

X_train, X_test, y_train, y_test = train_test_split(trainData[features], trainData['crime'], train_size=0.6)  
NB = BernoulliNB()  
nbStart = time.time()  
NB.fit(X_train, y_train)  
nbCostTime = time.time() - nbStart  
#print(X_test.shape)  
propa = NB.predict_proba(X_test)   #X_test为263415*17； 那么该行就是将263415分到39种犯罪类型中，每个样本被分到每一种的概率  
print("朴素贝叶斯建模%.2f秒"%(nbCostTime))  
predicted = np.array(propa)  
logLoss=log_loss(y_test, predicted)  
print("朴素贝叶斯的log损失为:%.6f"%logLoss) 



朴素贝叶斯建模1.26秒
朴素贝叶斯的log损失为:2.613916
