# 数据分析——葡萄酒品鉴
葡萄酒是以葡萄为原料酿造的一种果酒。其酒精度高于啤酒而低于白酒。营养丰富，保健作用明显。有人认为，葡萄酒是最健康最卫生的饮料之一。它能调整新陈代谢的性能，促进血液循环，防止胆固醇增加。还具有利尿、激发肝功能和防止衰老的功效。也是医治心脏病的辅助剂，可预防坏血病、贫血、脚气病、消化不良和眼角膜炎等疾病。常饮葡葡酒患心脏病率减少，血脂和血管硬化降低。目前，在中国也越来越多的人喜欢品尝葡萄酒，葡萄酒的品鉴，目前主要是靠品酒师来品鉴，人类能感觉的只有四种基本味感：甜、酸、咸、苦。四种基本呈味物质的敏感性不同，所以呈味物质在口腔中的反应速度也不同。舌尖对甜最敏感；接近舌尖的两侧对咸最敏感；舌的两侧对酸敏感；舌根对苦最敏感。那么，如何能够更精确的来判断葡萄酒的品级呢？通过鉴定葡萄酒的成人，结合品酒师的结论，能否实现以成分鉴定的方式，来给葡萄酒分级呢？在这里通过数据分析的手段，来看一下，葡萄酒成分与等级的关系。目前葡萄酒的分类主要有，以成品颜色来说，可分为红葡萄酒、白葡萄酒及粉红葡萄酒三类。这里收集了一部分红葡萄酒和白葡萄酒的数据，分析就在此基础上进行。

首先导入使用的类库，然后读入数据，在这里的数据有红葡萄酒和白葡萄酒两个文件，分别读入到不同的DataFrame中。

In [224]:
import pandas as pd
import numpy as np

读入红葡萄酒数据

In [225]:
red_wine = pd.read_csv('data/wine/winequality-red.csv', delimiter=';')
red_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


读入白葡萄酒数据

In [226]:
white_wine = pd.read_csv('data/wine/winequality-white.csv', delimiter=';')
white_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


数据读入后，检查一下数据中是否有缺失值。所谓缺失值就是数据收集整理过程中，该值不存在或者因为没有收集到，在数据文件中缺少的值。

In [227]:
red_wine.isnull().all()

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
dtype: bool

In [228]:
white_wine.isnull().all()

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
dtype: bool

通过上述结果，可以看到数据中没有任何缺失值。接下来，结算一下数据项目之间的关联关系，并通过热力图来展示一下数据的关系。

In [229]:
red_corr = red_wine.corr()
red_corr

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
fixed acidity,1.0,-0.256131,0.671703,0.114777,0.093705,-0.153794,-0.113181,0.668047,-0.682978,0.183006,-0.061668,0.124052
volatile acidity,-0.256131,1.0,-0.552496,0.001918,0.061298,-0.010504,0.07647,0.022026,0.234937,-0.260987,-0.202288,-0.390558
citric acid,0.671703,-0.552496,1.0,0.143577,0.203823,-0.060978,0.035533,0.364947,-0.541904,0.31277,0.109903,0.226373
residual sugar,0.114777,0.001918,0.143577,1.0,0.05561,0.187049,0.203028,0.355283,-0.085652,0.005527,0.042075,0.013732
chlorides,0.093705,0.061298,0.203823,0.05561,1.0,0.005562,0.0474,0.200632,-0.265026,0.37126,-0.221141,-0.128907
free sulfur dioxide,-0.153794,-0.010504,-0.060978,0.187049,0.005562,1.0,0.667666,-0.021946,0.070377,0.051658,-0.069408,-0.050656
total sulfur dioxide,-0.113181,0.07647,0.035533,0.203028,0.0474,0.667666,1.0,0.071269,-0.066495,0.042947,-0.205654,-0.1851
density,0.668047,0.022026,0.364947,0.355283,0.200632,-0.021946,0.071269,1.0,-0.341699,0.148506,-0.49618,-0.174919
pH,-0.682978,0.234937,-0.541904,-0.085652,-0.265026,0.070377,-0.066495,-0.341699,1.0,-0.196648,0.205633,-0.057731
sulphates,0.183006,-0.260987,0.31277,0.005527,0.37126,0.051658,0.042947,0.148506,-0.196648,1.0,0.093595,0.251397


In [230]:
from pyecharts import HeatMap
x_axis = y_axis = red_corr.index.tolist()
# print(y_axis)
values = red_corr.values
# values
data = [[i, j, values[i, j]] for i in range(len(x_axis)) for j in range(len(y_axis))]
heatmap = HeatMap()
heatmap.add(
    "相关性热力图",
    x_axis,
    y_axis,
    data,
    is_visualmap=True,
    visual_range=[-1, 1],
    visual_text_color="#000",
    visual_orient="horizontal",
    visual_pos="center",
    is_piecewise=True,
)
heatmap

In [231]:
white_corr = white_wine.corr()
white_corr

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
fixed acidity,1.0,-0.022697,0.289181,0.089021,0.023086,-0.049396,0.09107,0.265331,-0.425858,-0.017143,-0.120881,-0.113663
volatile acidity,-0.022697,1.0,-0.149472,0.064286,0.070512,-0.097012,0.089261,0.027114,-0.031915,-0.035728,0.067718,-0.194723
citric acid,0.289181,-0.149472,1.0,0.094212,0.114364,0.094077,0.121131,0.149503,-0.163748,0.062331,-0.075729,-0.009209
residual sugar,0.089021,0.064286,0.094212,1.0,0.088685,0.299098,0.401439,0.838966,-0.194133,-0.026664,-0.450631,-0.097577
chlorides,0.023086,0.070512,0.114364,0.088685,1.0,0.101392,0.19891,0.257211,-0.090439,0.016763,-0.360189,-0.209934
free sulfur dioxide,-0.049396,-0.097012,0.094077,0.299098,0.101392,1.0,0.615501,0.29421,-0.000618,0.059217,-0.250104,0.008158
total sulfur dioxide,0.09107,0.089261,0.121131,0.401439,0.19891,0.615501,1.0,0.529881,0.002321,0.134562,-0.448892,-0.174737
density,0.265331,0.027114,0.149503,0.838966,0.257211,0.29421,0.529881,1.0,-0.093591,0.074493,-0.780138,-0.307123
pH,-0.425858,-0.031915,-0.163748,-0.194133,-0.090439,-0.000618,0.002321,-0.093591,1.0,0.155951,0.121432,0.099427
sulphates,-0.017143,-0.035728,0.062331,-0.026664,0.016763,0.059217,0.134562,0.074493,0.155951,1.0,-0.017433,0.053678


In [232]:
x_axis = y_axis = red_corr.index.tolist()
# print(y_axis)
values = red_corr.values
# values
data = [[i, j, values[i, j]] for i in range(len(x_axis)) for j in range(len(y_axis))]
heatmap = HeatMap()
heatmap.add(
    "相关性热力图",
    x_axis,
    y_axis,
    data,
    is_visualmap=True,
    visual_range=[-1, 1],
    visual_text_color="#000",
    visual_orient="horizontal",
    visual_pos="center",
    is_piecewise=True,
)
heatmap

接下来通过关系图，再次确认哪个数据属性对酒的品质影响比加大。

In [233]:
from pyecharts import Graph
data = red_corr
#print(data)
nodes = []
for name in data.index.values:
    nodes.append({'name': name, 'symbolSize': data.at[name, 'quality'] * 50})

links = []
for i in nodes:
    for j in nodes:
        links.append({"source": i.get('name'), "target": j.get('name'), 
                      "value": data.at[i.get('name'), j.get('name')] * 50})
graph = Graph("红葡萄酒关系图-力引导布局示例")
graph.add("", nodes, links, graph_edge_length=250, is_label_show=True,)

In [239]:
data = white_corr
#print(data)
nodes = []
for name in data.index.values:
    nodes.append({'name': name, 'symbolSize': data.at[name, 'quality'] * 50})

links = []
for i in nodes:
    for j in nodes:
        links.append({"source": i.get('name'), "target": j.get('name'), 
                      "value": data.at[i.get('name'), j.get('name')] * 50})
graph = Graph("白葡萄酒关系图-力引导布局示例")
graph.add("", nodes, links, graph_edge_length=250, is_label_show=True,)

接下来查看一下，对葡萄酒品级影响的相关性的数据，如下：

In [240]:
corr_map = pd.concat([red_corr['quality'], white_corr['quality']], axis=1)
corr_map.columns = ['red_wine_quality', 'white_wine_quality']
corr_map.drop(index=['quality'], inplace=True)
corr_map

Unnamed: 0,red_wine_quality,white_wine_quality
fixed acidity,0.124052,-0.113663
volatile acidity,-0.390558,-0.194723
citric acid,0.226373,-0.009209
residual sugar,0.013732,-0.097577
chlorides,-0.128907,-0.209934
free sulfur dioxide,-0.050656,0.008158
total sulfur dioxide,-0.1851,-0.174737
density,-0.174919,-0.307123
pH,-0.057731,0.099427
sulphates,0.251397,0.053678


In [241]:
from pyecharts import Bar
bar = Bar('萄酒品质相关性比较')
labels = corr_map.index.tolist()
red_values = [i for i in corr_map['red_wine_quality'].values]
white_values = [i for i in corr_map['white_wine_quality'].values]
bar.add('red wine', labels, red_values, 
        mark_line_raw=[{'name': '标线', 'yAxis': '0.2'}, {'name': '标线', 'yAxis': '-0.2'}])
bar.add('white wine', labels, white_values,
        mark_line_raw=[{'name': '标线', 'yAxis': '0.2'}, {'name': '标线', 'yAxis': '-0.2'}])
bar

通过这个图的对比看一看到，在红葡萄酒中，在-0.2 与0.2之外的项目有：volatile acidity, citric acid, sulphates, alcohol；白葡萄酒是：volatile acidity, chlorides, density, alcohol。其中，alcohol对红葡萄酒和白葡萄酒来说，都对品质产生比较大的影响。接下里，就看一下酒精度对葡萄酒品质的影响是什么样子的，散点图很好的描述了x, y之间的变化关系。

In [245]:
from pyecharts import Scatter
scatter = Scatter('精度对葡萄酒品质的影响')
red_wine_alcohol = [i for i in red_wine['alcohol']]
red_wine_quality = [i for i in red_wine['quality']]
scatter.add('红葡萄酒', red_wine_alcohol, red_wine_quality)

white_wine_alcohol = [i for i in white_wine['alcohol']]
white_wine_quality = [i for i in white_wine['quality']]
scatter.add('白葡萄酒', white_wine_alcohol, white_wine_quality)
scatter

因为数据量太大，不能很直观的看出来葡萄酒品质随酒精度的变化情况，要考虑一下其他的方式来进行一步分析。虽然，不是变化规律不是非常清新，但是也能发现，随着酒精度的增加，品质有所提高，在这里将数据安装品质等级分组，然后通过箱线图来看一下，在同一个品级下，数据的分布情况，并比较不同分品级的分布状况。

In [243]:
x_axis = ['3', '4', '5', '6', '7', '8', '9']
y1_axis = []
for level in x_axis:
    data = red_wine[red_wine['quality'] == int(level)]
    values = [i for i in data['alcohol']]
    y1_axis.append(values)
    
y2_axis = []
for level in x_axis:
    data = white_wine[white_wine['quality'] == int(level)]
    values = [i for i in data['alcohol']]
    y2_axis.append(values)

In [244]:
from pyecharts import Boxplot
boxplot = Boxplot('葡萄酒品质')
boxplot.add('红葡萄酒', x_axis, boxplot.prepare_data(y1_axis))
boxplot.add('白葡萄酒', x_axis, boxplot.prepare_data(y2_axis))
boxplot

通过上面的箱线图，结合之前的散点图，基本可以得出红酒的品质，随着酒精度的提高，有所增加。这对红葡萄酒和白葡萄酒都适用。然后，可以在分析一个负相关项目-挥发酸（volatile acidity）与葡萄酒品级的关系。这个留给读者自行验证。