### 30 Days of Spark

#### 任务1：PySpark数据处理

*    步骤1：使用Python链接Spark环境
*    步骤2：创建dateframe数据
*    步骤3：用spark执行以下逻辑：找到数据行数、列数
*    步骤4：用spark筛选class为1的样本
*    步骤5：用spark筛选language >90 或 math> 90的样本


In [None]:
# 1、使用python链接Spark环境
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName('pyspark') \
    .getOrCreate()
# 原始数据 
# 2、创建dataframe数据
test = spark.createDataFrame([('001','1',100,87,67,83,98), ('002','2',87,81,90,83,83), ('003','3',86,91,83,89,63),
                            ('004','2',65,87,94,73,88), ('005','1',76,62,89,81,98), ('006','3',84,82,85,73,99),
                            ('007','3',56,76,63,72,87), ('008','1',55,62,46,78,71), ('009','2',63,72,87,98,64)],
                             ['number','class','language','math','english','physic','chemical'])
test.show()

##### 找到数据的行数和列数

In [None]:
# 方法一
column_len = len(test.columns)
print("The length of DataFrame's columns is %s" % column_len)

In [None]:
# 方法一
row_len = len(test.collect())
print("The length of DataFrame's rows is %s" % row_len)

In [None]:
# 方法二
shape = (test.count(), len(test.columns))

print("The length of DataFrame's rows is %s" % shape[0])
print("The length of DataFrame's columns is %s" % shape[1])

In [None]:
# 用spark筛选class为1的样本
test.filter(test['class'] == 1).show()

In [None]:
# 用spark筛选language>90 或math>90的样本
test.filter((test['language'] > 90) | (test['math'] > 90)).show()

-----------------------------

#### 任务2：PySpark数据统计

* 步骤1：读取文件https://cdn.coggle.club/Pokemon.csv
* 步骤2：将读取的进行保存，表头也需要保存
* 步骤3：分析每列的类型，取值个数
* 步骤4：分析每列是否包含缺失值


In [None]:
from pyspark import SparkFiles

# 读取文件
spark.sparkContext.addFile('https://cdn.coggle.club/Pokemon.csv')

# 将读取的进行保存
df = spark.read.csv("file://"+SparkFiles.get("Pokemon.csv"), header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'Sp Atk')
df = df.withColumnRenamed('Sp. Def', 'Sp Def')

In [None]:
df.show()

##### 分析每一列的类型和取值个数

In [None]:
# 方法一
df.dtypes

In [None]:
# 方法二
df.printSchema()

In [None]:
df.select('Name').count()

In [None]:
# 方法一：以去重的思想去分析列中的取值个数
# 可采用两种方法

# df.select('Name').drop_duplicates().count()

df.select('Name').distinct().count()

In [91]:
columns_list = df.columns

In [92]:
columns_list

['Name',
 'Type1',
 'Type2',
 'Total',
 'HP',
 'Attack',
 'Defense',
 'SpAtk',
 'SpDef',
 'Speed',
 'Generation',
 'Legendary']

In [93]:
for i in columns_list:
    value = df.select(i).drop_duplicates().count()
    print("列 %s 的取值为：%s" % (i, value))

列 Name 的取值为：799
列 Type1 的取值为：18
列 Type2 的取值为：19
列 Total 的取值为：200
列 HP 的取值为：94
列 Attack 的取值为：111
列 Defense 的取值为：103
列 SpAtk 的取值为：105
列 SpDef 的取值为：92
列 Speed 的取值为：108
列 Generation 的取值为：6
列 Legendary 的取值为：2


In [94]:
# 方法二：使用聚合函数 countDistinct
import pyspark.sql.functions as F
for i in columns_list:
    print(df.agg(F.countDistinct(i).alias(i)).collect())



[Row(Name=799)]
[Row(Type1=18)]
[Row(Type2=18)]
[Row(Total=200)]
[Row(HP=94)]
[Row(Attack=111)]
[Row(Defense=103)]
[Row(SpAtk=105)]
[Row(SpDef=92)]
[Row(Speed=108)]
[Row(Generation=6)]
[Row(Legendary=2)]


> 会发现上面的两个结果中，对于列“Type 2”的结果有所不同， 检查数据后发现是因为“Type 2”中包含有缺失值的数据，在第一种方法中，会将空值“NULL”当作一个值去统计，而使用`countDisinct`函数，他会排除出空值数据后再进行统计。
> 下面先分析每列中是否包含有缺失值，然后再重新使用方法一统计。

##### 分析每列是否包含缺失值

In [None]:
# 增加对每一列进行去重处理后再统计取值
for i in columns_list:
    value = df.select(i).dropna().drop_duplicates().count()
    print("列 %s 的取值为：%s" % (i, value))

In [None]:
#统计每列数据缺失占比情况
df.agg(*[(1 - (F.count(c) / F.count('*'))).alias(c) for c in df.columns]).show()

In [95]:
# 分析每列中缺失值个数
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])

In [96]:
df_agg.show()

+----+-----+-----+-----+---+------+-------+-----+-----+-----+----------+---------+
|Name|Type1|Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|
+----+-----+-----+-----+---+------+-------+-----+-----+-----+----------+---------+
|   0|    0|  386|    0|  0|     0|      0|    0|    0|    0|         0|        0|
+----+-----+-----+-----+---+------+-------+-----+-----+-----+----------+---------+



---------------------------------------------------

#### 任务三：

* 步骤1：读取文件https://cdn.coggle.club/Pokemon.csv
* 步骤2：学习groupby分组聚合的使用
* 步骤3：学习agg分组聚合的使用
* 步骤4：学习transform的使用
* 步骤5：使用groupby、agg、transform，统计数据在Type 1分组下 HP的均值

In [2]:
from pyspark import SparkFiles

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName('pyspark') \
    .getOrCreate()
spark.sparkContext.addFile('Pokemon.csv')

# 在windows下需要将file:// 改为file:///
df = spark.read.csv("file:///"+SparkFiles.get("Pokemon.csv"), header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'Sp Atk')
df = df.withColumnRenamed('Sp. Def', 'Sp Def')

##### 步骤2：学习groupby分组聚合的使用

PySpark DataFrame 还提供了一种使用常用方法拆分-应用-组合策略来处理分组数据的方法。按特定条件对数据进行分组，对每个组应用一个函数，然后将它们组合回 DataFrame。

In [None]:
df.show()

`df.groupby()`后可以使用自带基本统计功能的方法得到对应的结果（类似Pandas中GroupBy的用法）：

其中可以指定返回某一列或某几列的统计结果。

* `.count()`：返回每一组的数量，也就是行数。
* `.mean()`：返回每一组的mean。
* `.avg()`： 返回每一组的average。
* `.sum()`：返回每一组的总和。
* `.max()`：返回每一组的最大值。
* `.min()`：返回每一组的最小值。


> 均值(mean)是对恒定的真实值进行测量后，把测量偏离于真实值的所有值进行平均所得的结果；平均值(average)直接对一系列具有内部差异的数值进行的测量值进行的平均结果。均值是“观测值的平均”，平均值是“统计量的平均”

In [None]:
# 按照某一个字段分组 并统计各组的数量
df.groupby('Type 1').count().show()

In [None]:
# 按照某一个字段分组 并统计各组的平均值
df.groupby('Type 1').mean("Total", "HP").show()

In [None]:
# 按照某一个字段分组 并统计各组的平均值
df.groupby('Type 1').avg().show()

In [None]:
# 按照某一个字段分组 并统计各组各字段的最大值
df.groupby('Type 1').max().show()

In [None]:
# 按照某一个字段分组 返回指定字段的最大值
df.groupby('Type 1').max("HP").show()

In [None]:
# 按照某一个字段分组 并统计各组各字段的最小值
df.groupby('Type 1').min().show()

In [None]:
# 按照某一个字段分组 并统计各组各字段的总和
df.groupby('Type 1').sum().show()

##### 步骤3：学习agg分组聚合的使用

使用 agg() 函数，可以一次计算多个聚合。即可以对多列使用不同的集合函数进行聚合



In [None]:
from pyspark.sql.functions import sum,avg,max,min,mean,count
df.groupby('Type 1','Type 2').agg(count('HP').alias('总数'),
                        max('HP').alias('最大HP值'),
                        min('Attack').alias('最小攻击力')).show()

在 PySpark DataFrame 上，可以使用 where() 或 filter() 函数来过滤聚合数据的行

In [None]:
from pyspark.sql.functions import sum,avg,max,min,mean,count,col
df.groupby('Type 1','Type 2').agg(count('HP').alias('总数'),
                        max('HP').alias('最大HP值'),
                        min('Attack').alias('最小攻击力')) \
                        .where(col('最小攻击力')>=40).show()

##### 步骤4：学习transform的使用

返回一个新的 DataFrame。主要用于调用自定义的函数去处理DataFrame。

In [None]:
df.show()

In [None]:
# def cast_all_to_float(input_df):
#     return input_df.select([(col(col_name) + 10) for col_name in input_df.columns])
def sort_columns_asc(input_df):
    return input_df.select(*sorted(input_df.columns))
df.transform(sort_columns_asc).show()



##### 步骤5：使用groupby、agg、transform，统计数据在Type 1分组下 HP的均值

In [None]:
# 按照Type 1分组 并统计HP的均值
df.groupby('Type 1').mean('HP').show()

In [None]:
# 按照Type 1分组 并统计HP的均值
from pyspark.sql.functions import mean

df.groupby('Type 1').agg(mean('HP')).alias('Mean of HP').show()

In [3]:
type1_df = df.select('Type 1', 'HP')
# type1_df.show()
rows = type1_df.collect()
# {Type 1: value:{HP:, count:}}
result_dict = {}
# d = [{'name': 'Alice', 'age': 1}]
# output = spark.createDataFrame(d).collect()
for row in rows:
    if row['Type 1'] not in result_dict:
        # key
        result_dict[row['Type 1']] = {}

        result_dict[row['Type 1']]['HP'] = row['HP']
        result_dict[row['Type 1']]['count'] = 1
    else:
        result_dict[row['Type 1']]['HP'] += row['HP']
        result_dict[row['Type 1']]['count'] += 1

print(result_dict)

{'Grass': {'HP': 4709, 'count': 70}, 'Fire': {'HP': 3635, 'count': 52}, 'Water': {'HP': 8071, 'count': 112}, 'Bug': {'HP': 3925, 'count': 69}, 'Normal': {'HP': 7573, 'count': 98}, 'Poison': {'HP': 1883, 'count': 28}, 'Electric': {'HP': 2631, 'count': 44}, 'Ground': {'HP': 2361, 'count': 32}, 'Fairy': {'HP': 1260, 'count': 17}, 'Fighting': {'HP': 1886, 'count': 27}, 'Psychic': {'HP': 4026, 'count': 57}, 'Rock': {'HP': 2876, 'count': 44}, 'Ghost': {'HP': 2062, 'count': 32}, 'Ice': {'HP': 1728, 'count': 24}, 'Dragon': {'HP': 2666, 'count': 32}, 'Dark': {'HP': 2071, 'count': 31}, 'Steel': {'HP': 1761, 'count': 27}, 'Flying': {'HP': 283, 'count': 4}}


In [4]:
result_df = []
for k, v in result_dict.items():
    temp = {'Type 1':k,'mean':v['HP'] / v['count']}
    # temp[k] = v['HP'] / v['count']
    result_df.append(temp)
result_df


[{'Type 1': 'Grass', 'mean': 67.27142857142857},
 {'Type 1': 'Fire', 'mean': 69.90384615384616},
 {'Type 1': 'Water', 'mean': 72.0625},
 {'Type 1': 'Bug', 'mean': 56.88405797101449},
 {'Type 1': 'Normal', 'mean': 77.27551020408163},
 {'Type 1': 'Poison', 'mean': 67.25},
 {'Type 1': 'Electric', 'mean': 59.79545454545455},
 {'Type 1': 'Ground', 'mean': 73.78125},
 {'Type 1': 'Fairy', 'mean': 74.11764705882354},
 {'Type 1': 'Fighting', 'mean': 69.85185185185185},
 {'Type 1': 'Psychic', 'mean': 70.63157894736842},
 {'Type 1': 'Rock', 'mean': 65.36363636363636},
 {'Type 1': 'Ghost', 'mean': 64.4375},
 {'Type 1': 'Ice', 'mean': 72.0},
 {'Type 1': 'Dragon', 'mean': 83.3125},
 {'Type 1': 'Dark', 'mean': 66.80645161290323},
 {'Type 1': 'Steel', 'mean': 65.22222222222223},
 {'Type 1': 'Flying', 'mean': 70.75}]

In [7]:
output = spark.createDataFrame(result_df)
print(output.show())

+--------+-----------------+
|  Type 1|             mean|
+--------+-----------------+
|   Grass|67.27142857142857|
|    Fire|69.90384615384616|
|   Water|          72.0625|
|     Bug|56.88405797101449|
|  Normal|77.27551020408163|
|  Poison|            67.25|
|Electric|59.79545454545455|
|  Ground|         73.78125|
|   Fairy|74.11764705882354|
|Fighting|69.85185185185185|
| Psychic|70.63157894736842|
|    Rock|65.36363636363636|
|   Ghost|          64.4375|
|     Ice|             72.0|
|  Dragon|          83.3125|
|    Dark|66.80645161290323|
|   Steel|65.22222222222223|
|  Flying|            70.75|
+--------+-----------------+

None


In [8]:
# 按照Type 1分组 并统计HP的均值
# 自己实现一个分组并计算均值
def com_mean(input_df):
    type1_df = df.select('Type 1', 'HP')
    # type1_df.show()
    rows = type1_df.collect()
    # {Type 1: value:{HP:, count:}}
    result_dict = {}
    # d = [{'name': 'Alice', 'age': 1}]
    # output = spark.createDataFrame(d).collect()
    for row in rows:
        if row['Type 1'] not in result_dict:
            # key
            result_dict[row['Type 1']] = {}

            result_dict[row['Type 1']]['HP'] = row['HP']
            result_dict[row['Type 1']]['count'] = 1
        else:
            result_dict[row['Type 1']]['HP'] += row['HP']
            result_dict[row['Type 1']]['count'] += 1
    result_df = []
    for k, v in result_dict.items():
        temp = {'Type 1':k,'mean':v['HP'] / v['count']}
        # temp[k] = v['HP'] / v['count']
        result_df.append(temp)
    output = spark.createDataFrame(result_df)
    return output

df.transform(com_mean).show()
    

+--------+-----------------+
|  Type 1|             mean|
+--------+-----------------+
|   Grass|67.27142857142857|
|    Fire|69.90384615384616|
|   Water|          72.0625|
|     Bug|56.88405797101449|
|  Normal|77.27551020408163|
|  Poison|            67.25|
|Electric|59.79545454545455|
|  Ground|         73.78125|
|   Fairy|74.11764705882354|
|Fighting|69.85185185185185|
| Psychic|70.63157894736842|
|    Rock|65.36363636363636|
|   Ghost|          64.4375|
|     Ice|             72.0|
|  Dragon|          83.3125|
|    Dark|66.80645161290323|
|   Steel|65.22222222222223|
|  Flying|            70.75|
+--------+-----------------+



---------------------------------------------------

#### 任务四：SparkSQL基础语法

* 步骤1：使用Spark SQL完成任务1里面的数据筛选
    * 用spark筛选class为1的样本
    * 用spark筛选language >90 或 math> 90的样本
* 步骤2：使用Spark SQL完成任务2里面的统计（列可以不统计）
    * 分析每列的类型，取值个数
    * 分析每列是否包含缺失值
* 步骤3：使用Spark SQL完成任务3的分组统计
    * 统计数据在Type 1分组下 HP的均值

##### 步骤1：使用Spark SQL完成任务1里面的数据筛选
* 用spark筛选class为1的样本
* 用spark筛选language >90 或 math> 90的样本

In [9]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [20]:
# 任务一数据
task_1 = spark.createDataFrame([('001','1',100,87,67,83,98), ('002','2',87,81,90,83,83), ('003','3',86,91,83,89,63),
                            ('004','2',65,87,94,73,88), ('005','1',76,62,89,81,98), ('006','3',84,82,85,73,99),
                            ('007','3',56,76,63,72,87), ('008','1',55,62,46,78,71), ('009','2',63,72,87,98,64)],
                             ['number','class','language','math','english','physic','chemical'])

In [23]:
# 将DataFrame注册为sql临时表
task_1.createOrReplaceTempView("task_1")

In [24]:
sqlDF = spark.sql("SELECT * FROM task_1")
sqlDF.show()

+------+-----+--------+----+-------+------+--------+
|number|class|language|math|english|physic|chemical|
+------+-----+--------+----+-------+------+--------+
|   001|    1|     100|  87|     67|    83|      98|
|   002|    2|      87|  81|     90|    83|      83|
|   003|    3|      86|  91|     83|    89|      63|
|   004|    2|      65|  87|     94|    73|      88|
|   005|    1|      76|  62|     89|    81|      98|
|   006|    3|      84|  82|     85|    73|      99|
|   007|    3|      56|  76|     63|    72|      87|
|   008|    1|      55|  62|     46|    78|      71|
|   009|    2|      63|  72|     87|    98|      64|
+------+-----+--------+----+-------+------+--------+



In [27]:
# 筛选class为1的样本
sql1 = spark.sql("SELECT * FROM task_1 WHERE class=1")
sql1.show()

+------+-----+--------+----+-------+------+--------+
|number|class|language|math|english|physic|chemical|
+------+-----+--------+----+-------+------+--------+
|   001|    1|     100|  87|     67|    83|      98|
|   005|    1|      76|  62|     89|    81|      98|
|   008|    1|      55|  62|     46|    78|      71|
+------+-----+--------+----+-------+------+--------+



In [28]:
# 筛选language >90 或 math> 90的样本
sql2 = spark.sql("SELECT * FROM task_1 WHERE language>90 OR math>90")
sql2.show()

+------+-----+--------+----+-------+------+--------+
|number|class|language|math|english|physic|chemical|
+------+-----+--------+----+-------+------+--------+
|   001|    1|     100|  87|     67|    83|      98|
|   003|    3|      86|  91|     83|    89|      63|
+------+-----+--------+----+-------+------+--------+



##### 步骤2：使用Spark SQL完成任务2里面的统计（列可以不统计）
* 分析每列的类型，取值个数
* 分析每列是否包含缺失值

In [12]:
df = spark.read.csv('Pokemon.csv', header=True, inferSchema= True)

In [13]:
df.show()

+--------------------+------+------+-----+---+------+-------+-------+-------+-----+----------+---------+
|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp. Atk|Sp. Def|Speed|Generation|Legendary|
+--------------------+------+------+-----+---+------+-------+-------+-------+-----+----------+---------+
|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|     65|     65|   45|         1|    false|
|             Ivysaur| Grass|Poison|  405| 60|    62|     63|     80|     80|   60|         1|    false|
|            Venusaur| Grass|Poison|  525| 80|    82|     83|    100|    100|   80|         1|    false|
|VenusaurMega Venu...| Grass|Poison|  625| 80|   100|    123|    122|    120|   80|         1|    false|
|          Charmander|  Fire|  null|  309| 39|    52|     43|     60|     50|   65|         1|    false|
|          Charmeleon|  Fire|  null|  405| 58|    64|     58|     80|     65|   80|         1|    false|
|           Charizard|  Fire|Flying|  534| 78|    84|  

In [57]:
# 去掉字段名中的空格,防止在SQL中识别到空格出错
df = df.withColumnRenamed('Type 1', 'Type1')
df = df.withColumnRenamed('Type 2', 'Type2')
df = df.withColumnRenamed('Sp. Atk', 'SpAtk')
df = df.withColumnRenamed('Sp. Def', 'SpDef')

In [58]:
# 将DataFrame注册为sql临时表
df.createOrReplaceTempView("Pokemon")


In [59]:
sql3 = spark.sql("SELECT * FROM Pokemon")
sql3.show()

+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+
|                Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|
+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+
|           Bulbasaur|Grass|Poison|  318| 45|    49|     49|   65|   65|   45|         1|    false|
|             Ivysaur|Grass|Poison|  405| 60|    62|     63|   80|   80|   60|         1|    false|
|            Venusaur|Grass|Poison|  525| 80|    82|     83|  100|  100|   80|         1|    false|
|VenusaurMega Venu...|Grass|Poison|  625| 80|   100|    123|  122|  120|   80|         1|    false|
|          Charmander| Fire|  null|  309| 39|    52|     43|   60|   50|   65|         1|    false|
|          Charmeleon| Fire|  null|  405| 58|    64|     58|   80|   65|   80|         1|    false|
|           Charizard| Fire|Flying|  534| 78|    84|     78|  109|   85|  100|         1|    false|


In [60]:
# 分析每列的类型
sql4 = spark.sql("DESCRIBE TABLE Pokemon")
sql4.show()

+----------+---------+-------+
|  col_name|data_type|comment|
+----------+---------+-------+
|      Name|   string|   null|
|     Type1|   string|   null|
|     Type2|   string|   null|
|     Total|      int|   null|
|        HP|      int|   null|
|    Attack|      int|   null|
|   Defense|      int|   null|
|     SpAtk|      int|   null|
|     SpDef|      int|   null|
|     Speed|      int|   null|
|Generation|      int|   null|
| Legendary|  boolean|   null|
+----------+---------+-------+



In [61]:
df.columns

['Name',
 'Type1',
 'Type2',
 'Total',
 'HP',
 'Attack',
 'Defense',
 'SpAtk',
 'SpDef',
 'Speed',
 'Generation',
 'Legendary']

In [69]:
# 分析每列的取值个数
sql5 = spark.sql("SELECT COUNT(DISTINCT Name) as Name, COUNT(DISTINCT Type1) as Type1,  COUNT(DISTINCT Type2) as Type2, COUNT(DISTINCT Total) as Total, \
                 COUNT(DISTINCT HP) as HP, COUNT(DISTINCT Attack) as Attack, COUNT(DISTINCT Defense) as Defense, COUNT(DISTINCT SpAtk) as SpAtk, COUNT(DISTINCT SpDef) as SpDef,  \
                 COUNT(DISTINCT Speed) as Speed, COUNT(DISTINCT Generation) as Generation,  COUNT(DISTINCT Legendary) as Legendary FROM Pokemon")
sql5.show()

+----+-----+-----+-----+---+------+-------+-----+-----+-----+----------+---------+
|Name|Type1|Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|
+----+-----+-----+-----+---+------+-------+-----+-----+-----+----------+---------+
| 799|   18|   18|  200| 94|   111|    103|  105|   92|  108|         6|        2|
+----+-----+-----+-----+---+------+-------+-----+-----+-----+----------+---------+



In [97]:
# 分析每列是否包含缺失值
sql6 = spark.sql("select count(*) from Pokemon where Type2 is null")
sql6.show()

+--------+
|count(1)|
+--------+
|     386|
+--------+



In [106]:
# 分析每列是否包含缺失值
sql7 = spark.sql("select count(*) from Pokemon where isnull(Type2)='true'")
sql7.show()

+--------+
|count(1)|
+--------+
|     386|
+--------+



##### 步骤3：使用Spark SQL完成任务3的分组统计
* 统计数据在Type 1分组下 HP的均值

In [115]:
# 统计数据在Type 1分组下 HP的均值
sql8 = spark.sql("select mean(HP) from Pokemon group by Type1")
sql8.show()

+-----------------+
|         mean(HP)|
+-----------------+
|          72.0625|
|            67.25|
|65.22222222222223|
|65.36363636363636|
|             72.0|
|          64.4375|
|74.11764705882354|
|70.63157894736842|
|          83.3125|
|            70.75|
|56.88405797101449|
|59.79545454545455|
|69.90384615384616|
|         73.78125|
|66.80645161290323|
|69.85185185185185|
|67.27142857142857|
|77.27551020408163|
+-----------------+



---------------------------------------------------

#### 任务五：SparkSQL基础语法

* 步骤1：学习Spark ML中数据编码模块
    * https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#feature
    * https://spark.apache.org/docs/latest/ml-features.html
* 步骤2：读取文件Pokemon.csv，理解数据字段含义
* 步骤3：将其中的类别属性使用onehotencoder
* 步骤4：对其中的数值属性字段使用minmaxscaler
* 步骤5：对编码后的属性使用pca进行降维（维度可以自己选择）

In [23]:
from pyspark import SparkFiles

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName('pyspark') \
    .getOrCreate()
spark.sparkContext.addFile('Pokemon.csv')

# 在windows下需要将file:// 改为file:///
df = spark.read.csv("file:///"+SparkFiles.get("Pokemon.csv"), header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'Sp Atk')
df = df.withColumnRenamed('Sp. Def', 'Sp Def')

22/03/18 00:34:17 WARN SparkContext: The path Pokemon.csv has been added already. Overwriting of added paths is not supported in the current version.


In [4]:
# 步骤2：读取文件Pokemon.csv，理解数据字段含义
df.show()

+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+
|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp Atk|Sp Def|Speed|Generation|Legendary|
+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+
|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|    65|    65|   45|         1|    false|
|             Ivysaur| Grass|Poison|  405| 60|    62|     63|    80|    80|   60|         1|    false|
|            Venusaur| Grass|Poison|  525| 80|    82|     83|   100|   100|   80|         1|    false|
|VenusaurMega Venu...| Grass|Poison|  625| 80|   100|    123|   122|   120|   80|         1|    false|
|          Charmander|  Fire|  null|  309| 39|    52|     43|    60|    50|   65|         1|    false|
|          Charmeleon|  Fire|  null|  405| 58|    64|     58|    80|    65|   80|         1|    false|
|           Charizard|  Fire|Flying|  534| 78|    84|     78|   109|    8

该数据为宝可梦中的各生物数据，字段的含义如下：
* `Name`: 名字，为String类型；
* `Type 1`： 一级分类，类别属性；
* `Type 2`：二级分类，类别属性；
* `Total`：总数量，数值类型；
* `HP`:生命值，数值类型；
* `Attack`：普通攻击攻击力，数值类型；
* `Defense`：普通防御防御力，数值类型；
* `Sp Atk`：特攻攻击力，数值类型；
* `Sp Def`：特防防御力，数值类型；
* `Speed`：速度，数值类型
* `Generation`：宝可梦的不同世代，不同世代，一共有7世代，可以看做是类别属性。
* `Legendary`：是否是传奇口袋妖怪，Boolean



##### 步骤3：将其中的类别属性使用onehotencoder

> One-hot 编码将表示为标签索引的分类特征映射到二进制向量，其中最多有一个单值表示所有特征值集中存在特定特征值。
> 对于字符串类型的输入数据，通常首先使用 `StringIndexer` 对分类特征进行编码。
> OneHotEncoder 可以转换多个列，为每个输入列返回一个单热编码的输出向量列。

In [125]:
# 分别对Type1 、Type2和generate进行onehotencoder
from pyspark.ml.feature import StringIndexer

# 先使用StringIndexer对分类特征进行编码
inputs = ["Type 1", "Type 2"]
outputs = ["Type1Index", "Type2Index"]
indexer = StringIndexer(inputCols=inputs, outputCols=outputs)
indexed_df = indexer.fit(df).transform(df)
indexed_df.show()



22/03/17 23:52:25 ERROR Executor: Exception in task 0.0 in stage 325.0 (TID 276)
org.apache.spark.SparkException: Failed to execute user defined function (StringIndexerModel$$Lambda$4096/0x0000000841584040: (string) => double)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:136)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$m

Py4JJavaError: An error occurred while calling o649.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 325.0 failed 1 times, most recent failure: Lost task 0.0 in stage 325.0 (TID 276) (192.168.31.58 executor driver): org.apache.spark.SparkException: Failed to execute user defined function (StringIndexerModel$$Lambda$4096/0x0000000841584040: (string) => double)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:136)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: StringIndexer encountered NULL value. To handle or skip NULLS, try setting StringIndexer.handleInvalid.
	at org.apache.spark.ml.feature.StringIndexerModel.$anonfun$getIndexer$1(StringIndexer.scala:396)
	at org.apache.spark.ml.feature.StringIndexerModel.$anonfun$getIndexer$1$adapted(StringIndexer.scala:391)
	... 17 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:476)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:429)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
	at jdk.internal.reflect.GeneratedMethodAccessor95.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function (StringIndexerModel$$Lambda$4096/0x0000000841584040: (string) => double)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:136)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more
Caused by: org.apache.spark.SparkException: StringIndexer encountered NULL value. To handle or skip NULLS, try setting StringIndexer.handleInvalid.
	at org.apache.spark.ml.feature.StringIndexerModel.$anonfun$getIndexer$1(StringIndexer.scala:396)
	at org.apache.spark.ml.feature.StringIndexerModel.$anonfun$getIndexer$1$adapted(StringIndexer.scala:391)
	... 17 more


**直接执行会报错，因为`Type 2`中包含有缺失值，因此需要对缺失值进行处理，可以将其填为一个新的类别“unkown”**

In [126]:
df = df.fillna('unkown', subset = "Type 2")

In [127]:
df.show()

+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+
|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp Atk|Sp Def|Speed|Generation|Legendary|
+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+
|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|    65|    65|   45|         1|    false|
|             Ivysaur| Grass|Poison|  405| 60|    62|     63|    80|    80|   60|         1|    false|
|            Venusaur| Grass|Poison|  525| 80|    82|     83|   100|   100|   80|         1|    false|
|VenusaurMega Venu...| Grass|Poison|  625| 80|   100|    123|   122|   120|   80|         1|    false|
|          Charmander|  Fire|unkown|  309| 39|    52|     43|    60|    50|   65|         1|    false|
|          Charmeleon|  Fire|unkown|  405| 58|    64|     58|    80|    65|   80|         1|    false|
|           Charizard|  Fire|Flying|  534| 78|    84|     78|   109|    8

In [131]:
# 填充空值后再次执行 成功
from pyspark.ml.feature import StringIndexer

# 先使用StringIndexer对分类特征进行编码
inputs = ["Type 1", "Type 2"]
outputs = ["Type1Index", "Type2Index"]
indexer = StringIndexer(inputCols=inputs, outputCols=outputs)
indexed_df = indexer.fit(df).transform(df)
indexed_df.show()



+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+----------+----------+
|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp Atk|Sp Def|Speed|Generation|Legendary|Type1Index|Type2Index|
+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+----------+----------+
|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|    65|    65|   45|         1|    false|       2.0|       3.0|
|             Ivysaur| Grass|Poison|  405| 60|    62|     63|    80|    80|   60|         1|    false|       2.0|       3.0|
|            Venusaur| Grass|Poison|  525| 80|    82|     83|   100|   100|   80|         1|    false|       2.0|       3.0|
|VenusaurMega Venu...| Grass|Poison|  625| 80|   100|    123|   122|   120|   80|         1|    false|       2.0|       3.0|
|          Charmander|  Fire|unkown|  309| 39|    52|     43|    60|    50|   65|         1|    false|       5.0|       0.0|


In [132]:
# 使用新生成的index列进行Onehot编码
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCols=["Type1Index", "Type2Index","Generation"],
                        outputCols=["TypeVec1", "TypeVec2","GenerationVec"])
model = encoder.fit(indexed_df)
encoded = model.transform(indexed_df)
encoded.show()

+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+----------+----------+--------------+---------------+-------------+
|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp Atk|Sp Def|Speed|Generation|Legendary|Type1Index|Type2Index|      TypeVec1|       TypeVec2|GenerationVec|
+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+----------+----------+--------------+---------------+-------------+
|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|    65|    65|   45|         1|    false|       2.0|       3.0|(17,[2],[1.0])| (18,[3],[1.0])|(6,[1],[1.0])|
|             Ivysaur| Grass|Poison|  405| 60|    62|     63|    80|    80|   60|         1|    false|       2.0|       3.0|(17,[2],[1.0])| (18,[3],[1.0])|(6,[1],[1.0])|
|            Venusaur| Grass|Poison|  525| 80|    82|     83|   100|   100|   80|         1|    false|       2.0|       3.0|(17,[2],[1.0])| (18,[3],[1

##### 步骤4：对其中的数值属性字段使用minmaxscaler

* `Total`
* `HP`
* `Attack`
* `Defense`
* `Sp Atk`
* `Sp Def`
* `Speed`

> `MinMaxScaler` 转换向量行的数据集，将每个特征重新缩放到特定范围（通常为 $[0, 1]$）。它接受参数： `min`：默认为 `0.0`。转换后的下界，由所有特征共享。 `max`：默认为 `1.0`。转换后的上限，由所有特征共享。

> MinMaxScaler 计算数据集的汇总统计数据并生成 MinMaxScalerModel。然后，模型可以单独转换每个特征，使其在给定范围内。

特征 E 的重新缩放值计算为:

$$\begin{equation} Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min \end{equation}$$

其中，$E_{max} == E_{min}\ $,    $\ \ \ Rescaled(e_i) = 0.5 * (max + min)$

In [9]:
# 因为MinMaxScaler接受的是Vector类型，因此将每一列转换为vector的形式
from pyspark.ml.feature import VectorAssembler
num_columns = ['Total','HP','Attack','Defense','Sp Atk','Sp Def','Speed']
num_columns_new = ['TotalNew','HPNew','AttackNew','DefenseNew','Sp AtkNew','Sp DefNew','SpeedNew']

vecAssembler = VectorAssembler(inputCols=num_columns, outputCol='features')
output = vecAssembler.transform(df)
output.show()

+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+--------------------+
|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp Atk|Sp Def|Speed|Generation|Legendary|            features|
+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+--------------------+
|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|    65|    65|   45|         1|    false|[318.0,45.0,49.0,...|
|             Ivysaur| Grass|Poison|  405| 60|    62|     63|    80|    80|   60|         1|    false|[405.0,60.0,62.0,...|
|            Venusaur| Grass|Poison|  525| 80|    82|     83|   100|   100|   80|         1|    false|[525.0,80.0,82.0,...|
|VenusaurMega Venu...| Grass|Poison|  625| 80|   100|    123|   122|   120|   80|         1|    false|[625.0,80.0,100.0...|
|          Charmander|  Fire|  null|  309| 39|    52|     43|    60|    50|   65|         1|    false|[309.0,39.0,52.0,...|
|       

In [18]:
from pyspark.ml.feature import MinMaxScaler
# 首先将c2列转换为vector的形式
vecAssembler = VectorAssembler(inputCols=["Total"], outputCol="c2_new_")
# minmax tranform
mmScaler = MinMaxScaler(inputCol='c2_new_', outputCol='mm_c2_')
pipeline = Pipeline(stages=[vecAssembler, mmScaler])
pipeline_fit = pipeline.fit(df)
df = pipeline_fit.transform(df)
df.show()

+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+-------+--------------------+-------+--------------------+
|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp Atk|Sp Def|Speed|Generation|Legendary| c2_new|               mm_c2|c2_new_|              mm_c2_|
+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+-------+--------------------+-------+--------------------+
|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|    65|    65|   45|         1|    false|[318.0]|              [0.23]|[318.0]|              [0.23]|
|             Ivysaur| Grass|Poison|  405| 60|    62|     63|    80|    80|   60|         1|    false|[405.0]|             [0.375]|[405.0]|             [0.375]|
|            Venusaur| Grass|Poison|  525| 80|    82|     83|   100|   100|   80|         1|    false|[525.0]|[0.5750000000000001]|[525.0]|[0.5750000000000001]|
|VenusaurMega Venu...| Grass|Poiso

In [26]:
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

copy_df = df
for numC in num_columns:
    # 因为MinMaxScaler接受的是Vector类型，因此将每一列转换为vector的形式
    vecAssembler = VectorAssembler(inputCols=[numC], outputCol=numC+'features')
    # 使用minMax转换
    scaler = MinMaxScaler(inputCol=numC+'features', outputCol="scaled" + numC)
    pipeline = Pipeline(stages=[vecAssembler, scaler])
    pipeline_fit = pipeline.fit(copy_df)
    copy_df = pipeline_fit.transform(copy_df)
    # 删除中间Vectoer列
    copy_df = copy_df.drop(numC+'features')
copy_df.show()


+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp Atk|Sp Def|Speed|Generation|Legendary|         scaledTotal|            scaledHP|        scaledAttack|       scaledDefense|        scaledSp Atk|        scaledSp Def|         scaledSpeed|
+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|    65|    65|   45|         1|    false|              [0.23]|[0.1732283464566929]|[0.23783783783783...|[0.19555555555555...|[0.29891304347826...|[0.2142857142857143]|[0.22857142857142...|


##### 步骤5：对编码后的属性使用pca进行降维（维度可以自己选择）