# 数据集介绍
官网数据集含缺失值、Kaggle数据集含重复值

**官网原始数据集**共有303行、14列，行索引从0-302。其中
- ca列有4个缺失值，行索引分别为166、192、287、302
- thal列有2个缺失值，行索引分别为87、266
去除缺失值后，数据集有297行、14列

- thal指标用0、1、2进行替换
- target指标换为存在（值1）和不存在（值0），方便做二分类

**Kaggle数据集**共有1025行、14列，其中有723行重复值，去除重复值后剩余302行

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("heart1025.csv")
#df = pd.read_csv("heart303.csv")

In [3]:
df.shape

(1025, 14)

In [4]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [5]:
df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


# 处理缺失值与重复值

为保证系统分析的准确性和模型建立的一致性，需要对数据进行检查

参考
- [数据预处理—数据清洗（3）—重复值处理](https://blog.csdn.net/weixin_42831571/article/details/103430346)
- [数据预处理——缺失值处理(细讲具体代码实现](https://blog.csdn.net/H_2am/article/details/107814853)
- [数据预处理之重复值](https://blog.csdn.net/liuyanlin610/article/details/122730558)
- [pandas数据清洗之处理缺失、重复、异常数据](https://xiejava.blog.csdn.net/article/details/122767225)
- [数据清洗之数据预处理](https://www.cnblogs.com/xingnie/p/12264505.html)

## 缺失值处理（原始数据集）
检查数据内的NAN值，若存在NAN值在不影响整体系统的情况下可以进行删除或进行均值填充

对于缺失值的处理，通常有两种方法，一是直接丢掉含有缺失值的样本，二是用数据填充（比如用均值，中位数，众数等填充，也可以用指定的值填充），至于是直接丢掉效果好还是填充的效果好，这是没有具定论的，不同的数据效果是不一样的，所以实际操作过程中可以都进行尝试，找到最优的效果。

不过，需要强调的是，当训练集样本本身较少，而缺失值又相对较多的时候，不建议直接丢掉含有缺失值的样本，这样会使训练集样本更少，模型训练的时候学习到的东西也就更少。另外，测试集上的缺失值不能采用直接丢掉的方法，因为每一个样本都是你需要预测的样本，你不能把它丢掉。

In [7]:
# 查看缺失值
# 原始数据集中缺失值用'?'表示
df.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [8]:
# 用NaN替换'?'
df = df.replace('?',np.NaN)

In [9]:
# 再次查看缺失值
df.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

## 重复值处理（Kaggle数据集）
在实际数据采集、数据处理和数据分析中，经常会遇到的一个问题就是：重复数据。重复数据在进行数据分析或数据挖掘的过程中，对其输出结果有重要的影响。比如，在逻辑回归分析中，重复数据会影响模型的拟合优度；数据分析中，重复数据会影响预测内容准确性。所以，处理重复值数据有着重要的意义和作用。

数据去重是处理重复值的主要方法，但如下几种情况慎重去重：
1. 样本不均衡时，故意重复采样的数据；
2. 分类模型，某个分类训练数据过少，可以采取简单复制样本的方法来增加样本数量
3. 事务型数据，尤其与钱相关的业务场景下出现重复数据时，如重复订单，重复出库申请


In [10]:
#查看重复数据
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1020     True
1021     True
1022     True
1023     True
1024     True
Length: 1025, dtype: bool

In [11]:
#计算重复数量
np.sum(df.duplicated()) 

723

In [12]:
# 删除重复值
# inplace = True 对原始数据做改动
df.drop_duplicates(inplace = True)

In [13]:
df.shape

(302, 14)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 302 entries, 0 to 878
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       302 non-null    int64  
 1   sex       302 non-null    int64  
 2   cp        302 non-null    int64  
 3   trestbps  302 non-null    int64  
 4   chol      302 non-null    int64  
 5   fbs       302 non-null    int64  
 6   restecg   302 non-null    int64  
 7   thalach   302 non-null    int64  
 8   exang     302 non-null    int64  
 9   oldpeak   302 non-null    float64
 10  slope     302 non-null    int64  
 11  ca        302 non-null    int64  
 12  thal      302 non-null    int64  
 13  target    302 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 35.4 KB


# 将定类特征由整数编码转为实际对应的字符串

In [15]:
df['sex'][df['sex'] == 0] = 'female'
df['sex'][df['sex'] == 1] = 'male'

df['cp'][df['cp'] == 0] = 'typical angina'
df['cp'][df['cp'] == 1] = 'atypical angina'
df['cp'][df['cp'] == 2] = 'non-anginal pain'
df['cp'][df['cp'] == 3] = 'asymptomatic'
 
df['fbs'][df['fbs'] == 0] = 'lower than 120mg/ml'
df['fbs'][df['fbs'] == 1] = 'greater than 120mg ml'
 
df['restecg'][df['restecg'] == 0] = 'normal'
df['restecg'][df['restecg'] == 1] = 'ST-T wave abnormality'
df['restecg'][df['restecg'] == 2] = 'left ventricular hyper trophy'
 
df['exang'][df['exang'] == 0] = 'no'
df['exang'][df['exang'] == 1] = 'yes'
 
df['slope'][df['slope'] == 0] = 'upsloping'
df['slope'][df['slope'] == 1] = 'flat'
df['slope'][df['slope'] == 2] = 'downsloping'
 
df['thal'][df['thal'] == 0] = 'unknown'
df['thal'][df['thal'] == 1] = 'normal'
df['thal'][df['thal'] == 2] = 'fixed defect'
df['thal'][df['thal'] == 3] = 'reversable defect'

In [16]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,male,typical angina,125,212,lower than 120mg/ml,ST-T wave abnormality,168,no,1.0,downsloping,2,reversable defect,0
1,53,male,typical angina,140,203,greater than 120mg ml,normal,155,yes,3.1,upsloping,0,reversable defect,0
2,70,male,typical angina,145,174,lower than 120mg/ml,ST-T wave abnormality,125,yes,2.6,upsloping,0,reversable defect,0
3,61,male,typical angina,148,203,lower than 120mg/ml,ST-T wave abnormality,161,no,0.0,downsloping,1,reversable defect,0
4,62,female,typical angina,138,294,greater than 120mg ml,ST-T wave abnormality,106,no,1.9,flat,3,fixed defect,0


In [17]:
df.dtypes

age           int64
sex          object
cp           object
trestbps      int64
chol          int64
fbs          object
restecg      object
thalach       int64
exang        object
oldpeak     float64
slope        object
ca            int64
thal         object
target        int64
dtype: object

# 将离散的定类和定序特征转为One-Hot独热编码 

In [18]:
# 将定类数据扩展为特征
df = pd.get_dummies(df)

In [19]:
df.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,target,sex_female,sex_male,cp_asymptomatic,...,restecg_normal,exang_no,exang_yes,slope_downsloping,slope_flat,slope_upsloping,thal_fixed defect,thal_normal,thal_reversable defect,thal_unknown
0,52,125,212,168,1.0,2,0,False,True,False,...,False,True,False,True,False,False,False,False,True,False
1,53,140,203,155,3.1,0,0,False,True,False,...,True,False,True,False,False,True,False,False,True,False
2,70,145,174,125,2.6,0,0,False,True,False,...,False,False,True,False,False,True,False,False,True,False
3,61,148,203,161,0.0,1,0,False,True,False,...,False,True,False,True,False,False,False,False,True,False
4,62,138,294,106,1.9,3,0,True,False,False,...,False,True,False,False,True,False,True,False,False,False


In [20]:
df.iloc[0]

age                                         52
trestbps                                   125
chol                                       212
thalach                                    168
oldpeak                                    1.0
ca                                           2
target                                       0
sex_female                               False
sex_male                                  True
cp_asymptomatic                          False
cp_atypical angina                       False
cp_non-anginal pain                      False
cp_typical angina                         True
fbs_greater than 120mg ml                False
fbs_lower than 120mg/ml                   True
restecg_ST-T wave abnormality             True
restecg_left ventricular hyper trophy    False
restecg_normal                           False
exang_no                                  True
exang_yes                                False
slope_downsloping                         True
slope_flat   

# 将处理好的数据集导出为csv文件 

In [21]:
df.to_csv('process_heart1025.csv',index=False)
#df.to_csv('process_heart303.csv',index=False)