## 第二章 数据清洗及特征处理

我们拿到的数据通常是不干净的，所谓的不干净，就是数据中有缺失值，有一些异常点等，需要经过一定的处理才能继续做后面的分析或建模，所以拿到数据的第一步是进行数据清洗，本章我们将学习缺失值、重复值、字符串和数据转换等操作，将数据清洗成可以分析或建模的亚子。

In [1]:
import pandas as pd
import numpy as np

#### 2.1 缺失值观察与处理
我们拿到的数据经常会有很多缺失值，比如我们可以看到Cabin列存在NaN，那其他列还有没有缺失值，这些缺失值要怎么处理呢

In [20]:
df = pd.read_csv(r'joyful-pandas-master\data\train.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


###### 2.1.1 任务一：缺失值观察
- (1) 请查看每个特征缺失值个数

In [21]:
df.isnull().sum()

Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

- (2) 请查看Age， Cabin， Embarked列的数据 以上方式都有多种方式，所以大家多多益善

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   891 non-null    int64  
 1   PassengerId  891 non-null    int64  
 2   Survived     891 non-null    int64  
 3   Pclass       891 non-null    int64  
 4   Name         891 non-null    object 
 5   Sex          891 non-null    object 
 6   Age          714 non-null    float64
 7   SibSp        891 non-null    int64  
 8   Parch        891 non-null    int64  
 9   Ticket       891 non-null    object 
 10  Fare         891 non-null    float64
 11  Cabin        204 non-null    object 
 12  Embarked     889 non-null    object 
dtypes: float64(2), int64(6), object(5)
memory usage: 90.6+ KB


###### 2.1.2 任务二：对缺失值进行处理
- (1)处理缺失值一般有几种思路
- (2) 请尝试对Age列的数据的缺失值进行处理
- (3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

In [23]:
##删除
df.dropna().head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


In [24]:
## 用数值填充
df.fillna(0).head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,0,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,0,S


【思考1】dropna和fillna有哪些参数，分别如何使用呢?

#### dropna函数
- axis参数：默认情况为行，axis=1表示列
- how参数:可取值any或all，表示全为缺失去除和存在缺失去除
- subset参数:在某一组列范围中搜索缺失值

#### fillna函数
- value:填充的值
- method:填充方法，向前或向后填充
- axis:需要填充的轴，默认axis=0,竖直方向填充
- inplace:修改被调用的对象

【思考2】检索空缺值用np.nan要比用None好，这是为什么？

- np.nan查缺失值会减少遗漏把，因为默认的缺失值为np.nan

#### 2.2 重复值观察与处理
由于这样那样的原因，数据中会不会存在重复值呢，如果存在要怎样处理呢

###### 2.2.1 任务一：请查看数据中的重复值

In [30]:
df[df.duplicated()]

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked


###### 2.2.2 任务二：对重复值进行处理
- (1)重复值有哪些处理方式呢？
- (2)处理我们数据的重复值
- 方法多多益善

In [31]:
df.drop_duplicates().head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


###### 2.2.3 任务三：将前面清洗的数据保存为csv格式

In [32]:
df_csv=df.to_csv(r'joyful-pandas-master\data\train_test.csv')

#### 2.3 特征观察与处理
我们对特征进行一下观察，可以把特征大概分为两大类：
数值型特征：Survived ，Pclass， Age ，SibSp， Parch， Fare，其中Survived， Pclass为离散型数值特征，Age，SibSp， Parch， Fare为连续型数值特征

文本型特征：Name， Sex， Cabin，Embarked， Ticket，其中Sex， Cabin， Embarked， Ticket为类别型文本特征，

数值型特征一般可以直接用于模型的训练，但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析

###### 2.3.1 任务一：对年龄进行分箱（离散化）处理
- (1) 分箱操作是什么？

- (2) 将连续变量Age平均分箱成5个年龄段，并分别用类别变量12345表示

In [35]:
df = pd.read_csv(r'joyful-pandas-master\data\train_test.csv')
df['Agecut'] = pd.cut(df['Age'], 5,labels = ['1','2','3','4','5'])
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Agecut
0,0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,2
1,1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3
2,2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,2
3,3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3
4,4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3


- (3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段，并分别用类别变量12345表示

In [37]:
df['Agecut']=pd.cut(df.Age,[0,5,15,30,50,80], right=False, labels=[1,2,3,4,5])
df['Agecut']

0        3
1        4
2        3
3        4
4        4
      ... 
886      3
887      3
888    NaN
889      3
890      4
Name: Agecut, Length: 891, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

- (4) 将连续变量Age按10% 30% 50 70% 90%五个年龄段，并用分类变量12345表示

In [38]:
df['Agecut'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = ['1','2','3','4','5'])
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Agecut
0,0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,2
1,1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,5
2,2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3
3,3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,4
4,4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,4


- (5) 将上面的获得的数据分别进行保存，保存为csv格式

In [41]:
df_csv2=df.to_csv(r'joyful-pandas-master\data\train_test2.csv')

##### 2.3.2 任务二：对文本变量进行转换
- (1) 查看文本变量名及种类

In [46]:
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [47]:
df['Cabin'].value_counts()

B96 B98        4
C23 C25 C27    4
G6             4
D              3
F2             3
              ..
D56            1
B102           1
B50            1
B4             1
C148           1
Name: Cabin, Length: 147, dtype: int64

In [45]:
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

- (2) 将文本变量Sex， Cabin ，Embarked用数值变量12345表示

In [48]:
df['Sex_replace'] = df['Sex'].replace(['male','female'],[1,2])
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Agecut,Sex_replace
0,0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,2,1
1,1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,5,2
2,2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3,2
3,3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,4,2
4,4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,4,1


- (3) 将文本变量Sex， Cabin， Embarked用one-hot编码表示

- 忘记怎么做了，需要再加深下印象

###### 2.3.3 任务三：从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)

In [50]:
df['Titles']=df['Name'].str.extract('(["Mr","Miss","Mrs"]+)\.')
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Agecut,Sex_replace,Title,Titles
0,0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,2,1,Mr,Mr
1,1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,5,2,Mrs,Mrs
2,2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3,2,Miss,Miss
3,3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,4,2,Mrs,Mrs
4,4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,4,1,Mr,Mr
