# 沃尔玛销售数据

#### 简介：数据集包含沃尔玛的销售数据。沃尔玛全国范围内有多家零售门店在库存管理方面的问题，那么如何将供应与需求相匹配呢？作为一名数据科学家，你可以利用数据，提供有用的见解，并创建预测模型，从而能预测未来X个月/年的销售情况。

变量含义：
- Store：店铺编号
- Date：销售周
- Weekly_Sales：店铺在该周的销售额
- Holiday_Flag：是否为假日周
- Temperature：销售日的温度
- Fuel_Price：该地区的燃油成本
- CPI（消费者物价指数）：消费者物价指数
- Unemployment：失业率

# 读取数据

In [22]:
import pandas as pd

In [2]:
original_data = pd.read_csv("walmart_stores_data.csv")
original_data

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.90,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.242170,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,211.350143,8.106
...,...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,192.013558,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,192.170412,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,192.327265,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,192.330854,8.667


# 评估数据

## 评估数据整齐度

随机抽取10条数据，看它们的数据整不整齐

In [3]:
original_data.sample(10)

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
266,2,15-06-2012,1962924.3,0,80.56,3.393,221.40099,6.891
1271,9,13-07-2012,536537.64,0,82.39,3.256,225.677146,5.277
5362,38,17-06-2011,356797.0,0,86.84,3.935,129.0432,13.736
1005,8,05-03-2010,881503.95,0,45.64,2.625,214.721659,6.299
156,2,07-05-2010,2042581.71,0,71.28,2.835,210.001102,8.2
2685,19,23-03-2012,1342254.55,0,56.72,4.054,137.65529,7.943
3268,23,08-06-2012,1568048.54,0,56.82,3.746,138.117419,4.125
3048,22,17-12-2010,1527682.99,0,30.46,3.139,136.529281,8.572
1022,8,02-07-2010,852333.75,0,74.78,2.669,214.592812,6.315
3485,25,11-02-2011,615666.78,1,21.18,3.239,206.076386,7.343


经检查发现，数据十分的整齐，每行是观察值，每列是变量，每格是数值

## 评估数据干净度

In [7]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         6435 non-null   int64  
 1   Date          6435 non-null   object 
 2   Weekly_Sales  6435 non-null   float64
 3   Holiday_Flag  6435 non-null   int64  
 4   Temperature   6435 non-null   float64
 5   Fuel_Price    6435 non-null   float64
 6   CPI           6435 non-null   float64
 7   Unemployment  6435 non-null   float64
dtypes: float64(5), int64(2), object(1)
memory usage: 402.3+ KB


经发现，Store的类型应该是str，Date的类型应该是datatime，Holiday_Flag的类型应该是bool

### 数据缺失

经检查，无缺失数据

### 数据重复

In [10]:
original_data.duplicated().sum()

0

经检查，无重复数据

具有唯一性的数据是Store（商品编号），但它可以有不同周的数据，所以不存在重复数据

### 数据不一致

这组数据的变量中，不存在拥有多个名称的变量，所以不存在数据不一致问题

### 无效或错误数据

通过describe方法对这组数据的关键信息快速了解

In [12]:
original_data.describe()

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,6435.0,6435.0,6435.0,6435.0,6435.0,6435.0,6435.0
mean,23.0,1046965.0,0.06993,60.663782,3.358607,171.578394,7.999151
std,12.988182,564366.6,0.255049,18.444933,0.45902,39.356712,1.875885
min,1.0,209986.2,0.0,-2.06,2.472,126.064,3.879
25%,12.0,553350.1,0.0,47.46,2.933,131.735,6.891
50%,23.0,960746.0,0.0,62.67,3.445,182.616521,7.874
75%,34.0,1420159.0,0.0,74.94,3.735,212.743293,8.622
max,45.0,3818686.0,1.0,100.14,4.468,227.232807,14.313


未发现无效或错误数据

# 清理数据

In [14]:
cleaned_data = original_data.copy()
cleaned_data.head(10)

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.5,2.625,211.350143,8.106
5,1,12-03-2010,1439541.59,0,57.79,2.667,211.380643,8.106
6,1,19-03-2010,1472515.79,0,54.58,2.72,211.215635,8.106
7,1,26-03-2010,1404429.92,0,51.45,2.732,211.018042,8.106
8,1,02-04-2010,1594968.28,0,62.27,2.719,210.82045,7.808
9,1,09-04-2010,1545418.53,0,65.86,2.77,210.622857,7.808


解决整理出来的问题
- Store的类型应该改成str
- Data的类型应该改成datetime
- Holiday_Flag的类型应该是bool

将Story的数据类型改成str

In [16]:
cleaned_data["Store"] = cleaned_data["Store"].astype(str)
cleaned_data["Store"]

0        1
1        1
2        1
3        1
4        1
        ..
6430    45
6431    45
6432    45
6433    45
6434    45
Name: Store, Length: 6435, dtype: object

将Data的数据类型改成datetime

In [23]:
cleaned_data["Date"] = pd.to_datetime(cleaned_data["Date"],format=("%d-%m-%Y"))
cleaned_data["Date"]

0      2010-02-05
1      2010-02-12
2      2010-02-19
3      2010-02-26
4      2010-03-05
          ...    
6430   2012-09-28
6431   2012-10-05
6432   2012-10-12
6433   2012-10-19
6434   2012-10-26
Name: Date, Length: 6435, dtype: datetime64[ns]

将Holiday_Flag的数据类型改成bool

In [26]:
cleaned_data["Holiday_Flag"] = cleaned_data["Holiday_Flag"].astype(bool)
cleaned_data["Holiday_Flag"]

0       False
1        True
2       False
3       False
4       False
        ...  
6430    False
6431    False
6432    False
6433    False
6434    False
Name: Holiday_Flag, Length: 6435, dtype: bool

# 保存清理后的数据

In [27]:
cleaned_data.to_csv("walmart_stores_data_clean.csv",index=False)