<div class="alert alert-block alert-success">
    
# FIT5196 Task 1 in Assessment 2
#### Student Name: Deshui Yu      Liangjing Yang
#### Student ID: 34253599      34060871

Date: 28/09/2024

    
</div>

<div class="alert alert-block alert-danger">
    
## Table of Contents

</div>    

[1. Introduction](#Intro) <br>
[2. Importing Libraries](#libs) <br>
[3. Examining Patent Files](#examine) <br>
[4. Loading and Parsing Files](#load) <br>
$\;\;\;\;$[4.1. Defining Regular Expressions](#Reg_Exp) <br>
$\;\;\;\;$[4.2. Reading Files](#Read) <br>
$\;\;\;\;$[4.3. Whatever else](#latin) <br>
[5. Writing to CSV/JSON File](#write) <br>
$\;\;\;\;$[5.1. Verification - using the sample files](#test_xml) <br>
[6. Summary](#summary) <br>
[7. References](#Ref) <br>

-------------------------------------

<div class="alert alert-block alert-warning">

## 1.  Introduction  <a class="anchor" name="Intro"></a>
    
</div>

This project involves cleansing and analyzing a retail transactional dataset from DigiCO, an online electronics store in Melbourne. The task is to detect and fix errors, impute missing values, and remove outliers using exploratory data analysis (EDA). Cleaned data will be saved in the required output files, and the process will be documented in the final report.

<div class="alert alert-block alert-warning">
    
## 2.  Importing Libraries  <a class="anchor" name="libs"></a>
 </div>

The packages to be used in this assessment are imported in the following. They are used to fulfill the following tasks:

* **re:** to define and use regular expressions
* **pandas:** to manage and analyze data.
* **datetime** to handle dates and times.

In [102]:
import pandas as pd
import numpy as np
import math
import ast
from collections import Counter
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk


<div class="alert alert-block alert-warning">

## 3.  Examining Raw Data <a class="anchor" name="examine"></a>

 </div>

我们可以看到这三个文件都包含以下数据列：order_id、customer_id、date、nearest_warehouse、shopping_cart、order_price、delivery_charges、customer_lat、customer_long、coupon_discount、order_total、season、is_expedited_delivery、distance_to_nearest_warehouse、latest_customer_review 和 is_happy_customer。其中，coupon_discount、delivery_charges、shopping_cart 中的商品数量、order_id、customer_id 和 latest_customer_review 是没有错误的数据。在 Group181_missing_data.csv 文件中，is_happy_customer 列的数据缺失，其数据类型为 float64；在 Group181_dirty_data.csv 中存在错误数据；而在 Group181_outlier_data.csv 文件中则有异常数据。
通过数据逻辑我们可以发现，date和season数据是有关系的，customer_lat and customer_long 和 distance_to_nearest_warehouse

In [103]:
dirty_file_path = 'Group181_dirty_data.csv'
dirty_data = pd.read_csv(dirty_file_path)
dirty_data['error'] = 0
# dirty_data.info()

In [132]:
missing_file_path = 'Group181_missing_data.csv'
missing_data = pd.read_csv(missing_file_path)
missing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   order_id                       500 non-null    object 
 1   customer_id                    500 non-null    object 
 2   date                           500 non-null    object 
 3   nearest_warehouse              445 non-null    object 
 4   shopping_cart                  500 non-null    object 
 5   order_price                    485 non-null    float64
 6   delivery_charges               460 non-null    float64
 7   customer_lat                   500 non-null    float64
 8   customer_long                  500 non-null    float64
 9   coupon_discount                500 non-null    int64  
 10  order_total                    485 non-null    float64
 11  season                         500 non-null    object 
 12  is_expedited_delivery          500 non-null    boo

In [105]:
outlier_file_path = 'Group181_outlier_data.csv'
outlier_data = pd.read_csv(outlier_file_path)

<div class="alert alert-block alert-warning"> 

## 4.  Detect and fix errors in dirty_data <a class="anchor" name="load"></a>

</div>

<div class="alert alert-block alert-info">
    
### 4.1. Fix the date <a class="anchor" name="Reg_Exp"></a>

Through examining the current data, the issue was identified as the date values, which were supposed to be in the YYYY-MM-DD format, mistakenly being formatted as YYYY-DD-MM and DD-MM-YYYY. Additionally, based on logical reasoning, the date and season are highly correlated, so after fixing the date data, the season data should also be corrected accordingly.

In [106]:
# Check and mark invalid dates
# reference from chatGPT
# Temporarily convert the 'date' column to datetime format, marking invalid dates as NaT (Not a Time)
temp_dates = pd.to_datetime(dirty_data['date'], errors='coerce')
# Find the invalid date
invalid_dates_temp = dirty_data[temp_dates.isna()]
# print(invalid_dates_temp["date"])

In [107]:
# 将 date 列转换为 datetime，错误值会变成 NaT
temp_dates = pd.to_datetime(dirty_data['date'], errors='coerce')

# 将 NaT 行标记为 1，表示有错误
dirty_data.loc[temp_dates.isna(), 'error'] = 1

# 获取 dirty_data 中 error 列标记为 1 的行号
error_indices = dirty_data.index[dirty_data['error'] == 1]

# 打印出这些行号
print(error_indices)

# 准备修复日期并标记修复
fixed_dates = []
for idx, date_str in enumerate(dirty_data['date'].astype(str)):
    parts = date_str.split('-')
    # Check if format is DD-MM-YYYY and fix to YYYY-MM-DD
    if len(parts) == 3 and int(parts[0]) <= 12 and int(parts[1]) <= 31:
        fixed_dates.append(f'{parts[2]}-{parts[1]}-{parts[0]}')  # DD-MM-YYYY -> YYYY-MM-DD
        dirty_data.loc[idx, 'error'] = 2  # 只标记当前行
    # Check if format is YYYY-DD-MM and fix to YYYY-MM-DD
    elif len(parts) == 3 and int(parts[1]) > 12 and int(parts[2]) <= 12:
        fixed_dates.append(f'{parts[0]}-{parts[2]}-{parts[1]}')  # YYYY-DD-MM -> YYYY-MM-DD
        dirty_data.loc[idx, 'error'] = 2  # 只标记当前行
    # Keep original format for other cases
    else:
        fixed_dates.append(date_str)

# 将修复后的日期转换回 datetime 格式，并替换原始的 'date' 列
dirty_data['date'] = pd.to_datetime(fixed_dates, errors='coerce')


Int64Index([33, 43, 69, 110, 172, 240, 246, 285, 291, 371, 443, 466, 467], dtype='int64')


<div class="alert alert-block alert-info">
    
### 4.2. Fix the season <a class="anchor" name="Reg_Exp"></a>

Since we know that the date and season data are logically related, once the date data has been corrected, the season data should also be updated accordingly.

In [75]:
dirty_data['date'] = pd.to_datetime(dirty_data['date'], errors='coerce')

# 定义获取季节的函数
def get_correct_season(month):
    if month in [12, 1, 2]:
        return 'Summer'
    elif month in [3, 4, 5]:
        return 'Autumn'
    elif month in [6, 7, 8]:
        return 'Winter'
    elif month in [9, 10, 11]:
        return 'Spring'

# 第一步：标记错误的 season 数据

# 提取没有标记 error 的数据行
row_season_data = dirty_data[dirty_data['error'] == 0].copy()

# 提取月份
row_season_data['month'] = row_season_data['date'].dt.month

# 应用获取正确季节的函数
row_season_data['correct_season'] = row_season_data['month'].apply(get_correct_season)

# 检查 season 列是否正确
row_season_data['season_is_wrong'] = row_season_data['season'] != row_season_data['correct_season']

# 获取 season 错误的行的索引
wrong_season_indices = row_season_data.index[row_season_data['season_is_wrong']]

# 将对应的行在 dirty_data 中标记为 error
dirty_data.loc[wrong_season_indices, 'error'] = 1

# 删除临时列，并忽略不存在的列错误
dirty_data.drop(columns=['month', 'correct_season', 'season_is_wrong'], inplace=True, errors='ignore')


# 获取 dirty_data 中 error 列标记为 1 的行号
error_indices = dirty_data.index[dirty_data['error'] == 1]

# 打印出这些行号
print(error_indices)

Int64Index([  8,  24,  57,  60,  66,  95, 100, 115, 116, 132, 138, 147, 195,
            199, 211, 212, 213, 222, 245, 250, 255, 267, 323, 446, 462, 468,
            496],
           dtype='int64')


In [108]:
fixed_seasons = []

# 遍历每一行，修正季节数据
for idx, date in dirty_data.iterrows():
    month = date['date'].month  # 提取月份
    # 如果 error 列为 1，则修正季节
    if dirty_data.loc[idx, 'error'] == 1:
        if month in [9, 10, 11]:
            fixed_seasons.append('Spring')
        elif month in [12, 1, 2]:
            fixed_seasons.append('Summer')
        elif month in [3, 4, 5]:
            fixed_seasons.append('Autumn')
        elif month in [6, 7, 8]:
            fixed_seasons.append('Winter')
        # 将错误标记修正为 2
        dirty_data.loc[idx, 'error'] = 2
    else:
        # 如果 error 不为 1，则保留原来的 season 值
        fixed_seasons.append(dirty_data.loc[idx, 'season'])

# 将修复后的 season 列替换掉原始的 season 列
dirty_data['season'] = fixed_seasons

# 打印修复后的 date 和 season 列
print(dirty_data[['date', 'season']])

          date  season
0   2019-01-23  Summer
1   2019-11-07  Spring
2   2019-01-14  Summer
3   2019-10-31  Spring
4   2019-04-02  Autumn
..         ...     ...
495 2019-11-03  Spring
496 2019-02-07  Autumn
497 2019-05-29  Autumn
498 2019-05-03  Autumn
499 2019-09-06  Spring

[500 rows x 2 columns]


<div class="alert alert-block alert-info">
    
### 4.3. Fix the customer_lat, customer_long<a class="anchor" name="Reg_Exp"></a>

In [109]:
for idx, row in dirty_data.iterrows():
    lat_issue = row['customer_lat'] > 0  # 纬度应为负值
    long_issue = row['customer_long'] < 0  # 经度应为正值

    # 如果存在纬度问题或经度问题，则将 error 列标记为 1
    if lat_issue or long_issue:
        dirty_data.loc[idx, 'error'] = 1
        
# 获取 dirty_data 中 error 列标记为 1 的行号
error_indices = dirty_data.index[dirty_data['error'] == 1]

# 打印出这些行号
print(error_indices)


Int64Index([ 48,  68,  77,  87,  93, 136, 137, 140, 187, 196, 198, 216, 220,
            230, 242, 297, 299, 361, 366, 393, 408, 427, 438, 452, 464, 469,
            495],
           dtype='int64')


In [110]:
# 第二步：修正标记为 error == 1 的行（交换错误的纬度和经度）
for idx, row in dirty_data.iterrows():
    # 如果该行的 error 列标记为 1，则执行修正操作
    if dirty_data.loc[idx, 'error'] == 1:
        # 获取条件：纬度为正或经度为负
        condition = (dirty_data.loc[idx, 'customer_long'] < 0) or (dirty_data.loc[idx, 'customer_lat'] > 0)
        
        if condition:
            # 交换 customer_lat 和 customer_long 的值
            temp_lat = dirty_data.loc[idx, 'customer_lat']
            dirty_data.loc[idx, 'customer_lat'] = dirty_data.loc[idx, 'customer_long']
            dirty_data.loc[idx, 'customer_long'] = temp_lat
            
            # 将该行的 error 更新为 2，表示已修正
            dirty_data.loc[idx, 'error'] = 2

<div class="alert alert-block alert-info">
    
### 4.4. Fix the nearest_warehouse<a class="anchor" name="Reg_Exp"></a>

In [111]:
warehouse_counts = dirty_data['nearest_warehouse'].value_counts()
print(warehouse_counts)

for idx, row in dirty_data.iterrows():
    # 如果 error 列标记为 0，检查 nearest_warehouse
    if dirty_data.loc[idx, 'error'] == 0:
        # 如果 nearest_warehouse 不在指定的三个仓库列表中，标记为 error = 1
        if dirty_data.loc[idx, 'nearest_warehouse'] not in ["Nickolson", "Thompson", "Bakers"]:
            dirty_data.loc[idx, 'error'] = 1

Thompson     187
Nickolson    186
Bakers       108
bakers         9
thompson       7
nickolson      3
Name: nearest_warehouse, dtype: int64


In [112]:
# 遍历 dirty_data 中的每一行
for idx, row in dirty_data.iterrows():
    # 如果 error 列标记为 1，检查 nearest_warehouse
    if dirty_data.loc[idx, 'error'] == 1:
        # 获取当前 nearest_warehouse 的值
        warehouse = dirty_data.loc[idx, 'nearest_warehouse'].lower()  # 将其转为小写，便于比较
        
        # 如果 warehouse 是 "bakers"、"thompson" 或 "nickolson"，则将其转为首字母大写
        if warehouse in ["bakers", "thompson", "nickolson"]:
            dirty_data.loc[idx, 'nearest_warehouse'] = warehouse.capitalize()  # 首字母大写
            dirty_data.loc[idx, 'error'] = 2  # 将 error 列标记为 2

<div class="alert alert-block alert-info">
    
### 4.5. Fix the distance_to_nearest_warehouse<a class="anchor" name="Reg_Exp"></a>

In [113]:
# 读取仓库数据
warehouse = pd.read_csv("warehouses.csv")

# 创建字典，将仓库名称与对应的纬度和经度配对
lat = dict(zip(warehouse['names'], warehouse['lat']))
lon = dict(zip(warehouse['names'], warehouse['lon']))

# 定义哈弗赛因公式计算两点之间的距离
def haversine(lat1, lon1, lat2, lon2):
    dLat = (lat2 - lat1) * math.pi / 180.0
    dLon = (lon2 - lon1) * math.pi / 180.0
    lat1 = lat1 * math.pi / 180.0
    lat2 = lat2 * math.pi / 180.0

    a = (pow(math.sin(dLat / 2), 2) + 
         pow(math.sin(dLon / 2), 2) * 
         math.cos(lat1) * math.cos(lat2))

    # 地球半径（单位：公里）
    rad = 6378
    c = 2 * math.asin(math.sqrt(a))
    return rad * c

# 初始化 distance_computed 列
dirty_data['distance_computed'] = None

# 遍历每个客户的记录，计算到 nearest_warehouse 的距离
for index, row in dirty_data.iterrows():
    # 获取客户的纬度和经度
    customer_lat = row['customer_lat']
    customer_long = row['customer_long']
    
    # 获取 nearest_warehouse 名称
    nearest_warehouse = row['nearest_warehouse']
    
    # 检查 nearest_warehouse 是否在字典中
    if nearest_warehouse in lat and nearest_warehouse in lon:
        # 计算客户与 nearest_warehouse 的距离
        dist = round(haversine(customer_lat, customer_long, lat[nearest_warehouse], lon[nearest_warehouse]), 4)
        
        # 将计算的距离存入 distance_computed 列
        dirty_data.loc[index, 'distance_computed'] = dist
        

In [114]:
# 初始化一个列表来记录不相等的行的行号
mismatch_indices = []

# 遍历 dirty_data 的每一行
for idx, row in dirty_data.iterrows():
    # 只对 error 为 0 的行进行操作
    if dirty_data.loc[idx, 'error'] == 0:
        # 获取原始的 distance_to_nearest_warehouse 和 新计算的 distance_computed
        original_distance = dirty_data.loc[idx, 'distance_to_nearest_warehouse']
        computed_distance = dirty_data.loc[idx, 'distance_computed']
        
        # 如果两者不相等
        if original_distance != computed_distance:
            mismatch_indices.append(idx)
            dirty_data.loc[idx, 'distance_to_nearest_warehouse'] = computed_distance
            dirty_data.loc[idx, 'error'] = 2

# 打印记录不相等的行号
print(f"以下行的距离值不相等，需要修正：{mismatch_indices}")

# 删除 distance_computed 列
dirty_data.drop(columns=['distance_computed'], inplace=True)

# 打印修正后的数据进行检查
print(dirty_data.loc[mismatch_indices, ['distance_to_nearest_warehouse', 'error']])

以下行的距离值不相等，需要修正：[5, 18, 19, 29, 34, 39, 76, 78, 88, 96, 99, 112, 124, 143, 162, 165, 174, 207, 254, 274, 279, 311, 324, 346, 352, 356, 381, 385, 389, 396, 400, 401, 439, 440, 484]
     distance_to_nearest_warehouse  error
5                           3.4051      2
18                          1.3150      2
19                          0.7133      2
29                          0.7403      2
34                          1.5612      2
39                          0.9221      2
76                          2.1753      2
78                          5.5590      2
88                          0.7450      2
96                          0.9259      2
99                          2.1706      2
112                         5.0801      2
124                         0.9636      2
143                         0.6619      2
162                         1.5315      2
165                         1.7155      2
174                         0.9547      2
207                         0.8881      2
254                   

In [115]:
# 获取 dirty_data 中 error 列标记为 1 的行号
error_indices = dirty_data.index[dirty_data['error'] == 0]

# 打印出这些行号
print(error_indices)

Int64Index([  0,   1,   2,   3,   4,   6,   7,   8,   9,  10,
            ...
            488, 489, 490, 491, 492, 494, 496, 497, 498, 499],
           dtype='int64', length=392)


<div class="alert alert-block alert-info">
    
### 4.6. Fix the order_total<a class="anchor" name="Reg_Exp"></a>

In [83]:
# 初始化一个 Counter 来统计每个品牌的销售数量
brand_counter = Counter()

# 遍历 outlier_data 数据集以获取所有品牌
for index, row in outlier_data.iterrows():
    # 将 shopping_cart 列的字符串转换为 Python 列表
    shopping_cart = ast.literal_eval(row['shopping_cart'])
    # 遍历购物车中的每个商品
    for item in shopping_cart:
        brand_name = item[0]  # 获取商品的品牌名称
        brand_counter[brand_name] += 1  # 统计每个品牌的出现次数

# 获取所有的品牌列表
item_types = list(brand_counter.keys())
print(item_types)

# 初始化矩阵 A 和向量 b
A = np.zeros((len(outlier_data), len(item_types)))  # A 矩阵的形状是 (订单数量, 品牌数量)
b = np.zeros(len(outlier_data))  # b 是存储订单价格的向量

# 遍历 outlier_data 数据集来填充矩阵 A 和向量 b
for index, row in outlier_data.iterrows():
    # 将 shopping_cart 列的字符串形式转换为 Python 列表
    shopping_cart = ast.literal_eval(row["shopping_cart"])  # 使用 ast.literal_eval 提高安全性
    # 将订单的价格存储到向量 b 中
    b[index] = row["order_price"]
    # 遍历购物车中的每个商品和数量
    for item in shopping_cart:
        brand_name = item[0]
        quantity = item[1]
        # 如果商品属于已知的 item_types
        if brand_name in item_types:
            # 找到该商品在 item_types 列表中的索引
            item_index = item_types.index(brand_name)
            # 将商品的数量添加到矩阵 A 的相应位置
            A[index, item_index] += quantity

# 检查 A 和 b 中是否有全为零的行或无效值
valid_indices = np.where(A.any(axis=1) & ~np.isnan(b))[0]
A = A[valid_indices]
b = b[valid_indices]

# 使用 np.linalg.lstsq() 计算每个品牌的价格
prices, residuals, rank, s = np.linalg.lstsq(A, b, rcond=None)

# 对价格进行四舍五入处理，不保留小数点
rounded_prices = np.round(prices)

# 将结果转换为品牌价格的字典
price_dict = dict(zip(item_types, rounded_prices))

# 输出每个品牌的价格
print("Brand Prices (Rounded):")
for brand, price in price_dict.items():
    print(f"Brand: {brand}, Price: {int(price)}")  # 转换为整数形式输出

['iAssist Line', 'Lucent 330S', 'Toshika 750', 'Thunder line', 'Olivia x460', 'Universe Note', 'iStream', 'Alcon 10', 'Candle Inferno', 'pearTV']
Brand Prices (Rounded):
Brand: iAssist Line, Price: 2225
Brand: Lucent 330S, Price: 1230
Brand: Toshika 750, Price: 4320
Brand: Thunder line, Price: 2180
Brand: Olivia x460, Price: 1225
Brand: Universe Note, Price: 3450
Brand: iStream, Price: 150
Brand: Alcon 10, Price: 8950
Brand: Candle Inferno, Price: 430
Brand: pearTV, Price: 6310


In [116]:
dirty_data['order_computed'] = None
# 遍历每一行数据
for idx, row in dirty_data.iterrows():
    # 将 shopping_cart 列的字符串转换为 Python 列表
    shopping_cart = ast.literal_eval(row['shopping_cart'])
    # 如果 error 列为 0，则计算购物车的总价
    if dirty_data.loc[idx, 'error'] == 0:
        total_price = 0
        # 遍历购物车中的每个商品
        for item in shopping_cart:
            brand, quantity = item  # 每个 item 是 (brand, quantity) 的形式
            # 计算当前商品的总价
            if brand in price_dict:
                total_price += price_dict[brand] * quantity
        # 将计算出来的总价存入 order_computed 列
        dirty_data.loc[idx, 'order_computed'] = total_price
    else:
        # 如果 error 不为 0，则将 order_computed 设置为 order_price
        dirty_data.loc[idx, 'order_computed'] = row['order_price']
# print(dirty_data['order_computed'])

In [117]:
mismatch_rows = dirty_data[dirty_data['order_computed'] != dirty_data['order_price']]
mismatch_indices = mismatch_rows.index.tolist()
for idx in mismatch_indices:
    if dirty_data.loc[idx, 'error'] == 0:
        dirty_data.loc[idx, 'error'] = 1

import itertools
# 初始化一个列表来存储匹配成功的行号
successful_matches = []

# 遍历所有 error == 1 的行
for idx, row in dirty_data[dirty_data['error'] == 1].iterrows():
    shopping_cart = ast.literal_eval(row['shopping_cart'])
    original_order_price = row['order_price']

    # 获取购物车中商品的数量，保持数量不变
    quantities = [item[1] for item in shopping_cart]  # 只取数量，不替换品牌

    # 获取所有品牌的排列组合，确保每个品牌只出现一次
    if len(quantities) <= len(price_dict):  # 确保品牌数量足够
        brand_combinations = itertools.permutations(price_dict.keys(), len(quantities))

        # 遍历所有可能的品牌组合
        for possible_combination in brand_combinations:
            temp_total = 0
            temp_shopping_cart = []

            # 遍历购物车中的数量，使用不同品牌组合
            for brand, quantity in zip(possible_combination, quantities):
                temp_total += price_dict[brand] * quantity
                temp_shopping_cart.append((brand, quantity))

            # 如果替换后的购物车价格与原始 order_price 匹配（允许误差在10以内）
            if abs(temp_total - original_order_price) <= 5:
                print(f"在第 {idx} 行发现匹配成功，替换后的购物车：{temp_shopping_cart}")
                
                # 更新 dirty_data 中的 shopping_cart 和 order_computed
                dirty_data.loc[idx, 'shopping_cart'] = str(temp_shopping_cart)  # 将更新后的购物车存储为字符串
                dirty_data.loc[idx, 'order_computed'] = temp_total
                dirty_data.loc[idx, 'error'] = 2  # 匹配成功，标记 error 为 3
                
                # 记录匹配成功的行号
                successful_matches.append(idx)
                break  # 一旦找到匹配，跳出品牌组合的循环

# 输出匹配成功的行号和数量
print(f"匹配成功的行号: {successful_matches}")
print(f"匹配成功的行数: {len(successful_matches)}")

在第 4 行发现匹配成功，替换后的购物车：[('Thunder line', 1), ('Universe Note', 1), ('iAssist Line', 2)]
在第 11 行发现匹配成功，替换后的购物车：[('Thunder line', 1), ('Olivia x460', 2), ('Candle Inferno', 1)]
在第 23 行发现匹配成功，替换后的购物车：[('Universe Note', 2), ('Candle Inferno', 2)]
在第 30 行发现匹配成功，替换后的购物车：[('iAssist Line', 1), ('Lucent 330S', 1), ('Universe Note', 1), ('Alcon 10', 1)]
在第 44 行发现匹配成功，替换后的购物车：[('iAssist Line', 1), ('Lucent 330S', 1), ('Toshika 750', 1), ('pearTV', 2)]
在第 56 行发现匹配成功，替换后的购物车：[('Toshika 750', 2), ('Alcon 10', 2), ('Candle Inferno', 2)]
在第 98 行发现匹配成功，替换后的购物车：[('Thunder line', 1), ('Alcon 10', 1), ('iAssist Line', 2)]
在第 130 行发现匹配成功，替换后的购物车：[('Toshika 750', 2), ('Olivia x460', 2), ('Universe Note', 1), ('iStream', 1)]
在第 173 行发现匹配成功，替换后的购物车：[('Toshika 750', 1), ('Alcon 10', 1), ('Candle Inferno', 1), ('Thunder line', 2)]
在第 181 行发现匹配成功，替换后的购物车：[('iAssist Line', 1), ('iStream', 1), ('Thunder line', 2), ('Candle Inferno', 1)]
在第 189 行发现匹配成功，替换后的购物车：[('Toshika 750', 1), ('Universe Note', 2)]
在第 206 行发现匹配成功

In [118]:
# 初始化一个计数器
replace_count = 0

# 遍历所有 error == 1 的行
for idx, row in dirty_data[dirty_data['error'] == 1].iterrows():
    # 将 order_computed 的值替换到 order_price 上
    dirty_data.loc[idx, 'order_price'] = dirty_data.loc[idx, 'order_computed']
    
    # 将 error 设置为 2
    dirty_data.loc[idx, 'error'] = 2
    
    # 每次替换成功后计数器加1
    replace_count += 1

# 删除 order_computed 列
dirty_data.drop(columns=['order_computed'], inplace=True)

# 输出替换了多少行数据
print(f"共替换了 {replace_count} 行数据的 order_price")

# 检查替换和删除是否成功
print(dirty_data[['order_price', 'error']])

共替换了 20 行数据的 order_price
     order_price  error
0           8130      0
1           2750      0
2           6820      0
3           8555      0
4          10080      2
..           ...    ...
495         6735      2
496        26035      0
497        30475      0
498         4110      2
499        21570      2

[500 rows x 2 columns]


In [119]:
# 初始化 total_computed 列
dirty_data['total_computed'] = None
# 遍历每一行数据
for idx, row in dirty_data.iterrows():
    order_price = row['order_price']
    
    # 如果 error 列为 0，计算总价
    if dirty_data.loc[idx, 'error'] == 0:
        delivery_charges = row['delivery_charges']
        coupon_discount = row['coupon_discount'] / 100  # 将百分比折扣转换为小数
        # 计算总价：order_price 先应用折扣，再加上运费
        total_computed = order_price * (1 - coupon_discount) + delivery_charges
        # 将计算的总价存入 total_computed 列
        dirty_data.loc[idx, 'total_computed'] = total_computed
    else:
        # 如果 error 不为 0，则将 total_computed 设置为 order_total
        dirty_data.loc[idx, 'total_computed'] = row['order_total']

In [120]:
mismatch_rows = dirty_data[dirty_data['total_computed'] != dirty_data['order_total']]
mismatch_indices = mismatch_rows.index.tolist()
for idx in mismatch_indices:
    if dirty_data.loc[idx, 'error'] == 0:
        dirty_data.loc[idx, 'error'] = 1
        
# 获取 dirty_data 中 error 列标记为 1 的行号
error_indices = dirty_data.index[dirty_data['error'] == 1]
# 打印出这些行号
print(error_indices)

Int64Index([  7,  13,  25,  27,  41,  71, 103, 106, 111, 118, 141, 157, 179,
            186, 268, 289, 300, 316, 357, 358, 379, 404, 410, 448, 449, 481,
            487],
           dtype='int64')


In [121]:
# 初始化计数器
rreplace_count = 0

# 遍历所有 error 值为 1 的行
for idx, row in dirty_data[dirty_data['error'] == 1].iterrows():
  
    # 将 total_computed 的值替换到 order_total 上
    dirty_data.loc[idx, 'order_total'] = dirty_data.loc[idx, 'total_computed']
    
    # 将 error 设置为 2
    dirty_data.loc[idx, 'error'] = 2
    
    # 每次替换成功后计数器加1
    rreplace_count += 1

# 删除 total_computed 列
dirty_data.drop(columns=['total_computed'], inplace=True)

# 输出替换了多少行数据
print(f"共替换了 {rreplace_count} 行数据的 order_total")

# 检查替换和删除是否成功
print(dirty_data[['order_total', 'error']])

共替换了 27 行数据的 order_total
     order_total  error
0        7415.05      0
1        2555.36      0
2        5878.29      0
3        6499.12      0
4       10148.21      2
..           ...    ...
495      5135.42      2
496     23502.94      0
497     29017.60      0
498      3951.68      2
499     21643.81      2

[500 rows x 2 columns]


<div class="alert alert-block alert-info">
    
### 4.10. Fix the is_happy_customer<a class="anchor" name="Reg_Exp"></a>

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# 确保下载 vader_lexicon
nltk.download('vader_lexicon')

# 初始化 SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# 确保 latest_customer_review 列为字符串类型
dirty_data['latest_customer_review'] = dirty_data['latest_customer_review'].astype(str)

# 定义一个函数来计算情绪得分
def compute_sentiment(review_text):
    # 如果评论是空字符串或缺失值，默认返回 True
    if not review_text or review_text.strip() == "None":
        return True
    polarity_score = sia.polarity_scores(review_text)['compound']
    return polarity_score >= 0.05  # 返回 True 表示积极情绪，False 表示消极情绪

# 使用 apply 函数进行情绪分析，并将结果存入 test_sentiment 列
dirty_data['test_sentiment'] = dirty_data['latest_customer_review'].apply(compute_sentiment)

# 打印前几行结果以进行检查
print(dirty_data[['latest_customer_review', 'test_sentiment']])


In [129]:
# 初始化一个列表来存储修改过的行号
modified_indices = []

# 遍历所有 error == 0 的行
for idx, row in dirty_data[dirty_data['error'] == 0].iterrows():
    # 如果 test_sentiment 不等于 is_happy_customer
    if row['test_sentiment'] != row['is_happy_customer']:
        # 将 test_sentiment 的值赋给 is_happy_customer
        dirty_data.loc[idx, 'is_happy_customer'] = row['test_sentiment']
        
        # 将 error 列标记为 2
        dirty_data.loc[idx, 'error'] = 2
        
        # 记录修改过的行号
        modified_indices.append(idx)

# 打印修改过的行数和行号
print(f"修改过 is_happy_customer 的行数: {len(modified_indices)}")
print(f"修改过的行号: {modified_indices}")
dirty_data.drop(columns=['test_sentiment'], inplace=True)

修改过 is_happy_customer 的行数: 0
修改过的行号: []


<div class="alert alert-block alert-info">
    
### 4.11. Fix the is_expedited_delivery<a class="anchor" name="Reg_Exp"></a>

In [133]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# 假设 dirty_data 包含以下列：distance_to_nearest_warehouse, is_expedited_delivery, is_happy_customer, delivery_charges, season
dirty_data['transfer_is_expedited_delivery'] = dirty_data['is_expedited_delivery'].astype(int)
dirty_data['transfer_delivery_charges'] = dirty_data['delivery_charges'].astype(int)
# 1. Spring (春季)
spring_data = dirty_data[dirty_data['season'] == 'Spring']
X_spring = spring_data[['distance_to_nearest_warehouse', 'transfer_is_expedited_delivery', 'transfer_delivery_charges']]
y_spring = spring_data['delivery_charges']

X_train_spring, X_test_spring, y_train_spring, y_test_spring = train_test_split(X_spring, y_spring, test_size=0.2, random_state=42)
lm_for_spring = LinearRegression()
lm_for_spring.fit(X_train_spring, y_train_spring)

# 评估 Spring 模型
y_pred_spring = lm_for_spring.predict(X_test_spring)
mse_spring = mean_squared_error(y_test_spring, y_pred_spring)
r2_spring = r2_score(y_test_spring, y_pred_spring)

print(f"Spring 模型系数: {lm_for_spring.coef_}")
print(f"Spring 模型截距: {lm_for_spring.intercept_}")
print(f"Spring 均方误差: {mse_spring}")
print(f"Spring R^2 值: {r2_spring}")
print("="*50)

# 2. Summer (夏季)
summer_data = dirty_data[dirty_data['season'] == 'Summer']
X_summer = summer_data[['distance_to_nearest_warehouse', 'transfer_is_expedited_delivery', 'transfer_delivery_charges']]
y_summer = summer_data['delivery_charges']

X_train_summer, X_test_summer, y_train_summer, y_test_summer = train_test_split(X_summer, y_summer, test_size=0.2, random_state=42)
lm_for_summer = LinearRegression()
lm_for_summer.fit(X_train_summer, y_train_summer)

# 评估 Summer 模型
y_pred_summer = lm_for_summer.predict(X_test_summer)
mse_summer = mean_squared_error(y_test_summer, y_pred_summer)
r2_summer = r2_score(y_test_summer, y_pred_summer)

print(f"Summer 模型系数: {lm_for_summer.coef_}")
print(f"Summer 模型截距: {lm_for_summer.intercept_}")
print(f"Summer 均方误差: {mse_summer}")
print(f"Summer R^2 值: {r2_summer}")
print("="*50)

# 3. Autumn (秋季)
autumn_data = dirty_data[dirty_data['season'] == 'Autumn']
X_autumn = autumn_data[['distance_to_nearest_warehouse', 'transfer_is_expedited_delivery', 'transfer_delivery_charges']]
y_autumn = autumn_data['delivery_charges']

X_train_autumn, X_test_autumn, y_train_autumn, y_test_autumn = train_test_split(X_autumn, y_autumn, test_size=0.2, random_state=42)
lm_for_autumn = LinearRegression()
lm_for_autumn.fit(X_train_autumn, y_train_autumn)

# 评估 Autumn 模型
y_pred_autumn = lm_for_autumn.predict(X_test_autumn)
mse_autumn = mean_squared_error(y_test_autumn, y_pred_autumn)
r2_autumn = r2_score(y_test_autumn, y_pred_autumn)

print(f"Autumn 模型系数: {lm_for_autumn.coef_}")
print(f"Autumn 模型截距: {lm_for_autumn.intercept_}")
print(f"Autumn 均方误差: {mse_autumn}")
print(f"Autumn R^2 值: {r2_autumn}")
print("="*50)

# 4. Winter (冬季)
winter_data = dirty_data[dirty_data['season'] == 'Winter']
X_winter = winter_data[['distance_to_nearest_warehouse', 'transfer_is_expedited_delivery', 'transfer_delivery_charges']]
y_winter = winter_data['delivery_charges']

X_train_winter, X_test_winter, y_train_winter, y_test_winter = train_test_split(X_winter, y_winter, test_size=0.2, random_state=42)
lm_for_winter = LinearRegression()
lm_for_winter.fit(X_train_winter, y_train_winter)

# 评估 Winter 模型
y_pred_winter = lm_for_winter.predict(X_test_winter)
mse_winter = mean_squared_error(y_test_winter, y_pred_winter)
r2_winter = r2_score(y_test_winter, y_pred_winter)

print(f"Winter 模型系数: {lm_for_winter.coef_}")
print(f"Winter 模型截距: {lm_for_winter.intercept_}")
print(f"Winter 均方误差: {mse_winter}")
print(f"Winter R^2 值: {r2_winter}")
print("="*50)

Spring 模型系数: [0.0443219  0.09064435 0.99640716]
Spring 模型截距: 0.7729074802802245
Spring 均方误差: 0.11672739159582231
Spring R^2 值: 0.9994893205299628
Summer 模型系数: [-0.04765955 -0.2089709   1.00441788]
Summer 模型截距: 0.24917508158780777
Summer 均方误差: 0.09475318126791515
Summer R^2 值: 0.9994946574110137
Autumn 模型系数: [ 0.01651381 -0.0550806   1.00799579]
Autumn 模型截距: -0.11635017972062656
Autumn 均方误差: 0.09653793045460139
Autumn R^2 值: 0.9990401882816209
Winter 模型系数: [-0.01032114  0.08726834  0.99555374]
Winter 模型截距: 0.7776426571535637
Winter 均方误差: 0.08407617718869072
Winter R^2 值: 0.9988974177835975


<div class="alert alert-block alert-warning"> 

## 5.  Detect and remove outlier rows <a class="anchor" name="load"></a>

</div>

In [134]:

# 计算 delivery_charges 的均值和标准差
mean_delivery_charges = outlier_data['delivery_charges'].mean()
std_delivery_charges = outlier_data['delivery_charges'].std()

# 使用 3 Sigma 法则确定上下界
lower_bound = mean_delivery_charges - 3 * std_delivery_charges
upper_bound = mean_delivery_charges + 3 * std_delivery_charges

# 识别离群点（低于下界或高于上界）
outliers = outlier_data[(outlier_data['delivery_charges'] < lower_bound) | (outlier_data['delivery_charges'] > upper_bound)]

# 打印检测出的离群值
print(f"检测出的离群值有 {len(outliers)} 行数据")
print(outliers[['delivery_charges']])

# 从原始数据中移除离群点的行
cleaned_data = outlier_data.drop(outliers.index)

# 保存处理后的数据
cleaned_data.to_csv('cleaned_outlier_data.csv', index=False)

# 检查移除离群点后的数据
print(f"移除后剩余的行数: {len(cleaned_data)}")

检测出的离群值有 3 行数据
     delivery_charges
276           137.670
330           145.995
472           131.070
移除后剩余的行数: 497
