<div class="alert alert-block alert-success">
    
# FIT5196 Task 1 in Assessment 2
#### Student Name: Deshui Yu      Liangjing Yang
#### Student ID: 34253599      34060871

Date: 28/09/2024

    
</div>

<div class="alert alert-block alert-danger">
    
## Table of Contents

</div>    

[1. Introduction](#Intro) <br>
[2. Importing Libraries](#libs) <br>
[3. Examining Patent Files](#examine) <br>
[4. Loading and Parsing Files](#load) <br>
$\;\;\;\;$[4.1. Defining Regular Expressions](#Reg_Exp) <br>
$\;\;\;\;$[4.2. Reading Files](#Read) <br>
$\;\;\;\;$[4.3. Whatever else](#latin) <br>
[5. Writing to CSV/JSON File](#write) <br>
$\;\;\;\;$[5.1. Verification - using the sample files](#test_xml) <br>
[6. Summary](#summary) <br>
[7. References](#Ref) <br>

-------------------------------------

<div class="alert alert-block alert-warning">

## 1.  Introduction  <a class="anchor" name="Intro"></a>
    
</div>

This project involves cleansing and analyzing a retail transactional dataset from DigiCO, an online electronics store in Melbourne. The task is to detect and fix errors, impute missing values, and remove outliers using exploratory data analysis (EDA). Cleaned data will be saved in the required output files, and the process will be documented in the final report.

<div class="alert alert-block alert-warning">
    
## 2.  Importing Libraries  <a class="anchor" name="libs"></a>
 </div>

The packages to be used in this assessment are imported in the following. They are used to fulfill the following tasks:

* **re:** to define and use regular expressions
* **pandas:** to manage and analyze data.
* **datetime** to handle dates and times.

In [1]:
import pandas as pd
import numpy as np
import math

<div class="alert alert-block alert-warning">

## 3.  Examining Raw Data <a class="anchor" name="examine"></a>

 </div>

我们可以看到这三个文件都包含以下数据列：order_id、customer_id、date、nearest_warehouse、shopping_cart、order_price、delivery_charges、customer_lat、customer_long、coupon_discount、order_total、season、is_expedited_delivery、distance_to_nearest_warehouse、latest_customer_review 和 is_happy_customer。其中，coupon_discount、delivery_charges、shopping_cart 中的商品数量、order_id、customer_id 和 latest_customer_review 是没有错误的数据。在 Group181_missing_data.csv 文件中，is_happy_customer 列的数据缺失，其数据类型为 float64；在 Group181_dirty_data.csv 中存在错误数据；而在 Group181_outlier_data.csv 文件中则有异常数据。
通过数据逻辑我们可以发现，date和season数据是有关系的，customer_lat and customer_long 和 distance_to_nearest_warehouse

In [None]:
dirty_file_path = 'Group181_dirty_data.csv'
dirty_data = pd.read_csv(dirty_file_path)
print(dirty_data.info())
print(dirty_data.describe())

In [None]:
missing_file_path = 'Group181_missing_data.csv'
missing_data = pd.read_csv(missing_file_path)
print(missing_data.info())
print(missing_data.describe())

<div class="alert alert-block alert-warning"> 

## 4.  Detect and fix errors in dirty_data <a class="anchor" name="load"></a>

</div>

<div class="alert alert-block alert-info">
    
### 4.1. Fix the date and season <a class="anchor" name="Reg_Exp"></a>

Through examining the current data, the issue was identified as the date values, which were supposed to be in the YYYY-MM-DD format, mistakenly being formatted as YYYY-DD-MM and DD-MM-YYYY. Additionally, based on logical reasoning, the date and season are highly correlated, so after fixing the date data, the season data should also be corrected accordingly.

In [None]:
# Check and mark invalid dates
# reference from chatGPT
# Temporarily convert the 'date' column to datetime format, marking invalid dates as NaT (Not a Time)
temp_dates = pd.to_datetime(dirty_data['date'], errors='coerce')
# Find the invalid date
invalid_dates_temp = dirty_data[temp_dates.isna()]
print(invalid_dates_temp["date"])

In [8]:
fixed_dates = []
# Loop through each date string in the 'date' column
for date_str in dirty_data['date']:
    parts = date_str.split('-')
    # Check if format is DD-MM-YYYY and fix to YYYY-MM-DD
    if len(parts) == 3 and int(parts[0]) <= 12 and int(parts[1]) <= 31:
        fixed_dates.append(f'{parts[2]}-{parts[1]}-{parts[0]}')  # DD-MM-YYYY -> YYYY-MM-DD
    # Check if format is YYYY-DD-MM and fix to YYYY-MM-DD
    elif len(parts) == 3 and int(parts[1]) > 12 and int(parts[2]) <= 12:
        fixed_dates.append(f'{parts[0]}-{parts[2]}-{parts[1]}')  # YYYY-DD-MM -> YYYY-MM-DD
    # Keep original format for other cases
    else:
        fixed_dates.append(date_str)
# Convert the fixed dates back to datetime format and replace the original 'date' column
dirty_data['date'] = pd.to_datetime(fixed_dates, errors='coerce')
print(dirty_data['date'].type())

Since we know that the date and season data are logically related, once the date data has been corrected, the season data should also be updated accordingly.

In [None]:
fixed_seasons = []
# Modify the season data based on the fixed date
for date in dirty_data['date']:
    month = date.month # Extract the month
    if month in [9, 10, 11]:
        fixed_seasons.append('Spring')
    elif month in [12, 1, 2]:
        fixed_seasons.append('Summer')
    elif month in [3, 4, 5]:
        fixed_seasons.append('Autumn')
    elif month in [6, 7, 8]:
        fixed_seasons.append('Winter')
# Replace the original season column with the generated fixed_seasons
dirty_data['season'] = fixed_seasons

print(dirty_data[['date', 'season']])

<div class="alert alert-block alert-info">
    
### 4.2. Fix the customer_lat, customer_long, distance_to_nearest_warehouse and nearest_warehouse<a class="anchor" name="Reg_Exp"></a>

Since this business is based in Melbourne, the correct values for latitude and longitude should be Latitude: -37.8136° and Longitude: 144.9631°. However, I have discovered some incorrect data where the latitude and longitude values were swapped. After correcting these errors, the distance_to_nearest_warehouse and nearest_warehouse fields, which are calculated based on the latitude and longitude, may also need to be fixed accordingly.

In [None]:
# Check for latitude greater than 0 (latitude in Australia should be less than 0)
lat_issue = dirty_data.loc[dirty_data['customer_lat'] > 0]
print(lat_issue[['customer_lat', 'customer_long']])

# Check for longitude less than 0 (longitude in Australia should be greater than 0)
long_issue = dirty_data.loc[dirty_data['customer_long'] < 0]
print(long_issue[['customer_lat', 'customer_long']])

# Condition to find rows where either longitude < 0 or latitude > 0
condition = (dirty_data['customer_long'] < 0) | (dirty_data['customer_lat'] > 0)

# Select the rows matching the condition and swap the latitude and longitude
#reference from chatGPT
dirty_data.loc[condition, ['customer_lat', 'customer_long']] = \
    dirty_data.loc[condition, ['customer_long', 'customer_lat']].values

Data after fixing latitude and longitude:
   customer_lat  customer_long
0    -37.822570     144.952745
1    -37.818625     144.985920
2    -37.824845     144.957647
3    -37.809950     144.950436
4    -37.800566     144.952814


In [8]:
def haversine(lat1, lon1, lat2, lon2):
    dLat = (lat2 - lat1) * math.pi / 180.0  # 将纬度差转换为弧度
    dLon = (lon2 - lon1) * math.pi / 180.0  # 将经度差转换为弧度
    lat1 = (lat1) * math.pi / 180.0         # 将起点纬度转换为弧度
    lat2 = (lat2) * math.pi / 180.0         # 将终点纬度转换为弧度
    
    # Haversine公式中的a值，计算两个角度之间的弧长
    a = (pow(math.sin(dLat / 2), 2) + 
         pow(math.sin(dLon / 2), 2) * 
         math.cos(lat1) * math.cos(lat2))
    
    rad = 6378  # 地球半径，单位为公里
    c = 2 * math.asin(math.sqrt(a))  # 计算大圆距离
    return rad * c  # 返回距离，单位为公里

In [None]:
warehouse = pd.read_csv("warehouses.csv")
lat = dict(zip(warehouse.names, warehouse.lat))
lon = dict(zip(warehouse.names, warehouse.lon))

for index, row in missing_data.iterrows():
    customer_lat = row['customer_lat']  # 获取客户的纬度
    customer_long = row['customer_long']  # 获取客户的经度
    
    min_distance = float('inf')  # 初始化最小距离为无穷大
    nearest_name = None  # 初始化最近仓库名称为空
    
    # 遍历所有仓库，计算客户与每个仓库的距离
    for name in lat:
        dist = round(haversine(customer_lat, customer_long, lat[name], lon[name]), 4)  # 计算客户到仓库的距离
        if dist < min_distance:  # 如果距离更小，更新最小距离和最近仓库名称
            min_distance = dist
            nearest_name = name
    
    # 将计算得到的最近距离和仓库名称写回 missing_data 数据集中
    missing_data.at[index, 'nearest_distance'] = min_distance
    missing_data.at[index, 'nearest_name'] = nearest_name


In [None]:
#shopping_cart & order_price
for index, row in missing_data_Nickolson.iterrows():
    
    # 将字符串形式的 shopping_cart 列转换为 Python 列表
    shopping_cart = eval(row['shopping_cart'])
    
    # 输出解析后的购物车内容、类型及其元素
    print("购物车内容：", shopping_cart)           # 输出整个购物车内容
    print("购物车类型：", type(shopping_cart))     # 输出购物车的类型，应该是列表
    print("第一个商品：", shopping_cart[0])        # 输出第一个商品的信息
    print("第二个商品：", shopping_cart[1])        # 输出第二个商品的信息
    
    # 进一步访问列表中每个商品的具体信息
    print("第一个商品名称：", shopping_cart[0][0])  # 输出第一个商品的名称
    print("第二个商品数量：", shopping_cart[1][1])  # 输出第二个商品的数量
    
    # 为了只运行一次循环，使用 break 停止遍历
    break
  
item_types = set()  # 初始化一个集合来存储商品名称

for index, row in missing_data_Nickolson.iterrows():
    shopping_cart = ast.literal_eval(row['shopping_cart'])  # 解析购物车字符串
    for item in shopping_cart:
        item_types.add(item[0])  # 将商品名称添加到集合中


In [None]:
import numpy as np

# 定义系数矩阵 a
a = np.array([[1, 2], [3, 5]])

# 定义结果向量 b
b = np.array([1, 2])

# 使用 numpy 的 linalg.solve() 解决线性方程组 a * x = b
x = np.linalg.solve(a, b)

# 输出解 x
print("方程组的解：", x)

order total & order