### 05.20.2025

#### 这个代码的作用是清理 excel 文件 “recovered_utf8_for_excel.csv”
#### 之前的问题是，旧文件有很多行，将近两万行有多余4个 fields, 这样会导致加载报错
#### 这个代码的目的就是把多余的列，合并到第四列，然后消除报错



#### 1. 把所有的row 多余4个 fields 的都合并成只有4个fields

In [3]:

import csv
import os
import pandas as pd
from collections import Counter

# Define paths
input_csv_path = 'recovered_utf8_for_excel.csv'
output_csv_path = 'cleaned_recovered_utf8_for_excel.csv'

# Step 1: Count problematic rows
def count_problematic_rows(input_path, expected_fields=4):
    """
    Count rows with incorrect number of fields and log their details.
    
    Args:
        input_path (str): Path to the input CSV file
        expected_fields (int): Expected number of fields per row
    Returns:
        tuple: (problematic_rows, field_counts)
            - problematic_rows: List of tuples (line_number, row, field_count)
            - field_counts: Counter of field counts
    """
    problematic_rows = []
    field_counts = Counter()
    
    with open(input_path, 'r', encoding='utf-8') as infile:
        reader = csv.reader(infile)
        for line_num, row in enumerate(reader, 1):
            field_count = len(row)
            field_counts[field_count] += 1
            if field_count != expected_fields:
                problematic_rows.append((line_num, row, field_count))
    
    return problematic_rows, field_counts

# Step 2: Clean CSV by merging extra fields into the 4th field
def clean_csv(input_path, output_path, expected_fields=4):
    """
    Clean the CSV by merging extra fields into the 4th field for rows with >4 fields.
    Rows with <4 fields are padded with empty strings.
    
    Args:
        input_path (str): Path to the input CSV file
        output_path (str): Path to save the cleaned CSV
        expected_fields (int): Expected number of fields per row
    """
    with open(input_path, 'r', encoding='utf-8') as infile, \
         open(output_path, 'w', encoding='utf-8', newline='') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
        
        for row in reader:
            if len(row) > expected_fields:
                # Merge all fields after the 3rd into the 4th field
                cleaned_row = row[:3] + [','.join(row[3:])]
            elif len(row) < expected_fields:
                # Pad with empty strings if fewer fields
                cleaned_row = row + [''] * (expected_fields - len(row))
            else:
                # Row is already correct
                cleaned_row = row
            writer.writerow(cleaned_row)

# Step 3: Verify the cleaned CSV
def verify_cleaned_csv(output_path, expected_fields=4):
    """
    Verify that the cleaned CSV has the correct number of fields in all rows.
    
    Args:
        output_path (str): Path to the cleaned CSV
        expected_fields (int): Expected number of fields per row
    Returns:
        bool: True if all rows have the expected number of fields, False otherwise
    """
    with open(output_path, 'r', encoding='utf-8') as infile:
        reader = csv.reader(infile)
        for line_num, row in enumerate(reader, 1):
            if len(row) != expected_fields:
                print(f"Verification failed: Line {line_num} has {len(row)} fields")
                return False
    return True


In [4]:
# Run the analysis and cleaning
print("Step 1: Analyzing problematic rows...")
problematic_rows, field_counts = count_problematic_rows(input_csv_path)

# Print summary
print("\nField count distribution:")
for field_count, count in sorted(field_counts.items()):
    print(f"Rows with {field_count} fields: {count}")
print(f"Total problematic rows: {len(problematic_rows)}")
print("\nSample problematic rows (first 5):")
for line_num, row, field_count in problematic_rows[:5]:
    print(f"Line {line_num}: {field_count} fields - {row}")

# Clean the CSV
print("\nStep 2: Cleaning the CSV...")
clean_csv(input_csv_path, output_csv_path)

# Verify the cleaned CSV
print("\nStep 3: Verifying the cleaned CSV...")
if verify_cleaned_csv(output_csv_path):
    print(f"Success: All rows in {output_csv_path} have exactly 4 fields.")
else:
    print(f"Error: Some rows in {output_csv_path} still have incorrect field counts.")

# Optional: Preview the cleaned CSV with pandas
print("\nPreview of cleaned CSV:")
df = pd.read_csv(output_csv_path, quoting=csv.QUOTE_ALL, encoding='utf-8')
print(df.head())


Step 1: Analyzing problematic rows...

Field count distribution:
Rows with 1 fields: 12467
Rows with 2 fields: 1824
Rows with 3 fields: 1925
Rows with 4 fields: 84344
Rows with 5 fields: 3233
Rows with 6 fields: 1924
Rows with 7 fields: 1613
Rows with 8 fields: 1345
Rows with 9 fields: 1099
Rows with 10 fields: 968
Rows with 11 fields: 841
Rows with 12 fields: 637
Rows with 13 fields: 553
Rows with 14 fields: 472
Rows with 15 fields: 420
Rows with 16 fields: 376
Rows with 17 fields: 342
Rows with 18 fields: 302
Rows with 19 fields: 238
Rows with 20 fields: 225
Rows with 21 fields: 170
Rows with 22 fields: 176
Rows with 23 fields: 119
Rows with 24 fields: 128
Rows with 25 fields: 117
Rows with 26 fields: 89
Rows with 27 fields: 90
Rows with 28 fields: 57
Rows with 29 fields: 76
Rows with 30 fields: 58
Rows with 31 fields: 54
Rows with 32 fields: 46
Rows with 33 fields: 30
Rows with 34 fields: 31
Rows with 35 fields: 34
Rows with 36 fields: 22
Rows with 37 fields: 18
Rows with 38 fields:

### 2. Check the new generated data has only 4 fields

In [6]:

import csv
import os
import pandas as pd
from collections import Counter

# Define path to the cleaned CSV
cleaned_csv_path = 'cleaned_recovered_utf8_for_excel.csv'

# Step 1: Verify the number of fields in each row
def verify_csv_fields(csv_path, expected_fields=4):
    """
    Verify that all rows in the CSV have the expected number of fields.
    
    Args:
        csv_path (str): Path to the CSV file
        expected_fields (int): Expected number of fields per row
    Returns:
        tuple: (all_valid, field_counts, problematic_rows)
            - all_valid: True if all rows have expected_fields, False otherwise
            - field_counts: Counter of field counts
            - problematic_rows: List of tuples (line_number, row, field_count) for problematic rows
    """
    problematic_rows = []
    field_counts = Counter()
    
    with open(csv_path, 'r', encoding='utf-8') as infile:
        reader = csv.reader(infile)
        for line_num, row in enumerate(reader, 1):
            field_count = len(row)
            field_counts[field_count] += 1
            if field_count != expected_fields:
                problematic_rows.append((line_num, row, field_count))
    
    all_valid = len(problematic_rows) == 0
    return all_valid, field_counts, problematic_rows

# Step 2: Run the verification
print(f"Verifying CSV: {cleaned_csv_path}")
if not os.path.exists(cleaned_csv_path):
    print(f"Error: File not found at {cleaned_csv_path}")
else:
    all_valid, field_counts, problematic_rows = verify_csv_fields(cleaned_csv_path)

    # Print field count distribution
    print("\nField count distribution:")
    for field_count, count in sorted(field_counts.items()):
        print(f"Rows with {field_count} fields: {count}")

    # Report results
    if all_valid:
        print(f"\nSuccess: All rows in {cleaned_csv_path} have exactly 4 fields.")
    else:
        print(f"\nError: Found {len(problematic_rows)} rows with incorrect field counts.")
        print("Sample problematic rows (first 5):")
        for line_num, row, field_count in problematic_rows[:5]:
            print(f"Line {line_num}: {field_count} fields - {row}")

    # Step 3: Preview the cleaned CSV
    print("\nPreview of cleaned CSV:")
    try:
        df = pd.read_csv(cleaned_csv_path, quoting=csv.QUOTE_ALL, encoding='utf-8')
        print(df.head())
        print(f"\nTotal rows: {len(df)}")
    except Exception as e:
        print(f"Error reading CSV with pandas: {str(e)}")


Verifying CSV: cleaned_recovered_utf8_for_excel.csv

Field count distribution:
Rows with 4 fields: 116910

Success: All rows in cleaned_recovered_utf8_for_excel.csv have exactly 4 fields.

Preview of cleaned CSV:
  ﻿department        title                                                ask  \
0       营养保健科  小儿肥胖超重该如何治疗  女宝宝，刚7岁，这一年，察觉到，我家孩子身上肉很多，而且，食量非常的大，平时都不喜欢吃去玩，...   
1       营养保健科  小儿肥胖超重该怎样医治  男孩子，刚4岁，最近，发现，我家孩子体重要比别的孩子重很多，而且，最近越来越能吃了，还特别的...   
2       营养保健科  小儿肥胖能吃该如何治疗  男宝，已经5岁，今年，察觉到，孩子身上越来越肉乎了，同时，吃的饭也比一般孩子多，平时都不喜欢...   
3       营养保健科  小儿肥胖能吃该如何医治  女宝宝，目前2岁，近期，观察到，我家孩子越来越胖了，而且，吃起来好像也特别不节制，叫他运动也...   
4       营养保健科   小儿肥胖懒应怎样治疗  男孩，7岁，上小学了，这一年，观察到，孩子身上越来越肉乎了，而且，食量非常的大，平时都不喜欢...   

                                              answer  
0  孩子出现肥胖症的情况。家长要通过孩子运功和健康的饮食来缓解他的症状，可以先让他做一些有氧运动...  
1  孩子一旦患上肥胖症家长要先通过运动和饮食来改变孩子的情况，要让孩子做一些他这个年龄段能做的运...  
2  当孩子患上肥胖症的时候家长可以增加孩子的运动量和控制他的饮食来改变症状，像游泳，爬坡这类游泳...  
3  当孩子患上肥胖症的时候家长可以增加孩子的运动量和控制他的饮食来改变症状，家长要监督孩子做一些...  
4  当孩子患上肥胖症的时候家长可以增加孩子的运动

#### 3. 在合并成只有 4个 fields 之后，清理row 含有任何的 missing value

In [4]:

import csv
import os
import pandas as pd

# Define paths
input_csv_path = 'cleaned_recovered_utf8_for_excel.csv'
output_csv_path = 'no_missing_cleaned_recovered_utf8_for_excel.csv'

# Step 1: Inspect the original cleaned CSV
def inspect_csv(csv_path):
    """
    Inspect the CSV to confirm structure and missing values.
    """
    print(f"Inspecting {csv_path}...")
    try:
        # Read first few lines
        with open(csv_path, 'r', encoding='utf-8-sig') as infile:
            print("\nFirst 5 lines (raw):")
            for i, line in enumerate(infile, 1):
                if i > 5:
                    break
                print(f"Line {i}: {line.strip()}")
        
        # Load with pandas
        df = pd.read_csv(csv_path, encoding='utf-8-sig', quoting=csv.QUOTE_ALL)
        print("\nPandas DataFrame Info:")
        print(df.info())
        print("\nFirst 5 rows:")
        print(df.head())
        print(f"\nActual columns: {df.columns.tolist()}")
        print(f"\nMissing values:")
        for col in df.columns:
            print(f"{col}: {df[col].isna().sum()}")
        print(f"Total rows: {len(df)}")
    except Exception as e:
        print(f"Error inspecting CSV: {str(e)}")

# Step 2: Remove rows with missing values and fix column names
def remove_missing_values_csv(input_path, output_path):
    """
    Remove rows with missing values in department, ask, or answer, and fix column names.
    """
    try:
        # Load the CSV with pandas, handling BOM
        df = pd.read_csv(input_path, encoding='utf-8-sig', quoting=csv.QUOTE_ALL)
        
        # Rename columns to match expected names
        rename_dict = {
            '\ufeffdepartment': 'department',
            'title': 'extra',
            'ask': 'ask',
            'answer': 'answer'
        }
        df = df.rename(columns=rename_dict)
        
        # Remove rows with missing values in critical columns
        original_rows = len(df)
        df = df.dropna(subset=['department', 'ask', 'answer'])
        removed_rows = original_rows - len(df)
        
        # Ensure extra column is not null
        df['extra'] = df['extra'].fillna('')
        
        # Save the new CSV
        df.to_csv(output_path, index=False, encoding='utf-8', quoting=csv.QUOTE_ALL)
        print(f"\nNew CSV saved to {output_path}")
        print(f"Original rows: {original_rows}")
        print(f"Rows after removing missing values: {len(df)}")
        print(f"Rows removed: {removed_rows}")
        print(f"New columns: {df.columns.tolist()}")
        
    except Exception as e:
        print(f"Error processing CSV: {str(e)}")

# Step 3: Verify the new CSV
def verify_new_csv(csv_path, expected_fields=4, expected_columns=['department', 'ask', 'answer', 'extra']):
    """
    Verify that the new CSV has no missing values and correct structure.
    """
    print(f"\nVerifying {csv_path}...")
    try:
        # Check field counts with csv.reader
        with open(csv_path, 'r', encoding='utf-8') as infile:
            reader = csv.reader(infile)
            headers = next(reader, None)
            row_count = 1
            problematic_rows = []
            for line_num, row in enumerate(reader, 2):
                row_count += 1
                if len(row) != expected_fields:
                    problematic_rows.append((line_num, row, len(row)))
            
            print(f"Headers: {headers}")
            print(f"Total rows (including header): {row_count}")
            if problematic_rows:
                print(f"Found {len(problematic_rows)} problematic rows:")
                for line_num, row, field_count in problematic_rows[:5]:
                    print(f"Line {line_num}: {field_count} fields - {row}")
            else:
                print("All rows have exactly 4 fields.")
        
        # Check with pandas
        df = pd.read_csv(csv_path, encoding='utf-8', quoting=csv.QUOTE_ALL)
        print("\nPandas DataFrame Info:")
        print(df.info())
        print(f"\nActual columns: {df.columns.tolist()}")
        print(f"Expected columns: {expected_columns}")
        print(f"Total rows: {len(df)}")
        print(f"\nMissing values:")
        for col in expected_columns:
            print(f"{col}: {df[col].isna().sum()}")
        
    except Exception as e:
        print(f"Error verifying CSV: {str(e)}")

# Run the inspection, processing, and verification
print("Step 1: Inspecting the original cleaned CSV...")
inspect_csv(input_csv_path)

print("\nStep 2: Removing rows with missing values and fixing CSV...")
remove_missing_values_csv(input_csv_path, output_csv_path)

print("\nStep 3: Verifying the new CSV...")
verify_new_csv(output_csv_path)





Step 1: Inspecting the original cleaned CSV...
Inspecting cleaned_recovered_utf8_for_excel.csv...

First 5 lines (raw):
Line 1: "﻿department","title","ask","answer"
Line 2: "营养保健科","小儿肥胖超重该如何治疗","女宝宝，刚7岁，这一年，察觉到，我家孩子身上肉很多，而且，食量非常的大，平时都不喜欢吃去玩，请问：小儿肥胖超重该如何治疗。","孩子出现肥胖症的情况。家长要通过孩子运功和健康的饮食来缓解他的症状，可以先让他做一些有氧运动，比如慢跑，爬坡，游泳等，并且饮食上孩子多吃黄瓜，胡萝卜，菠菜等，禁止孩子吃一些油炸食品和干果类食物，这些都是干热量高脂肪的食物，而且不要让孩子总是吃完就躺在床上不动，家长在治疗小儿肥胖期间如果孩子情况严重就要及时去医院在医生的指导下给孩子治疗。"
Line 3: "营养保健科","小儿肥胖超重该怎样医治","男孩子，刚4岁，最近，发现，我家孩子体重要比别的孩子重很多，而且，最近越来越能吃了，还特别的懒，请问：小儿肥胖超重该怎样医治。","孩子一旦患上肥胖症家长要先通过运动和饮食来改变孩子的情况，要让孩子做一些他这个年龄段能做的运动，如游泳，慢跑等，要给孩子多吃一些像苹果，猕猴桃，胡萝卜等食物，禁止孩子吃高热量，高脂肪的食物，像蛋糕，干果，曲奇饼干等，严格的控制孩子的饮食，不要让他暴饮暴食，多运动对改变孩子肥胖都是有好处的，在治疗小儿肥胖期间如果情况严重，建议家长先带孩子去医院检查一下孩子肥胖症的原因在针对性的治疗。"
Line 4: "营养保健科","小儿肥胖能吃该如何治疗","男宝，已经5岁，今年，察觉到，孩子身上越来越肉乎了，同时，吃的饭也比一般孩子多，平时都不喜欢吃去玩，请问：小儿肥胖能吃该如何治疗。","当孩子患上肥胖症的时候家长可以增加孩子的运动量和控制他的饮食来改变症状，像游泳，爬坡这类游泳运动对肥胖的症状都很好的效果，像冬瓜，西红柿这样高纤维的蔬菜要多吃一些，孩子不可以吃像蛋糕，夏威夷果这些高热量的食物，而且不要让孩子总是吃完就躺在床上不动，家长在治疗小儿肥胖期间如果孩子情况严重就要及时去医院在医生的指导下给孩子治疗。"
Line 5: "营养保健科