# Global Superstore Case Study
## Notebook 01: Data Understanding and Cleaning

**Objective:** Load the dataset, explore its structure, and perform initial cleaning to prepare it for analysis.

### Step 1: Loading the Dataset

In [11]:
# Import necessary libraries
import pandas as pd

# Load the dataset
df = pd.read_csv("superstore.csv")

# Preview the few rows of the dataset
print("\nDataset:")
print(df.head())


Dataset:
          Category         City        Country Customer.ID     Customer.Name  \
0  Office Supplies  Los Angeles  United States   LS-172304  Lycoris Saunders   
1  Office Supplies  Los Angeles  United States   MV-174854     Mark Van Huff   
2  Office Supplies  Los Angeles  United States   CS-121304      Chad Sievert   
3  Office Supplies  Los Angeles  United States   CS-121304      Chad Sievert   
4  Office Supplies  Los Angeles  United States   AP-109154    Arthur Prichep   

   Discount Market  记录数               Order.Date        Order.ID  ... Sales  \
0       0.0     US    1  2011-01-07 00:00:00.000  CA-2011-130813  ...    19   
1       0.0     US    1  2011-01-21 00:00:00.000  CA-2011-148614  ...    19   
2       0.0     US    1  2011-08-05 00:00:00.000  CA-2011-118962  ...    21   
3       0.0     US    1  2011-08-05 00:00:00.000  CA-2011-118962  ...   111   
4       0.0     US    1  2011-09-29 00:00:00.000  CA-2011-146969  ...     6   

    Segment                Ship.Da

### Step 2: Exploring the Dataset

2.1 Dataset Summary:

In [17]:
# Check Dataset Structures
print("Dataset Information:")
print(df.info())

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 27 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Category        51290 non-null  object 
 1   City            51290 non-null  object 
 2   Country         51290 non-null  object 
 3   Customer.ID     51290 non-null  object 
 4   Customer.Name   51290 non-null  object 
 5   Discount        51290 non-null  float64
 6   Market          51290 non-null  object 
 7   记录数             51290 non-null  int64  
 8   Order.Date      51290 non-null  object 
 9   Order.ID        51290 non-null  object 
 10  Order.Priority  51290 non-null  object 
 11  Product.ID      51290 non-null  object 
 12  Product.Name    51290 non-null  object 
 13  Profit          51290 non-null  float64
 14  Quantity        51290 non-null  int64  
 15  Region          51290 non-null  object 
 16  Row.ID          51290 non-null  int64  
 17  Sales     

Explaination: The data looks quite clean with no missing values; however there is a column 7, "记录数", that needs to be renamed later. We need to determine the meaning of the column name and then rename it.

In [18]:
# Check Statistics for numeric columns
print("\nSummary Statistics:")
print(df.describe())


Summary Statistics:
           Discount      记录数        Profit      Quantity       Row.ID  \
count  51290.000000  51290.0  51290.000000  51290.000000  51290.00000   
mean       0.142908      1.0     28.610982      3.476545  25645.50000   
std        0.212280      0.0    174.340972      2.278766  14806.29199   
min        0.000000      1.0  -6599.978000      1.000000      1.00000   
25%        0.000000      1.0      0.000000      2.000000  12823.25000   
50%        0.000000      1.0      9.240000      3.000000  25645.50000   
75%        0.200000      1.0     36.810000      5.000000  38467.75000   
max        0.850000      1.0   8399.976000     14.000000  51290.00000   

              Sales  Shipping.Cost          Year       weeknum  
count  51290.000000   51290.000000  51290.000000  51290.000000  
mean     246.498440      26.375818   2012.777208     31.287112  
std      487.567175      57.296810      1.098931     14.429795  
min        0.000000       0.002000   2011.000000      1.00000

Explaination: There might be some outliers in "Profit" or "Sales" which need to furthur inspected. However, overall the quality of this dataset looks good.

In [19]:
# Display column names
print("\nColumns in the Dataset:")
print(df.columns.tolist())



Columns in the Dataset:
['Category', 'City', 'Country', 'Customer.ID', 'Customer.Name', 'Discount', 'Market', '记录数', 'Order.Date', 'Order.ID', 'Order.Priority', 'Product.ID', 'Product.Name', 'Profit', 'Quantity', 'Region', 'Row.ID', 'Sales', 'Segment', 'Ship.Date', 'Ship.Mode', 'Shipping.Cost', 'State', 'Sub.Category', 'Year', 'Market2', 'weeknum']
