## Loading the data

In [3]:
import pandas as pd

# Load the dataset
data = pd.read_excel('/content/Online Retail.xlsx')

# Display the first few rows
print(data.head())


  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

          InvoiceDate  UnitPrice  CustomerID         Country  
0 2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
1 2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
2 2010-12-01 08:26:00       2.75     17850.0  United Kingdom  
3 2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
4 2010-12-01 08:26:00       3.39     17850.0  United Kingdom  


## Exploring the data

In [4]:
#checking the null values
print(data.isnull().sum())


InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64


**observation**:

*   Description has 1454 missing values.
*   CustomerID has a significant number of missing values (135,080 entries are null).



In [5]:
#understand the data type
print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
None


**Observations:**

*   InvoiceNo, StockCode, Description, and Country are categorical.
*   Quantity, UnitPrice, and CustomerID are numeric.
*   InvoiceDate is of type datetime64[ns].




In [6]:
#summary statistics
print(data.describe())


            Quantity                    InvoiceDate      UnitPrice  \
count  541909.000000                         541909  541909.000000   
mean        9.552250  2011-07-04 13:34:57.156386048       4.611114   
min    -80995.000000            2010-12-01 08:26:00  -11062.060000   
25%         1.000000            2011-03-28 11:34:00       1.250000   
50%         3.000000            2011-07-19 17:17:00       2.080000   
75%        10.000000            2011-10-19 11:27:00       4.130000   
max     80995.000000            2011-12-09 12:50:00   38970.000000   
std       218.081158                            NaN      96.759853   

          CustomerID  
count  406829.000000  
mean    15287.690570  
min     12346.000000  
25%     13953.000000  
50%     15152.000000  
75%     16791.000000  
max     18287.000000  
std      1713.600303  


**Observations:**
1. **Quantity:**

*   Negative values indicate possible errors or returned items.
*   Wide range from -80,995 to 80,995 suggests outliers.

2. **UnitPrice:**

*   Negative values are likely data errors.
*   The maximum value (38,970) seems unusually high, possibly an outlier.

3. **CustomerID:**

*   Negative values are likely data errors.
*   The maximum value (38,970) seems unusually high, possibly an outlier.

4. **CustomerID:**

*   Represents unique customers but has missing values.

In [8]:
# Drop rows where CustomerID is missing
data = data.dropna(subset=['CustomerID'])

# Replace missing values in Description with "Unknown"
data['Description'] = data['Description'].fillna('Unknown')

# Check the result
print("Remaining Missing Values:")
print(data.isnull().sum())

Remaining Missing Values:
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Description'] = data['Description'].fillna('Unknown')
