### 1. Import Required Libraries
    pandas (pd): Used for data manipulation and analysis.
    numpy (np): Supports numerical computations.
    matplotlib.pyplot (plt): Used for data visualization.
    seaborn (sns): Provides advanced statistical plotting.
    
    sklearn.preprocessing:
       LabelEncoder: Converts categorical values into numerical codes.
       StandardScaler: Standardizes numerical features (mean=0, std=1).
       MinMaxScaler: Scales numerical values to a fixed range (0 to 1).

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler

 ### 2. Load the Dataset
    Reads the dataset from the specified file path and loads it into a pandas DataFrame (df).
    This step ensures we can analyze and manipulate the data in Python

In [None]:
#Load the Dataset
df = pd.read_csv("../data/Customer Purchasing Behaviors.csv")

 ### 3. Check General Information   
    ✅df.info(): Displays the column names, data types, non-null values, and memory usage.
                 Helps in detecting missing values and identifying categorical vs. numerical columns.
    ✅df.head(): Prints the first five rows to get an overview of the dataset.
                 Helps check if the dataset loaded correctly.

In [None]:
#Check General Information
print(df.info())  # Get column types and missing values
print(df.head())  # Display first few rows 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238 entries, 0 to 237
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   user_id             238 non-null    int64  
 1   age                 238 non-null    int64  
 2   annual_income       238 non-null    int64  
 3   purchase_amount     238 non-null    int64  
 4   loyalty_score       238 non-null    float64
 5   region              238 non-null    object 
 6   purchase_frequency  238 non-null    int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 13.1+ KB
None
   user_id  age  annual_income  purchase_amount  loyalty_score region  \
0        1   25          45000              200            4.5  North   
1        2   34          55000              350            7.0  South   
2        3   45          65000              500            8.0   West   
3        4   22          30000              150            3.0   East   
4        5   29       

### 4. Generate Summary Statistics
Provides descriptive statistics for numerical columns:

✅Count (number of non-null values).<br>
✅Mean, Std (Standard Deviation).<br>
✅Min, Max, 25th, 50th, 75th percentiles (Quartiles).<br>
    
    Helps identify outliers, data ranges, and missing values.

In [10]:
#summary statistics
print(df.describe())

          user_id         age  annual_income  purchase_amount  loyalty_score  \
count  238.000000  238.000000     238.000000       238.000000     238.000000   
mean   119.500000   38.676471   57407.563025       425.630252       6.794118   
std     68.848868    9.351118   11403.875717       140.052062       1.899047   
min      1.000000   22.000000   30000.000000       150.000000       3.000000   
25%     60.250000   31.000000   50000.000000       320.000000       5.500000   
50%    119.500000   39.000000   59000.000000       440.000000       7.000000   
75%    178.750000   46.750000   66750.000000       527.500000       8.275000   
max    238.000000   55.000000   75000.000000       640.000000       9.500000   

       purchase_frequency  
count          238.000000  
mean            19.798319  
std              4.562884  
min             10.000000  
25%             17.000000  
50%             20.000000  
75%             23.000000  
max             28.000000  


### 5. Check Unique Values in Categorical Columns
    df.select_dtypes(include=["object"]) filters only categorical columns.
.nunique() counts the number of unique values in each categorical column.
    
    This helps:
    ✅Identify categorical variables for encoding.
    ✅Detect high-cardinality columns (columns with too many unique values).
    ✅Spot potential errors, such as inconsistent formatting in categorical data.

In [6]:
#Check Unique Values in Categorical Columns
for col in df.select_dtypes(include=["object"]).columns:
    print(f"{col}: {df[col].nunique()} unique values")

region: 4 unique values


### Conclusion
    This initial data cleaning and exploration process helps us understand the structure of the dataset, detect missing values, identify categorical variables, and assess numerical distributions. By performing these steps, we can determine necessary preprocessing actions such as handling missing data, encoding categorical features, and scaling numerical values before building a predictive model.

    Moving forward, we can refine the dataset by addressing data inconsistencies, outliers, and feature transformations to ensure a high-quality input for our regression analysis on customer loyalty scores