<a href="https://colab.research.google.com/github/vaisshnavee1410/basic_statistics_.ipynb/blob/main/Copy_of_Basic_Statistics_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **BASIC STATISTICS**

## Descriptive Analysis and Data Preprocessing on Sales & Discount Dataset

####*Introduction*



*   To perform descriptive analytics, visualize data distributions, and preprocess the dataset for further analysis.



###Descriptive Analysis for Numerical Columns



*   Objective: To compute and analyze basic statistical measures for numerical columns in the dataset.
*   Steps:


In [None]:
!git clone https://github.com/vaisshnavee1410/basic_statistics_.ipynb.git

In [None]:
!pip install geopandas

In [None]:
import pandas as pd

In [None]:
df=pd.read_csv('sales_data_with_discounts.csv')

In [None]:
# Selecting numerical columns
numerical_cols = ['Volume', 'Avg Price', 'Total Sales Value',
                  'Discount Rate (%)', 'Discount Amount', 'Net Sales Value']

In [None]:
# Compute mean, median, standard deviation
mean_values = df[numerical_cols].mean()
median_values = df[numerical_cols].median()
std_values = df[numerical_cols].std()



In [None]:
# Compute mode (taking the first mode value)
mode_values = df[numerical_cols].mode().iloc[0]

In [None]:
# Combine results into a single DataFrame
stats_df = pd.DataFrame({
    'Mean': mean_values,
    'Median': median_values,
    'Mode': mode_values,
    'Standard Deviation': std_values
})

In [None]:
# Display the results
print(stats_df)

###Data Visualization



*   **Objective:** To visualize the distribution and relationship of numerical and categorical variables in the dataset.
*   **Histograms:**



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew

In [None]:
# Load the dataset
file_path = "sales_data_with_discounts.csv"
df = pd.read_csv(file_path)

In [None]:
# Identify numerical columns
numerical_cols = ['Volume', 'Avg Price', 'Total Sales Value',
                  'Discount Rate (%)', 'Discount Amount', 'Net Sales Value']

In [None]:
# Plot histograms for all numerical columns
plt.figure(figsize=(12, 8))

for i, col in enumerate(numerical_cols, 1):
    plt.subplot(2, 3, i)  # Arrange in 2 rows, 3 columns
    sns.histplot(df[col], bins=20, kde=True, color='blue')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel("Frequency")

plt.tight_layout()
plt.show()

In [None]:
# Calculate skewness for each numerical column
skewness_values = df[numerical_cols].apply(skew)

# Display skewness
print("\nSkewness of Each Column:")
print(skewness_values)

In [None]:
# Create boxplots for all numerical columns
plt.figure(figsize=(12, 8))

for i, col in enumerate(numerical_cols, 1):
    plt.subplot(2, 3, i)
    sns.boxplot(y=df[col], color='red')
    plt.title(f'Boxplot of {col}')
    plt.ylabel(col)

plt.tight_layout()
plt.show()

In [None]:
# Interpretation of skewness
for col in numerical_cols:
    skew_value = skewness_values[col]

    if skew_value > 1:
        skew_type = "Highly Right-Skewed (Positive Skew)"
    elif 0.5 < skew_value <= 1:
        skew_type = "Moderately Right-Skewed"
    elif -1 < skew_value <= -0.5:
        skew_type = "Moderately Left-Skewed"
    elif skew_value < -1:
        skew_type = "Highly Left-Skewed (Negative Skew)"
    else:
        skew_type = "Approximately Symmetric"

    print(f"{col}: Skewness = {skew_value:.2f} ({skew_type})")



*   **Boxplots:**


In [None]:
# Function to detect outliers using IQR
def detect_outliers_iqr(data, col):
    Q1 = data[col].quantile(0.25)  # First quartile (25th percentile)
    Q3 = data[col].quantile(0.75)  # Third quartile (75th percentile)
    IQR = Q3 - Q1  # Interquartile Range
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
    return outliers

In [None]:
# Identify and print outliers for each numerical column
print("\nOutlier Analysis:")

for col in numerical_cols:
    outliers = detect_outliers_iqr(df, col)
    num_outliers = len(outliers)
    print(f"{col}: {num_outliers} outliers detected.")



*  **Bar Chart Analysis for Categorical Column:**

In [None]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print("Categorical Columns Identified:", categorical_cols)

In [None]:
# Plot bar charts for each categorical column
plt.figure(figsize=(12, 8))

for i, col in enumerate(categorical_cols[:6], 1):  # Limit to 6 categorical columns for better visualization
    plt.subplot(2, 3, i)
    sns.countplot(y=df[col], palette="viridis", order=df[col].value_counts().index)
    plt.title(f'Frequency of {col}')
    plt.xlabel("Count")
    plt.ylabel(col)

plt.show()

In [None]:
# Display category distribution for each categorical column
for col in categorical_cols:
    print(f"\nColumn: {col}")
    print(df[col].value_counts(normalize=True) * 100)

###Standardization of Numerical Variables



*   **Objective**: To scale numerial variables for uniformity, improving the dataset's suitability for analytical models.


*   **Steps:**





In [None]:
import pandas as pd

In [None]:
df.head()

In [None]:
# Standardize each numerical column
df_standardized = pd.DataFrame() # Create an empty DataFrame
for col in numerical_cols:
    mean = df[col].mean()  # Calculate mean (μ)
    std_dev = df[col].std()  # Calculate standard deviation (σ)
    df_standardized[col] = (df[col] - mean) / std_dev  # Apply Z-score formula


# Display first few rows after standardization
print("Standardized Data Sample:")
df_standardized[numerical_cols].head()

In [None]:
# Comparisons of data distributons.

fig, axes = plt.subplots(len(numerical_cols), 2, figsize=(12, 4 * len(numerical_cols)))

for i, col in enumerate(numerical_cols):
    # Original Data Histogram
    sns.histplot(df[col], bins=30, kde=True, ax=axes[i, 0], color='blue')
    axes[i, 0].set_title(f"Original Distribution: {col}")

    # Standardized Data Histogram
    sns.histplot(df_standardized[col], bins=30, kde=True, ax=axes[i, 1], color='red')
    axes[i, 1].set_title(f"Standardized Distribution: {col}")

plt.tight_layout()
plt.show()

###Conversion of Categorical Data into Dummy Variables



*   **Objective:** To transform categorical variables into a format that can be provided to ML algorithms.
*   **Steps:**


In [None]:
# Get categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical Columns Identified:", categorical_cols)

In [None]:
# Apply One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Display first few rows of transformed data
print("Data After One-Hot Encoding:")
df_encoded.head()

###Conclusion:



*   Summarize the key findings from the descriptive analytics and data visualizations.


After performing descriptive analytics and data visualizations, we have identified several important patterns and insights.

Descriptive Statistics (Mean, Median, Standard Deviation)
	•	Sales, Discounts, and Quantity sold exhibit high variability, indicating fluctuations across different transactions.
	•	Right-skewed distributions were observed in sales and discount values, meaning some products had significantly higher sales and discounts than others.
	•	Outliers were detected, especially in sales and quantity, suggesting promotional impacts or extreme purchasing behavior.

 Data Distribution (Histograms & Boxplots)
	•	Sales and discount amounts are not normally distributed, with most values concentrated at the lower end.
	•	Outliers were identified, particularly in sales and discount values, which could indicate seasonal trends, promotional campaigns, or data entry errors.

  Bar Charts for Categorical Features
	•	Some product categories and regions have significantly higher representation than others.
	•	Uneven distribution of sales across different categories suggests that certain products contribute more to total revenue.

  One-Hot Encoding Results
	•	Categorical data was successfully converted into dummy variables, making it suitable for machine learning models.
	•	Some categories had high cardinality, leading to a larger number of features in the dataset.



*   Reflect on the importance of data preprocessing steps like standardization
and one-hot encoding in data analysis and machine learning



- Prevents Bias → Standardization avoids dominance of large-scale variables.
- Enhances Interpretability → Categorical encoding makes feature relationships clearer.
- Required for ML Algorithms → Many models demand properly formatted numerical input.
-One-Hot Encoding transformed categorical variables into machine-readable formats, avoiding incorrect interpretations.
-Sales & Discount Trends: Sales are highly skewed, with a few high-value transactions contributing significantly to revenue.
-Regional & Product Category Insights: Certain regions and product categories dominate, which can guide inventory and marketing strategies.
-Preprocessed Data for ML: Standardization and encoding have prepared the dataset for predictive modeling (e.g., sales forecasting, customer segmentation).